====== srun no funciona en el login node ====== Si al intentar utilizar srun desde el login node se obtiene un mensaje del tipo: $ srun -N 1 -n 64 --partition=gpunode --pty /bin/bash srun: job 4872 queued and waiting for resources srun: error: Security violation, slurm message from uid 202 srun: error: Security violation, slurm message from uid 202 srun: error: Security violation, slurm message from uid 202 srun: error: Task launch for StepId=4872.0 failed on node cn065: Invalid job credential srun: error: Application launch failed: Invalid job credential srun: Job step aborted Se produce el siguiente error en el log de slurm: sched: _slurm_rpc_allocate_resources JobId=4872 NodeList=(null) usec=228 sched: Allocate JobId=4872 NodeList=cn065 #CPUs=64 Partition=gpunode error: slurm_receive_msgs: [[snmgt01]:38727] failed: Zero Bytes were transmitted or received Killing interactive JobId=4872: Communication connection failure _job_complete: JobId=4872 WEXITSTATUS 1 _job_complete: JobId=4872 done step_partial_comp: JobId=4872 StepID=0 invalid; this step may have already completed _slurm_rpc_complete_job_allocation: JobId=4872 error Job/step already completing or completed Scontrol ping muestra: $ scontrol ping Slurmctld(primary) at mmgt01 is DOWN Si nuestro setup es configless: SlurmctldParameters=enable_configless Asegurarse que los nodos NO tienen el archivo slurm.conf en /etc/slurm, ya que sino colisionan los archivos de configuración y se refieren a claves MUNGE distintas. Verificar con: scontrol show config | grep -i "hash_val" cn080: HASH_VAL = Different Ours=<...> a Slurmctld=<...> Es decir, si la configuración es configless, los nodos no deben tener un slurm.conf en su directorio /etC/conf.