slurm_errors
srun no funciona en el login node
Si al intentar utilizar srun desde el login node se obtiene un mensaje del tipo:
$ srun -N 1 -n 64 --partition=gpunode --pty /bin/bash srun: job 4872 queued and waiting for resources srun: error: Security violation, slurm message from uid 202 srun: error: Security violation, slurm message from uid 202 srun: error: Security violation, slurm message from uid 202 srun: error: Task launch for StepId=4872.0 failed on node cn065: Invalid job credential srun: error: Application launch failed: Invalid job credential srun: Job step aborted
Se produce el siguiente error en el log de slurm:
sched: _slurm_rpc_allocate_resources JobId=4872 NodeList=(null) usec=228 sched: Allocate JobId=4872 NodeList=cn065 #CPUs=64 Partition=gpunode error: slurm_receive_msgs: [[snmgt01]:38727] failed: Zero Bytes were transmitted or received Killing interactive JobId=4872: Communication connection failure _job_complete: JobId=4872 WEXITSTATUS 1 _job_complete: JobId=4872 done step_partial_comp: JobId=4872 StepID=0 invalid; this step may have already completed _slurm_rpc_complete_job_allocation: JobId=4872 error Job/step already completing or completed
Scontrol ping muestra:
$ scontrol ping Slurmctld(primary) at mmgt01 is DOWN
Si nuestro setup es configless:
SlurmctldParameters=enable_configless
Asegurarse que los nodos NO tienen el archivo slurm.conf en /etc/slurm, ya que sino colisionan los archivos de configuración y se refieren a claves MUNGE distintas. Verificar con:
scontrol show config | grep -i "hash_val" cn080: HASH_VAL = Different Ours=<...> a Slurmctld=<...>
Es decir, si la configuración es configless, los nodos no deben tener un slurm.conf en su directorio /etC/conf.
slurm_errors.txt · Last modified: by bbruzzo
