User Tools

Site Tools


slurm_errors

srun no funciona en el login node

Si al intentar utilizar srun desde el login node se obtiene un mensaje del tipo:

$ srun -N 1 -n 64 --partition=gpunode --pty /bin/bash
srun: job 4872 queued and waiting for resources
srun: error: Security violation, slurm message from uid 202
srun: error: Security violation, slurm message from uid 202
srun: error: Security violation, slurm message from uid 202
srun: error: Task launch for StepId=4872.0 failed on node cn065: Invalid job
credential
srun: error: Application launch failed: Invalid job credential
srun: Job step aborted

Se produce el siguiente error en el log de slurm:

sched: _slurm_rpc_allocate_resources JobId=4872 NodeList=(null) usec=228
sched: Allocate JobId=4872 NodeList=cn065 #CPUs=64 Partition=gpunode
error: slurm_receive_msgs: [[snmgt01]:38727] failed: Zero Bytes were
transmitted or received
Killing interactive JobId=4872: Communication connection failure
_job_complete: JobId=4872 WEXITSTATUS 1
_job_complete: JobId=4872 done
step_partial_comp: JobId=4872 StepID=0 invalid; this step may have already
completed
_slurm_rpc_complete_job_allocation: JobId=4872 error Job/step already
completing or completed

Scontrol ping muestra:

$ scontrol ping
Slurmctld(primary) at mmgt01 is DOWN

Si nuestro setup es configless:

SlurmctldParameters=enable_configless

Asegurarse que los nodos NO tienen el archivo slurm.conf en /etc/slurm, ya que sino colisionan los archivos de configuración y se refieren a claves MUNGE distintas. Verificar con:

scontrol show config | grep -i "hash_val"

cn080: HASH_VAL                = Different Ours=<...> a Slurmctld=<...>

Es decir, si la configuración es configless, los nodos no deben tener un slurm.conf en su directorio /etC/conf.

slurm_errors.txt · Last modified: by bbruzzo