====== srun no funciona en el login node ======
Si al intentar utilizar srun desde el login node se obtiene un mensaje del tipo:
$ srun -N 1 -n 64 --partition=gpunode --pty /bin/bash
srun: job 4872 queued and waiting for resources
srun: error: Security violation, slurm message from uid 202
srun: error: Security violation, slurm message from uid 202
srun: error: Security violation, slurm message from uid 202
srun: error: Task launch for StepId=4872.0 failed on node cn065: Invalid job
credential
srun: error: Application launch failed: Invalid job credential
srun: Job step aborted
Se produce el siguiente error en el log de slurm:
sched: _slurm_rpc_allocate_resources JobId=4872 NodeList=(null) usec=228
sched: Allocate JobId=4872 NodeList=cn065 #CPUs=64 Partition=gpunode
error: slurm_receive_msgs: [[snmgt01]:38727] failed: Zero Bytes were
transmitted or received
Killing interactive JobId=4872: Communication connection failure
_job_complete: JobId=4872 WEXITSTATUS 1
_job_complete: JobId=4872 done
step_partial_comp: JobId=4872 StepID=0 invalid; this step may have already
completed
_slurm_rpc_complete_job_allocation: JobId=4872 error Job/step already
completing or completed
Scontrol ping muestra:
$ scontrol ping
Slurmctld(primary) at mmgt01 is DOWN
Si nuestro setup es configless:
SlurmctldParameters=enable_configless
Asegurarse que los nodos NO tienen el archivo slurm.conf en /etc/slurm, ya que sino colisionan los archivos de configuración y se refieren a claves MUNGE distintas. Verificar con:
scontrol show config | grep -i "hash_val"
cn080: HASH_VAL = Different Ours=<...> a Slurmctld=<...>
Es decir, si la configuración es configless, los nodos no deben tener un slurm.conf en su directorio /etC/conf.