Troubleshooting Common Slurm Job Issues#
This document outlines common problems encountered with Slurm-managed jobsβespecially in Snakemake workflowsβand provides clear, actionable solutions.
β No Output Files Generated#
Symptoms#
When no slurm-*.out
files are generated after job submission, the issue is typically due to one of the following:
- Insufficient disk space in the output directory (e.g.,
$HOME
,/tmp
, or/var/tmp
) - Missing write permissions for the user in the target directory
- Filesystem not mounted on compute nodes (e.g.,
/var/tmp
or/var/spool
inaccessible)
Solution#
-
Check available disk space:
df -h /tmp /var/tmp /var/spool
-
Verify write permissions:
ls -ld /tmp /var/tmp
-
Ensure the filesystem is mounted on compute nodes:
# On a compute node (via srun): srun -N 1 -n 1 --pty bash df -h | grep -E "(tmp|spool)"
β οΈ Tip:
/var/spool
and/var/tmp
typically share the same underlying storage. Filling one will affect the other.
π Jobs Fail Due to Full /tmp
or /var/tmp
(Fixed via JobContainerType=job_container/tmpfs
)#
Problem#
Jobs may fail with errors like:
error: write /var/spool/slurmd/cred_state.new error No space left on device
This occurs when /var/tmp
or /var/spool
becomes full due to large temporary files created during job execution.
π Note:
/var/spool
and/var/tmp
typically share the same disk partition.
Diagnosis#
Use ncdu
to inspect disk usage on the affected node:
# Access the log archive on maestro-log.maestro.pasteur.fr
ncdu -f <(zcat /var/log/cluster/ncdu/maestro-1054/ncdu-20220427_125001.gz)
This helps identify large or orphaned files consuming space.
Solution#
To prevent future issues:
- Use JobContainerType=job_container/tmpfs
in job scripts. This isolates temporary files to RAM-based storage, avoiding disk contention.
Example:
#SBATCH --job-container=job_container/tmpfs
β Benefits: - Prevents disk exhaustion on shared nodes - Improves job stability and performance - Ideal for workflows with large intermediate files
π’ Snakemake Workflow Appears Stuck#
Problem#
A Snakemake workflow may appear "frozen" despite ongoing activity. This is often due to a dependent job that has completed but remains marked as RUNNING
in Slurm.
Example Scenario (MicroSeek Workflow)#
- Snakemake waits for job
13738020
. - Job actually completed over 12 hours ago:
2025-09-02T00:03:49.669815+02:00 maestro-sched slurmctld[3611801]: _job_complete: JobId=13738020 done
- Yet
sacct
still shows it asRUNNING
:ID Name Partition Cluster State TimeSubmit TimeStart TimeEnd -------- ---------- ---------- ---------- ---------- ------------------- ------------------- ------------------- 13738020 27f8840d-+ dedicated maestro RUNNING 2025-09-01T23:58:02 2025-09-01T23:58:03 Unknown
π© Root Cause: MicroSeek relies on
sacct
to check job status. Since the job isnβt marked asCOMPLETED
, the workflow stalls.
β
Solution: Fix Runaway Jobs Using sacctmgr
#
-
List runaway jobs:
sacctmgr show runawayjobs
-
When prompted:
AnswerWould you like to fix these runaway jobs? (y/N)
y
.
This action:
- Sets a completion time
- Marks the job as COMPLETED
- Verify the fix:
sacct -j 13738020
Expected output:
JobID JobName Partition Account AllocCPUS State ExitCode
------------ -------- ---------- ---------- ---------- ---------- --------
13738020 27f8840d-+ dedicated dpl 10 COMPLETED 0:0
13738020.ba+ batch dpl 10 COMPLETED 0:0
13738020.ex+ extern dpl 10 COMPLETED 0:0
13738020.0 python3.11 dpl 10 COMPLETED 0:0
β Workflow resumes normally once the job state is updated.
π Pro Tips for Diagnosing Stuck Workflows#
Source | Use Case |
---|---|
snakemake.log (e.g., <jobid>.log ) |
Check job IDs of launched tasks, especially those in RUNNING or PENDING state |
sacctmgr show runawayjobs |
Identify jobs stuck in RUNNING despite completion |
sacct -u <username> |
List all jobs from a user running for an unusually long time |
β Summary#
Issue | Solution |
---|---|
No slurm-*.out files |
Check disk space, permissions, and mount status |
Jobs fail due to full /tmp or /var/tmp |
Use JobContainerType=job_container/tmpfs |
Snakemake workflow stuck | Fix runaway jobs via sacctmgr show runawayjobs |
π‘ Best Practice: Always use
job_container/tmpfs
for jobs with large temporary data to avoid disk contention and improve reliability.