Troubleshooting Common Slurm Job Issues#
This document outlines common problems encountered with Slurm-managed jobsβespecially in Snakemake workflowsβand provides clear, actionable solutions.
β No Output Files Generated#
Symptoms#
When no slurm-*.out files are generated after job submission, the issue is typically due to one of the following:
- Insufficient disk space in the output directory (e.g.,
$HOME,/tmp, or/var/tmp) - Missing write permissions for the user in the target directory
- Filesystem not mounted on compute nodes (e.g.,
/var/tmpor/var/spoolinaccessible)
Solution#
-
Check available disk space:
df -h /tmp /var/tmp /var/spool -
Verify write permissions:
ls -ld /tmp /var/tmp -
Ensure the filesystem is mounted on compute nodes:
# On a compute node (via srun): srun -N 1 -n 1 --pty bash df -h | grep -E "(tmp|spool)"
β οΈ Tip:
/var/spooland/var/tmptypically share the same underlying storage. Filling one will affect the other.
π Jobs Fail Due to Full /tmp or /var/tmp (Fixed via JobContainerType=job_container/tmpfs)#
Problem#
Jobs may fail with errors like:
error: write /var/spool/slurmd/cred_state.new error No space left on device
This occurs when /var/tmp or /var/spool becomes full due to large temporary files created during job execution.
π Note:
/var/spooland/var/tmptypically share the same disk partition.
Diagnosis#
Use ncdu to inspect disk usage on the affected node:
# Access the log archive on maestro-log.maestro.pasteur.fr
ncdu -f <(zcat /var/log/cluster/ncdu/maestro-1054/ncdu-20220427_125001.gz)
This helps identify large or orphaned files consuming space.
Solution#
To prevent future issues:
- Use JobContainerType=job_container/tmpfs in job scripts. This isolates temporary files to RAM-based storage, avoiding disk contention.
Example:
#SBATCH --job-container=job_container/tmpfs
β Benefits: - Prevents disk exhaustion on shared nodes - Improves job stability and performance - Ideal for workflows with large intermediate files
π’ Snakemake Workflow Appears Stuck#
Problem#
A Snakemake workflow may appear "frozen" despite ongoing activity. This is often due to a dependent job that has completed but remains marked as RUNNING in Slurm.
Example Scenario (MicroSeek Workflow)#
- Snakemake waits for job
13738020. - Job actually completed over 12 hours ago:
2025-09-02T00:03:49.669815+02:00 maestro-sched slurmctld[3611801]: _job_complete: JobId=13738020 done - Yet
sacctstill shows it asRUNNING:ID Name Partition Cluster State TimeSubmit TimeStart TimeEnd -------- ---------- ---------- ---------- ---------- ------------------- ------------------- ------------------- 13738020 27f8840d-+ dedicated maestro RUNNING 2025-09-01T23:58:02 2025-09-01T23:58:03 Unknown
π© Root Cause: MicroSeek relies on
sacctto check job status. Since the job isnβt marked asCOMPLETED, the workflow stalls.
β
Solution: Fix Runaway Jobs Using sacctmgr#
-
List runaway jobs:
sacctmgr show runawayjobs -
When prompted:
AnswerWould you like to fix these runaway jobs? (y/N)y.
This action:
- Sets a completion time
- Marks the job as COMPLETED
- Verify the fix:
sacct -j 13738020
Expected output:
JobID JobName Partition Account AllocCPUS State ExitCode
------------ -------- ---------- ---------- ---------- ---------- --------
13738020 27f8840d-+ dedicated dpl 10 COMPLETED 0:0
13738020.ba+ batch dpl 10 COMPLETED 0:0
13738020.ex+ extern dpl 10 COMPLETED 0:0
13738020.0 python3.11 dpl 10 COMPLETED 0:0
β Workflow resumes normally once the job state is updated.
π Pro Tips for Diagnosing Stuck Workflows#
| Source | Use Case |
|---|---|
snakemake.log (e.g., <jobid>.log) |
Check job IDs of launched tasks, especially those in RUNNING or PENDING state |
sacctmgr show runawayjobs |
Identify jobs stuck in RUNNING despite completion |
sacct -u <username> |
List all jobs from a user running for an unusually long time |
β Summary#
| Issue | Solution |
|---|---|
No slurm-*.out files |
Check disk space, permissions, and mount status |
Jobs fail due to full /tmp or /var/tmp |
Use JobContainerType=job_container/tmpfs |
| Snakemake workflow stuck | Fix runaway jobs via sacctmgr show runawayjobs |
π‘ Best Practice: Always use
job_container/tmpfsfor jobs with large temporary data to avoid disk contention and improve reliability.