Troubleshooting Common Slurm Job Issues#

This document outlines common problems encountered with Slurm-managed jobs—especially in Snakemake workflows—and provides clear, actionable solutions.

❌ No Output Files Generated#

Symptoms#

When no slurm-*.out files are generated after job submission, the issue is typically due to one of the following:

Insufficient disk space in the output directory (e.g., $HOME, /tmp, or /var/tmp)
Missing write permissions for the user in the target directory
Filesystem not mounted on compute nodes (e.g., /var/tmp or /var/spool inaccessible)

Solution#

Check available disk space:
```
df -h /tmp /var/tmp /var/spool
```
Verify write permissions:
```
ls -ld /tmp /var/tmp
```

Ensure the filesystem is mounted on compute nodes:

# On a compute node (via srun):
srun -N 1 -n 1 --pty bash
df -h | grep -E "(tmp|spool)"

⚠️ Tip: /var/spool and /var/tmp typically share the same underlying storage. Filling one will affect the other.

🛑 Jobs Fail Due to Full `/tmp` or `/var/tmp` (Fixed via `JobContainerType=job_container/tmpfs`)#

Problem#

Jobs may fail with errors like:

error: write /var/spool/slurmd/cred_state.new error No space left on device

This occurs when /var/tmp or /var/spool becomes full due to large temporary files created during job execution.

🔍 Note: /var/spool and /var/tmp typically share the same disk partition.

Diagnosis#

Use ncdu to inspect disk usage on the affected node:

# Access the log archive on maestro-log.maestro.pasteur.fr
ncdu -f <(zcat /var/log/cluster/ncdu/maestro-1054/ncdu-20220427_125001.gz)

This helps identify large or orphaned files consuming space.

Solution#

To prevent future issues: - Use JobContainerType=job_container/tmpfs in job scripts. This isolates temporary files to RAM-based storage, avoiding disk contention.

Example:

#SBATCH --job-container=job_container/tmpfs

✅ Benefits: - Prevents disk exhaustion on shared nodes - Improves job stability and performance - Ideal for workflows with large intermediate files

🐢 Snakemake Workflow Appears Stuck#

Problem#

A Snakemake workflow may appear "frozen" despite ongoing activity. This is often due to a dependent job that has completed but remains marked as RUNNING in Slurm.

Example Scenario (MicroSeek Workflow)#

Snakemake waits for job 13738020.

Job actually completed over 12 hours ago:

2025-09-02T00:03:49.669815+02:00 maestro-sched slurmctld[3611801]: _job_complete: JobId=13738020 done

Yet sacct still shows it as RUNNING:

ID          Name        Partition  Cluster    State      TimeSubmit           TimeStart             TimeEnd
--------    ----------  ---------- ---------- ---------- ------------------- ------------------- -------------------
13738020    27f8840d-+   dedicated  maestro    RUNNING    2025-09-01T23:58:02  2025-09-01T23:58:03   Unknown

🚩 Root Cause: MicroSeek relies on sacct to check job status. Since the job isn’t marked as COMPLETED, the workflow stalls.

✅ Solution: Fix Runaway Jobs Using `sacctmgr`#

List runaway jobs:
```
sacctmgr show runawayjobs
```

When prompted:

Would you like to fix these runaway jobs? (y/N)

Answer y.

This action: - Sets a completion time - Marks the job as COMPLETED

Verify the fix:
```
sacct -j 13738020
```

Expected output:

JobID           JobName  Partition    Account  AllocCPUS      State     ExitCode
------------    -------- ---------- ---------- ---------- ---------- --------
13738020        27f8840d-+ dedicated    dpl         10      COMPLETED     0:0
13738020.ba+    batch                  dpl         10      COMPLETED     0:0
13738020.ex+    extern                 dpl         10      COMPLETED     0:0
13738020.0      python3.11             dpl         10      COMPLETED     0:0

✅ Workflow resumes normally once the job state is updated.

🔍 Pro Tips for Diagnosing Stuck Workflows#

Source	Use Case
`snakemake.log` (e.g., `<jobid>.log`)	Check job IDs of launched tasks, especially those in `RUNNING` or `PENDING` state
`sacctmgr show runawayjobs`	Identify jobs stuck in `RUNNING` despite completion
`sacct -u <username>`	List all jobs from a user running for an unusually long time

✅ Summary#

Issue	Solution
No `slurm-*.out` files	Check disk space, permissions, and mount status
Jobs fail due to full `/tmp` or `/var/tmp`	Use `JobContainerType=job_container/tmpfs`
Snakemake workflow stuck	Fix runaway jobs via `sacctmgr show runawayjobs`

💡 Best Practice: Always use job_container/tmpfs for jobs with large temporary data to avoid disk contention and improve reliability.

Troubleshooting Common Slurm Job Issues#

❌ No Output Files Generated#

Symptoms#

Solution#

🛑 Jobs Fail Due to Full /tmp or /var/tmp (Fixed via JobContainerType=job_container/tmpfs)#

Problem#

Diagnosis#

Solution#

🐢 Snakemake Workflow Appears Stuck#

Problem#

Example Scenario (MicroSeek Workflow)#

✅ Solution: Fix Runaway Jobs Using sacctmgr#

🔍 Pro Tips for Diagnosing Stuck Workflows#

✅ Summary#

🛑 Jobs Fail Due to Full `/tmp` or `/var/tmp` (Fixed via `JobContainerType=job_container/tmpfs`)#

✅ Solution: Fix Runaway Jobs Using `sacctmgr`#