Skip to content

Troubleshooting Common Slurm Job Issues#

This document outlines common problems encountered with Slurm-managed jobsβ€”especially in Snakemake workflowsβ€”and provides clear, actionable solutions.


❌ No Output Files Generated#

Symptoms#

When no slurm-*.out files are generated after job submission, the issue is typically due to one of the following:

  • Insufficient disk space in the output directory (e.g., $HOME, /tmp, or /var/tmp)
  • Missing write permissions for the user in the target directory
  • Filesystem not mounted on compute nodes (e.g., /var/tmp or /var/spool inaccessible)

Solution#

  1. Check available disk space:

    df -h /tmp /var/tmp /var/spool
    

  2. Verify write permissions:

    ls -ld /tmp /var/tmp
    

  3. Ensure the filesystem is mounted on compute nodes:

    # On a compute node (via srun):
    srun -N 1 -n 1 --pty bash
    df -h | grep -E "(tmp|spool)"
    

⚠️ Tip: /var/spool and /var/tmp typically share the same underlying storage. Filling one will affect the other.


πŸ›‘ Jobs Fail Due to Full /tmp or /var/tmp (Fixed via JobContainerType=job_container/tmpfs)#

Problem#

Jobs may fail with errors like:

error: write /var/spool/slurmd/cred_state.new error No space left on device

This occurs when /var/tmp or /var/spool becomes full due to large temporary files created during job execution.

πŸ” Note: /var/spool and /var/tmp typically share the same disk partition.

Diagnosis#

Use ncdu to inspect disk usage on the affected node:

# Access the log archive on maestro-log.maestro.pasteur.fr
ncdu -f <(zcat /var/log/cluster/ncdu/maestro-1054/ncdu-20220427_125001.gz)

This helps identify large or orphaned files consuming space.

Solution#

To prevent future issues: - Use JobContainerType=job_container/tmpfs in job scripts. This isolates temporary files to RAM-based storage, avoiding disk contention.

Example:

#SBATCH --job-container=job_container/tmpfs

βœ… Benefits: - Prevents disk exhaustion on shared nodes - Improves job stability and performance - Ideal for workflows with large intermediate files


🐒 Snakemake Workflow Appears Stuck#

Problem#

A Snakemake workflow may appear "frozen" despite ongoing activity. This is often due to a dependent job that has completed but remains marked as RUNNING in Slurm.

Example Scenario (MicroSeek Workflow)#

  • Snakemake waits for job 13738020.
  • Job actually completed over 12 hours ago:
    2025-09-02T00:03:49.669815+02:00 maestro-sched slurmctld[3611801]: _job_complete: JobId=13738020 done
    
  • Yet sacct still shows it as RUNNING:
    ID          Name        Partition  Cluster    State      TimeSubmit           TimeStart             TimeEnd
    --------    ----------  ---------- ---------- ---------- ------------------- ------------------- -------------------
    13738020    27f8840d-+   dedicated  maestro    RUNNING    2025-09-01T23:58:02  2025-09-01T23:58:03   Unknown
    

🚩 Root Cause: MicroSeek relies on sacct to check job status. Since the job isn’t marked as COMPLETED, the workflow stalls.


βœ… Solution: Fix Runaway Jobs Using sacctmgr#

  1. List runaway jobs:

    sacctmgr show runawayjobs
    

  2. When prompted:

    Would you like to fix these runaway jobs? (y/N)
    
    Answer y.

This action: - Sets a completion time - Marks the job as COMPLETED

  1. Verify the fix:
    sacct -j 13738020
    

Expected output:

JobID           JobName  Partition    Account  AllocCPUS      State     ExitCode
------------    -------- ---------- ---------- ---------- ---------- --------
13738020        27f8840d-+ dedicated    dpl         10      COMPLETED     0:0
13738020.ba+    batch                  dpl         10      COMPLETED     0:0
13738020.ex+    extern                 dpl         10      COMPLETED     0:0
13738020.0      python3.11             dpl         10      COMPLETED     0:0

βœ… Workflow resumes normally once the job state is updated.


πŸ” Pro Tips for Diagnosing Stuck Workflows#

Source Use Case
snakemake.log (e.g., <jobid>.log) Check job IDs of launched tasks, especially those in RUNNING or PENDING state
sacctmgr show runawayjobs Identify jobs stuck in RUNNING despite completion
sacct -u <username> List all jobs from a user running for an unusually long time

βœ… Summary#

Issue Solution
No slurm-*.out files Check disk space, permissions, and mount status
Jobs fail due to full /tmp or /var/tmp Use JobContainerType=job_container/tmpfs
Snakemake workflow stuck Fix runaway jobs via sacctmgr show runawayjobs

πŸ’‘ Best Practice: Always use job_container/tmpfs for jobs with large temporary data to avoid disk contention and improve reliability.