Slurm

Slurm workload manager

Simple Linux Utility for Resource Management.

sinfo

check information of the system.

sinfo # partitions
sinfo -N # nodes
sinfo -N --states=idle # check all idle nodes

to check more detailed GPU usage:

cinfo -p <partition> 
cinfo -p <partition> occupy-reserved # only show reserved quota

squeue

check the current running (R) and pending (PD) jobs.

squeue -u <user name> # list your jobs
squeue -l # list all info

srun/sbatch

launch/submit a job.

-J # --job-name=JOB_NAME
-p # --partition=xxx, choose which partition of clusters to use
-n # --ntasks=N, usually 1
-c # --cpus-per-task=16, how many CPUs to use in total (per node)
-N # --nodes=N, node count, 1 for single-node, or more for multi-node
--ntasks-per-node # must be n/N

-o # --output=OUTPUT_FILENAME
-e # --error=ERROR_FILENAME

-w # --nodelist=node[1,2], prefered nodes
-x # --exclude=node[3,5-6], nodes to avoid
--exclusive # exclusively use the nodes

--gres # --gres=gpu:2, gpu allocation
--quotatype=reserved # [phoenix-feature] auto, reserved (will not be reallocated), spot (may be reallocated)

QUOTA mode [phoenix-feature]:

reserved: guaranteed GPU resources for this partition, will allocate as long as it's idle.
spot: borrow idle resources from other partitions, will be reallocated if required by other partitions.
auto: try to allocate reserved quota first, if not successful, turn to spot mode.

srun will start the job in foreground, suitable for single-node training:

# quick alias (xxx is your partition)
alias srcpu="srun --ntasks=1 --ntasks-per-node=1 --cpus-per-task=16 -p xxx"
alias sr1gpu="srun --ntasks=1 --ntasks-per-node=1 --cpus-per-task=8 --gres=gpu:1 -p xxx"
alias sr8gpu="srun --ntasks=1 --ntasks-per-node=1 --cpus-per-task=64 --gres=gpu:8 -p xxx"
alias sr8gpu_spot="srun --ntasks=1 --ntasks-per-node=1 --cpus-per-task=64 --gres=gpu:8 --quotatype=spot -p xxx"
alias squ="squeue -u `whoami`"

# use at ease
srcpu bash some_script.sh
sr1gpu python test.py

sbatch will submit jobs in background, and can perform multi-node training.

For example, we launch 4 * 8 = 32 GPUs to train:

#!/bin/bash
#SBATCH --job-name=MY_JOB
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --gres=gpu:8
#SBATCH --partition=3dobject_aigc_light
#SBATCH --quotatype=spot
#SBATCH --output=logs/%j_%x_out.log
#SBATCH --err=logs/%j_%x_err.log
#SBATCH --nodelist=xxx-[123-125],xxx-145

# configs
LOG_PATH="log.txt" # where all the printing goes
GPUS_PER_NODE=8 # align with --gres

echo "START TIME: $(date)"

# NCCL & AWS settings
export NCCL_PROTO=simple
export RDMAV_FORK_SAFE=1
export FI_EFA_FORK_SAFE=1
export FI_EFA_USE_DEVICE_RDMA=1
export FI_PROVIDER=efa
export FI_LOG_LEVEL=1
export NCCL_IB_DISABLE=0
export NCCL_SOCKET_IFNAME=ib0

# proxy settings
unset http_proxy
unset https_proxy

# ip & port & rank
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
MASTER_PORT=10231 # use 5 digits ports

NNODES=$SLURM_NNODES
NUM_PROCESSES=$(expr $NNODES \* $GPUS_PER_NODE)

# Note: it is important to escape `$SLURM_PROCID` since we want the srun on each node to evaluate this variable
# the accelerate config can be the same as single-node training, we will override machine rank.
export LAUNCHER="accelerate launch \
    --config_file acc_configs/gpu8.yaml \
    --main_process_ip $MASTER_ADDR \
    --main_process_port $MASTER_PORT \
    --machine_rank \$SLURM_PROCID \
    --num_processes $NUM_PROCESSES \
    --num_machines $NNODES \
    "

export PROGRAM="\
main.py vae \
    --workspace workspce_resume \
    --resume workspace/model.safetensors
"

export CMD="$LAUNCHER $PROGRAM"

srun --jobid $SLURM_JOBID bash -c "$CMD" 2>&1 | tee -a $LOG_PATH

echo "END TIME: $(date)"

make sure the --output and --err log directory (logs here) exists! Else the task will fail immediately without any information!

scontrol

re-config a pending job.

scontrol show job JOBID

scontrol hold JOBID # will not enter running state
scontrol update jobid=JOBID ...
scontrol release JOBID # release

scancel

stop/cancel a job.

scancel <jobid>

sacct

show status of running and (recently) finished jobs.

sacct

# example output
      JobID    JobName PhxPriority UserPriority VirtualPartition  Partition    Account  AllocGPUS  AllocCPUS      State ExitCode 
------------ ---------- ----------- ------------ ---------------- ---------- ---------- ---------- ---------- ---------- -------- 
10739714     accelerate        none         none xxx    llm2tmp   research          8         64 CANCELLED+      0:0 
10740830         python      normal         none xxx    llm2tmp   research          1         16    RUNNING      0:0 
10747428     accelerate        none         none xxx    llm2tmp   research          8         40  COMPLETED      0:0

swatch

identify why my job is pending.

swatch check_pending <jobid>