HTCONDOR batch system

Please read the CONDOR user manual, specifically the "submitting a job" and "managing a job" chapters. The important commands are condor_status, condor_q, condor_submit, condor_hold, and condor_rm.

The first job

CONDOR works with a special "job description file" or "condor command file", where the user requests a set of resources required for a job. The job description file defines the type of the job (single-core, parallel, etc.) and can be used to pass parameters to a job and manage the transfer of input and output files.

In this example, we will submit a user's script condorJob1.sh using job description file condorJob1.sh.

  • login to one of the cluster interactive nodes (cms1, hpcm, or t3int0)
  • create a batch job project area :
    • mkdir -p /xdata/$USER/batch
    • cd /xdata/$USER/batch
  • Check the list of available nodes
    • condor_status
 Name                                               OpSys      Arch          State     Activity LoadAv Mem    ActvtyTime

 slot1@node1.nicadd.niu.edu      LINUX      EL8x64 Unclaimed Idle       0.000   918 19+16:42:18
 ......................................................................................................................................................................
 slot59@pt3wrk4.nicadd.niu.edu LINUX      EL8x64 Claimed     Idle      1.000  1835  0+00:02:57
 slot60@pt3wrk4.nicadd.niu.edu LINUX      EL8x64 Claimed     Busy    1.100  1835  0+00:01:32
 slot61@pt3wrk4.nicadd.niu.edu LINUX      EL8x64 Claimed     Idle      1.000  1835  0+00:02:26
 slot62@pt3wrk4.nicadd.niu.edu LINUX      EL8x64 Claimed     Idle      1.000  1835  0+00:02:26
 slot63@pt3wrk4.nicadd.niu.edu LINUX      EL8x64 Unclaimed Idle      1.000  1835  0+00:00:39
 slot64@pt3wrk4.nicadd.niu.edu LINUX      EL8x64 Unclaimed Idle      1.000  1835  0+00:02:49
                    Machines Owner Claimed Unclaimed Matched Preempting

        EL8x64/LINUX      248     0     168        80       0          0
        EL8x64/LINUX      450     0      11       439       0          0

               Total      698     0     179       519       0          0

Note1:: Specify (Arch == "EL8x64") to use only Alma Linux 8.X nodes Note2:: Pay attention to the "Mem" (the available memory in MB for this slot). A job that requests memory of 1024MB will only be able to run in slots with available memory equal to or greater than 1024MB. Example - to request slots with available memory greater/equal then 1024MB, add to the condor command file:

  • request_memory = 1024MB
  • Check the status of running jobs on the cluster
    • condor_q -run -all -global -nobatch
 -- Schedd: pt3int0.nicadd.niu.edu : <192.168.100.10:9618?... @ 07/30/19 14:17:31
  ID        OWNER            SUBMITTED     RUN_TIME HOST(S)
 115540.0   prudhvib        7/30 10:11   0+04:05:22 slot1@pt3wrk0.nicadd.niu.edu 
 115541.0   prudhvib        7/30 10:11   0+04:05:22 slot2@pt3wrk0.nicadd.niu.edu
 115542.0   prudhvib        7/30 10:11   0+04:05:22 slot3@pt3wrk0.nicadd.niu.edu
 ----------------------------------------------------------------------------------------------------------------
  -- Schedd: phpcm.nicadd.niu.edu : <192.168.132.199:9618?... @ 07/30/19 14:17:31
  ID        OWNER            SUBMITTED     RUN_TIME HOST(S)
 392097.0   dnrtwh          7/28 19:33   1+18:35:19 slot1@phpc2.nicadd.niu.edu
 392106.0   dnrtwh          7/30 01:22   0+12:49:17 slot1@pnc2.nicadd.niu.edu
  • copy condorJob1.cmd and .condorJob1.sh into /xdata/$USER/batch
  • inspect the condor command file. We require condorJob1.sh to be submitted on any of EL8x64 nodes in CONDOR "vanilla" universe (a single core job); define where to write job error, log, and out files (produced by CONDOR system during the run); request to transfer all files produced by the condorJob1.sh back to the current directory and require e-mail notification at the end of the job. Do not forget to uncomment the "notify_user" line and provide a working e-mail address.
    • cat condorJob1.cmd
##################################################            
##Condor command file example                                                                
##############################                                                                 
# 
# The requirement line specifies which machines we want to
# run this job on.  Any arbitrary classad expression can be used.                                                                                          
requirements              = (Arch == "EL8x64")
# 
#Request  the amount of memory needed for an application
#(use condor_status to check if there are enough slots to satisfy the request) 
# Example:: output of condor_status for  slot10@pt3wrk0  shows 1508MB is reserved for this slot
# slot10@pt3wrk0.nic LINUX      EL8x64 Unclaimed Idle      0.000 1508  0+19:17:29                                            
request_memory  =  1024MB
#                                                                                            
# The executable or script to run.                                                                                           
executable                = $ENV(PWD)/condorJob1.sh                                                        

# Where to  store the job's stdout, stderr, and condor log file for this job                                                                                                       
output                    = $ENV(PWD)/condorJob1.out                                                       
error                     = $ENV(PWD)/condorJob1.err                                                       
log                       = $ENV(PWD)/condorJob1.log

#
#Type of submitted job ( we use vanilla for single CPU jobs and  parallel for MPI jobs)  
universe                  = vanilla

# Copy all of the user's current shell environment variables 
# at the time of job submission.
GetEnv          = True

#The two flags above will transfer all the necessary input files and executable to the worker nodes. 
#When the job is completed, it will transfer the output files back
should_transfer_files     = YES
when_to_transfer_output   = ON_EXIT_OR_EVICT

#
# We want to email if the job was completed successfully. This can
# be set to Always, Error, or Never.
notification              = always

#
#Uncomment and provide a working e-mail address
#notify_user              = user@gmail.com

#
# This should be the last command and tells Condor to queue the
# job.  If a number is placed after the "queue" command (like "queue 15"),
# then the job will be submitted N times.  Use the $(Process)
# macro to make your input/output and log files unique.
#
queue
#
#############################################################################
  • inspect the condorJob1.sh script (it only says "Hello", prints environment variables defined at the remote node at the time of the run, sleeps 3 min, and creates the two toy output files):
    • cat condorJob1.sh
#!/bin/bash
#A condor job script example 
UNAME="$(id -nu)"
SCRIPTNAME="condorJob1"
JOBID="$(echo $_CONDOR_SCRATCH_DIR | sed 's/\// /g'| sed 's/\_/ /g' | awk '{print $NF}')"
DATE="$(date | sed 's/:/ /g' | awk '{print $2$3"_"$4_$5_$6}')"
JOBNAME="${SCRIPTNAME}_${JOBID}_${DATE}"
###
# Here is the placeholder - to have some output 
##
echo "Greetings $UNAME! (from CONDOR on node $HOSTNAME at $DATE)"
echo "Job environment:"
env | sort
#Pretend of doing something for 3 mins
sleep 180
#
#Create some files
echo "Here can be a job output  data"   >> ${JOBNAME}.root 
echo "Here can be a job logging info"   >> ${JOBNAME}.info
#
  • submit condor job
    • condor_submit condorJob1.cmd
  • check job status
    • condor_q -all -nobatch

-Schedd: hpcm.nicadd.niu.edu : <192.168.100.2:9248?...
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
1.0 user 9/8 23:57 0+00:00:19 R 0 0.0 condorJob1.sh

1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended

  • when a job ends (wait 3 min, check e-mail from the CONDOR system), inspect the job output files
    • ls -l condorJob1*
  • Read CONDOR manual for more info about Submitting a Job

!! All-in-one vanilla job

Job status inspection

If Condor can not find a node that does satisfy the command file requirements, it keeps it in a waiting state, marked as I (IDLE) in the output of the
"condor_q -all" command. Users can wait until requested resources become available or change requirements to pick up available nodes/cores.
To check the status of an idle job (the same command can also be applied to a running job to get details of job attributes in use):

  • condor_q -better-analyze jobid

If the job is running user can ssh into the remote node and inspect its progress.

  • condor_ssh_to_job jobid

All-in-one job

The example below will create the CONDOR command file and the shell script for the run and then submit the job.
This method is convenient for configuring complex wrappers to run hundreds of jobs.

##################################################
#!/bin/bash
# A script to submit Nicadd condor jobs (serguei@nicadd.niu.edu)
# !!! t3int0.nicadd.niu.edu should be used for job submission ; create  /bdata/$USER/testproject for tests
# Arguments:  jobname  [absolute-path-to-the-project_code_dir] [short_nodename]
# Note: script will use absolute-path-to-the-project_code_dir/job_archive.tgz if it exist,
# (tar -czf  /absolute-path-to-the-project_code_dir/project_archive.tgz --directory /absolute-path-to-the-project_code_dir ./ )
# or a new archive will be created for each job
# 
# Customize ( defaults in parentness )  
# NOTIFY_USER ($USER@nicadd.niu.edu ), e-mail address for notification.
# BATCH_ROOT_DIR (/bdata/${USER}/batch), the default directory for batch jobs output.
# Script to be executed on a remote node - edit lines of this file  between  
#"cat > ${PROJECT_CONDOR_DIR}/${JOBCMDSCRIPT}  << EOF_USER_SCRIPT"   
#  user script here
#"EOF_USER_SCRIPT"
#
# Attention - for better understanding, please read the Condor manual
# http://research.cs.wisc.edu/htcondor/manual/v7.6.10/2_Users_Manual.html
# particulary to learn condor_status, condor_q, condor_submit, condor_hold, and condor_rm commands
#=================================================================================
MYUSERNAME="$USER"
NOTIFY_USER="$MYUSERNAME@nicadd.niu.edu"
#Project root directory (somewhere on a large NFS disk)
BATCH_ROOT_DIR="/bdata/${MYUSERNAME}/batch"
#
#
JOBNAME="$1"
#
if [ "a$2" != "a" ]; then 
   PROJECT_CODE_DIR="$2"
#   echo "PROJECT_CODE_DIR  = $PROJECT_CODE_DIR" 
else 
#   echo "MYUSERNAME = $USER = $MYUSERNAME" 
   PROJECT_CODE_DIR=""; 
fi
#
if [ "a$3" != "a" ]; then 
   SHORT_NODE_NAME="$3"
   NODE_NAME="$3.nicadd.niu.edu"
else 
   NODE_NAME=""; 
fi
#
if [ "0$JOBNAME" == "0" ] ||  [ "0${PROJECT_CODE_DIR}" == "0" ]; then
    echo "Usage ( parameters in  square brackets are optional ):   $0  jobname   project_code_dir [short_nodename]"
    echo "Example: $0  testJob  /bdata/$USER/testproject [pt3wrk0] "
    echo "Note: Here we assume that source package is in  /bdata/$USER/batch/testproject folder "
    exit 0;
else
  if [ ! -d $PROJECT_CODE_DIR ]; then 
    echo "Error!! Project code directory $PROJECT_CODE_DIR is not found, exiting"
    echo "Usage ( parameters in  square brackets are optional ):   $0  jobname   project_code_dir [short_nodename]"
    echo "Example: $0  testJob  /bdata/$USER/testproject [pt3wrk0] "
    echo "Note: Here we assume that the source package is in  /bdata/$USER/batch/testproject folder."
    exit 0;
  fi
fi

##
#Batch jobs scratch areas
#PROJECT_SCRATCH_DIR="/disk/${MYUSERNAME}"
#Condor cmd, job shell scripts, and input archives are here
PROJECT_CONDOR_DIR="$BATCH_ROOT_DIR/condor"
#Store output from all jobs here
PROJECT_OUT_DIR="$BATCH_ROOT_DIR/results"
if [ ! -d $PROJECT_CONDOR_DIR ];  then mkdir -p $PROJECT_CONDOR_DIR; fi
if [ ! -d $PROJECT_OUT_DIR ];  then mkdir -p $PROJECT_OUT_DIR; fi
#
PROJECT_NAME="$(echo $PROJECT_CODE_DIR | sed 's/\// /g' | awk '{print $NF}')"
DATE="$(date | sed 's/:/ /g' | awk '{print $2$3"_"$4_$5_$6}')"
# 
JOBOUTDIR="$PROJECT_OUT_DIR/${JOBNAME}_${DATE}"
#
JOBLOGDIR="$JOBOUTDIR/log"
#To be submitted with condor_submit
JOBCMDCONDOR="${JOBNAME}_${DATE}.cmd"
#To be launched by $JOBCMDCONDOR
JOBCMDSCRIPT="${JOBNAME}_${DATE}.sh"
#
if [ ! -d $JOBLOGDIR ]; then  mkdir -p $JOBLOGDIR; fi
echo "Preparing job $JOBNAME for user $MYUSERNAME, job information will be sent to $NOTIFY_USER"
echo "Input code and data will be taken from $PROJECT_CODE_DIR"
echo "Creating job shell script         : ${PROJECT_CONDOR_DIR}/${JOBCMDSCRIPT}"
echo "Creating condor description script: ${PROJECT_CONDOR_DIR}/${JOBCMDCONDOR}"
#
#SCRATCH area on a remote node
#JOBSCRATCH="${PROJECT_SCRATCH_DIR}/${JOBNAME}_${DATE}"
#Consumables (anything that will be used by the job shell script)
JOB_INPUT_ARCHIVE="${JOBNAME}_${DATE}.tgz"
#
#Create or use  an input archive 
if [ ! -f ${PROJECT_CODE_DIR}/project_archive.tgz ]; then 
 echo "Project archive (${PROJECT_CODE_DIR}/project_archive.tgz) is not found, creating"
 TAR_COMMAND="tar -czf $PROJECT_CONDOR_DIR/$JOB_INPUT_ARCHIVE --directory  $PROJECT_CODE_DIR ./ > /dev/null 2>&1"
 echo "Running $TAR_COMMAND" 
 tar -czf $PROJECT_CONDOR_DIR/$JOB_INPUT_ARCHIVE --directory  $PROJECT_CODE_DIR ./ > /dev/null 2>&1
 ls -ltrh $PROJECT_CONDOR_DIR/$JOB_INPUT_ARCHIVE
 else
 echo "Will use ${PROJECT_CODE_DIR}/project_archive.tgz for the batch job"
 cp -rfvp ${PROJECT_CODE_DIR}/project_archive.tgz $PROJECT_CONDOR_DIR/$JOB_INPUT_ARCHIVE
fi 
#
#Create shell script (will run on the remote node)
if [ -f ${PROJECT_CONDOR_DIR}/${JOBCMDSCRIPT} ]; then rm -f ${PROJECT_CONDOR_DIR}/${JOBCMDSCRIPT}; fi
#==============================================================================================
cat > ${PROJECT_CONDOR_DIR}/${JOBCMDSCRIPT}  << EOF_USER_SCRIPT
#!/bin/bash
#===========Command script to be run on a remote node =========================================
#${JOBCMDSCRIPT} 
#The code below will be executed on a remote node - to be customized by a user  
#We are on the remote  node
HOSTNAME_SHORT="\$(hostname -s)"
PROJECT_SCRATCH_DIR="/nfs/work/\${HOSTNAME_SHORT}/${MYUSERNAME}"
JOBSCRATCH="\$PROJECT_SCRATCH_DIR/${JOBNAME}_${DATE}_\$_CONDOR_SLOT"
#
#Creating the scratch directory on a remote node
mkdir -p \$JOBSCRATCH
cd \$JOBSCRATCH
#
#Copy/unpack the input archive
rsync -av $PROJECT_CONDOR_DIR/$JOB_INPUT_ARCHIVE ./
tar -xzf $JOB_INPUT_ARCHIVE
#
#Set job evironment variables here (e.g $LD_LIBRARY_PATH, $ROOTSYS, etc) 
#
echo "LD_LIBRARY_PATH = $LD_LIBRARY_PATH"
echo "=================================="
echo "PATH            = $PATH"
# Here is the placeholder - to have some output 
##
echo "Hello Condor from node "\${HOSTNAME_SHORT}
echo "MY ENVIRONMENT"
env | sort
# Create dummy output.root and out.log
echo "Here should be my output data"   >>  output.root 
echo "Here could be my log         "   >>  out.log
cp -rfvp output.root  $JOBOUTDIR/
cp -rfvp out.log      $JOBOUTDIR/
#
ls -ltrh 
#
cd \${PROJECT_SCRATCH_DIR}
ls -l \${PROJECT_SCRATCH_DIR}
#Clean-up (please do not comment, except for debug)
if [ "0\$JOBSCRATCH" != "0" ]; then 
 if [ -d \$JOBSCRATCH ];  then
   echo "All done; cleaning up job scratch \$JOBSCRATCH" 
   rm -rf \$JOBSCRATCH
 fi
fi
#
EOF_USER_SCRIPT
#
#Prepare the condor command file
if [ -f ${PROJECT_CONDOR_DIR}/${JOBCMDCONDOR} ]; then rm -f ${PROJECT_CONDOR_DIR}/${JOBCMDCONDOR}; fi
#==============================================================================================
if [ "0$NODE_NAME" == "0" ]; then
cat > ${PROJECT_CONDOR_DIR}/${JOBCMDCONDOR}  <<EOF
#==============================================================================================
## Condor job description file to launch ${PROJECT_CONDOR_DIR}/${JOBCMDSCRIPT} on any free node
#==============================================================================================
requirements              = (Arch == "EL8x64")
#
#
executable                = ${PROJECT_CONDOR_DIR}/${JOBCMDSCRIPT}
output                    = ${JOBLOGDIR}/${JOBNAME}.out
error                     = ${JOBLOGDIR}/${JOBNAME}.err
log                       = ${JOBLOGDIR}/${JOBNAME}.log
universe                  = vanilla
getenv                    = true
#
RequestMemory             = 1024
#
should_transfer_files     = YES
when_to_transfer_output   = ON_EXIT
notify_user               = ${NOTIFY_USER}
notification              = always
submit_nicadd_job.shGetEnv                    = True 
queue
EOF
#
else
echo "Remote working directory /nfs/work/${SHORT_NODE_NAME}/$MYUSERNAME/${JOBNAME}_${DATE}" 
cat > ${PROJECT_CONDOR_DIR}/${JOBCMDCONDOR} <<EOF
#==============================================================================================
## Condor job description file to launch ${PROJECT_CONDOR_DIR}/${JOBCMDSCRIPT} on $NODE_NAME
#==============================================================================================
#
requirements              = (Arch == "EL8x64")&&((Machine == "$NODE_NAME"))
#
executable                = ${PROJECT_CONDOR_DIR}/${JOBCMDSCRIPT}
output                    = ${JOBLOGDIR}/${JOBNAME}.out
error                     = ${JOBLOGDIR}/${JOBNAME}.err
log                       = ${JOBLOGDIR}/${JOBNAME}.log
universe                  = vanilla
getenv                    = true
#RequestMemory             = 1001
should_transfer_files     = YES
when_to_transfer_output   = ON_EXIT
notify_user               = ${NOTIFY_USER}
notification              = always
GetEnv                    = True
queue
EOF
fi
echo "Submitting job ${PROJECT_CONDOR_DIR}/${JOBCMDCONDOR} to run ${PROJECT_CONDOR_DIR}/${JOBCMDSCRIPT}"
sleep 10
condor_submit  ${PROJECT_CONDOR_DIR}/${JOBCMDCONDOR}
echo "Done, job output will be copied to $JOBOUTDIR"
echo "Job stat will be written to $JOBLOGDIR"
echo "To check job status,  run :"
echo "condor_q -global -all -nobatch" 
echo "#------------#"


 #

Parallel MPI jobs

Parallel jobs have to be submitted using "universe = parallel" directive from hpcm.nicadd.niu.edu' or t3int0.nicadd.niu.edu.

 The "machine_count = $NPROC" command defines the number of cores to run.
  • Nodes (cores) available for parallel use:
    • from hpcm:: pcms0-pcms6(12 cores each), phpc0-phpc2 (48 cores each), pnc0-pnc3 (64 cores each).
    • from t3int0 :: pt3wrk0-pt3wrk2(16 cores each), pt3wrk3-pt3wrk4 (64 cores each).

Cross-node jobs are not supported.

Additionally, it is very important that scripts running mpi jobs include the following interface
(please use the COSY MPI job tutorial for the quick start).

##################################################
#!/bin/bash
#
DATE="$(date | sed 's/:/ /g' | awk '{print $2$3"_"$4_$5_$6}')"
#
# Project and job name  (customize)
mpiProjectName="mpiTest"
mpiJobName="${mpiProjectName}_$DATE"
#
#Path and options of the MPI app (customize)
MPIAPP_LONG="/path/to/my/mpi/app"
MPIAPP_OPTS=""
#
#We recommend using a special folder on a shared disk for mpi jobs (create if not yet exist)
mpiAppsDir="/xdata/$USER/mpiAPPs"
#
#We assume that any input files are  located in a shared folder  (create if needed)
MPIAPP_INPUTDIR="$mpiAppsDir/$mpiProjectName/INPUT" 
#
#Create an output directory for this job 
MPIAPP_OUTPUTDIR="$mpiAppsDir/$mpiProjectName/$mpiJobName"
mkdir  -p  $MPIAPP_OUTPUTDIR
#
#This is necessary to define the module environment
. /usr/share/Modules/init/bash
#
#Load the mpi module
module load openmpi/openmpi-1.8.1
#
#Service environment variables for condor job
_CONDOR_PROCNO=$_CONDOR_PROCNO
_CONDOR_NPROCS=$_CONDOR_NPROCS

CONDOR_SSH=`condor_config_val libexec`
CONDOR_SSH=$CONDOR_SSH/condor_ssh

SSHD_SH=`condor_config_val libexec`
SSHD_SH=$SSHD_SH/sshd.sh

. $SSHD_SH $_CONDOR_PROCNO $_CONDOR_NPROCS 

#Important!! Without the condition below, the script will launch $_CONDOR_NPROCS identical MPI jobs
#=============================================================================
# If not the master mpi core process, just sleep forever to let Condor mark the remaining cores reserved 
# 
if [ $_CONDOR_PROCNO -ne 0 ]
then
                wait
                sshd_cleanup
                exit 0
fi
#OK,  we are the $_CONDOR_PROCNO = 0, continue to start MPI app
#=============================================================================
#We are on the remote parent node
HOSTNAME_SHORT="$(hostname -s)"

#Create local scratch area ( Note!!  - it has to be located  under   /nfs/work/${HOSTNAME_SHORT}  folder)
RUN_DIR="/nfs/work/${HOSTNAME_SHORT}/$USER/$mpiJobName"
mkdir -p  $RUN_DIR
#
echo  "User $USER at  NODE=$HOSTNAME_SHORT"
echo  "Run_Dir  = $RUN_DIR"
#
#Copy required input files 
rsync -av  ${MPIAPP_INPUTDIR}/*  $RUN_DIR/
#
#CONDOR_CONTACT_FILE is a special file that keeps information about reserved nodes and cores for this job 
rsync -av $_CONDOR_SCRATCH_DIR/contact $RUN_DIR/
#
CONDOR_CONTACT_FILE=$RUN_DIR/contact
export CONDOR_CONTACT_FILE

# Convert CONDOR_CONTACT_FILE into the format suitable for the MPI job
sort -n -k 1 < $CONDOR_CONTACT_FILE | awk '{print $2}' > machines

#Check memlock limits (mpi likes unlimited)
echo "Condor ulimits (as set in /usr/lib/systemd/system/condor.service)"
ulimit -a
#
#Run the mpi application defined above
echo mpirun -v --prefix $MPI_HOME  -n $_CONDOR_NPROCS -hostfile machines ${MPIAPP_LONG} ${MPIAPP_OPTS}
mpirun -v --prefix $MPI_HOME  -n $_CONDOR_NPROCS -hostfile machines ${MPIAPP_LONG} ${MPIAPP_OPTS}

echo "Run finished"
rsync -av * ${MPIAPP_OUTPUTDIR}/

sshd_cleanup

/bin/ls -l  $RUN_DIR
#Be careful with  this (the scratch area should be on local /disk) 
cd /disk
/bin/rm -rf  $USER/$mpiJobName
rm -f machines
 #

Parallel SCOOP jobs

SCOOP (Scalable COncurrent Operations in Python) is a distributed task module allowing concurrent parallel programming on various environments, from heterogeneous grids to supercomputers. Documentation is available at http://scoop.readthedocs.org/. SCOOP does not support (as of Oct 2019) the HTCondor scheduler. Correspondingly, SCOOP jobs can not be distributed over several nodes and should reserve an entire node to avoid conflicts with other HTCondor jobs.

The tutorial below explains how to run SCOOP jobs via HTCONDOR on multiprocessor nodes (pncX and phpcX) of the NICADD cluster.

1) login to hpcm.nicadd.niu.edu; create project area on /xdata or /bdata disks ; and copy /opt/nicadd/contrib/beam_examples/impact2018_scoop

@phpcm ~]$  
mkdir -p  /xdata/$USER/myscoop
cd /xdata/$USER/myscoop
rsync -av /opt/nicadd/contrib/beam_examples/impact2018_scoop /xdata/$USER/myscoop/
cd /xdata/$USER/myscoop/impact2018_scoop /

phpcm impact2018_scoop]$ ls
Input  Input_ecool48hr_nicadd  Input_ecool48hr_pbs  submit_condor_scoop.sh

Above "Input" is the link to the directory that holds all files required to run the Input/ga.py program.
(Input_ecool48hr_nicadd in this example). A user has to create his input directory, say MyJobInput
and replace the Input link.

 
[@phpcm impact2018_scoop]$ mkdir MyJobInput
copy everything needed for the job into MyJobInput; name the main driver python script as ./MyJobInput/ga.py;
[@phpcm impact2018_scoop]$ rm -f Input; ln -s MyJobInput  Input

For practice, one can try to submit a test job in default configuration

[@phpcm impact2018_scoop]$ ./submit_condor_scoop.sh 
Usage::   ./submit_condor_scoop.sh  short_node_name [#of_cores_to_use] ... please resubmit
Exampl:: ./submit_condor_scoop.sh  pnc1 64

If [#of_cores_to_use] is not provided, the job will use all available cores on the requested node.
One can try (use any free node to test; pnc1 is just an example)

[@phpcm impact2018_scoop]$  ./submit_condor_scoop.sh pnc1
We will use 64 cores on node pnc1
‘run_ga_pnc1_n64_Oct14_135027.cmd’ -> ‘/xdata/$USER/myscoop/impact2018_scoop/condor_out//ga_pnc1_n64_Oct14_135027/run_ga_pnc1_n64_Oct14_135027.cmd’
‘run_ga_pnc1_n64_Oct14_135027.sh’ -> ‘/xdata/$USER/myscoop/impact2018_scoop/condor_out//ga_pnc1_n64_Oct14_135027/run_ga_pnc1_n64_Oct14_135027.sh’
Submitting job:: condor_submit  run_ga_pnc1_n64_Oct14_135027.cmd
Submitting job(s).
1 job(s) submitted to cluster 396835.
Job stat will be written to /xdata/$USER/myscoop/impact2018_scoop/condor_out//ga_pnc1_n64_Oct14_135027/log
To check the job status,  run :
condor_q -all -global -nobatch
#------------#

As one can see, the above command creates the script run_ga_pnc1_n64_Oct14_135027.sh, which is submitted to the pnc1 node via run_ga_pnc1_n64_Oct14_135027.cmd job description file.
The job will start after a delay of up to 10 minutes (depending on the cluster load)

[@phpcm impact2018_scoop]$ condor_q -all -global -nobatch
396835.0   $USER         10/14 13:50   0+00:00:25 slot1@pnc1.nicadd.niu.edu

Inspect/nfs/work/pnc1/$USER/ga_pnc1_n64_Oct14_135027 to check job progress.
It is also accessible via the "rundir" link in the job working directory on the remote node (use 'condor_ssh_to_job jobid').

@phpcm impact2018_scoop]$ ls -l  /nfs/work/pnc1/$USER/ga_pnc1_n64_Oct14_135027 
or 
@phpcm impact2018_scoop]$ condor_ssh_to_job 396835.0.0
@pnc1 dir_4091373]$ ls
condor_exec.exe  _condor_stderr  _condor_stdout  contact  rundir  tmp
@pnc1 dir_4091373]$ ls -ltrh rundir/
... All project files and folders....

If the job is killed with condor_rm (please do so for this tutorial), the job working directory will still be accessible.
It is recommended to copy it to the ./condor_out/jobname/ :

condor_rm  396835.0
rsync -av /nfs/work/pnc1/$USER/ga_pnc1_n64_Oct14_135027/     /xdata/$USER/myscoop/impact2018_scoop/condor_out//ga_pnc1_n64_Oct14_135027/

If the job is allowed to run until its expected completion, the remote working directory will be stored under ./condor_out/ automatically.

How to reserve extra memory for a job

For a job that requires more memory than available for condor slots:

  • Use Parallel SCOOP jobs instructions, modify the submit_condor_scoop.sh script to run your application.
    • Note1 - even if the application is not parallel, it has to be configured this way to reserve extra memory>
  • Set RequestMemory = 2000M. Use the machine_count = X parameter to reserve the required memory. Use "condor_status | grep nodeName"
    • to get the required info. For example, from below, pnc1 has 64 slots, each with 2011MB.
    • condor_status | grep pnc1
slot64@pnc1.nicadd.niu.edu    LINUX      EL8x64 Unclaimed Idle      0.000  2011  5+01:46:27
  • Correspondingly, to reserve 12MB of memory for a job, use "machine_count = 6" (6_slots x 2011MB >12MB)
    • Note2 - In this mode, the reserved memory amount is not enforced - the job will not be put on hold if it consumes more. It means that if several such jobs are submitted to one node, they may run out of memory.
    • Note3 - Do not try to increase RequestMemory for jobs configured this way. On pnc1, condor will idle any job with RequestMemory> 2011. On the other nodes, it will idle jobs that request more memory than the maximum available in the configured slots.

Why my job is not running?

To check the status of submitted jobs, use

  • condor_q -all -global -nobatch

Jobs in "IDLE" and "HOLD" states can be inspected using

  • condor_q --analyze jobid

and/or

  • condor_q --better-analyze jobid

For example, the output below shows that job 1234.0 is in the HOLD status:

[pt3int0 ~]$ condor_q -all -global  -nobatch

-- Schedd: pt3int0.nicadd.niu.edu : <192.168.100.10:9618?... @ 10/10/19 23:50:15
OWNER   BATCH_NAME     SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS

 userName  samplejob_xxx    01/01   21:02        _          _           _          1           1         1234.0

1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 1 held, 0 suspended

Now "condor_q -analyze" explains the reason for the hold

[
[pt3int0 ~]$ condor_q -analyze 1234.0

-- Schedd: pt3int0.nicadd.niu.edu : <192.168.100.10:9618?...
1234.000:  Job is held.
Hold reason: Error from slot64@pnc1.nicadd.niu.edu: Job used more resident memory than specified by request_memory.

And "condor_q -better-analyze 1234.0" provides a detailed analysis of job requirements. The job was put on hold
as the RequestMemory parameter was not defined in the condor description file. In this case, it must be
defined as "RequestMemory = 1600", where "1600" is the requested memory size in MB, which should be greater than the "ResidentSetSize = 1500" shown below.

[pt3int0 ~]$ condor_q -better-analyze 1234.0


-- Schedd: pt3int0.nicadd.niu.edu : <192.168.100.10:9618?...
The Requirements expression for job 1234.000 is            

    ( ( Arch == "EL8x64" ) ) && ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= RequestDisk ) &&
    ( TARGET.Memory >= RequestMemory ) && ( TARGET.HasFileTransfer )                            

Job 1234.000 defines the following attributes:

    DiskUsage = 15
    ImageSize = 1500
    MemoryUsage = ( ( ResidentSetSize + 1023 ) / 1024 )
    RequestDisk = DiskUsage
    RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,( ImageSize + 1023 ) / 1024)
    ResidentSetSize = 1500

The Requirements expression for job 1234.000 reduces to these conditions:

         Slots
Step    Matched  Condition
-----  --------  ---------
[0]         686  Arch == "EL8x64"
[1]         686  TARGET.OpSys == "LINUX"
[3]         686  TARGET.Disk >= RequestDisk
[5]         686  TARGET.Memory >= RequestMemory
[7]         686  TARGET.HasFileTransfer


1234.000:  Job is held.

Hold reason: Error from slot64@pnc1.nicadd.niu.edu: The job used more resident memory than specified by request_memory.

Last successful match: Thu Oct 10 21:02:05 2019

11986.000:  Run analysis summary ignoring user priority.  Of 686 machines,
      0 are rejected by your job's requirements
     10 reject your job because of their own requirements
      0 match and are already running your jobs
      0 match but are serving other users
    676 are available to run your job

Changing Requirement expression - (condor_qedit)

In the previous example, job "1234.0" can be released (restarted) using condor_qedit
to edit job requirements expression:

[pt3int0 ~]$ condor_qedit -jobids "1234.0"  RequestMemory 1600
Set attribute "RequestMemory" for 1 matching jobs.
[pt3int0 ~]$ condor_release 1234.0
Job 1234.0 released

or for all "userName" 's jobs

[pt3int0 ~]$ condor_qedit -owner "userName"   RequestMemory 1600
[pt3int0 ~]$ condor_release userName

Inspecting job progress (condor_ssh_to_job)

This command allows the user to ssh on the compute node where the job is running. Once the command is run, the user will be in the job's working directory and can examine the job's environment and run commands. In the previous example, a user can ssh to job 1234.0 (should be in the running state):

[pt3int0 ~]$ condor_ssh_to_job  1234.0
Welcome to slot41@pnc1.nicadd.niu.edu!
Your condor job is running with pid(s) 3926302.
************************************************************************
[pnc1 dir_3926296]$ ls
condor_exec.exe  _condor_stderr  _condor_stdout


Page last modified on January 31, 2024, at 09:14 AM EST

|--Created: Sergey A. Uzunyan -|- Site Config --|