HTCONDOR batch system
Please read the CONDOR user manual, specifically the "submitting a job" and "managing a job" chapters. The important commands are condor_status, condor_q, condor_submit, condor_hold, and condor_rm.
The first job
CONDOR works with a special "job description file" or "condor command file", where the user requests a set of resources required for a job. The job description file defines the type of the job (single-core, parallel, etc.) and can be used to pass parameters to a job and manage the transfer of input and output files.
In this example, we will submit a user's script condorJob1.sh using job description file condorJob1.sh.
- login to one of the cluster interactive nodes (cms1, hpcm, or t3int0)
- create a batch job project area :
- mkdir -p /xdata/$USER/batch
- cd /xdata/$USER/batch
- Check the list of available nodes
- condor_status
Name OpSys Arch State Activity LoadAv Mem ActvtyTime slot1@node1.nicadd.niu.edu LINUX EL8x64 Unclaimed Idle 0.000 918 19+16:42:18 ...................................................................................................................................................................... slot59@pt3wrk4.nicadd.niu.edu LINUX EL8x64 Claimed Idle 1.000 1835 0+00:02:57 slot60@pt3wrk4.nicadd.niu.edu LINUX EL8x64 Claimed Busy 1.100 1835 0+00:01:32 slot61@pt3wrk4.nicadd.niu.edu LINUX EL8x64 Claimed Idle 1.000 1835 0+00:02:26 slot62@pt3wrk4.nicadd.niu.edu LINUX EL8x64 Claimed Idle 1.000 1835 0+00:02:26 slot63@pt3wrk4.nicadd.niu.edu LINUX EL8x64 Unclaimed Idle 1.000 1835 0+00:00:39 slot64@pt3wrk4.nicadd.niu.edu LINUX EL8x64 Unclaimed Idle 1.000 1835 0+00:02:49 Machines Owner Claimed Unclaimed Matched Preempting EL8x64/LINUX 248 0 168 80 0 0 EL8x64/LINUX 450 0 11 439 0 0 Total 698 0 179 519 0 0
Note1:: Specify (Arch == "EL8x64") to use only Alma Linux 8.X nodes Note2:: Pay attention to the "Mem" (the available memory in MB for this slot). A job that requests memory of 1024MB will only be able to run in slots with available memory equal to or greater than 1024MB. Example - to request slots with available memory greater/equal then 1024MB, add to the condor command file:
- request_memory = 1024MB
- Check the status of running jobs on the cluster
- condor_q -run -all -global -nobatch
-- Schedd: pt3int0.nicadd.niu.edu : <192.168.100.10:9618?... @ 07/30/19 14:17:31 ID OWNER SUBMITTED RUN_TIME HOST(S) 115540.0 prudhvib 7/30 10:11 0+04:05:22 slot1@pt3wrk0.nicadd.niu.edu 115541.0 prudhvib 7/30 10:11 0+04:05:22 slot2@pt3wrk0.nicadd.niu.edu 115542.0 prudhvib 7/30 10:11 0+04:05:22 slot3@pt3wrk0.nicadd.niu.edu ---------------------------------------------------------------------------------------------------------------- -- Schedd: phpcm.nicadd.niu.edu : <192.168.132.199:9618?... @ 07/30/19 14:17:31 ID OWNER SUBMITTED RUN_TIME HOST(S) 392097.0 dnrtwh 7/28 19:33 1+18:35:19 slot1@phpc2.nicadd.niu.edu 392106.0 dnrtwh 7/30 01:22 0+12:49:17 slot1@pnc2.nicadd.niu.edu
- copy condorJob1.cmd and .condorJob1.sh into /xdata/$USER/batch
- inspect the condor command file. We require condorJob1.sh to be submitted on any of EL8x64 nodes in CONDOR "vanilla" universe (a single core job); define where to write job error, log, and out files (produced by CONDOR system during the run); request to transfer all files produced by the condorJob1.sh back to the current directory and require e-mail notification at the end of the job. Do not forget to uncomment the "notify_user" line and provide a working e-mail address.
- cat condorJob1.cmd
################################################## ##Condor command file example ############################## # # The requirement line specifies which machines we want to # run this job on. Any arbitrary classad expression can be used. requirements = (Arch == "EL8x64") # #Request the amount of memory needed for an application #(use condor_status to check if there are enough slots to satisfy the request) # Example:: output of condor_status for slot10@pt3wrk0 shows 1508MB is reserved for this slot # slot10@pt3wrk0.nic LINUX EL8x64 Unclaimed Idle 0.000 1508 0+19:17:29 request_memory = 1024MB # # The executable or script to run. executable = $ENV(PWD)/condorJob1.sh # Where to store the job's stdout, stderr, and condor log file for this job output = $ENV(PWD)/condorJob1.out error = $ENV(PWD)/condorJob1.err log = $ENV(PWD)/condorJob1.log # #Type of submitted job ( we use vanilla for single CPU jobs and parallel for MPI jobs) universe = vanilla # Copy all of the user's current shell environment variables # at the time of job submission. GetEnv = True #The two flags above will transfer all the necessary input files and executable to the worker nodes. #When the job is completed, it will transfer the output files back should_transfer_files = YES when_to_transfer_output = ON_EXIT_OR_EVICT # # We want to email if the job was completed successfully. This can # be set to Always, Error, or Never. notification = always # #Uncomment and provide a working e-mail address #notify_user = user@gmail.com # # This should be the last command and tells Condor to queue the # job. If a number is placed after the "queue" command (like "queue 15"), # then the job will be submitted N times. Use the $(Process) # macro to make your input/output and log files unique. # queue # #############################################################################
- inspect the condorJob1.sh script (it only says "Hello", prints environment variables defined at the remote node at the time of the run, sleeps 3 min, and creates the two toy output files):
- cat condorJob1.sh
#!/bin/bash #A condor job script example UNAME="$(id -nu)" SCRIPTNAME="condorJob1" JOBID="$(echo $_CONDOR_SCRATCH_DIR | sed 's/\// /g'| sed 's/\_/ /g' | awk '{print $NF}')" DATE="$(date | sed 's/:/ /g' | awk '{print $2$3"_"$4_$5_$6}')" JOBNAME="${SCRIPTNAME}_${JOBID}_${DATE}" ### # Here is the placeholder - to have some output ## echo "Greetings $UNAME! (from CONDOR on node $HOSTNAME at $DATE)" echo "Job environment:" env | sort #Pretend of doing something for 3 mins sleep 180 # #Create some files echo "Here can be a job output data" >> ${JOBNAME}.root echo "Here can be a job logging info" >> ${JOBNAME}.info #
- submit condor job
- condor_submit condorJob1.cmd
- check job status
- condor_q -all -nobatch
-Schedd: hpcm.nicadd.niu.edu : <192.168.100.2:9248?...
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
1.0 user 9/8 23:57 0+00:00:19 R 0 0.0 condorJob1.sh
1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
- when a job ends (wait 3 min, check e-mail from the CONDOR system), inspect the job output files
- ls -l condorJob1*
- Read CONDOR manual for more info about Submitting a Job
Job status inspection
If Condor can not find a node that does satisfy the command file requirements, it keeps it in a waiting state, marked as I (IDLE) in the output of the
"condor_q -all" command. Users can wait until requested resources become available or change requirements to pick up available nodes/cores.
To check the status of an idle job (the same command can also be applied to a running job to get details of job attributes in use):
- condor_q -better-analyze jobid
If the job is running user can ssh into the remote node and inspect its progress.
- condor_ssh_to_job jobid
All-in-one job
The example below will create the CONDOR command file and the shell script for the run and then submit the job.
This method is convenient for configuring complex wrappers to run hundreds of jobs.
################################################## #!/bin/bash # A script to submit Nicadd condor jobs (serguei@nicadd.niu.edu) # !!! t3int0.nicadd.niu.edu should be used for job submission ; create /bdata/$USER/testproject for tests # Arguments: jobname [absolute-path-to-the-project_code_dir] [short_nodename] # Note: script will use absolute-path-to-the-project_code_dir/job_archive.tgz if it exist, # (tar -czf /absolute-path-to-the-project_code_dir/project_archive.tgz --directory /absolute-path-to-the-project_code_dir ./ ) # or a new archive will be created for each job # # Customize ( defaults in parentness ) # NOTIFY_USER ($USER@nicadd.niu.edu ), e-mail address for notification. # BATCH_ROOT_DIR (/bdata/${USER}/batch), the default directory for batch jobs output. # Script to be executed on a remote node - edit lines of this file between #"cat > ${PROJECT_CONDOR_DIR}/${JOBCMDSCRIPT} << EOF_USER_SCRIPT" # user script here #"EOF_USER_SCRIPT" # # Attention - for better understanding, please read the Condor manual # http://research.cs.wisc.edu/htcondor/manual/v7.6.10/2_Users_Manual.html # particulary to learn condor_status, condor_q, condor_submit, condor_hold, and condor_rm commands #================================================================================= MYUSERNAME="$USER" NOTIFY_USER="$MYUSERNAME@nicadd.niu.edu" #Project root directory (somewhere on a large NFS disk) BATCH_ROOT_DIR="/bdata/${MYUSERNAME}/batch" # # JOBNAME="$1" # if [ "a$2" != "a" ]; then PROJECT_CODE_DIR="$2" # echo "PROJECT_CODE_DIR = $PROJECT_CODE_DIR" else # echo "MYUSERNAME = $USER = $MYUSERNAME" PROJECT_CODE_DIR=""; fi # if [ "a$3" != "a" ]; then SHORT_NODE_NAME="$3" NODE_NAME="$3.nicadd.niu.edu" else NODE_NAME=""; fi # if [ "0$JOBNAME" == "0" ] || [ "0${PROJECT_CODE_DIR}" == "0" ]; then echo "Usage ( parameters in square brackets are optional ): $0 jobname project_code_dir [short_nodename]" echo "Example: $0 testJob /bdata/$USER/testproject [pt3wrk0] " echo "Note: Here we assume that source package is in /bdata/$USER/batch/testproject folder " exit 0; else if [ ! -d $PROJECT_CODE_DIR ]; then echo "Error!! Project code directory $PROJECT_CODE_DIR is not found, exiting" echo "Usage ( parameters in square brackets are optional ): $0 jobname project_code_dir [short_nodename]" echo "Example: $0 testJob /bdata/$USER/testproject [pt3wrk0] " echo "Note: Here we assume that the source package is in /bdata/$USER/batch/testproject folder." exit 0; fi fi ## #Batch jobs scratch areas #PROJECT_SCRATCH_DIR="/disk/${MYUSERNAME}" #Condor cmd, job shell scripts, and input archives are here PROJECT_CONDOR_DIR="$BATCH_ROOT_DIR/condor" #Store output from all jobs here PROJECT_OUT_DIR="$BATCH_ROOT_DIR/results" if [ ! -d $PROJECT_CONDOR_DIR ]; then mkdir -p $PROJECT_CONDOR_DIR; fi if [ ! -d $PROJECT_OUT_DIR ]; then mkdir -p $PROJECT_OUT_DIR; fi # PROJECT_NAME="$(echo $PROJECT_CODE_DIR | sed 's/\// /g' | awk '{print $NF}')" DATE="$(date | sed 's/:/ /g' | awk '{print $2$3"_"$4_$5_$6}')" # JOBOUTDIR="$PROJECT_OUT_DIR/${JOBNAME}_${DATE}" # JOBLOGDIR="$JOBOUTDIR/log" #To be submitted with condor_submit JOBCMDCONDOR="${JOBNAME}_${DATE}.cmd" #To be launched by $JOBCMDCONDOR JOBCMDSCRIPT="${JOBNAME}_${DATE}.sh" # if [ ! -d $JOBLOGDIR ]; then mkdir -p $JOBLOGDIR; fi echo "Preparing job $JOBNAME for user $MYUSERNAME, job information will be sent to $NOTIFY_USER" echo "Input code and data will be taken from $PROJECT_CODE_DIR" echo "Creating job shell script : ${PROJECT_CONDOR_DIR}/${JOBCMDSCRIPT}" echo "Creating condor description script: ${PROJECT_CONDOR_DIR}/${JOBCMDCONDOR}" # #SCRATCH area on a remote node #JOBSCRATCH="${PROJECT_SCRATCH_DIR}/${JOBNAME}_${DATE}" #Consumables (anything that will be used by the job shell script) JOB_INPUT_ARCHIVE="${JOBNAME}_${DATE}.tgz" # #Create or use an input archive if [ ! -f ${PROJECT_CODE_DIR}/project_archive.tgz ]; then echo "Project archive (${PROJECT_CODE_DIR}/project_archive.tgz) is not found, creating" TAR_COMMAND="tar -czf $PROJECT_CONDOR_DIR/$JOB_INPUT_ARCHIVE --directory $PROJECT_CODE_DIR ./ > /dev/null 2>&1" echo "Running $TAR_COMMAND" tar -czf $PROJECT_CONDOR_DIR/$JOB_INPUT_ARCHIVE --directory $PROJECT_CODE_DIR ./ > /dev/null 2>&1 ls -ltrh $PROJECT_CONDOR_DIR/$JOB_INPUT_ARCHIVE else echo "Will use ${PROJECT_CODE_DIR}/project_archive.tgz for the batch job" cp -rfvp ${PROJECT_CODE_DIR}/project_archive.tgz $PROJECT_CONDOR_DIR/$JOB_INPUT_ARCHIVE fi # #Create shell script (will run on the remote node) if [ -f ${PROJECT_CONDOR_DIR}/${JOBCMDSCRIPT} ]; then rm -f ${PROJECT_CONDOR_DIR}/${JOBCMDSCRIPT}; fi #============================================================================================== cat > ${PROJECT_CONDOR_DIR}/${JOBCMDSCRIPT} << EOF_USER_SCRIPT #!/bin/bash #===========Command script to be run on a remote node ========================================= #${JOBCMDSCRIPT} #The code below will be executed on a remote node - to be customized by a user #We are on the remote node HOSTNAME_SHORT="\$(hostname -s)" PROJECT_SCRATCH_DIR="/nfs/work/\${HOSTNAME_SHORT}/${MYUSERNAME}" JOBSCRATCH="\$PROJECT_SCRATCH_DIR/${JOBNAME}_${DATE}_\$_CONDOR_SLOT" # #Creating the scratch directory on a remote node mkdir -p \$JOBSCRATCH cd \$JOBSCRATCH # #Copy/unpack the input archive rsync -av $PROJECT_CONDOR_DIR/$JOB_INPUT_ARCHIVE ./ tar -xzf $JOB_INPUT_ARCHIVE # #Set job evironment variables here (e.g $LD_LIBRARY_PATH, $ROOTSYS, etc) # echo "LD_LIBRARY_PATH = $LD_LIBRARY_PATH" echo "==================================" echo "PATH = $PATH" # Here is the placeholder - to have some output ## echo "Hello Condor from node "\${HOSTNAME_SHORT} echo "MY ENVIRONMENT" env | sort # Create dummy output.root and out.log echo "Here should be my output data" >> output.root echo "Here could be my log " >> out.log cp -rfvp output.root $JOBOUTDIR/ cp -rfvp out.log $JOBOUTDIR/ # ls -ltrh # cd \${PROJECT_SCRATCH_DIR} ls -l \${PROJECT_SCRATCH_DIR} #Clean-up (please do not comment, except for debug) if [ "0\$JOBSCRATCH" != "0" ]; then if [ -d \$JOBSCRATCH ]; then echo "All done; cleaning up job scratch \$JOBSCRATCH" rm -rf \$JOBSCRATCH fi fi # EOF_USER_SCRIPT # #Prepare the condor command file if [ -f ${PROJECT_CONDOR_DIR}/${JOBCMDCONDOR} ]; then rm -f ${PROJECT_CONDOR_DIR}/${JOBCMDCONDOR}; fi #============================================================================================== if [ "0$NODE_NAME" == "0" ]; then cat > ${PROJECT_CONDOR_DIR}/${JOBCMDCONDOR} <<EOF #============================================================================================== ## Condor job description file to launch ${PROJECT_CONDOR_DIR}/${JOBCMDSCRIPT} on any free node #============================================================================================== requirements = (Arch == "EL8x64") # # executable = ${PROJECT_CONDOR_DIR}/${JOBCMDSCRIPT} output = ${JOBLOGDIR}/${JOBNAME}.out error = ${JOBLOGDIR}/${JOBNAME}.err log = ${JOBLOGDIR}/${JOBNAME}.log universe = vanilla getenv = true # RequestMemory = 1024 # should_transfer_files = YES when_to_transfer_output = ON_EXIT notify_user = ${NOTIFY_USER} notification = always submit_nicadd_job.shGetEnv = True queue EOF # else echo "Remote working directory /nfs/work/${SHORT_NODE_NAME}/$MYUSERNAME/${JOBNAME}_${DATE}" cat > ${PROJECT_CONDOR_DIR}/${JOBCMDCONDOR} <<EOF #============================================================================================== ## Condor job description file to launch ${PROJECT_CONDOR_DIR}/${JOBCMDSCRIPT} on $NODE_NAME #============================================================================================== # requirements = (Arch == "EL8x64")&&((Machine == "$NODE_NAME")) # executable = ${PROJECT_CONDOR_DIR}/${JOBCMDSCRIPT} output = ${JOBLOGDIR}/${JOBNAME}.out error = ${JOBLOGDIR}/${JOBNAME}.err log = ${JOBLOGDIR}/${JOBNAME}.log universe = vanilla getenv = true #RequestMemory = 1001 should_transfer_files = YES when_to_transfer_output = ON_EXIT notify_user = ${NOTIFY_USER} notification = always GetEnv = True queue EOF fi echo "Submitting job ${PROJECT_CONDOR_DIR}/${JOBCMDCONDOR} to run ${PROJECT_CONDOR_DIR}/${JOBCMDSCRIPT}" sleep 10 condor_submit ${PROJECT_CONDOR_DIR}/${JOBCMDCONDOR} echo "Done, job output will be copied to $JOBOUTDIR" echo "Job stat will be written to $JOBLOGDIR" echo "To check job status, run :" echo "condor_q -global -all -nobatch" echo "#------------#" #
Parallel MPI jobs
Parallel jobs have to be submitted using "universe = parallel" directive from hpcm.nicadd.niu.edu' or t3int0.nicadd.niu.edu.
The "machine_count = $NPROC" command defines the number of cores to run.
- Nodes (cores) available for parallel use:
- from hpcm:: pcms0-pcms6(12 cores each), phpc0-phpc2 (48 cores each), pnc0-pnc3 (64 cores each).
- from t3int0 :: pt3wrk0-pt3wrk2(16 cores each), pt3wrk3-pt3wrk4 (64 cores each).
Cross-node jobs are not supported.
Additionally, it is very important that scripts running mpi jobs include the following interface
(please use the COSY MPI job tutorial for the quick start).
################################################## #!/bin/bash # DATE="$(date | sed 's/:/ /g' | awk '{print $2$3"_"$4_$5_$6}')" # # Project and job name (customize) mpiProjectName="mpiTest" mpiJobName="${mpiProjectName}_$DATE" # #Path and options of the MPI app (customize) MPIAPP_LONG="/path/to/my/mpi/app" MPIAPP_OPTS="" # #We recommend using a special folder on a shared disk for mpi jobs (create if not yet exist) mpiAppsDir="/xdata/$USER/mpiAPPs" # #We assume that any input files are located in a shared folder (create if needed) MPIAPP_INPUTDIR="$mpiAppsDir/$mpiProjectName/INPUT" # #Create an output directory for this job MPIAPP_OUTPUTDIR="$mpiAppsDir/$mpiProjectName/$mpiJobName" mkdir -p $MPIAPP_OUTPUTDIR # #This is necessary to define the module environment . /usr/share/Modules/init/bash # #Load the mpi module module load openmpi/openmpi-1.8.1 # #Service environment variables for condor job _CONDOR_PROCNO=$_CONDOR_PROCNO _CONDOR_NPROCS=$_CONDOR_NPROCS CONDOR_SSH=`condor_config_val libexec` CONDOR_SSH=$CONDOR_SSH/condor_ssh SSHD_SH=`condor_config_val libexec` SSHD_SH=$SSHD_SH/sshd.sh . $SSHD_SH $_CONDOR_PROCNO $_CONDOR_NPROCS #Important!! Without the condition below, the script will launch $_CONDOR_NPROCS identical MPI jobs #============================================================================= # If not the master mpi core process, just sleep forever to let Condor mark the remaining cores reserved # if [ $_CONDOR_PROCNO -ne 0 ] then wait sshd_cleanup exit 0 fi #OK, we are the $_CONDOR_PROCNO = 0, continue to start MPI app #============================================================================= #We are on the remote parent node HOSTNAME_SHORT="$(hostname -s)" #Create local scratch area ( Note!! - it has to be located under /nfs/work/${HOSTNAME_SHORT} folder) RUN_DIR="/nfs/work/${HOSTNAME_SHORT}/$USER/$mpiJobName" mkdir -p $RUN_DIR # echo "User $USER at NODE=$HOSTNAME_SHORT" echo "Run_Dir = $RUN_DIR" # #Copy required input files rsync -av ${MPIAPP_INPUTDIR}/* $RUN_DIR/ # #CONDOR_CONTACT_FILE is a special file that keeps information about reserved nodes and cores for this job rsync -av $_CONDOR_SCRATCH_DIR/contact $RUN_DIR/ # CONDOR_CONTACT_FILE=$RUN_DIR/contact export CONDOR_CONTACT_FILE # Convert CONDOR_CONTACT_FILE into the format suitable for the MPI job sort -n -k 1 < $CONDOR_CONTACT_FILE | awk '{print $2}' > machines #Check memlock limits (mpi likes unlimited) echo "Condor ulimits (as set in /usr/lib/systemd/system/condor.service)" ulimit -a # #Run the mpi application defined above echo mpirun -v --prefix $MPI_HOME -n $_CONDOR_NPROCS -hostfile machines ${MPIAPP_LONG} ${MPIAPP_OPTS} mpirun -v --prefix $MPI_HOME -n $_CONDOR_NPROCS -hostfile machines ${MPIAPP_LONG} ${MPIAPP_OPTS} echo "Run finished" rsync -av * ${MPIAPP_OUTPUTDIR}/ sshd_cleanup /bin/ls -l $RUN_DIR #Be careful with this (the scratch area should be on local /disk) cd /disk /bin/rm -rf $USER/$mpiJobName rm -f machines #
Parallel SCOOP jobs
SCOOP (Scalable COncurrent Operations in Python) is a distributed task module allowing concurrent parallel programming on various environments, from heterogeneous grids to supercomputers. Documentation is available at http://scoop.readthedocs.org/. SCOOP does not support (as of Oct 2019) the HTCondor scheduler. Correspondingly, SCOOP jobs can not be distributed over several nodes and should reserve an entire node to avoid conflicts with other HTCondor jobs.
The tutorial below explains how to run SCOOP jobs via HTCONDOR on multiprocessor nodes (pncX and phpcX) of the NICADD cluster.
1) login to hpcm.nicadd.niu.edu; create project area on /xdata or /bdata disks ; and copy /opt/nicadd/contrib/beam_examples/impact2018_scoop
@phpcm ~]$ mkdir -p /xdata/$USER/myscoop cd /xdata/$USER/myscoop rsync -av /opt/nicadd/contrib/beam_examples/impact2018_scoop /xdata/$USER/myscoop/ cd /xdata/$USER/myscoop/impact2018_scoop / phpcm impact2018_scoop]$ ls Input Input_ecool48hr_nicadd Input_ecool48hr_pbs submit_condor_scoop.sh
Above "Input" is the link to the directory that holds all files required to run the Input/ga.py program.
(Input_ecool48hr_nicadd in this example). A user has to create his input directory, say MyJobInput
and replace the Input link.
[@phpcm impact2018_scoop]$ mkdir MyJobInput copy everything needed for the job into MyJobInput; name the main driver python script as ./MyJobInput/ga.py; [@phpcm impact2018_scoop]$ rm -f Input; ln -s MyJobInput Input
For practice, one can try to submit a test job in default configuration
[@phpcm impact2018_scoop]$ ./submit_condor_scoop.sh Usage:: ./submit_condor_scoop.sh short_node_name [#of_cores_to_use] ... please resubmit Exampl:: ./submit_condor_scoop.sh pnc1 64
If [#of_cores_to_use] is not provided, the job will use all available cores on the requested node.
One can try (use any free node to test; pnc1 is just an example)
[@phpcm impact2018_scoop]$ ./submit_condor_scoop.sh pnc1 We will use 64 cores on node pnc1 ‘run_ga_pnc1_n64_Oct14_135027.cmd’ -> ‘/xdata/$USER/myscoop/impact2018_scoop/condor_out//ga_pnc1_n64_Oct14_135027/run_ga_pnc1_n64_Oct14_135027.cmd’ ‘run_ga_pnc1_n64_Oct14_135027.sh’ -> ‘/xdata/$USER/myscoop/impact2018_scoop/condor_out//ga_pnc1_n64_Oct14_135027/run_ga_pnc1_n64_Oct14_135027.sh’ Submitting job:: condor_submit run_ga_pnc1_n64_Oct14_135027.cmd Submitting job(s). 1 job(s) submitted to cluster 396835. Job stat will be written to /xdata/$USER/myscoop/impact2018_scoop/condor_out//ga_pnc1_n64_Oct14_135027/log To check the job status, run : condor_q -all -global -nobatch #------------#
As one can see, the above command creates the script run_ga_pnc1_n64_Oct14_135027.sh, which is submitted to the pnc1 node via run_ga_pnc1_n64_Oct14_135027.cmd job description file.
The job will start after a delay of up to 10 minutes (depending on the cluster load)
[@phpcm impact2018_scoop]$ condor_q -all -global -nobatch 396835.0 $USER 10/14 13:50 0+00:00:25 slot1@pnc1.nicadd.niu.edu
Inspect/nfs/work/pnc1/$USER/ga_pnc1_n64_Oct14_135027 to check job progress.
It is also accessible via the "rundir" link in the job working directory on the remote node (use 'condor_ssh_to_job jobid').
@phpcm impact2018_scoop]$ ls -l /nfs/work/pnc1/$USER/ga_pnc1_n64_Oct14_135027 or @phpcm impact2018_scoop]$ condor_ssh_to_job 396835.0.0 @pnc1 dir_4091373]$ ls condor_exec.exe _condor_stderr _condor_stdout contact rundir tmp @pnc1 dir_4091373]$ ls -ltrh rundir/ ... All project files and folders....
If the job is killed with condor_rm (please do so for this tutorial), the job working directory will still be accessible.
It is recommended to copy it to the ./condor_out/jobname/ :
condor_rm 396835.0 rsync -av /nfs/work/pnc1/$USER/ga_pnc1_n64_Oct14_135027/ /xdata/$USER/myscoop/impact2018_scoop/condor_out//ga_pnc1_n64_Oct14_135027/
If the job is allowed to run until its expected completion, the remote working directory will be stored under ./condor_out/ automatically.
How to reserve extra memory for a job
For a job that requires more memory than available for condor slots:
- Use Parallel SCOOP jobs instructions, modify the submit_condor_scoop.sh script to run your application.
- Note1 - even if the application is not parallel, it has to be configured this way to reserve extra memory>
- Set RequestMemory = 2000M. Use the machine_count = X parameter to reserve the required memory. Use "condor_status | grep nodeName"
- to get the required info. For example, from below, pnc1 has 64 slots, each with 2011MB.
- condor_status | grep pnc1
slot64@pnc1.nicadd.niu.edu LINUX EL8x64 Unclaimed Idle 0.000 2011 5+01:46:27
- Correspondingly, to reserve 12MB of memory for a job, use "machine_count = 6" (6_slots x 2011MB >12MB)
- Note2 - In this mode, the reserved memory amount is not enforced - the job will not be put on hold if it consumes more. It means that if several such jobs are submitted to one node, they may run out of memory.
- Note3 - Do not try to increase RequestMemory for jobs configured this way. On pnc1, condor will idle any job with RequestMemory> 2011. On the other nodes, it will idle jobs that request more memory than the maximum available in the configured slots.
Why my job is not running?
To check the status of submitted jobs, use
- condor_q -all -global -nobatch
Jobs in "IDLE" and "HOLD" states can be inspected using
- condor_q --analyze jobid
and/or
- condor_q --better-analyze jobid
For example, the output below shows that job 1234.0 is in the HOLD status:
[pt3int0 ~]$ condor_q -all -global -nobatch -- Schedd: pt3int0.nicadd.niu.edu : <192.168.100.10:9618?... @ 10/10/19 23:50:15 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTAL JOB_IDS userName samplejob_xxx 01/01 21:02 _ _ _ 1 1 1234.0 1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 1 held, 0 suspended
Now "condor_q -analyze" explains the reason for the hold
[[pt3int0 ~]$ condor_q -analyze 1234.0 -- Schedd: pt3int0.nicadd.niu.edu : <192.168.100.10:9618?... 1234.000: Job is held. Hold reason: Error from slot64@pnc1.nicadd.niu.edu: Job used more resident memory than specified by request_memory.
And "condor_q -better-analyze 1234.0" provides a detailed analysis of job requirements. The job was put on hold
as the RequestMemory parameter was not defined in the condor description file. In this case, it must be
defined as "RequestMemory = 1600", where "1600" is the requested memory size in MB, which should be greater than the "ResidentSetSize = 1500" shown below.
[pt3int0 ~]$ condor_q -better-analyze 1234.0 -- Schedd: pt3int0.nicadd.niu.edu : <192.168.100.10:9618?... The Requirements expression for job 1234.000 is ( ( Arch == "EL8x64" ) ) && ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) && ( TARGET.HasFileTransfer ) Job 1234.000 defines the following attributes: DiskUsage = 15 ImageSize = 1500 MemoryUsage = ( ( ResidentSetSize + 1023 ) / 1024 ) RequestDisk = DiskUsage RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,( ImageSize + 1023 ) / 1024) ResidentSetSize = 1500 The Requirements expression for job 1234.000 reduces to these conditions: Slots Step Matched Condition ----- -------- --------- [0] 686 Arch == "EL8x64" [1] 686 TARGET.OpSys == "LINUX" [3] 686 TARGET.Disk >= RequestDisk [5] 686 TARGET.Memory >= RequestMemory [7] 686 TARGET.HasFileTransfer 1234.000: Job is held. Hold reason: Error from slot64@pnc1.nicadd.niu.edu: The job used more resident memory than specified by request_memory. Last successful match: Thu Oct 10 21:02:05 2019 11986.000: Run analysis summary ignoring user priority. Of 686 machines, 0 are rejected by your job's requirements 10 reject your job because of their own requirements 0 match and are already running your jobs 0 match but are serving other users 676 are available to run your job
Changing Requirement expression - (condor_qedit)
In the previous example, job "1234.0" can be released (restarted) using condor_qedit
to edit job requirements expression:
[pt3int0 ~]$ condor_qedit -jobids "1234.0" RequestMemory 1600 Set attribute "RequestMemory" for 1 matching jobs. [pt3int0 ~]$ condor_release 1234.0 Job 1234.0 released
or for all "userName" 's jobs
[pt3int0 ~]$ condor_qedit -owner "userName" RequestMemory 1600 [pt3int0 ~]$ condor_release userName
Inspecting job progress (condor_ssh_to_job)
This command allows the user to ssh on the compute node where the job is running. Once the command is run, the user will be in the job's working directory and can examine the job's environment and run commands. In the previous example, a user can ssh to job 1234.0 (should be in the running state):
[pt3int0 ~]$ condor_ssh_to_job 1234.0 Welcome to slot41@pnc1.nicadd.niu.edu! Your condor job is running with pid(s) 3926302. ************************************************************************ [pnc1 dir_3926296]$ ls condor_exec.exe _condor_stderr _condor_stdout