Using Condor

What is Condor

When many users want to run processes or jobs on a limited number of machines, one needs to fairly allocate time and resources to the users. Otherwise all jobs get slow. Condor is job-scheduling system that allows only one job to run on one processor without interruptions. It is capable of scheduling jobs on a variety of platforms on a network. It is also provides checkpointing and migrations, input/output redirection, remote system calls etc. For a more detailed overview and how to use it, see the online Condor manual.

Condor at Lehigh

HPC-Lehigh provides 2 different condor-pools or groups of machines that are controlled by condor on campus. These groups are exclusive and jobs that are intended for one can not run on the other.

  1. HPC Cluster: HPC-Cluster is controlled by condor only. All jobs and processes except some special ones are run through condor only. No other jobs are allowed on these machines. This pool is available under Service Level Enhanced-II.
    See instructions for using condor on HPC Cluster.

  2. University Wide Condor Pool: When the public site PCs located all around the campus are not being used by anyone, condor takes over and executes any jobs that have been submitted to this pool. If a user returns to use any machine where a condor job is running, the job is stopped and started on some other machine. So jobs on this pool are prone to interruptions and may take a long time to finish. This pool is available under Service Level Basic.
    See instructions for using condor on public sites.

General Condor Usage

These instructions are applicable for any pool. Detailed explanations of each command is available in the online Condor manual. All commands should be entered on the command prompt.

To see what machines are available in the cluster and their status:

condor_status

This will display a list of all machines in the pool along with their architecture, operating system, available memory and current state of the machine. The last few lines of the output will look like:

...
vm1@PS-XS036A WINNT51     INTEL  Claimed    Busy       1.000   507  0+00:35:31
vm2@PS-XS036A WINNT51     INTEL  Claimed    Busy       1.000   507  0+00:36:19
vm1@PS-XS201. WINNT51     INTEL  Claimed    Busy       0.980   251  0+00:05:02
vm2@PS-XS201. WINNT51     INTEL  Claimed    Busy       1.030   251  0+00:11:03
vm1@PS-XS303. WINNT51     INTEL  Claimed    Busy       0.670   251  0+00:00:39
vm2@PS-XS303. WINNT51     INTEL  Claimed    Busy       0.990   251  0+01:17:43

                     Total Owner Claimed Unclaimed Matched Preempting Backfill

          IA64/LINUX    32    23       0         9       0          0        0
         INTEL/LINUX    31    17       0        14       0          0        0
       INTEL/WINNT51  1029   310     702        17       0          0        0
             PPC/OSX   101     8       0        93       0          0        0

               Total  1193   358     702       133       0          0        0
The above output shows that there are 1193 CPUs available out of with 1029 are running windows and 31 are running on INTEL/LINUX, 358 CPUs are being used by some user interactively and condor is running jobs on 702 CPUs.

To see what jobs a user (say with username asm4) has submitted, do:

condor_q -global asm4 The output will look like:
...
72575.171 asm4            5/2  19:32   0+01:24:05 R  0   29.3 symphony -f option
72575.172 asm4            5/2  19:32   0+01:28:12 R  0   29.3 symphony -f option
72575.173 asm4            5/2  19:32   0+01:23:20 R  0   39.1 symphony -f option
72575.174 asm4            5/2  19:32   0+01:27:30 R  0   146.5 symphony -f option
129 jobs; 0 idle, 129 running, 0 held

This shows that there are 129 jobs submitted by the user asm4 that are currently in condor's queue and condor is running all of them. The first entry in the list gives the job number (e.g. 72575.174 in above output).

To see more details about a particular job with number, say 72575.174, do:

condor_q -global -l 72575.174

This will give detailed information of where the job is running, where it was submitted from, how long it has stayed in queue and much more.

To remove a particular job do: condor_rm 72575.174. This will remove job number 72575.174. To remove all jobs do: condor_rm asm4. This will remove all jobs that belong to the user asm4. A job may be removed only from the same machine where it was submitted.

Using Condor on HPC Cluster

Condor is used to run all jobs on the HPC-Cluster. Users are allowed to run jobs interactively only for setting up, compiling and/or debugging their executables. Some instructions for compiling and running are available on the simple jobs page. Once a user has logged in to the cluster and wants to execute a file (say my_file) then it has to be submitted to condor. Once the job has been submitted the user may log off and condor will send an email when the job is finished.

Submitting a simple job

To submit a job to condor, the user should create a condor-submit file. The file lists the name of the executable (my_file) and other details about what machines should run this job and how many times, what are the inputs etc. In the simplest form, the submit file (with name say f.submit) may look like this:

########################
# This is a comment.
# Submit file for a simple program located at 
# /home/username/some/path/my_file
########################

universe = vanilla
notify_user = username@lehigh.edu
notification = Always
getenv = TRUE
initialdir = /home/username/dir1

Executable = /home/username/some/path/my_file
Output = foo.out.$(PROCESS)
log = foo.log.$(CLUSTER)
error = foo.err.$(PROCESS)
arguments = arg01 arg02

should_transfer_files = YES
transfer_input_files = 
WhenToTransferOutput = ON_EXIT_OR_EVICT

queue 10
This file can then be submitted to condor by doing: condor_submit f.submit . The above file tells condor to run the program my_file 10 different times. The output of the first run is stored in the location /home/username/dir1/foo.out.0, of the second run in /home/username/dir1/foo.out.1 and so on. Arguments arg01 and arg02 are used for each run. The log file showing when the job starts running and other related information is created at the location /home/username/dir1/foo.log.0

Compiling for checkpointing long jobs

If a user wants to run a program for a long time (say many days) then she may want to checkpoint the job occasionally so that, in case of a disruption, the job can start from the last checkpoint. This can only be done for programs that are compiled by the user. Application software and pre-compiled binaries can not be checkpointed using this mechanism and the user will have to implement her own checkpoint mechanism if she desires.

Submitting a job with check pointing is a two step process:

  1. Compiling the job with checkpointing
  2. Submitting the job in standard universe

Compiling: The binary has to be compiled using condor_compile Using condor_compile is pretty straight forward. Just precede the compilation command with condor_compile. e.g. in order to compile a program foo.cpp using g++ and to link with condor, do:
condor_compile g++ foo.cpp -o foo.bin
If you are using Makefiles then change the Makefiles to prepend condor_compile to every call of g++, gcc, f77, gfortran etc.
More details about condor_compile can be found using man condor_compile or on the online condor-manual.

Submitting to standard universe:

Running MPI jobs on HPC-Clusters

MPI jobs can only be run through condor on these clusters. In order to build and run MPI jobs on these clusters, the user has to first set up ssh-keys for using MPI correctly. This only needs to be done once by a user. This can be done be executing the command:

setup_mpi_ssh

The output should show ssh setup successful. This setup should be done only once by each user and does not need to be used ever again. This command sets up ssh keys to access other nodes in the cluster without using a password. It creates appropriate files in the users /home/user-name/.ssh folder. The above command should be run whenever the user changes or deletes files from this folder.

The parallel universe provided by condor is useful for executing the same binaries on multiple machines simultaneously. This makes the parallel universe very useful for running MPI programs. If the binary needs to be compiled before execution, please follow the instructions for compiling MPI programs above.

MPI jobs can only be submitted from blaze1. If you submit your parallel jobs from any other machine, they will never run. If you happen to be logged into inferno or crane, just do rsh blaze1 to log into blaze1. Then, submit your jobs from there.

In order to run an MPI job (say mpi_foo with arguments arg01 arg02) on 4 processors, the condor submit file (foo.condor) should look like:

########################
# This is a comment.
# Submit description file for program mpi_foo.
########################

universe = parallel
notify_user = user-name@lehigh.edu
notification = Error
getenv = TRUE

# this is where condor output and log appear
initialdir = /home/user-name/some/path     

Executable = /usr/local/bin/mp1script
machine_count = 4
Output = foo.out.$(NODE)
log = foo.log.$(CLUSTER)
error = foo.err.$(NODE)

# name of the MPI-binary
arguments = /home/user-name/path/to/mpi_foo arg01 arg02

should_transfer_files = YES
transfer_input_files = 
WhenToTransferOutput = ON_EXIT_OR_EVICT

queue 1

Note that the executable should always be /usr/local/bin/mp1script and NOT mpi_foo. mp1script is a wrapper script which correctly sets up the environment for running mpi_foo. The output of each node is stored in foo.out.$(NODE). The condor job can then be submitted by doing:

condor_submit foo.condor

The job will be queued and executed as and when the resources become available.

Using Condor on University wide condor pool

Adding condor to your PATH

In order to run condor commands, the user has to add the location of condor binaries in her path by doing:
export PATH=$PATH:/usr/local/condor/bin
In order to avoid writing this command everytime you login, append the above shell command to your .bashrc file (/home/username/.bashrc). To check if your path is set properly, do:
condor -version
This should show the version of condor installed in the machine.

Selecting a condor universe

Condor universe determines the kind of job (or program) to be executed. Simple binaries or compiled programs are run in vanilla universe. If a user compiles her program and links it using condor_compile, then she should use standard universe. Parallel programs, like those compiled using MPI, should be run in a parallel universe.

Using Condor for simple jobs

In order to submit her programs, the user has to write a submit file. Suppose the user wants to run a program foo, with arguments arg01, arg02. Here is a sample submit file:

########################
# This is a comment.
# Submit description file for program foo.
########################

universe = vanilla
notify_user = username@lehigh.edu
notification = Error
getenv = TRUE
initialdir = /home/user/some/path

Executable = /path/to/foo
Output = foo.out.$(PROCESS)
log = foo.log.$(CLUSTER)
error = foo.err.$(PROCESS)
arguments = arg01 arg02
queue 1
After saving the above submit file (say, as foo.condor), the user can submit this job by doing:
condor_submit foo.condor

Note:Condor by default runs jobs on those machines which have similar architecture as the machine where those jobs were submitted. Thus jobs submitted on Blaze will not run on Egenera and vice-versa. But jobs submitted from Blaze will run on Inferno and vice-versa. To run your jobs on all possible available architectures, include this line in your submit file:
requirements = (Arch == "X86_64") || (Arch == "INTEL")
In order to run only on Inferno, put:
requirements = (Subnet == "192.168.3")
Use 192.168.1, 192.168.2 for Blaze and Egenera respectively.

To use standard universe, to submit multiple instances of a program, to use input files, to transfer files, and many other options, read the man page of condor_submit:
man condor_submit

Using Condor for MPI jobs

The parallel universe is useful for executing the same binaries on multiple machines simultaneously. This makes the parallel universe very useful for running MPI programs. See instructions for compiling MPI programs on this page.

Note: Parallel jobs can only be submitted from blaze1. If you submit your parallel jobs from any other machine, they will never run. If you happen to be logged into inferno or crane, just do rsh blaze1 to log into blaze1. Then, submit your jobs from there. See note above, for changing requirements of jobs when submitting from blaze1.

In order to run an MPI job (say mpi_foo with arguments arg01 arg02) no 4 processors, the condor submit file should look like:

########################
# This is a comment.
# Submit description file for program mpi_foo.
########################

universe = parallel
notify_user = username@lehigh.edu
notification = Error
getenv = TRUE
initialdir = /home/user/some/path

Executable = /usr/local/bin/mp1script
machine_count = 4
Output = foo.out.$(NODE)
log = foo.log.$(CLUSTER)
error = foo.err.$(NODE)
arguments = /full/path/to/mpi_foo arg01 arg02

should_transfer_files = YES
transfer_input_files = 
WhenToTransferOutput = ON_EXIT_OR_EVICT

queue 1
Note that the executable should always be /usr/local/bin/mp1script and NOT mpi_foo. mp1script is a wrapper script which correctly sets up the environment for running mpi_foo. The output of each node is stored in foo.out.$(NODE).

Useful condor commands

Here are some useful condor commands, to look at your condor queue and manage it:

To see all your jobs in queue and their status:
condor_q username
This will show the ids of jobs submitted by that user.

To see why your job, with id xx.yy, is not running:
condor_q -analyze xx.yy

To see a long description of your job (with id xx.yy):
condor_q -long xx.yy | less

To remove all your jobs:
condor_rm username

To see what machines are available in the cluster and their status:
condor_status

To see your and other user's priority:
condor_userprio -allusers -all

More details are available in the man pages of condor_q, condor_status, condor_rm, condor_hold, condor_release, condor_userprio

Compiling for Standard Universe

Condor has good support for checkpointing and migration of programs if the machine where a job is running becomes unavailable. To use these features, standard universe should be used. Details are available in the condor-manual.

Using condor_compile is pretty straight forward. Just precede the compilation command with condor_compile. e.g. in order to compile a program foo.cpp using g++ and to link with condor, do:
condor_compile g++ foo.cpp -o foo.bin
Note: The -m32 flag is not necessary anymore with condor_compile This will create an executable foo.bin. This executable can then be submitted as a condor job. If the program is compiled on a 64-bit machine (like Blaze), then -m32 should be used with gcc or g++. This is because, condor has 32-bit libraries. The above command would then be:
condor_compile g++ -m32 foo.cpp -o foo.bin

More details about condor_compile can be found using man condor_compile or on the online condor-manual.