When many users want to run processes or jobs on a limited number of machines, one needs to fairly allocate time and resources to the users. Otherwise all jobs get slow. Condor is job-scheduling system that allows only one job to run on one processor without interruptions. It is capable of scheduling jobs on a variety of platforms on a network. It is also provides checkpointing and migrations, input/output redirection, remote system calls etc. For a more detailed overview and how to use it, see the online Condor manual.
HPC-Lehigh provides 2 different condor-pools or groups of machines that are controlled by condor on campus. These groups are exclusive and jobs that are intended for one can not run on the other.
These instructions are applicable for any pool. Detailed explanations of each command is available in the online Condor manual. All commands should be entered on the command prompt.
To see what machines are available in the cluster and their status:
condor_statusThis will display a list of all machines in the pool along with their architecture, operating system, available memory and current state of the machine. The last few lines of the output will look like:
...
vm1@PS-XS036A WINNT51 INTEL Claimed Busy 1.000 507 0+00:35:31
vm2@PS-XS036A WINNT51 INTEL Claimed Busy 1.000 507 0+00:36:19
vm1@PS-XS201. WINNT51 INTEL Claimed Busy 0.980 251 0+00:05:02
vm2@PS-XS201. WINNT51 INTEL Claimed Busy 1.030 251 0+00:11:03
vm1@PS-XS303. WINNT51 INTEL Claimed Busy 0.670 251 0+00:00:39
vm2@PS-XS303. WINNT51 INTEL Claimed Busy 0.990 251 0+01:17:43
Total Owner Claimed Unclaimed Matched Preempting Backfill
IA64/LINUX 32 23 0 9 0 0 0
INTEL/LINUX 31 17 0 14 0 0 0
INTEL/WINNT51 1029 310 702 17 0 0 0
PPC/OSX 101 8 0 93 0 0 0
Total 1193 358 702 133 0 0 0
The above output shows that there are 1193 CPUs available out of with 1029 are
running windows and 31 are running on INTEL/LINUX, 358 CPUs are being used
by some user interactively and condor is running jobs on 702 CPUs.
To see what jobs a user (say with username asm4) has submitted, do:
condor_q -global asm4 The output will look like:... 72575.171 asm4 5/2 19:32 0+01:24:05 R 0 29.3 symphony -f option 72575.172 asm4 5/2 19:32 0+01:28:12 R 0 29.3 symphony -f option 72575.173 asm4 5/2 19:32 0+01:23:20 R 0 39.1 symphony -f option 72575.174 asm4 5/2 19:32 0+01:27:30 R 0 146.5 symphony -f option 129 jobs; 0 idle, 129 running, 0 held
This shows that there are 129 jobs submitted by the user asm4 that are currently in condor's queue and condor is running all of them. The first entry in the list gives the job number (e.g. 72575.174 in above output).
To see more details about a particular job with number, say 72575.174, do:
condor_q -global -l 72575.174This will give detailed information of where the job is running, where it was submitted from, how long it has stayed in queue and much more.
To remove a particular job do: condor_rm 72575.174. This will remove job number 72575.174. To remove all jobs do: condor_rm asm4. This will remove all jobs that belong to the user asm4. A job may be removed only from the same machine where it was submitted.
Condor is used to run all jobs on the HPC-Cluster. Users are allowed to run jobs interactively only for setting up, compiling and/or debugging their executables. Some instructions for compiling and running are available on the simple jobs page. Once a user has logged in to the cluster and wants to execute a file (say my_file) then it has to be submitted to condor. Once the job has been submitted the user may log off and condor will send an email when the job is finished.
To submit a job to condor, the user should create a condor-submit file. The file lists the name of the executable (my_file) and other details about what machines should run this job and how many times, what are the inputs etc. In the simplest form, the submit file (with name say f.submit) may look like this:
######################## # This is a comment. # Submit file for a simple program located at # /home/username/some/path/my_file ######################## universe = vanilla notify_user = username@lehigh.edu notification = Always getenv = TRUE initialdir = /home/username/dir1 Executable = /home/username/some/path/my_file Output = foo.out.$(PROCESS) log = foo.log.$(CLUSTER) error = foo.err.$(PROCESS) arguments = arg01 arg02 should_transfer_files = YES transfer_input_files = WhenToTransferOutput = ON_EXIT_OR_EVICT queue 10This file can then be submitted to condor by doing: condor_submit f.submit . The above file tells condor to run the program my_file 10 different times. The output of the first run is stored in the location /home/username/dir1/foo.out.0, of the second run in /home/username/dir1/foo.out.1 and so on. Arguments arg01 and arg02 are used for each run. The log file showing when the job starts running and other related information is created at the location /home/username/dir1/foo.log.0
If a user wants to run a program for a long time (say many days) then she may want to checkpoint the job occasionally so that, in case of a disruption, the job can start from the last checkpoint. This can only be done for programs that are compiled by the user. Application software and pre-compiled binaries can not be checkpointed using this mechanism and the user will have to implement her own checkpoint mechanism if she desires.
Submitting a job with check pointing is a two step process:
Compiling: The binary has to be compiled using condor_compile
Using condor_compile is pretty straight forward. Just precede the compilation
command with condor_compile. e.g. in order to compile a program foo.cpp using
g++ and to link with condor, do:
condor_compile g++ foo.cpp -o foo.bin
If you are using Makefiles then change the Makefiles to prepend condor_compile
to every call of g++, gcc, f77, gfortran etc.
More details about condor_compile can be found using man
condor_compile or on the online
condor-manual.
Submitting to standard universe:
MPI jobs can only be run through condor on these clusters. In order to build and run MPI jobs on these clusters, the user has to first set up ssh-keys for using MPI correctly. This only needs to be done once by a user. This can be done be executing the command:
setup_mpi_sshThe output should show ssh setup successful. This setup should be done only once by each user and does not need to be used ever again. This command sets up ssh keys to access other nodes in the cluster without using a password. It creates appropriate files in the users /home/user-name/.ssh folder. The above command should be run whenever the user changes or deletes files from this folder.
The parallel universe provided by condor is useful for executing the same binaries on multiple machines simultaneously. This makes the parallel universe very useful for running MPI programs. If the binary needs to be compiled before execution, please follow the instructions for compiling MPI programs above.
MPI jobs can only be submitted from blaze1. If you submit your parallel jobs from any other machine, they will never run. If you happen to be logged into inferno or crane, just do rsh blaze1 to log into blaze1. Then, submit your jobs from there.
In order to run an MPI job (say mpi_foo with arguments arg01 arg02) on 4 processors, the condor submit file (foo.condor) should look like:
######################## # This is a comment. # Submit description file for program mpi_foo. ######################## universe = parallel notify_user = user-name@lehigh.edu notification = Error getenv = TRUE # this is where condor output and log appear initialdir = /home/user-name/some/path Executable = /usr/local/bin/mp1script machine_count = 4 Output = foo.out.$(NODE) log = foo.log.$(CLUSTER) error = foo.err.$(NODE) # name of the MPI-binary arguments = /home/user-name/path/to/mpi_foo arg01 arg02 should_transfer_files = YES transfer_input_files = WhenToTransferOutput = ON_EXIT_OR_EVICT queue 1
Note that the executable should always be /usr/local/bin/mp1script and NOT mpi_foo. mp1script is a wrapper script which correctly sets up the environment for running mpi_foo. The output of each node is stored in foo.out.$(NODE). The condor job can then be submitted by doing:
condor_submit foo.condorThe job will be queued and executed as and when the resources become available.
In order to run condor commands, the user has to add the location of condor binaries in her path by doing:
export PATH=$PATH:/usr/local/condor/bin
In order to avoid writing this command everytime you login, append the above shell command to your .bashrc file (/home/username/.bashrc). To check if your path is set properly, do:
condor -version
This should show the version of condor installed in the machine.
In order to submit her programs, the user has to write a submit file. Suppose the user wants to run a program foo, with arguments arg01, arg02. Here is a sample submit file:
######################## # This is a comment. # Submit description file for program foo. ######################## universe = vanilla notify_user = username@lehigh.edu notification = Error getenv = TRUE initialdir = /home/user/some/path Executable = /path/to/foo Output = foo.out.$(PROCESS) log = foo.log.$(CLUSTER) error = foo.err.$(PROCESS) arguments = arg01 arg02 queue 1After saving the above submit file (say, as foo.condor), the user can submit this job by doing:
requirements = (Arch == "X86_64") || (Arch == "INTEL")In order to run only on Inferno, put:
Requirements = SubnetMask == "255.255.252.0" && Memory = 2006Replace the memory value with Memory = 1005 is you want to use Blaze. Other ClassAd attributes per machine can be seen with this command: condor_status -l machinename | sort | less
To use standard universe, to submit multiple instances of a program, to use input files, to transfer files, and many other options, read the man page of condor_submit:
man condor_submit
The parallel universe is useful for executing the same binaries on multiple machines simultaneously. This makes the parallel universe very useful for running MPI programs. See instructions for compiling MPI programs on this page.
Note: Parallel jobs can only be submitted from blaze1. If you submit
your parallel jobs from any other machine, they will never run. If you happen
to be logged into inferno or crane, just do rsh blaze1 to log into
blaze1. Then, submit your jobs from there. See note above, for changing
requirements of jobs when submitting from blaze1.
In order to run an MPI job (say mpi_foo with arguments arg01 arg02) no 4 processors, the condor submit file should look like:
######################## # This is a comment. # Submit description file for program mpi_foo. ######################## universe = parallel notify_user = username@lehigh.edu notification = Error getenv = TRUE initialdir = /home/user/some/path Executable = /usr/local/bin/mp1script machine_count = 4 Output = foo.out.$(NODE) log = foo.log.$(CLUSTER) error = foo.err.$(NODE) arguments = /full/path/to/mpi_foo arg01 arg02 should_transfer_files = YES transfer_input_files = WhenToTransferOutput = ON_EXIT_OR_EVICT queue 1Note that the executable should always be /usr/local/bin/mp1script and NOT mpi_foo. mp1script is a wrapper script which correctly sets up the environment for running mpi_foo. The output of each node is stored in foo.out.$(NODE).
Here are some useful condor commands, to look at your condor queue and manage it:
To see all your jobs in queue and their status:
condor_q username
This will show the ids of jobs submitted by that user.
To see why your job, with id xx.yy, is not running:
condor_q -analyze xx.yy
To see a long description of your job (with id xx.yy):
condor_q -long xx.yy | less
To remove all your jobs:
condor_rm username
To see what machines are available in the cluster and their status:
condor_status
To see your and other user's priority:
condor_userprio -allusers -all
Condor has good support for checkpointing and migration of programs if the machine where a job is running becomes unavailable. To use these features, standard universe should be used. Details are available in the condor-manual.
Using condor_compile is pretty straight forward. Just precede the compilation command with condor_compile. e.g. in order to compile a program foo.cpp using g++ and to link with condor, do:
condor_compile g++ foo.cpp -o foo.bin
Note: The -m32 flag is not necessary anymore with condor_compile
This will create an executable foo.bin. This executable can then be submitted as a condor job. If the program is compiled on a 64-bit machine (like Blaze), then -m32 should be used with gcc or g++. This is because, condor has 32-bit libraries. The above command would then be:
condor_compile g++ -m32 foo.cpp -o foo.bin
More details about condor_compile can be found using man condor_compile or on the online condor-manual.