For the purposes of the examples shown here the HTCondor install location is
shown as /opt/glideincondor.
The working directory is
/opt/glidecondor/condor_local and the machine name is
mymachine.fnal.gov.
If you want to use a different setup, make the necessary changes.
If you installed HTCondor via RPMs the configuration files location is different: see
this OSG guide
or the OSG pages about the
Frontend
and Factory.
Multiple Schedds in the Factory
Note: If you specified any of these options using the GlideinWMS configuration based installer, these files and initialization steps will already have been performed. These instructions are relevant to any post-installation changes you desire to make.
Unless explicity mentioned, all operations are to be done by the user that you installed HTCondor as.
Increase the number of available file descriptors
When using multiple schedds, you may want to consider increasing the available file descriptors. This can be done by issuing a "ulimit -n" command as well as changing the values in the /etc/security/limits.conf file
Using the condor_shared_port feature
GlideinWMS V3+
Additional information on this daemon can be found here:Your /opt/glidecondor/condor_config.d/02_gwms_schedds.config will need to contain the following attributes. Port 9618 is the default port for the schedds.
#-- Enable shared_port_daemonNote: Both the SCHEDD and SHADOW processes need to specify the shared port option is in effect. Very important: As explained below in this documentation, all HTCondor daemons on the Frontend (including User Collector and Schedd) use the shared port daemon on port 9618, which must be open. For the secondary collectors, you may need to open the port range 9620 to 9660 depending on your configuration (i.e. if Glideins call back on those ports). If there are standalone submit hosts, they may have only port 9615 open, as indicated in the examples. In this case, please review the firewalls to make sure 9618 is open. The same range must be open also for the GlideinWMS versions prior to 3.4.1, as well as the port 9615. Please note, if you install the user schedd on a separate host, incoming TCP port 9618 remains to be open (it was 9615 for GlideinWMS 3.4.0 and earlier).
SHADOW.USE_SHARED_PORT = True
SCHEDD.USE_SHARED_PORT = True
SHARED_PORT_MAX_WORKERS = 1000
SCHEDD.SHARED_PORT_ARGS = -p 9618
DAEMON_LIST = $(DAEMON_LIST), SHARED_PORT
Multiple Schedds in GlideinWMS
The following needs to be added to your HTCondor config file for each additional schedd desired. Note the numeric suffix used to distinguish each schedd.If the multiple schedds are being used on your WMS Collector, HTCondor-G is used to submit the glidein pilot jobs and the SCHEDD(GLIDEINS/JOBS)2_ENVIRONMENT attribute shown below is required. If not, then it should be omitted.
The JOB_QUEUE_LOG attribute is required.
For the WMS Collector:
SCHEDDGLIDEINS2 = $(SCHEDD)
SCHEDDGLIDEINS2_ARGS = -local-name scheddglideins2
SCHEDDGLIDEINS2.SCHEDD_NAME = schedd_glideins2
SCHEDDGLIDEINS2.SCHEDD_LOG = $(LOG)/SchedLog.$(SCHEDDGLIDEINS2.SCHEDD_NAME)
SCHEDDGLIDEINS2.LOCAL_DIR_ALT = $(LOCAL_DIR)/$(SCHEDDGLIDEINS2.SCHEDD_NAME)
SCHEDDGLIDEINS2.EXECUTE = $(SCHEDDGLIDEINS2.LOCAL_DIR_ALT)/execute
SCHEDDGLIDEINS2.LOCK = $(SCHEDDGLIDEINS2.LOCAL_DIR_ALT)/lock
SCHEDDGLIDEINS2.PROCD_ADDRESS = $(SCHEDDGLIDEINS2.LOCAL_DIR_ALT)/procd_pipe
SCHEDDGLIDEINS2.SPOOL = $(SCHEDDGLIDEINS2.LOCAL_DIR_ALT)/spool
SCHEDDGLIDEINS2.JOB_QUEUE_LOG = $(SCHEDDGLIDEINS2.SPOOL)/job_queue.log ## Note: Required with HTCondor 7.7.5+
SCHEDDGLIDEINS2.SCHEDD_ADDRESS_FILE = $(SCHEDDGLIDEINS2.SPOOL)/.schedd_address
SCHEDDGLIDEINS2.SCHEDD_DAEMON_AD_FILE = $(SCHEDDGLIDEINS2.SPOOL)/.schedd_classad
SCHEDDGLIDEINS2_SPOOL_DIR_STRING = "$(SCHEDDGLIDEINS2.SPOOL)"
SCHEDDGLIDEINS2.SCHEDD_EXPRS = SPOOLL_DIR_STRING
SCHEDDGLIDEINS2_ENVIRONMENT = "_CONDOR_GRIDMANAGER_LOG=$(LOG)/GridManagerLog.$(SCHEDDGLIDEINS2.SCHEDD_NAME).$(USERNAME)"
DAEMON_LIST = $(DAEMON_LIST), SCHEDDGLIDEINS2
DC_DAEMON_LIST = + SCHEDDGLIDEINS2
For the User Submit host:
SCHEDDJOBS2 = $(SCHEDD)
SCHEDDJOBS2_ARGS = -local-name scheddglideins2
SCHEDDJOBS2.SCHEDD_NAME = schedd_glideins2
SCHEDDJOBS2.SCHEDD_LOG = $(LOG)/SchedLog.$(SCHEDDJOBS2.SCHEDD_NAME)
SCHEDDJOBS2.LOCAL_DIR_ALT = $(LOCAL_DIR)/$(SCHEDDJOBS2.SCHEDD_NAME)
SCHEDDJOBS2.EXECUTE = $(SCHEDDJOBS2.LOCAL_DIR_ALT)/execute
SCHEDDJOBS2.LOCK = $(SCHEDDJOBS2.LOCAL_DIR_ALT)/lock
SCHEDDJOBS2.PROCD_ADDRESS = $(SCHEDDJOBS2.LOCAL_DIR_ALT)/procd_pipe
SCHEDDJOBS2.SPOOL = $(SCHEDDJOBS2.LOCAL_DIR_ALT)/spool
SCHEDDJOBS2.JOB_QUEUE_LOG = $(SCHEDDJOBS2.SPOOL)/job_queue.log
SCHEDDJOBS2.SCHEDD_ADDRESS_FILE = $(SCHEDDJOBS2.SPOOL)/.schedd_address
SCHEDDJOBS2.SCHEDD_DAEMON_AD_FILE = $(SCHEDDJOBS2.SPOOL)/.schedd_classad
SCHEDDJOBS2_SPOOL_DIR_STRING = "$(SCHEDDJOBS2.SPOOL)"
SCHEDDJOBS2.SCHEDD_EXPRS = SPOOL_DIR_STRING
DAEMON_LIST = $(DAEMON_LIST), SCHEDDJOBS2
DC_DAEMON_LIST = + SCHEDDJOBS2
The directories files will need to be created for the attributes by these attributes defined above:
LOCAL_DIR
EXECUTE
SPOOL
LOCK
A script is available to do this for you, given the attributes are defined with the naming convention shown. If they already exist, it will verify their existance and ownership, otherwise they will be created.
source /opt/glidecondor/condor.sh
GLIDEINWMS_LOCATION/install/services/init_schedd.sh
(sample output)
Validating schedd: SCHEDDJOBS2
Processing schedd: SCHEDDJOBS2
SCHEDDJOBS2.LOCAL_DIR_ALT: /opt/glidecondor/condor_local/schedd_jobs2
... created
SCHEDDJOBS2.EXECUTE: /opt/glidecondor/condor_local/schedd_jobs2/execute
... created
SCHEDDJOBS2.SPOOL: /opt/glidecondor/condor_local/schedd_jobs2/spool
... created
SCHEDDJOBS2.LOCK: /opt/glidecondor/condor_local/schedd_jobs2/lock
... created
Multiple Collectors for Scalability / Shared Port
For scalability purposes, this section will describe the steps (configuration) necessary to add additional (secondary) HTCondor collectors for the WMS and/or User Collectors using or not, shared_port option.
Note: If you specified any of these options using the GlideinWMS configuration based installer, these files and initialization steps will already have been performed. These instructions are relevant to any post-installation changes you desire to make.
Important: When secondary (additional) collectors are added to either the WMS Collector or User Collector, changes must also be made to the Frontend configurations of all Frontends, so they are made aware of them.
HTCondor configuration changes
Individual Ports
For each secondary collector, the following Condor attributes are required:
COLLECTORnn = $(COLLECTOR) COLLECTORnn_ENVIRONMENT = "_CONDOR_COLLECTOR_LOG=$(LOG)/CollectornnLog" COLLECTORnn_ARGS = -f -p port_number DAEMON_LIST = $(DAEMON_LIST), COLLECTORnn
In the above example, n is an arbitrary value to uniquely identify each secondary collector. Each secondary collector must also have a unique port_number.
After these changes have been made in your Condor configuration file, restart HTCondor to effect the change. You will see these collector processes running (example has 5 secondary collectors).
user 17732 1 0 13:34 ? 00:00:00 condor_master user 17735 17732 0 13:34 ? 00:00:00 condor_collector -f primary user 17736 17732 0 13:34 ? 00:00:00 condor_negotiator -f user 17737 17732 0 13:34 ? 00:00:00 condor_collector -f -p 9619 secondary user 17738 17732 0 13:34 ? 00:00:00 condor_collector -f -p 9620 secondary user 17739 17732 0 13:34 ? 00:00:00 condor_collector -f -p 9621 secondary user 17740 17732 0 13:34 ? 00:00:00 condor_collector -f -p 9622 secondary user 17741 17732 0 13:34 ? 00:00:00 condor_collector -f -p 9623 secondary
Separate Ports
Since GlideinWMS v3.4.1, shared_port is enabled by default for secondary collectors and CCBs, having all the collector communication behind a single TCP por (by default, port 9618). This helps to have a queue per daemon, instead of global queue. To carry out this, the following HTCondor attributes are required:
COLLECTOR_HOST = $(CONDOR_HOST):port_number USE_SHARED_PORT = True SHARED_PORT_MAX_WORKERS = 1000 SHARED_PORT_ARGS = -p port_number DAEMON_LIST = $(DAEMON_LIST), SHARED_PORTFor the secondary collectors configuration, in this example we show an arbitrary value to uniquely identify each secondary collector, but all of the collector host are behind of a single TCP port.
use Experimental:CollectorNode(nn) COLLECTORnn_ENVIRONMENT = "_CONDOR_COLLECTOR_LOG=$(LOG)/CollectornnLog"
After these changes have been made in your HTCondor configuration file, restart HTCondor to effect the change. You will see these collector processes running (example has 5 secondary collectors and the main one listening on the same port: 9618).
├─1675222 condor_shared_port -f -p 9618 TCP single port ├─1675223 condor_collector -f primary ├─1675227 condor_negotiator -f ├─1675229 condor_schedd -f ├─1675230 condor_collector -f -f -local-name COLLECTOR1 -sock collector1 secondary ├─1675232 condor_collector -f -f -local-name COLLECTOR2 -sock collector2 secondary ├─1675234 condor_collector -f -f -local-name COLLECTOR3 -sock collector3 secondary ├─1675237 condor_collector -f -f -local-name COLLECTOR4 -sock collector4 secondary ├─1675240 condor_collector -f -f -local-name COLLECTOR5 -sock collector5 secondary
Transition to Shared Port
This is a temporary configuration to switch from separate ports to shared_port In GlideinWMS v3.4.1, shared port only configuration is incompatible if talking to older Factories (v3.4 or older). Also, it requires the Frontend admin to drain the Frontend, change the configuration and restart it. To make it compatible and to allow a smother transition, the following configuration makes possible to support both, different ports and shared port and avoid the pitfalls mentioned. A secondary collector can both listen on a separate port and listen to the shared port daemon.
COLLECTORnn = $(COLLECTOR) COLLECTORnn_ARGS = -f -local-name COLLECTORnn -p port_number> -sock collectornn COLLECTORnn_ENVIRONMENT = "_CONDOR_COLLECTOR_LOG=$(LOG)/CollectornnLog" DAEMON_LIST=$(DAEMON_LIST), COLLECTORnn
After these changes have been made in your Condor configuration file, restart HTCondor to effect the change. You will see as many collector processes running as you defined.
Multiple Collectors for High Availability (HA)
For reliability purposes, you may want to utilize HTCondor's High Availability (HA) feature for
collectors.
Important: When the HTCondor High Availability feature is used in the User Collector, changes must also be made to the Frontend configurations so it is made aware of them.
Fine Tuning User Schedd for Large Scale Installations
Increase the number of available file descriptors
Number of ports used by the condor_schedd process increases as the number of jobs running/queued in the schedd increase. The default number
of file descriptors per process is 1024 on most systems. Increase this limit to ~16k or value higher than number of jobs that might be in
the queue at any given time. This is particularly required for large scale installations.
In most cases for default installation, user schedd is configured to start as root and started through the script in /etc/xinet.d/condor. This
is a good place to set higher file descriptor limit for the schedd process.