WMS Factory
- Overview
- Install
- Configuration
- Design
- Monitoring
- Troubleshooting
Custom Condor Variables
Description
This document describes what configuration variables are used by the glideins. Most administrators never need to touch most of them, but a sophisticated Glidein Factory administrator may need to tweak some of them to implement the desired policies (for example: require encryption over the wire) or to address the needs of a particular site (for example: max allowed wallclock time).
Configuration variable location
The glideinWMS ships with a set of pre-defined configuration variables, that are stored in two files, known as condor vars files:
glideinWMS/creation/web_base/condor_vars.lst
glideinWMS/creation/web_base/condor_vars.lst.entry
The two files are equivalent, but were split for historical
reasons, and the second one is meant to contain site specific
configuration variables.
These files should never be modified,
and represent just the default shipped by the software!
A glideinWMS administrator can change the values of the predefined variables (with some exceptions, see below), and define new ones using the Glidein Factory configuration file.
Condor vars files
The condor vars files contain the glideinWMS pre-defined
configuration variables, and should never be modified.
However,
a glideinWMS administrator should nevertheless be able to read them.
Each of them is an ASCII file, with one entry per row.
Lines
starting with # are comments and are ignored.
Each non comment line must have 7 columns. Each column has a specific meaning:
- Attribute name
- Attribute type
- I (int) – integer
- S (string) – quoted string
- C (expr) – unquoted string (i.e. Condor keyword or expression)
- Default value, use – if no default
- Condor name, i.e. under which name should this attributed be known in the configuration used by Condor daemons
- Is a value required for this attribute?
Must be Y or N. If Y and the attribute is not defined, the glidein will fail. - Will condor_startd publish this attribute to the
collector?
Must be Y or N. - Will the attribute be exported to the user job environment?
- - - Do not export (for glidein/condor internal use)
- + - Export to the user job environment using the original attribute name
- @ - Export to the user job environment using the Condor name
Here below, you can see a short extract; the semantics of the variables is defined below.
# VarName Type Default CondorName Req. Export UserJobEnvName # S=Quote - = No Default + = VarName Condor - = Do not export # + = Use VarName # @ = Use CondorName ################################################################################################################# X509_USER_PROXY C - GSI_DAEMON_PROXY Y N - USE_MATCH_AUTH C - SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION N N - GLIDEIN_Factory S - + Y Y @ GLIDEIN_Name S - + Y Y @ GLIDEIN_Collector C - HEAD_NODE Y N - GLIDEIN_Expose_Grid_Env C False JOB_INHERITS_STARTER_ENVIRONMENT N Y + TMP_DIR S - GLIDEIN_Tmp_Dir Y Y @ CONDORG_CLUSTER I - GLIDEIN_ClusterId Y Y @ CONDORG_SUBCLUSTER I - GLIDEIN_ProcId Y Y @ CONDORG_SCHEDD S - GLIDEIN_Schedd Y Y @ SEC_DEFAULT_ENCRYPTION C OPTIONAL + N N - SEC_DEFAULT_INTEGRITY C REQUIRED + N N - MAX_MASTER_LOG I 1000000 + N N - MAX_STARTD_LOG I 10000000 + N N -
Glidein Variables
This section defines all the variables that the glideins explicity use. Please be aware that, apart from the below mentioned variable many other variables will be used by the Condor daemons, since glideins are Condor based; see the Condor manual for more details.
The variables can be divided based on their source:
Factory config xml - attr tags
This section presents all the variables that can be directly changed by a Glidein Factory administrator using attr tags in the factory configuration XML using the following tags:
<attr name=“name” value=“val” type=“type” .../>
Attr tags can be in both the Factory and Frontend configuration, available types are Int (I in HTCondor vars file), String (S) or Expr (C). String and Expr values are treated literally (no need to escape them) and Strings are quoted when passed to HTCondor.
More information on the XML format can be found in
Glidein Factory configuration section.
If not specified in the XML, most of these variables have
defaults set in condor vars files,
which are used if the Glidein Factory administrator does not override
them. These defaults are listed below.
Please also note that some of these variables may also be provided by the VO factory clients.
Name |
Type |
Default Value |
Description |
GLIDEIN_Site | String | Entry name | Logical name of the Grid site where the glidein is running. This information is published both in the startd ClassAd and in the user job environment. |
GLIDEIN_Hold | Expr (Bool) | True |
Condor expression to use to specify when a user job in the glideins should be held. If any expression is true, the glidein is held. This is usually done to specify "bad" jobs, such as those that claim too much memory. |
GLIDEIN_Entry_PREEMPT | Expr (Bool) | True |
Condor expression to use to specify when a user job in the glideins should be preempted. If any expression is true, the glidein is preempted. This is usually done to specify custom preemption policies for user jobs. |
GLIDEIN_PREEMPT | Expr (Bool) | True | |
GLIDEIN_Rank | Expr (Int) | 1 |
Used in calculating the Condor RANK They are summed together, and the user job with the largest rank will run first. |
GLIDEIN_Entry_Rank | Expr (Int) | 1 | |
GLIDEIN_Max_Idle | Int | 1200 (20 mins) |
Max amount of time a condor_startd will wait to be matched before giving up and terminating. |
GLIDEIN_Max_Tail | Int | 400 (6 mins) |
Max amount of time a condor_startd will wait after having already completed a job to be matched again. (i.e. the tail of a job). |
GLIDEIN_Retire_Time | Int | 21600 (6 hours) |
How long the condor_startd be running before
no longer accepting jobs.
|
GLIDEIN_Retire_Time_Spread | Int | 7200 (2 hours) |
|
GLIDEIN_Max_Walltime | Int | N/A |
Max allowed time for the glidein. |
GLIDEIN_Graceful_Shutdown | Int | 120 |
Once DAEMON_SHUTDOWN is reached and the glidein pilot enters the Retiring state, this amount passes to allow the startd and job to gracefully shutdown before forcefully terminating the glidein. See Lifetime of a glidein for details. |
PREEMPT | Expr (Bool) | False | 1 | Specifies whether preemption is allowed to occur on this glidein.
PREEMPT_GRACE_TIME | Int | 10000000 | This value affects the condor value "MaxJobRetirementtime" and is an integer value representing the number of seconds a preempted job will be allowed to run before being evicted. This only affects behaviour if PREEMPT=True. |
HOLD_GRACE_TIME | Int | 0 | This value affects the condor value "MaxJobRetirementtime" and is an integer value representing the number of seconds a job that triggers WANT_HOLD will be allowed to run before being evicted. This only affects behaviour if GLIDEIN_HOLD, GLIDEIN_Entry_HOLD, GLIDECLIENT_HOLD, or GLIDECLIENT_Group_HOLD are specified and become true. By default, these "bad" jobs are immediately evicted. |
GLIDEIN_Monitoring_Enabled | Expr (Bool) | True | Ability to control whether the pseduo-interactive monitoring slot is started on the worker node. Set to False if you do not want the monitoring slot started. |
GLIDEIN_Resource_Slots | String | None | Special purpose resources added in separate slots or the main slot of the glidein. The separate slots will never be available for the regular jobs. This string is a semicolon separated list of comma separated resource descriptions. Each resource description contains the name (case insensitive) and optionally the number of resource instances (default is 1 (*), the total memory reserved (default is 128MB time the number of resources), a type option to say if the resource is added to the main slot (main) or has a new dedicated slot (partitionable) or many dedicated slots (one per instance, static). The default is partitionable, unless there is only one resource instance, then is static, and the disk reserved (default is auto, where HTCondor splits evenly). When adding resources to the main slots: the memory and disk parameters are ignored; HTCondor splits automatically resources depending on the number of CPUs and having a partitionable main slot or not; if the number of resources is not equal (or an exact multiple) to the number of CPUs, then you must select partitionable slots (slots_layout="partitionable" in the config/submit section of the entry configuration or SLOTS_LAYOUT partitionable in the frontend configuration) otherwise the startd may fail due to impossible configuration (check the HTCondor manual to learn more on how resources are split) The parameters in a resource description can be listed or specified using their name: name, number, memory, type, disk (see the last example below). Characters may be appended to a numerical value of memory to indicate percentage (%) or units. K or KB indicates KiB, $2^{10}$ numbers of bytes. M or MB indicates MiB, $2^{20}$ numbers of bytes. G or GB indicates GiB, $2^{30}$ numbers of bytes. T or TB indicates TiB, $2^{40}$ numbers of bytes. Disk space must be a fraction or percentage: 1/4, 25%. auto lets HTCondor do the splitting. Check the HTCondor manual for more about resource splitting. Jobs submission must use request_RESOURCE=N with N>0 to use these slots or to use the resource in the main slot, e.g. request_ioslot=1 Examples:
(*) GPUs is a special resource name. If no number is specified, the glidein will invoke HTCondor's GPU discovery mechanism and get the number from there, which could be 0 if there are no GPUs. Your job will find in the HTCondor ClassAd also all the special attributes about the GPUs. If no number is specified and the auto-discovery fails 0 GPUs are assumed. Since you don't know the number of GPUs that will be added to the main slot, the slot_layout must be partitionable. |
GLEXEC_BIN | String | None |
If set, Condor will launch all user jobs via glexec,
thus running the job under the
appropriate local account. This is important both for glideinWMS
security and for accounting
purposes. This variable is renamed to GLEXEC in the condor config. |
GLEXEC_JOB | Expr (Bool) | False |
If set to False, the condor_starter is run sharing the same UID as the user job. This has security implications. If running Condor 7.1.3 or later, it is recommended to turn this on and have the condor_starter be protected from the user jobs. |
GLIDEIN_Use_PGroups | Expr (Bool) | False |
Should process group monitoring be enabled? This is a Condor optimization parameter. Unfortunately, it negatively interferes with the batch systems used by the Grid sites, so it should not be turned on unless you have a very good reason to do so. |
UPDATE_COLLECTOR_WITH_TCP | Expr (Bool) | True |
If True, forces the glidein to use TCP updates. Also see the Condor documentation for implications and side effects. |
WANT_UDP_COMMAND_SOCKET | Expr (Bool) | False |
If True, enable the startd UDP command socket (Condor default). Using the UDP command socket is a Condor optimization that makes working over firewalls and NATs very difficult. It is thus recommended you leave it disabled in the glideins. Please note if you leave it disabled, that you must configure
the schedd with |
STARTD_SENDS_ALIVES | Expr (Bool) | True |
If set to False, the schedd will be sending keepalives to the startd. Setting this to True instructs the startd to send keepalives to the schedd. This improves the glidein behavior when running behind a firewall or a NAT. Please note that the schedd must be configured in the same way for this to work. |
SEC_DEFAULT_INTEGRITY | Expr | REQUIRED |
Security related settings. Please notice that the glideins always require GSI authentication. For more details see the configuration page or the Condor manual. |
SEC_DEFAULT_ENCRYPTION | Expr | OPTIONAL | |
USE_MATCH_AUTH | Expr (Bool) | False |
Another security setting. If set to True, the schedd and the startd will use a low overhead protocol. See the configuration page or the Condor manual. |
MAX_MASTER_LOG | Int | 1M |
What is the maximum size the logs should grow. Setting them too low will made debugging difficult. |
MAX_STARTD_LOG | Int | 10M | |
MAX_STARTER_LOG | Int | 10M | |
USE_CCB | Expr (Bool) | False |
If set to True, it will enable HTCondor Connection Brokering (CCB), which is needed if the glideins are behind a firewall or a NAT. |
USE_SHARED_PORT | Expr (Bool) | False |
If set to True, it will enable the shared port daemon, which will reduce the number of connections between the glidein and the collectors. |
GLIDEIN_CPUS | Int | 1 |
Number of CPUs glidein should use. GLIDEIN_CPUS is used to set NUM_CPUS for the HTCondor started by the glidein. Use "slot" (or "auto") to let glidein determine this from the job slot assigned to the glidein, use "node" to let glidein determine this from the hardware of the worker node (e.g. if you are sure that only your job is running on the node), set a value to force a different number. In case of static partitioning, glidein will create GLIDEIN_CPUS number of slots. In case of dynamic partitioning, the slots will be created automatically based on the CPUs required by the job and GLIDEIN_CPUS is the sum of the slots made available. Refer to HTCondor manual for info on NUM_CPUS |
GLIDEIN_MaxMemMBs | Int | None |
Amount of memory glidein should use. If set, GLIDEIN_MaxMemMBs is used to set total MEMORY used for the HTCondor and Startd and jobs started by the glidein. If not GLIDEIN_MaxMemMBs is not set and GLIDEIN_MaxMemMBs_Estimate is TRUE, GLIDEIN_MaxMemMBs is calculated based on the memory available on the worker node. If GLIDEIN_MaxMemMBs is not set and GLIDEIN_MaxMemMBs_Estimate is not TRUE, glidein lets HTCondor decide the amount of memory. Refer to HTCondor manual for info on MEMORY |
GLIDEIN_MaxMemMBs_Estimate | Expr (Bool) | False |
Used in conjunction with GLIDEIN_MaxMemMBs. See GLIDEIN_MaxMemMBs for the description. |
GLIDEIN_Factory_Report_Failed | String | ALIVEONLY |
This attribute regulates advertising of validation failures to the Factory collector.
When advertised, the classad is flagged GLIDEIN_Failed=True, the error is recorded in the GLIDEIN_FAILURE_REASON and GLIDEIN_EXIT_CODE attributes, and the failing script is recorded in the GLIDEIN_LAST_SCRIPT attribute. |
Factory config xml - configuration
The second set of variables comes from values the Glidein Factory administrator defined to make the factory to work. They are generated based on xml tags in the factory configuration (most in the entry tag). They cannot be changed by an administrator in any other way.
Name | Type | Source | Description |
GLIDEIN_Factory | String | <glidein factory_name="value"> |
Logical name of the Glidein Factory machine (like “osg1”). |
GLIDEIN_Name | String | <glidein glidein_name="value"> |
Identification name of the Glidein Factory instance (like “v1_0”). |
GLIDEIN_Entry_Name | String | ...<entries><entry name=”value”> |
Identification name of the entry point (like “ucsd5”). |
GLIDEIN_GridType | String | ...<entries><entry gridtype=“value”> |
Type of Grid resource (like “gt2”). |
GLIDEIN_Gatekeeper | String | ...<entries><entry gatekeeper=“value”> |
URI of the Grid gatekeeper (like “osg1.ucsd.edu/jobmanager-pbs”) |
GLIDEIN_GlobusRSL | String | ...<entries><entry rsl=“value”> |
Optional RSL string (like "(condor_submit=('+ProdSlot' 'TRUE'))") |
PROXY_URL | String | ...<entries><entry proxy_url=“value”> |
Optional URL of the site Web proxy. A special value “OSG” can be used to automatically discover the local Web proxy on OSG worker nodes. This variable is exported as GLIDEIN_Proxy_URL to the use job environment. |
DEBUG_MODE | String | ...<entries><entry verbosity=“value”> |
This setting can be either:
|
Frontend Client Variables
The third set of values comes from the Glidein Frontend clients. While a client can set any number of variables, the ones described below ar the most often used.
Name | Type | Description |
GLIDEIN_Client | String |
Identification name of the VO frontend request (like “ucsd5@v1_0@osg1@cms4”). |
GLIDEIN_Collector | Expr (List) |
List of Collector URIs used by the VO Condor pool (like “cc.cms.edu:9620,cc.cms.edu:9621”). One of the URIs in the list will be selected and used as HEAD_NODE in the condor_config. |
GLIDECLIENT_Hold | Expr (Bool) |
Condor expression to use to specify when a user job in the glideins should be held. If any expression is true, the glidein is held. This is usually done to specify "bad" jobs, such as those that claim too much memory. |
GLIDECLIENT_Group_Hold | Expr (Bool) | |
GLIDECLIENT_PREEMPT | Expr (Bool) |
Condor expression to use to specify when a user job in the glideins should be preempted. If any expression is true, the glidein is preempted. This is usually done to specify custom preemption policies. |
GLIDECLIENT_Group_PREEMPT | Exp (Bool) | |
GLIDECLIENT_Rank | Expr (Int) |
Used in calculating the Condor RANK They are summed together, and the user job with the largest rank will run first. |
GLIDECLIENT_Group_Rank | Expr (Int) | |
GLIDEIN_Job_Max_Time | Int |
Max allowed time for the job to end. |
GLIDEIN_Expose_Grid_Env | Expr (Bool) |
If False, the user job environment will contain only glidein factory provided variables. If True, the user job environment will also contain the environment variables defined at glidein startup. See JOB_INHERITS_STARTER_ENVIRONMENT documentation for more details. |
GLIDEIN_Expose_X509 | Expr (Bool) |
By default, the glidein will unset the variable X509_USER_PROXY for security reasons to prevent the user jobs from accessing the pilot proxy. Setting this to true will override this behavior and keep the X509_USER_PROXY in the environment. |
SLOTS_LAYOUT | String |
Defines how multi-core glideins should split their resources. There are only two legal values:
Note: This variable MUST NOT be passed as a parameter, or the glideins will fail! |
FORCE_PARTITIONABLE | String (Bool) |
By default, single core glideins will never be configured as partitionable,
independently of the value of SLOTS_LAYOUT. If partitionable slots are desired also for single-core glideins, set this variable to "True". |
GLIDEIN_Report_Failed | String |
This attribute regulates advertising of validation failures to the user collector.
When advertised, the classad is flagged GLIDEIN_Failed=True, the error is recorded in the GLIDEIN_FAILURE_REASON and GLIDEIN_EXIT_CODE attributes, and the failing script is recorded in the GLIDEIN_LAST_SCRIPT attribute. |
GLIDEIN_CLAIM_WORKLIFE | Int |
CLAIM_WORKLIFE for non-dynamic slots. Defaults to -1 i.e. HTCondor will treat this as an infinite claim worklife and schedd will hold claim to the slot until jobs are preempted or user runs out of jobs. |
GLIDEIN_CLAIM_WORKLIFE_DYNAMIC | Int |
CLAIM_WORKLIFE for dynamically partitionable slots. Defaults to 3600. This controls how frequently the dynamically partionable slots will coalesce. |
Cloud VM specific Variables
The following variables are only applicable to Cloud VMs.
These variables can be either configured by factory or the frontend
Name |
Type |
Default Value |
Description |
VM_MAX_LIFETIME | Int | 172800 (48 hours) |
Max lifetime of the VM. When this timer is reached the glideinwms-pilot service will terminate and glidein process, shutdown the glideinwms-pilot service and issue a VM shutdown |
VM_DISABLE_SHUTDOWN | Expr (Bool) | False |
Disables VM from automatically shutting down after
the glideinwms-pilot service has exited. |
Dynamically generated variables
The following variables are being dynamically generated and/or modified by glideinWMS processes. The glideinWMS administrators cannot directly change them.
The first set of variables comes from the Glidein Factory.
Name | Type | Description |
GLIDEIN_Signature | String |
These variables contain the SHA1 signature of the signature files. These signatures are used as a base to ensure the integrity of all the data downloaded in the glidein startup scripts, but they also provide a fingerprint of the configuration used by the glidein. These variables are published both in the glidein ClassAd and in the user job environenmt. |
GLIDEIN_Entry_Signature | String | |
CONDORG_SCHEDD | String |
The schedd used by the Glidein Factory to submit the glidein. This variables is exported a GLIDEIN_Schedd both in the glidein ClassAd and to the user job environment. |
CONDORG_CLUSTER | Int |
The cluster and process id assigned by the Glidein Factory schedd to this glidein. These variables are exported as GLIDEIN_ClusterId and GLIDEIN_ProcId both in the glidein ClassAd and to the user job environment. |
CONDORG_SUBCLUSTER | Int |
Directory Path Variables
The next set contains the location of files and/or directories downloaded or created by the glidein. Most of them are located under the working directory specified by
<entry work_dir=“value”>
Name | Description |
TMP_DIR |
Path to the directory that admin-provided scripts and user jobs can use for storing temporary data. This variable is exported as GLIDEIN_Tmp_Dir both to the glidein ClassAd and to the user job environment. |
GLIDEIN_LOCAL_TMP_DIR |
Path to the directory on the local file system
for storing temporary data. This variable is exported to the user job environment. |
CONDOR_VARS_FILE |
File path to the condor vars files. Admin-provided scripts may want to add entries to these files. |
CONDOR_VARS_ENTRY_FILE | |
ADD_CONFIG_LINE_SOURCE |
File path to the script containing the add_config_line and add_confir_vars line functions. |
X509_USER_PROXY |
File path to the glidein proxy file. |
X509_CONDORMAP |
File path to the Condor mapfile used by the glidein. |
X509_CERT_DIR |
Path to the directory containing the trusted CAs' public keys and RSLs. |
CONDOR_DIR |
Directory where the glidein Condor binary distribution have been installed. |
WRAPPER_LIST |
File path to the list of wrapper scripts used by the glidein. |
GLIDEIN_WRAPPER_EXEC |
This is the executable that glideins will run (ie what to
put after "exec" in the condor job wrapper). By default,
glideins will perform exec "\$@" to run the pilot.
For other modes of execution, you may need different
arguments. For example, to run a program under parrot, you
may need
exec "$GLIDEIN_PARROT/parrot_run"
-t "$parrot_tmp" "$@" .
|
Machine Job Features variables (dynamic)
This set of variables contains various variables generated by the glidein startup scripts that are related to the Machine Job Features. For convenience the description of those variable is replicated here in this page.
Name | Type | Description |
MJF_MACHINE_TOTAL_CPU | Int |
Number of processors which may be allocated to jobs. Typically the number of processors seen by the operating system on one worker node (that is the number of \processor :" lines in /proc/cpuinfo on Linux), but potentially set to more or less than this for performance reasons. |
MJF_MACHINE_HS06 | Int |
Total HS06 rating of the full machine in its current setup. HS06 is measured following the HEPiX recommendations, with HS06 benchmarks run in parallel, one for each processor which may be allocated to jobs. |
MJF_MACHINE_SHUTDOWNTIME | Int |
Shutdown time for the machine as a UNIX time stamp in seconds. The value is dynamic and optional. If the file is missing, no shutdown is foreseen. |
MJF_MACHINE_GRACE_SECS | Int |
If the resource provider announces a shutdown time to the jobs on this host, that time will not be less than grace secs seconds after the moment the shutdown time is set. This allows jobs to begin packages of work knowing that there will be sufficient time for them to be completed even if a shutdown time is announced. This value is required if a shutdown time will be set or changed which will affect any jobs which have already started on this host. |
MJF_JOB_ALLOCATED_CPU | Int |
Number of processors allocated to the current job. |
MJF_JOB_HS06_JOB | Int |
Total HS06 rating for the processors allocated to this job. The job's share is calculated by the resource provider from per-processor HS06 measurements made for the machine. |
MJF_JOB_SHUTDOWNTIME_JOB | Int |
Dynamic value. Shutdown time as a UNIX time stamp in seconds. If the file is missing no job shutdown is foreseen. The job needs to have finished all of its processing when the shutdown time has arrived. |
MJF_JOB_GRACE_SECS_JOB | Int |
If the resource provider announces a shutdowntime job to the job, it will not be less than grace secs job seconds after the moment the shutdown time is set. This allows jobs to begin packages of work knowing that there will be sufficient time for them to be completed even if a shutdown time is announced. This value is static and required if a shutdown time will be set or changed after the job has started. |
MJF_JOB_JOBSTART_SECS | Int |
UNIX time stamp in seconds of the time when the job started on the worker node. For a pilot job scenario, this is when the batch system started the pilot job, not when the user payload started to run. |
MJF_JOB_JOB_ID | Int |
A string of printable non-whitespace ASCII characters used by the resource provider to identify the job at the site. In batch environments, this should simply be the job ID. In virtualized environments, job id will typically contain the UUID of the VM. |
MJF_JOB_WALL_LIMIT_SECS | Int |
Elapsed time limit in seconds, starting from jobstart secs. This is not scaled up for multiprocessor jobs. |
MJF_JOB_CPU_LIMIT_SECS | Int |
CPU time limit in seconds. For multiprocessor jobs this is the total for all processes started by the job |
MJF_JOB_MAX_RSS_BYTES | Int |
Resident memory usage limit, if any, in bytes for all processes started by this job. |
MJF_JOB_MAX_SWAP_BYTES | Int |
Swap limit, if any, in bytes for all processes started by this job. |
MJF_JOB_SCRATCH_LIMIT_BYTES | Int |
Scratch space limit if any. If no quotas are used on a shared system, this corresponds to the full scratch space available to all jobs which run on the host. User jobs from EGI-registered VOs expect the \max size of scratch space used by jobs" value on their VO ID Card to be available to each job in the worst case. If there is a recognised procedure for informing the job of the location of the scratch space (eg EGI's $TMPDIR policy), then this value refers to that space. |
Glidein Script Variables (dynamic)
The last set contains various variables generated by the glidein startup scripts.
Name | Type | Description |
X509_GRIDMAP_DNS | String |
List of DNs trusted by the glidein. |
X509_EXPIRE | Expr (time_t) |
When is the proxy expected to expire. |
GLEXEC_STARTER | Expr (Bool) |
If gLExec is used and this is set to True, the condor_starter will be run sharing the same UID as the user job. |
ALTERNATIVE_SHELL | String |
If gLExec is used, this variable points to a trusted copy of a shell. |
GLEXEC_USER_DIR | String |
If gLExec is used, this variable points to the working directory under which all user jobs will be started. |
SiteWMS_WN_Draining | Expr (Bool) |
The variable controls wether or not the glidein should accept new jobs. As part of the WLCG Machine / Job Features Task Force Site admins have the possibility to put worker nodes to "drain mode". In particular if the JOBFEATURES (or MACHINEFEATURES) environment variable is set and it points to a local directory containing a file named shutdowntime_job (shutdowntime), or $JOBFEATURES/shutdowntime_job ($MACHINEFEATURES/shutdowntime) is a valid URL (it is possible to wget it), then the glidein will stop accepting jobs. (see here). The SiteWMS_WN_Draining variable is periodically updated by means of the STARTD_CRON condor feature. |
SiteWMS_WN_Preempt | Expr (Bool) |
The variable controls wether or not the glidein should preempt jobs. This is still part of the WLCG Machine / Job Features Task Force (see above). The variable will become true if the shutdown time value contained in the file pointed by the JOBFEATURES (or MACHINEFEATURES) variable contains a timestamp (unix epoch time) that is less than 30 minutes in the future. If so PREEMPT_GRACE_TIME will be set to 20 minutes and the job will be preempted after that time if it does not exit. The SiteWMS_WN_Preempt variable is periodically updated by means of the STARTD_CRON condor feature. |
Lifetime of a Glidein
All the various variables in the glidein configuration can be confusing. The above diagram illustrates the lifetime of a glidein pilot that has a long-running job.
- Green: For the first GLIDEIN_Retire_Time seconds (modified by a random spread GLIDEIN_Retire_Spread to smooth out glideins all ending simultaneously), jobs can start on the glidein pilot.
- Yellow: During the yellow period, START will evauluate to FALSE, so no new jobs will start. However, already running jobs will continue to run for GLIDEIN_Job_Max_Time. (Note that the glideins will end during this period if the job ends. They will not idle afterwards since no new jobs can start anyway).
- Once this period is done, the DAEMON_SHUTDOWN variable will be true.
- Orange: There can be a delay of up to UPDATE_INTERVAL (usually about 5 minutes) between when DAEMON_SHUTDOWN becomes true and when it is actually updated in the collector. This is because the collector only reevaluates this expression periodically.
- Red: Once DAEMON_SHUTDOWN is true, condor gives a short grace period of GLIDEIN_Graceful_Shutdown before forcefully terminating everything and shutting down. All of these periods are totalled and calculated to fit within GLIDEIN_MAX_WALLTIME if specified.
Note that if a job ends early in the green period, a new job will start. If a job ends after this period, then the glidein will shut down and end early. This can be seen in the example below with two jobs below:
- Orange: A glidein first starts up.
- Yellow: After starting up, the glidein will wait for Glidein_Max_Idle for its initial matching. If it is Idle for longer than this, it will assume no jobs are available and will shutdown.
- Green: A job runs (subject to the maximum limits in the previous diagram).
- Yellow: Once a job runs, it will wait for Glidein_Max_Tail for another job.
- Green: Another matching job runs.
- Yellow: Once the job finishes, it will wait for Glidein_Max_Tail for another job. If cannot find one, it will shutdown.