GlideinWMS The Glidein-based Workflow Management System

Frontend Collector Factory Corral

WMS Factory

Configuration

Example Configuration

Below is an example factory configuration xml file. Click on any piece for a more detailed description.
<glidein advertise_delay="5" factory_name="factory-dstrain" glidein_name="v2_4" loop_delay="60" restart_attempts="3" restart_interval="1800" schedd_name="schedd glideins1@submit.fnal.gov,schedd_glideins2@submit.fnal.gov" factory_collector="submit.fnal.gov:9618" entry_parallel_workers="0">
<log_retention >
<condor_logs max_days="14.0" max_mbytes="100.0" min_days="3.0" />
<job_logs max_days="7.0" max_mbytes="100.0" min_days="3.0" />
<process_logs >
<process_log extension="info" max_days="7.0" max_mbytes="100.0" min_days="3.0" msg_types="INFO" backup_count="5" compression="gz" />
<process_log extension="debug" max_days="7.0" max_mbytes="100.0" min_days="3.0" msg_types="DEBUG,ERR,WARN" backup_count="5" />
</process_logs >
<summary_logs max_days="31.0" max_mbytes="100.0" min_days="3.0" />
</log_retention >
<monitor base_dir="/var/www/html/glidefactory/monitor" flot_dir="/opt/javascriptrrd-0.6.3/flot" javascriptRRD_dir="/opt/javascriptrrd-0.6.3/src/lib" jquery_dir="/opt/javascriptrrd-0.6.3/flot" />
<monitor_footer display_txt="Legal Disclaimer" href_link="/site/disclaimer.html" />
<security key_length="2048" pub_key="RSA" reuse_oldkey_onstartup_gracetime="900" remove_old_cred_freq="24" remove_old_cred_age="30"/>
<frontends >
<frontend name="vofrontend" identity="vofrontend@vofrontend.fnal.gov" >
<security_classes >
<security_class name="frontend" username="frontend1" />
</security_classes >
</frontend >
</frontends >
</security >
<stage base_dir="/var/www/html/glidefactory/stage" use_symlink="True" web_base_url="http://factory.fnal.gov:9000/glidefactory/stage"/>
<submit base_client_log_dir="/opt/clientlogs/clients/logs" base_client_proxies_dir="/opt/clientlogs/clients/proxies" base_dir="/opt/wmsfactory/" base_log_dir="/opt/wmsfactory/logs" />
<attrs>
<attr name="CONDOR_VERSION" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="False" type="string" value="default" />
<attr name="GCB_ORDER" const="True" glidein_publish="True" job_publish="False" parameter="True" publish="True" type="string" value="NONE" />
<attr name="GLEXEC_JOB" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="True" type="string" value="True" />
<attr name="USE_CCB" const="True" glidein_publish="True" job_publish="False" parameter="True" publish="True" type="string" value="False" />
<attr name="USE_MATCH_AUTH" const="False" glidein_publish="False" job_publish="False" parameter="True" publish="True" type="string" value="True" />
</attrs>
<entries>
<entry name="EXAMPLE_ENTRY" enabled="True" auth_method="grid_proxy" trust_domain="OSG" gatekeeper="gatewayname.fnal.gov/jobmanager-condor" gridtype="gt2" rsl="(queue=default)(jobtype=single)" schedd_name="wmscollector.fnal.gov" verbosity="std" work_dir="OSG">
<config>
<max_jobs>
<per_entry held="1000" idle="2000" glideins="10000"/>
<default_per_frontend held="50" idle="100" glideins="5000"/>
<per_frontends>
<per_frontend name="FRONTEND:SECURITY_CLASS" held="50" idle="100" glideins="5000"/>
</per_frontends>
</max_jobs>
<release max_per_cycle="20" sleep="0.2"/>
<remove max_per_cycle="5" sleep="0.2"/>
<submit cluster_size="10" max_per_cycle="100" sleep="0.2" slots_layout="partitionable">
<submit_attrs>
<submit_attr name="RequestMemory" value="2000"/>
<submit_attrs/>
</submit>
</config>
<allow_frontends />
<attrs>
<attr name="CONDOR_ARCH" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="False" type="string" value="default"/>
<attr name="CONDOR_OS" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="False" type="string" value="default" />
<attr name="GLEXEC_BIN" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="True" type="string" value="NONE"/>
<attr name="GLIDEIN_Site" const="True" glidein_publish="True" job_publish="True" parameter="True" publish="True" type="string" value="FNAL_EXAMPLE_SITE"/>
<attr name="GLIDEIN_CPUS" const="True" glidein_publish="True" job_publish="True" parameter="True" publish="True" type="string" value="1"/>
<attr />
</attrs>
<monitorgroups/>
<files/>
<infosys_refs />
</entry>
</entries>
<files>
<file absfname="/usr/conf/sethome.source" after_entry="False" const="True" executable="False" untar="False" wrapper="True" />
</files>
<condor_tarballs>
<condor_tarball arch="default" base_dir="/opt/wmscollector/" os="default" tar_file="/var/www/html/glidefactory/stage/glidein_v2_4/condor_bin_default-default-default.a83ePm.tgz" version="default"/>
</condor_tarballs>
<monitoring_colectors>
<monitoring_colector DN="/DC=org/DC=doegrids/OU=Services/CN=factmoncollector.fnal.gov" node="factmoncollector.fnal.gov" secondary="False" group="default" />
<monitoring_colector DN="/DC=org/DC=doegrids/OU=Services/CN=factmoncollector.fnal.gov" node="factmoncollector.fnal.gov:9620-9819" secondary="True" group="default" />
<monitoring_colector DN="/DC=org/DC=doegrids/OU=Services/CN=factmoncollector2.fnal.gov" node="factmoncollector2.fnal.gov" secondary="False" group="ha" />
<monitoring_colector DN="/DC=org/DC=doegrids/OU=Services/CN=factmoncollector2.fnal.gov" node="factmoncollector2.fnal.gov:9620-9919" secondary="True" group="ha" />
</monitoring_colectors>
</glidein>

The configuration file

The configuration file is a XML document. It contains both global arguments as well as configuration specific to each entry point. At least one entry point must be specified in the configuration file.

The tags of the XML configuration file are described below. Each is given a designation:

  • Required You must change or examine this in order for the factory to function correctly.
  • Recommended The installer provides a good default, but you should examine this attribute to make sure it is correct for your installation.
  • Optional The installer-provided default is likely correct for your installation. Change this only if your particular configuration requires special treatment or fine-tuning.

Global arguments

Global arguments are common to all entry points but can be overridden by individual entry point configuration.


The other arguments are for advanced admins only, and are explained in a dedicated section.

Entry point arguments

The following are arguments that are specific to each entry point. They override the global arguments if present.

Grouping entries into metasites

Starting in v3.2.20, entries with similar configuration can be grouped in <entry_sets> to form what we call a metasite. <entry_sets> starts right after the <entries> tag is closed. A metasite is defined by starting an <entry_set>, which then contains the common configuration for the different entries, and then an <entries< tag containing the different entries for the metasite.

The following is an example of two different entry set containing two separate entries.

The attributes auth_method, gridtype, and trust_domain must be the same in all entry elements (they are used for credential generation in the frontend).

Only the metasite will be advertized as a glideresource classad, so the frontend will only see one element (entry). The Frontend will apply limits transparently as before, and the Factory as well. Right before glidein submission the factory will detect there are multiple CE where the work can be send (multiple submission files) and it will count the running+idle glideins for each CE and send jobs to the one with less. The submission file is added to the job condor submit file for accounting purposes (GlideinEntrySubmitFile classad).

...
</entries>
<entry_sets>
<entry_set alias="ITB_FC_CE3" enabled="True">
<config>
<max_jobs>
<default_per_frontend glideins="5000" held="100" idle="400"/>
<per_entry glideins="10000" held="1000" idle="4000"/>
<per_frontends>
</per_frontends>
</max_jobs>
<release max_per_cycle="20" sleep="0.2"/>
<remove max_per_cycle="5" sleep="0.2"/>
<restrictions require_glidein_glexec_use="False" require_voms_proxy="False"/>
<submit cluster_size="10" max_per_cycle="100" sleep="0.2" slots_layout="partitionable">
<submit_attrs>
</submit_attrs>
</submit>
</config>
<allow_frontends>
</allow_frontends>
<attrs>
<attr name="CONDOR_ARCH" const="False" glidein_publish="False" job_publish="False" parameter="True"
publish="True" type="string" value="default"/>
<attr name="CONDOR_OS" const="False" glidein_publish="False" job_publish="False" parameter="True"
publish="True" type="string" value="default"/>
<attr name="GLEXEC_BIN" const="True" iglidein_publish="False" job_publish="False" parameter="True" publish="True" type="string" value="NONE"/>
<attr name="GLIDEIN_Site" const="True" glidein_publish="True" job_publish="True" parameter="True" publish="True" type="string" value="ITB_GRATIA_TEST"/>
<attr name="GLIDEIN_Supported_VOs" const="True" glidein_publish="True" job_publish="True" parameter="True" publish="True" type="string" value="CMS"/>
<attr name="GLIDEIN_CPUS" const="True" glidein_publish="False" job_publish="True" parameter="True" publish="True" type="string" value="8"/>
</attrs>
<files>
</files>
<infosys_refs>
</infosys_refs>
<monitorgroups>
</monitorgroups>
<entries>
<entry name="ITB_FC_CE3_121" auth_method="grid_proxy" enabled="True" gatekeeper="fermicloud121.fnal.gov/jobmanager-condor" gridtype="gt2" rsl="(queue=default)(jobtype=single)" schedd_name="schedd_glideins4@fermicloud092.fnal.gov" trust_domain="grid" verbosity="std" work_dir="OSG">
</entry>
<entry name="ITB_FC_CE3_025" auth_method="grid_proxy" enabled="True" gatekeeper="fermicloud025.fnal.gov/jobmanager-condor" gridtype="gt2"
rsl="(queue=default)(jobtype=single)"
schedd_name="schedd_glideins4@fermicloud092.fnal.gov"
trust_domain="grid" verbosity="std" work_dir="OSG">
</entry>
</entries>
</entry_set>
<entry_set alias="T2_CH_CERN" enabled="True">
<config>
<max_jobs>
<default_per_frontend glideins="5000" held="50" idle="100"/>
<per_entry glideins="10000" held="1000" idle="4000"/>
<per_frontends>
</per_frontends>
</max_jobs>
<release max_per_cycle="20" sleep="0.2"/>
<remove max_per_cycle="5" sleep="0.2"/>
<restrictions require_glidein_glexec_use="False"
require_voms_proxy="False"/>
<submit cluster_size="10" max_per_cycle="10" sleep="2"
slots_layout="fixed">
<submit_attrs>
</submit_attrs>
</submit>
</config>
<allow_frontends>
</allow_frontends>
<attrs>
<attr name="GLEXEC_BIN" const="True" glidein_publish="False" job_publish="False" parameter="True"
<attr name="GLEXEC_BIN" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="True" type="string" value="glite"/>
<attr name="GLIDEIN_CMSSite" const="True" glidein_publish="True" job_publish="True" parameter="True" publish="True" type="string" value="T2_CH_CERN"/>
<attr name="GLIDEIN_CPUS" const="True" glidein_publish="False" job_publish="True" parameter="True" publish="True" type="string" value="8"/>
<attr name="GLIDEIN_Country" const="True" glidein_publish="True" job_publish="True" parameter="True" publish="True" type="string" value="CH"/>
<attr name="GLIDEIN_MaxMemMBs" const="True" glidein_publish="False" job_publish="True" parameter="True" publish="True" type="int" value="22000"/>
<attr name="GLIDEIN_Max_Walltime" const="True" iglidein_publish="False" job_publish="False" parameter="True" publish="True" type="int" value="257400"/>
<attr name="GLIDEIN_ResourceName" const="True" glidein_publish="True" job_publish="True" parameter="True" publish="True" type="string" value="CERN-PROD"/>
<attr name="GLIDEIN_Retire_Time" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="True" type="int" value="108000"/>
<attr name="GLIDEIN_Site" const="True" glidein_publish="True" job_publish="True" parameter="True" publish="True" type="string" value="CERN"/>
<attr name="GLIDEIN_Supported_VOs" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="True" type="string" value="CMS"/>
</attrs>
<files>
</files>
<infosys_refs>
</infosys_refs>
<monitorgroups>
</monitorgroups>
<entries>
<entry name="CMSHTPC_T2_CH_CERN_ce301" auth_method="grid_proxy" comment="Converted to multicore 2016-03-29-Vassil" enabled="True" gatekeeper="ce301.cern.ch:8443/cream-lsf-grid_cms" gridtype="cream" rsl="WholeNodes = False; HostNumber = 1; CPUNumber = 8" trust_domain="grid" verbosity="std" work_dir=".">
</entry>
<entry name="CMSHTPC_T2_CH_CERN_ce302" auth_method="grid_proxy" comment="Converted to multicore 2016-03-31-Vassil" enabled="True" gatekeeper="ce302.cern.ch:8443/cream-lsf-grid_cms" gridtype="cream" rsl="WholeNodes = False; HostNumber = 1; CPUNumber = 8" trust_domain="grid" verbosity="std" work_dir=".">
</entry>
</entries>
</entry_set>
</entry_sets>
...

Advanced topics

While the above is enough for setting up a personal glidein pool on the local area network, you will need to do more fine tuning when deploying a larger one. In this section, the various advanced aspects of glidein pools will be presented.

Integration with Singularity

  • If an entry is capable of Singularity AND enforces the use of Singularity, you must have <attr name="GLIDEIN_SINGULARITY_REQUIRE" const="True" glidein_publish="True" job_publish="True" parameter="True" publish="True" type="string" value="True"/> in <attrs> and SINGULARITY_BIN must be given the full path name of singularity binary, for example, <attr name="SINGULARITY_BIN" const="True" glidein_publish="True" job_publish="True" parameter="True" publish="True" type="string" value="/usr/bin/singularity"/> in <attrs>.

  • If an entry is not capable of Singularity at all, make sure you don't have SINGULARITY_BIN in <attrs>. Remember that SINGULARITY_BIN should NEVER be set in the factory configuration if your site does not support the Singularity (or set it to the keyword NONE, uppercase).

  • If an entry is capable of Singularity BUT does not necessarily enforce the use of Singularity, you don't need to have GLIDEIN_SINGULARITY_REQUIRE in <attrs> but SINGULARITY_BIN must be given the path of singularity binary(not including the filename), for example, <attr name="SINGULARITY_BIN" const="True" glidein_publish="True" job_publish="True" parameter="True" publish="True" type="string" value="/usr/bin"/>.

  • If the value of SINGULARITY_BIN is not a valid Singularity path (and is different from NONE), then GlideinWMS will search on the node for the singularity binary, first using $PATH, then trying to invoke module, for example, <attr name="SINGULARITY_BIN" const="True" glidein_publish="True" job_publish="True" parameter="True" publish="True" type="string" value="yes"/>.

Integration with gLExec

As you may have noticed, all of the glideins are submitted with the same service proxy. While this has the advantage of simplifying the architecture and improve both efficiency and VO control, it does have a few problems:

  • All glidein scripts and HTCondor daemons, AND user jobs all run under the same Unix UID. So users can interfere with the glidein tasks, possibly hacking the system. 
    Plus, when several glideins start on the same node (on multi processor/core machines), one user job can interfere with another user job.

  • The real user is never authenticated against the Grid site authorization infrastructure. This makes it impossible for the sites to enforce their policies, nor can they analyze the usage of their resources; they see only glideins. This makes them very unfriendly toward the glidein based WMS.

To solve this problem, some Grid sites are deploying gLExec on the worker nodes. gLExec is a service that will take the following:

  • the user proxy, and
  • the desired command

It will contact the local authorization and mapping system, switch to the UID of the user (as opposed to the glidein UID), and execute the provided command as that UID.

By using gLExec, a Glidein Factory can get rid of both of the above problems, and still keep all the advantages.

To enable gLExec support, you need to specify:

<glidein>
 [<entries><entry>]
 <files>
  <file absfname="web_base/glexec_setup.sh>" executable="True"/>
 </files>
 <attrs>
 <attr name="GLEXEC_BIN" value="path to glexec" publish="False" parameter="True"/>

For most current gLExec installation this comes down to:

<attr name="GLEXEC_BIN" value="OSG" publish="False" parameter="True"/>

More details about scripts in general can be found in the "custom code" section.

You will also need to properly configure the shadow config files on the submit machine, by adding the following to the condor_config:

GLEXEC_STARTER = True
GLEXEC = /bin/false

As of version 7.1.3 of HTCondor, a new, better glexec operation mode is supported; in the old operation mode, condor_startd invoked condor_starter through glexec. The result was that condor_starter was running under the same UID as the user job, leaving it vulnerable to attack from a malicious user. The new operating mode solves this by having condor_starter run the user jobs via glexec; this adds a little more overhead to handle the user jobs, but makes the system much more secure.

To enable the new operation mode, add the following line to your configuration file:

<attr name="GLEXEC_JOB" value="True" publish="False" parameter="True"/>

Note that you still need to set GLEXEC_BIN, too.

Warning: Use it only if you use HTCondor 7.1.3 or later, as it will not work on any older HTCondor version!

Troubleshooting

gLExec installations on at least one site had problems with delegated proxies. If in doubt, try to disable the delegation.

To disable delegation, add the following to the shadow configuration file:

DELEGATE_JOB_GSI_CREDENTIALS=False
SEC_DEFAULT_ENCRYPTION=PREFERRED

Then, set the following tags in the glidein creation file:

<glidein>
 [<entries><entry>]
  <attrs>    <attr name="SEC_DEFAULT_ENCRYPTION" value="REQUIRED" publish="False" parameter="True"/>

Another thing to consider is the startup directory; it must be accessible by both the starting user and the target user(s). The directory you usually start in the Grid is most often not readable by any other user, so you must select something else. Both HTCondor and OSG should be fine, or you can specify any other fixed, WN-local location.

Private networks and firewalls

HTCondor daemons need two way communication in order to work properly. This clashes with the network policies of most Grid sites, that have worker nodes in private networks or implement a restrictive firewall.

HTCondor provides the CCB mechanism to address this. It was providing also a second mechanism, GCB, but it is no more supported in recent versions, so remove it if you still have it in your configuration.

CCB - Condor Connection Broker

CCB was introduced in Condor v7.3.0 to replace GCB in most circumstances. It is much more reliable than GCB and also easier to setup.
The detailed description of CCB is beyond the scope of this manual and you should refer to the HTCondor documentation available at http://research.cs.wisc.edu/htcondor/manual/v8.0/3_7Networking_includes.html#sec:CCB. Here you will find only the parameters needed to enable it in the glideins.

To use HTCondor with CCB, you need to specify:

<glidein>
 [<entries><entry>]
 <attrs>
  <attr name="USE_CCB" value="True" publish="False" parameter="True"/>

and you are done. Just make sure you follow the suggested scalability guidelines described in the HTCondor manual.

Security handles

As mentioned in the startup page, the glidein pool must be properly configured to protect it from hackers and malicious users. The same page also describes what needs to be done on the collector machine.
The glidein itself can also be configured. The default configuration works fine for most users, but you may need to change them.

The values are set using the <attr /> option, and the default values are:

  • SEC_DEFAULT_ENCRYPTION=OPTIONAL
  • SEC_DEFAULT_INTEGRITY=REQUIRED
  • DELEGATE_JOB_GSI_CREDENTIALS=False

As of HTCondor version 7.1.3 condor also supports a more efficient authentication mechanism between the condor_schedd/condor_shadow and condor_startd/condor_starter. This method uses the match ClaimId as a shared password for authentication between these daemons. Since using a shared secret is much cheaper that using GSI authentication, this should be used every time it is feasible.

This option is enabled by default. <attr /> option:

<attr name=USE_MATCH_AUTH ... value=True.. /> ... enabled
<attr name=USE_MATCH_AUTH ... value=False.. /> ... disabled

When enabled, this HTCondor attribute must be set in the condor_config of the submit machine.
This option is not used by the HTCondor negotiator or collector and therefore not needed if they are installed separately.

SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION = True

Using TCP to send updates to the Collector

By default, HTCondor uses UDP packets to communicate between the glideins and the Collector. While more efficient than TCP, UDP packets are often blocked at the firewall, or lost on the WAN.

To disable TCP updates, specify, with the <attr/> option:

UPDATE_COLLECTOR_WITH_TCP=False

In GlideinWMS, we enable the glideins to update the user collector using TCP by default.
Please be aware that this will configure the glideins only; you still need to properly configure the Collector machine. See Condor documentation for more details.

Multiple Collectors

By default, HTCondor uses only one Collector for the glidein (user) pool. However, if the load becomes too high on the collector, you can configure multiple collectors in a chain.

You will need a master and a set of slave collectors. Each slave collector will service a portion of the pool and will forward communication between the startd daemons to the master collector. Machine classads from these startd's will be sent to the master collector. The negotiator and the schedds will talk to the master collector, and the startds will talk to one of the slave ones. This will reduce load on the central manager.

To set up slave collector in the glidein (user) pool, one way is to set the following env variables before starting up the condor_master:

COLH=`condor_config_val COLLECTOR_HOST`
LD=`condor_config_val LOCAL_DIR`
export _CONDOR_COLLECTOR_HOST=$COLH:
export _CONDOR_MASTER_NAME=collector_
export _CONDOR_DAEMON_LIST="MASTER, COLLECTOR"
export _CONDOR_LOCAL_DIR=$LD/$_CONDOR_MASTER_NAME
export _CONDOR_LOCK=$_CONDOR_LOCAL_DIR/lock
# Forward all the traffic to the main collector
export _CONDOR_CONDOR_VIEW_HOST=localhost:9618

Using localhost to communicate to the master collector is more efficient. But will not work if they are on different hosts or if you are using HTCondor older than 8.2. For those case you'll need: export _CONDOR_CONDOR_VIEW_HOST=$COLH:9618

Once you have the slave collectors set up, you will want to use them.

The VO Frontend will have to point the factory to a list of collectors.

The configuration internally will add a line in the factory configuration file that will set up the glideins to handle the multiple collectors. (You should now see a line like: "<file absfname="web_base/collector_setup.sh" executable="True"/>" after reconfiguring).

Setting the glidein start and rank condition

As with any HTCondor pool, you may need to set the startd start and rank conditions.
For a glidein, you can set this with the <attr/> options:

GLIDEIN_Start=expression
GLIDEIN_Rank=expression

For example:

<glidein>
 [<entries><entry>]
  <attrs>
    <attr name="GLIDEIN_Start" value="Owner==&quot;sfiligoi&quot;" publish="False" parameter="True"/>
    <attr name="GLIDEIN_Rank" value="ImageSize" publish="False" parameter="True"/>

Internal Configuration

The configuration is parsed during the reconfiguration of the Factory, and split into a number of files:
  • job.descript = is read by the daemon do decide how to work
  • attributes.cfg = are fixed values, these are published in the factory classad
  • params.cfg = are for values the frontend will change, also published in the Factory classad
For more information, see the Entry Internals page.

Multiple Condor Tarballs

One frequent problem is that one particular condor binary will not run on all compute nodes. Entry points require different architectures, or have different versions of glibc (ie. SL3 does not have glib2.4).

The solution is to have multiple condor binaries. The way to do this is to specify a tarball tag in the factory configuration file.

  1. Download the HTCondor binary from the University of Wisconsin site. (Alternatively, You can build it from scratch on the architecture. Refer to HTCondor instructions for this.)

    The glideinwms pilot uses a subset of the condor binaries/libraries. The create_condor_tarball script can be used to reduce space needed on your factory node. Details on this script can be found in the Components - Tools section of the documentation.

  2. Add a new condor_tarball tag to the factory configuration file:
    There are two ways you can do this:
    • put the tarball in a directory owned by the wmsfactory and enter the condor_tarball tag as:
      <glidein ... >
      ...
        <condor_tarballs >
          <condor_tarball os="OS" arch="Arch" tar_file="ZIPPED_TARFILE" version="Condor_Version" />
    • or put the tarball in a directory owned by the wmsfactory and unzip/untar it. Then, enter the condor_tarball tag as:
      <glidein ... >
      ...
        <condor_tarballs >
          <condor_tarball os="OS" arch="Arch" base_dir="DIR_OF_UNTARRED_BINARY" version="Condor_Version" />

    Starting v2.6.2, to simplify the configuration, os, arch and version support comma values. This can drastically reduce the number of condor_tarball entries needed in the configuration file.

    Consider an example below for default os as rhel5 and default arch as x86_64. If the factory admin also wants os and arch information explicitly available, the configuration needs following entries to cover possible combinations.

    <condor_tarball os="default" arch="default" base_dir="dir" version="default"/>
    <condor_tarball os="rhel5" arch="x86_64" base_dir="dir" version="default"/>
    <condor_tarball os="rhel5" arch="default" base_dir="dir" version="default"/>
    <condor_tarball os="default" arch="x86_64" base_dir="dir" version="default"/>
    Above example can be easily consolidated into a single condor_tarball entry as below and the factory reconfiguration process will internally consider all the combinations. This also applies to the version.
    <condor_tarball os="default,rhel5" arch="default,x86_64" base_dir="dir" version="default"/>
  3. Verify your entry point attributes. Each entry point will have the following attr set up. Make sure that this matches the above condor_tarball parameters:
    <entry>
       <attrs>
         <attr name="CONDOR_ARCH" const="True" parameter="True" glidein_publish="False" job_publish="False" publish="False" type="string" value="Arch"/>
         <attr name="CONDOR_OS" const="True" parameter="True" glidein_publish="False" job_publish="False" publish="False" type="string" value="Condor_Version"/>
         <attr name="GLEXEC_JOB" const="True" parameter="True" glidein_publish="False" job_publish="False" publish="False" type="string" value="True"/>
       </attrs>
    The CONDOR_OS and the CONDOR_ARCH should match the os and arch defined in the tarball tag. If set to "auto", the glidein will decide the appropriate tarball to use for that worker node. By default, the CONDOR_VERSION will be defined globally in <glidein><attrs> and should match the version in the condor_tarball tag. You can overwrite this global version and define one locally in the entry if needed.
  4. Reconfigure the factory using the command:
    ./factory_startup reconfig ../CONFIG_DIR/glideinWMS.xml
  5. After reconfig, you can see the tar_file created from the condor distribution in the condor_tarball element in the configuration as (if you populated just tar_file in step 2
    <condor_tarball arch="default" os="default" tar_file="FACTORY_DIR/condor-7.8.tgz" version="default"/>
    or (if you populated just base_dir in step 2)
    <condor_tarball arch="default" base_dir="/opt/glideins/git-xen21-master-ps.ini/condor-wms" os="default" tar_file="/var/www/html/glideinwms/factory_service/stage/glidein_master/condor_bin_0.d7bbk9.tgz" version="default"/>

Limiting time spent on a Grid resource

The whole concept of gliding into Grid resources is based on the idea that you are getting those resources on a temporary basis. This implies that you need to leave the slot as soon as possible, else your jobs will simply be killed by the annoyed Grid administrators.
On the other hand, submitting new glideins is not cost free, so you want to keep the resource for at least some period of time.

The glideins have two mechanisms to regulate this:

  1. After a specified amount of time, the glidein will enter the RETIRING state. This means, it will wait for the current job to finish (or kill it if it does not end within a configurable timeout) and exit immediately afterwards. This obviously implies that no new jobs will start after it entered that state.
    The two timeouts can be set with the <attr /> options:

    GLIDEIN_Retire_Time=nr_of_seconds
    GLIDEIN_Job_Max_Time=nr_of_seconds

    The two default to 2 and 100 hours.

  2. If a glidein is not claimed within a configurable timeout, the glidein will exit.
    The timeout can be set with:

    GLIDEIN_Max_Idle=nr_of_seconds GLIDEIN_Max_Tail=nr_of_seconds

    There are two configurable parameters for this timeout behavior. The first, GLIDEIN_Max_Idle, affects how long a glidein will wait for its first job. The second parameter is how long a glidein will wait to get a subsequent job once its finished its job. The defaults for these are 1200 and 400 seconds, respectively.

An example:

<glidein>
 [<entries><entry>]
  <attrs>
    <attr name="GLIDEIN_Max_Idle" value="300" type="int" publish="False" parameter="True"/>
    <attr name="GLIDEIN_Retire_Time" value="14400" type="int" publish="False" parameter="True"/>
  <attr name="GLIDEIN_Job_Max_Time" value="180000" type="int" publish="False" parameter="True"/>

Old-style pseudo-interactive monitoring

Since v1_4_1, the pseudo-interactive monitoring uses a dedicated startd in the glideins for monitoring purposes. This allows for monitoring even when the job starter enters the “Retiring” activity.

The side effect is that you do not have anymore the cross-VM statistics and the names of the slots is also different.

To enable the old mode, use:

<attr name="MONITOR_MODE" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="False" type="string" value="MULTI"/>

Adding custom files/scripts to the glideins

While provided code should cover most of the general purpose use cases, some administrators may have additional needs. For these cases, the Glidein framework provides the possibility to download and process additional files. Both Factory and Frontend configuration allow to specify lists of files either for all their glideins, or for a specific entry (Factory), or job group (Frontend). These files can be scripts (executables), regular files or tar-balls and depending on the options can be downloaded at different times (see the custom scripts document for more detils).

Note: Files and subsystems will be downloaded before the scripts. User provided scripts will be executed in the specified order, and before the HTCondor daemons are started up.

Here a list of the attributes of the files. Some examples follow below:

Attribute Name

Attribute Description

absfname

Path of the file on the server (Factory or Frontend)

executable

True if the file is a script (executable, see example below), default is False

wrapper

True if the file is a user wrapper (see example below), default is False

untar

True if the file is a tar-ball that needs to be expanded (see example below), default is False

const

If False the file is not constant (i.e. changes may happen without a reconfiguration of the factory and the file cannot be checksummed), default is True

relfname

Path to save the file, relative to the glidein main directory

period

If period>0 is the period (in seconds) of the executable . Default is 0 (non periodic script). This is ignored for non executable scripts. (see the custom scripts document for more)

prefix

STARTD_CRON prefix, it is prepended to all HTCondor variables generated by the script (see documentation). This is ignored from anything different form a periodic executable script. The default value is GLIDEIN_PS_, use NOPREFIX to have no prefix. Be aware that if you choose not to have a prefix your HTCondor variables may collide with other scripts or variables. Variables used by the wrapper, like GLIDEIN_PS_LAST, will keep using the prefix.

comment

Arbitrary comment string

after_entry

If True, the script is executed after the entry scripts. Default is False for the Factory, default is True for the Frontend. (see the custom scripts document for more)

after_group

If True, the Frontend script is executed after the group scripts. Default is False. Not considered in the Factory. (see the custom scripts document for more)

  • <glidein>
     [<entries><entry>]
     <files>
      <file absfname="script name" executable="True" prefix="cron_prefix" comment="comment"/>

    Path to the custom script. The script will be copied in the Web-accessible area, and when a glidein starts, the glidein startup script will pull it and execute it. If any parameters are needed, they can be specified using <attr />, or stored in a file (see below).
    For more detailed information, see the page dedicated to writing custom scripts.

    <glidein>
     [<entries><entry>]
     <files>
       <file absfname="script name" wrapper="True" comment="comment"/>

    Path to the wrapper custom script. The script will be copied in the Web-accessible area, and will be sourced just before a user job starts starts; i.e. it will become part of the user job wrapper.

  • <glidein>
     [<entries><entry>]
     <files>
       <file absfname="local file name" relfname="target file name" const="Bool" executable="False" comment="comment"/>

    Path to the config file. The file will be copied in the Web-accessible area, and pulled by the glidein startup script when a glidein starts. It can be then used by any script (see above).
    Note: Please be cautious in using the const flag; if set to False, the content of the file will not be verified by the glidein startup script and could be tampered in transit by a malicious user. So never put sensitive data (like the switch to disable security checks) in a changeable file.

  • <glidein>
     [<entries><entry>]
     <files>
      <file absfname="local file name" untar="True" comment="comment">
        <untar_options cond_attr="conf_sw" dir="dir name" absdir_outattr="attr name"/>

    Sometimes it is useful to transfer a whole set of files, or even directories, and that is much easier to accomplish by means of a tar-ball. A subsystem is the glidein way to describe a compressed tarball that is delivered to the worker nodes, untarred in a separate directory and advertised to the other scripts.

    • absfname: Path to the custom tar-ball. (like "/tmp/mytar_v12.5.tgz")
    • conf_sw: Name of a configuration switch. (like "ENABLE_KRB5")
      The tarball will be unpacked only if that parameter will be set to 1. Use the <attr /> switch to define the default value. A special name TRUE can be used to always untar it.
    • dir: Name of the subdirectory to untar it in. (like "krb5")
    • absdir_outattr: Name of a variable name. (like "KRB5_SUBSYS_DIR")
      The variable will be set to the absolute path of the directory where the tarball was unpacked, if and only if the unpacking actually happened (else it will not be defined.) ENTRY_ will be prepended if if the <file> directive occurs in an entry.

Monitoring collectors

By default, the glideins talk to the VO Pool Collector only. This makes monitoring them from the Factory side extremely difficult.

To solve this, you can set up a Monitoring Collector tree that mirrors that of the VO Pool Collector, and tell the glideins to report there, too.

The configuration syntax is very similar to that of the the VO Pool Collector, but using monitoring_collector instead of collector keyword.
For example:

<monitoring_colectors>
<monitoring_colector DN="/DC=org/DC=doegrids/OU=Services/CN=factmoncollector.fnal.gov" node="factmoncollector.fnal.gov" secondary="False" group="default" />
<monitoring_colector DN="/DC=org/DC=doegrids/OU=Services/CN=factmoncollector.fnal.gov" node="factmoncollector.fnal.gov:9620-9819" secondary="True" group="default" />
<monitoring_colector DN="/DC=org/DC=doegrids/OU=Services/CN=factmoncollector2.fnal.gov" node="factmoncollector2.fnal.gov" secondary="False" group="ha" />
<monitoring_colector DN="/DC=org/DC=doegrids/OU=Services/CN=factmoncollector2.fnal.gov" node="factmoncollector2.fnal.gov:9620-9919" secondary="True" group="ha" />
</monitoring_colectors>

For more details, see the Frontend documentation.

XSLT Plugins to extend configuration

Starting in v3.2.1, you can use XSL transformations (XSLT) to manage complex configuration files. During the reconfig process, the glidein Factory applies XSL transformations available to it in the directory configured by the environment variable GWMS_XSLT_PLUGIN_DIR. (It is also available by supplying the -xslt_plugin_dir option to the reconfig_glidein and reconfig_frontend commands as shown below.)

Setting the variable via sysconfig files are supported: /etc/sysconfig/gwms-factory for the factory service and /etc/sysconfig/gwms-frontend for the Frontend service, with the following contents:

prompt$ cat /etc/sysconfig/gwms-factory 
# Configuration file for the Glideinwms services.

#
# Plugin directory to get xslt transformations from.
#

export GWMS_XSLT_PLUGIN_DIR=/etc/gwms-factory/plugin.d
        
The following sample XSLT adds a custom attribute to the configuration file used by the factory.
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" >
   <xsl:template match="glidein/attrs">
       <xsl:copy>
         <attr name="SOME_GWMS_ATTRIBUTE" const="False" glidein_publish="False" job_publish="False" parameter="True" publish="True" type="int" value="7000"/>
         <xsl:apply-templates select="@* | node()"/>
       </xsl:copy>
   </xsl:template>

   <xsl:template match="/ | @* | node()">
       <xsl:copy>
           <xsl:apply-templates select="@* | node()"/>
       </xsl:copy>
   </xsl:template>

The default behavior can be overridden by specifying the option -xslt_plugin_dir to reconfig_glidein and reconfig_frontend tools.
prompt$ reconfig_glidein -xslt_plugin_dir <xslt directory> [...]
prompt$ reconfig_frontend -xslt_plugin_dir <xslt directory> [...]

Glidein's Startd Advertising to Site HTCondor-CE Collector

Starting in v3.2.11, you can make glidein's HTCondor daemons advertise to site's local collector. There is no Glideinwms configuration that enable this and all the changes are on the site side.

  1. Give glidein write access to the site collector
    This is done by adding glidein's DN to gridmapfile of the site collector
  2. Set site collector info in the glidein's environment
    Set the CONDORCE_COLLECTOR_HOST in the glidein's environment
    CONDORCE_COLLECTOR_HOST=<site-local HTCondorCE collector address>