Glidein Factory Configuration

Description

To configure the Glidein Factory you need to create a directory with a set of files. This is done by the command line tools described below. A set of configuration example is provided, too.

Index

Command line tools

To create a new directory, use the command:
create_glidein

To update an existing directory, use:
reconfig_glidein

Both are located in:
glideinWMS/creation

They will create all the needed directories and files needed by a Glidein Factory Daemon.

Both commands has only one argument, and it is the configuration file to load:
./create_glidein <xml file>
./reconfig_glidein <xml file>

The configuration file

The configuration file is a XML document.

It is composed of two parts:

The global arguments are common to all the entry points, but each entry point may have its own specific settings.
At least one entry point must be specified in the configuration file.

For some complete examples, look into
glideinWMS/creation/config_examples/

Suggested global arguments

You definitely need to set the following arguments:

You most probably want to set the following arguments, too:

Some other arguments you might want to set, are:


The other arguments are for advanced admins only, and are explained in a dedicated section.

Suggested entry point arguments

Each entry point will have its own root tag:


For each entry point, you definitely need to set the following arguments:

You most probably want to set the following arguments, too:

Some other arguments you might want to set, are:

The other arguments are for advanced admins only, and are explained below.

Advanced topics

While the above is enough for setting up a personal glidein pool on the local area network, you will need to do more fine tuning when deploying a larger one. In this section, the various advanced aspects of glidein pools will be presented.

Integration with gLExec

As you may have noticed, all of the glideins are submitted with the same service proxy. While this has the advantage of simplifying the architecture and improve both efficiency and VO control, it does have a few problems:

To solve this problem, some Grid sites are deploying gLExec on the worker nodes. gLExec is a service that, taken

contacts the local authorization and mapping system, switches to the UID of the user (as opposed to the glidein UID), and executes the provided command as that UID.

By using gLExec, a Glidein Factory can get rid of both of the above problems, and still keep all the advantages.

To enable gLExec support, you need to specify:
<glidein>
 [entries><entry>]
 <
files>
   <
file absfname="web_base/glexec_setup.sh" executable="True"/>
 <
files/>
 <
attrs>
  <
attr name="GLEXEC_BIN" value="path to glexec" publish="False" parameter="True"/>


For most current gLExec installation this comes down to:
<attr name="GLEXEC_BIN" value="OSG" publish="False" parameter="True"/>

More details about scripts in general can be found in the "custom code" section.

You will also need to properly configure the shadow config files on the submit machine, by adding:
GLEXEC_STARTER = True
GLEXEC = /bin/false

to the condor_config.

As of version 7.1.3 of Condor, a new, better glexec operation mode is supported; in the old operation mode, condor_startd invoked condor_starter through glexec. The result was that condor_starter was running under the same UID as the user job, leaving it vulnerable to attack from a malicious user. The new operating mode solves this by having condor_starter run the user jobs via glexec; this adds a little more overhead to handle the user jobs, but makes the system much more secure.

To enable the new operation mode, add the following line to your configuration file:
<attr name="GLEXEC_JOB" value="True" publish="False" parameter="True"/>

Note that you still need to set GLEXEC_BIN, too.
Warning: Use it only if you use Condor 7.1.3 or later, as it will not work on any older Condor version!

Troubleshooting

gLExec installations on at least one site had problems with delegated proxies. If in doubt, try to disable the delegation.
To disable delegation, add
DELEGATE_JOB_GSI_CREDENTIALS=False
SEC_DEFAULT_ENCRYPTION=PREFERRED
to the shadow configuration file
and set in the glidein creation file:
<glidein>
 [<entries><entry>]
  <
attrs>  
   <attr name="SEC_DEFAULT_ENCRYPTION" value="REQUIRED" publish="False" parameter="True"/>

Another thing to consider is the startup directory; it must be accessible by both the starting user and the target user(s). The directory you usually start in the Grid is most often not readable by any other user, so you must select something else. Both Condor and OSG should be fine, or you can specify any other fixed, WN-local location.

Private networks and firewalls

Condor daemons need two way communication in order to work properly. This clashes with the network policies of most Grid sites, that have worker nodes in private networks or implement a restrictive firewall.

Condor provides two mechanisms to address this:

GCB - Generic Connection Broker

GCB was the first Condor implementation that allowed it to work in restrictive network environments.
The detailed description of GCB is beyond the scope of this manual and you should refer to the Condor documentation available at http://www.cs.wisc.edu/condor/manual/v7.2/3_7Networking_includes.html#sec:GCB. Here you will find only the parameters needed to enable it in the glideins.

To use Condor with GCB, you need to specify:
<glidein>
 [entries><entry>]
 <
attrs>
  <
attr name="GCB_LIST" value="IP[:PORT],IP[:PORT],..." publish="False" parameter="True"/>
  <attr name="GCB_ORDER" value="NONE|RANDOM|GCBLOAD|ROUNDROBIN|SEQUENTIAL" publish="False" parameter="True"/>

where:

If your GCBs support freesockets queries (v7.0 and above), you most probably want to protect your glideins from failing due to an overloaded GCB. To do that, gcb_broker_query binary needs to be part of the Condor distribution you are using. You also need to decide what is the minimum number of free sockets you are comfortable with:
<glidein><attrs><attr name="GCB_MIN_FREE" value="number" publish="False" parameter="True"/>
I would suggest you set it to at least 100, possibly more. Most Condor versions use around 5 sockets per VM (depending on configuration).

You can also specify a default GCB port (defaults to 65432):
<glidein><attrs><attr name="GCB_PORT" value="port" publish="False" parameter="True"/>
although Condor GCB right now does not support any other port number.

Also, for more flexibility, you can let the frontends to provide their own GCB servers, by setting publish="True" const="False" .

If you are more sophisticate, and want to use GCB routing tables, too, add:
<glidein>
 [entries><entry>]
 <
files>
   <
file absfname="path to routing file" relfname="gcb_route.cfg"/>
 <
files/>
 <
attrs>
  <
attr name="GCB_REMAP_ROUTE" value="gcb_route.cfg" publish="False" parameter="True"/>


Please be aware that the above will configure the glideins only; you still need to properly configure the Collector and the submit machines.

CCB - Condor Connection Broker

CCB was introduced in Condor v7.3.0 to replace GCB in most circumstances. It is much more reliable than GCB and also easier to setup.
The detailed description of CCB is beyond the scope of this manual and you should refer to the Condor documentation available at http://www.cs.wisc.edu/condor/manual/v7.3/3_7Networking_includes.html#sec:CCB. Here you will find only the parameters needed to enable it in the glideins.

To use Condor with CCB, you need to specify:
<glidein>
 [entries><entry>]
 <
attrs>
  <
attr name="USE_CCB" value="True" publish="False" parameter="True"/>
 

and you are done. Just make sure you follow the suggested scalability guidelines described in the Condor manual.

Security handles

As mentioned in the startup page, the glidein pool must be properly configured to protect it from hackers and malicious users. The same page also describes what needs to be done on the collector machine.
The glidein itself can also be configured. The default configuration works fine for most users, but you may need to change them.

The values are set using the <attr /> option, and the default values are:

As of Condor version 7.1.3 condor also supports a more efficient authentication mechanism between the condor_schedd/condor_shadow and condor_startd/condor_starter. This method uses the match ClaimId as a shared password for authentication between these daemons. Since using a shared secret is much cheaper that using GSI authentication, this should be used every time it is feasible.

To enable this option, you need to set an attribute using the <attr /> option:


You will also need to enable this on the submit machine, by adding
SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION = True
to the schedd condor_config. Do not add this option to either the negotiator or collector as it will not work.
Obviously, this will only work with Condor versions 7.1.3 and above.

Using TCP to send updates to the Collector

By default, Condor uses UDP packets to communicate between the glideins and the Collector. While more efficient than TCP, UDP packets are often blocked at the firewall, or lost on the WAN.

To switch on TCP updates, please specify, with the <attr /> option:
UPDATE_COLLECTOR_WITH_TCP=True


Please be aware that this will configure the glideins only; you still need to properly configure the Collector machine. See Condor documentation for more details.

Multiple Collectors

By default, Condor uses only one Collector for the glidein (user) pool.However, if the load becomes too high, you can configure multiple collectors in a chain.

You will need a master and a set of slave collectors. The slave collectors forward the startd adds to the master collector.
The negotiator and the schedds will talk to the master collector, while the startds will talk to one of the slave ones.

To set up slave collector in the glidein (user) pool, one way is to set the following env variables before starting up the condor_master:

COLH=`condor_config_val COLLECTOR_HOST`
LD=`condor_config_val LOCAL_DIR`
export _CONDOR_COLLECTOR_HOST=$COLH:
export _CONDOR_MASTER_NAME=collector_
export _CONDOR_DAEMON_LIST="MASTER, COLLECTOR"
export _CONDOR_LOCAL_DIR=$LD/$_CONDOR_MASTER_NAME
export _CONDOR_LOCK=$_CONDOR_LOCAL_DIR/lock
# Forward all the traffic to the main collector
export _CONDOR_CONDOR_VIEW_HOST=$COLH:9618
Once you have the slave collectors set up, you will want to use them.

The VO frontend will have to point the factory to a list of collectors.

However, by default the glideins expect a specific collector, so you need to add an additional script to handle this:
<glidein>
 [entries><entry>]
 <
files>
   <
file absfname="web_base/collector_setup.sh" executable="True"/>
 <
files/>

Setting the glidein start and rank condition

As with any Condor pool, you may need to set the startd start and rank conditions.
For a glidein, you can set this with the <attr /> options:


For example:
<glidein>
 [entries><entry>]
  <attrs>
    <attr name="GLIDEIN_Start" value="Owner==&quot;sfiligoi&quot;" publish="False" parameter="True"/>
    <attr name="GLIDEIN_Rank" value="ImageSize" publish="False" parameter="True"/>

Limiting time spent on a Grid resource

The whole concept of gliding into Grid resources is based on the idea that you are getting those resources on a temporary basis. This implies that you need to leave the slot as soon as possible, else your jobs will simply be killed by the annoyed Grid administrators.
On the other hand, submitting new glideins is not cost free, so you want to keep the resource for at least some period of time.

The glideins have two mechanisms to regulate this:

  1. After a specified amount of time, the glidein will enter the RETIRING state. This means, it will wait for the current job to finish (or kill it if it does not end within a configurable timeout) and exit immediately afterwards. This obviously implies that no new jobs will start after it entered that state.
    The two timeouts can be set with the <attr /> options:

    The two default to 2 and 100 hours.

  2. If a glidein is not claimed within a configurable timeout, the glidein will exit.
    The timeout can be set qith:

    The default is 20 minutes.

An example:
<glidein>
 [entries><entry>]
  <attrs>
    <attr name="GLIDEIN_Max_Idle" value="300" type="int" publish="False" parameter="True"/>
    <attr name="GLIDEIN_Retire_Time" value="14400" type="int" publish="False" parameter="True"/>
    <attr name="GLIDEIN_Job_Max_Time" value="180000" type="int" publish="False" parameter="True"/>

Old-style pseudo-interactive monitoring

Since v1_4_1, the pseudo-interactive monitoring uses a dedicated startd in the glideins for monitoring purposes. This allows for monitoring even when the job starter enters the “Retiring” activity.

The side effect is that you do not have anymore the cross-VM statistics and the names of the slots is also different.

To enable the old mode, use:

<attr name="MONITOR_MODE" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="False" type="string" value="MULTI"/>

Adding custom code/files to the glideins

While provided code should cover most of the general purpose use cases, some administrators may have additional needs. For these cases, the glidein creation command adds the following options:

Please notice that files and subsystems will be downloaded before the scripts, and that the user provided scripts will be executed in the specified order, and before the Condor daemons are started up.

Repository

CVSROOT

cvsuser@cdcvs.fnal.gov:/cvs/cd

Package(s)

glideinWMS/creation

Author(s)

Since Aug. 14th - Igor Sfiligoi (Fermilab Computing Division)