Example Configuration
Below is an example factory configuration xml file. Click on any piece for a more detailed description.<log_retention ></glidein>
< condor_logs max_days="14.0" max_mbytes="100.0" min_days="3.0" /></log_retention >
< job_logs max_days="7.0" max_mbytes="100.0" min_days="3.0" />
< process_logs >
< process_log extension="info" max_days="7.0" max_mbytes="100.0" min_days="3.0" msg_types="INFO" />< /process_logs >
< process_log extension="debug" max_days="7.0" max_mbytes="100.0" min_days="3.0" msg_types="DEBUG,ERR,WARN" />
< summary_logs max_days="31.0" max_mbytes="100.0" min_days="3.0" />
<monitor base_dir="/var/www/html/glidefactory/monitor" flot_dir="/opt/javascriptrrd-0.6.1/flot" javascriptRRD_dir="/opt/javascriptrrd-0.6.1/src/lib" jquery_dir="/opt/javascriptrrd-0.6.1/flot" />
<monitor_footer display_txt="Legal Disclaimer" href_link="/site/disclaimer.html" />
<security key_length="2048" pub_key="RSA" reuse_oldkey_onstartup_gracetime="900" remove_old_cred_freq="24" remove_old_cred_age="30">
<frontends ></security >
<frontend name="vofrontend" identity="vofrontend@vofrontend.fnal.gov" ></frontends >
<security_classes ></frontend >
<security_class name="frontend" username="frontend1" ></security_classes >
<stage base_dir="/var/www/html/glidefactory/stage" use_symlink="True" web_base_url="http://factory.fnal.gov:9000/glidefactory/stage"/>
<submit base_client_log_dir="/opt/clientlogs/clients/logs" base_client_proxies_dir="/opt/clientlogs/clients/proxies" base_dir="/opt/wmsfactory/" base_log_dir="/opt/wmsfactory//logs" />
<attrs>
<attr name="CONDOR_VERSION" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="False" type="string" value="default" /></attrs>
<attr name="GCB_ORDER" const="True" glidein_publish="True" job_publish="False" parameter="True" publish="True" type="string" value="NONE" />
<attr name="GLEXEC_JOB" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="True" type="string" value="True" />
<attr name="USE_CCB" const="True" glidein_publish="True" job_publish="False" parameter="True" publish="True" type="string" value="False" />
<attr name="USE_MATCH_AUTH" const="False" glidein_publish="False" job_publish="False" parameter="True" publish="True" type="string" value="True" />
<entries>
<entry name="EXAMPLE_ENTRY" enabled="True" auth_method="grid_proxy" trust_domain="OSG" gatekeeper="gatewayname.fnal.gov/jobmanager-condor" gridtype="gt2" rsl="(queue=default)(jobtype=single)" schedd_name="wmscollector.fnal.gov" verbosity="std" work_dir="OSG"></entries>
<config></entry>
<max_jobs></config>
<per_entry held="1000" idle="2000" glideins="10000"></max_jobs>
<default_per_frontend held="50" idle="100" glideins="5000">/
<per_frontends/>
<release max_per_cycle="20" sleep="0.2">
<remove max_per_cycle="5" sleep="0.2">
<submit cluster_size="10" max_per_cycle="100" sleep="0.2" slots_layout="single_slot">
<allow_frontends />
<attrs>
<attr name="CONDOR_ARCH" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="False" type="string" value="default"/></attrs>
<attr name="CONDOR_OS" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="False" type="string" value="default" />
<attr name="GLEXEC_BIN" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="True" type="string" value="NONE"/>
<attr name="GLIDEIN_Site" const="True" glidein_publish="True" job_publish="True" parameter="True" publish="True" type="string" value="FNAL_EXAMPLE_SITE"/>
<attr />
<monitorgroups/>
<files/>
<infosys_refs />
<condor_tarballs>
<condor_tarball arch="default" base_dir="/opt/wmscollector/" os="default" tar_file="/var/www/html/glidefactory/stage/glidein_v2_4/condor_bin_default-default-default.a83ePm.tgz" version="default"></condor_tarballs>
<files />
The configuration file
The configuration file is a XML document. It contains both global arguments as well as configuration specific to each entry point. At least one entry point must be specified in the configuration file.
The tags of the XML configuration file are described below. Each is given a designation:
- Required You must change or examine this in order for the factory to function correctly.
- Recommended The installer provides a good default, but you should examine this attribute to make sure it is correct for your installation.
- Optional The installer-provided default is likely correct for your installation. Change this only if your particular configuration requires special treatment or fine-tuning.
Global arguments
Global arguments are common to all entry points but can be overridden by individual entry point configuration.
-
<glidein ... >
Required The main tag of the factory configuration. See below for the paramters for this tag:
-
<glidein glidein_name="your name">Required The name of the configuration. It will be used to advertise the entry points, will be defined as Condor glidein attribute GLIDEIN_Name, and is used also to create the directory names. Choose a short name that describes the set of Grid resources it represents and append a version number (like "fnalcms_1"). Starting with v2.0 of glideinWMS, you can use the factory reconfig tool to make changes to the factory configuration. You will only need new configuration for the factories during major upgrade. For more details refer the Glidein Factory management section
-
<glidein factory_name="your name">Recommended Changing this value from the name of the machine allows you to move the factory without disrupting the system.
-
<glidein schedd_name="schedd name[,schedd name]*">Recommended If you want to use multiple Condor schedds or you don't like the default name, you definitely need to set this. If you specify more than a single schedd, the various entries will be equally spread among all the listed schedds. Possible values include (but are not limited to):
"myschedd@mymachine.mydomain"
"myschedd_g1@mymachine.mydomain,myschedd_g2@mymachine.mydomain,myschedd_g3@mymachine.mydomain"
-
<glidein loop_delay="seconds" advertise_delay="nr" >Optional Defines how active the glidein factory should be. The glidein factory works in polling mode. loop_delay defines how much time should pass between each polling loop, with the collector being updated every advertise_delay loops.
-
<glidein restart_attempts="nr" restart_interval="seconds" >Optional Defines how many times restart_attempts should be applied within restart_interval seconds for an entry if the entry crashes.
-
<glidein advertise_with_tcp="True|False" >Optional Defines if the factory should use TCP to advertise its classads.
-
<glidein advertise_with_multiple="True|False" >Optional Defines if the factory should use -multiple to advertise its classads (requires Condor 7.5.4+).
-
-
<glidein><log_retention><process_logs><process_log max_days="max days" min_days="min days" max_bytes="max bytes" type="INFO"/>
The admin can configure one or more logs with any combination of following log message types:
- INFO: Informational messages about the state of the system.
- DEBUG: Debug message. These are additional informational messages that describe code execution in detail.
- ERR: Error messages. These may include tracebacks.
- WARN: Warning messages. These warn of conditions that were found that don't necessarily cause abnormal execution.
-
<glidein><condor_tarballs><condor_tarball os="os" arch="arch" base_dir="directory" version="condor version"/>
Required Where to find the Condor binaries. You can list as many as you need, but at least one is required. This lets you configure glideins for different sites to use different version of condor binaries based on architecture, os of the worker nodes that could be found on the site.
It is recommended to have one default entry with os="default" arch="default" version="default".
See multiple tarballs for more detailed instructions on supporting tarballs for multiple architectures. -
<glidein><submit base_dir="directory" base_log_dir="log directory" base_client_log_dir="client log directory" base_client_proxies_dir="directory where proxies will be stored"/>
Recommended Where to create the glidein submit directory. The default is the user home directory. Log directories can be configured independent of the base directory using options mentioned above.
-
<glidein><stage base_dir="web dir" web_base_dir="URL"/>
Recommended These two define where the Web server directories are located.
The defaults are reasonable, but you may have different needs. -
<glidein><monitor base_dir="web dir" javascriptRRD_dir="web dir" flot_dir="web dir" jquery_dir="web dir>" >
Recommended The base_dir defines where the monitoring web are is.
The other entries point to where javascriptRRD, Flot and JQuery libraries are. -
<glidein><monitor_footer display_txt="Legal Disclaimer" href_link="/site/disclaimer.html" >
OPTIONAL: If the display text and link are configured, the monitoring pages will display the text/link at the bottom of the page.
-
<glidein><security reuse_oldkey_onstartup_gracetime="900" remove_old_cred_age="30" remove_old_cred_freq="24" pub_key="RSA"/>
The factory can remove old credential files saved on disk. You can set the min age (in days) the credential needs to be before it can be deleted and the frequency (in hours) to perform the removal. If you set the frequency to less than zero, the clean up functionality is disabled.
-
<glidein><security><frontends><frontend name="FrontendName" identity="username_usedby_factory@factory_hostname" />
Recommended (Required for privilege separation) This configures a frontend. Frontend on the frontend hosting machine should have same name as mentioned in the 'name'. Identity tells the factory the username under which the factory should map the given frontend to. If this does not match the frontend configuration, the factory will drop all requested glideins.
-
<glidein><security><frontends><frontend><security_classes><security_class name="frontend" username=">username_usedby_factory"/>
Required Tells the class and user name the factory uses for this frontend.
- <security_class name="frontend" username="username_usedby_factory"/> -
<glidein><attrs><attr name="attr name" value="value" const="True" parameter="True" publish="True" glidein_publish="True" comment="comment" />
Attributes you want to publish that effect all the factory entries
To set Attributes specific to an entry point, set them in /glidein/entries/entry/attrs section.
Table below describes the <attrs ... > tag.
Attribute Name
Attribute Description
name
Name of the attribute
value
Value of the attribute
const
If this attribute is a constant so that Glidein Frontend can not change it. If set to const, the attribute will be available in the constants file created in the staging area.
parameter
Set True if the attribute should be passed as a parameter. Always set this to True unless you know what you are doing.
publish
If set to True, the attribute will be published in Factory's classad
glidein_publish
If set to True, the attribute will be available in the condor_startd's classad. Used only if parameter is True.
job_publish
If set to True, the attribute will be available in the user job's environment. Used only if parameter is True.
comment
You can specify description of the attribute here.
type
Type of the attribute. Supported types are 'int', 'string' and 'expr'. Typeexpr is equivalent to condor constant/expression in condor_vars.lst
These are used by the VO frontend matchmaking and job matchmaking.
Example attributes are:<attrs> <attr name="VOpilot" value="CMS" publish="True" parameter="True" const="True" glidein_publish="True" comment=“A test attribute”/> <attr name="CondorVersion" value="v6.9.1" publish="True" parameter="True" const="True" glidein_publish="True"/> </attrs>
A list of all the attributes can be found on the dedicated configuration variables page .
The other arguments are for advanced admins only, and are
explained in a dedicated section.
Entry point arguments
The following are arguments that are specific to each entry point. They override the global arguments if present.
-
<glidein><entries><entry name="entry name">
RequiredEach entry point will have its own root tag with parameters:
-
<glidein><entries><entry name="entry name">Required Specify an easy to remember name that will identify this entry for display purposes and for specifying using command line tools.
-
<glidein><entries><entry name="entry name" auth_method="grid_proxy">Required The authentication method this entry supports. It is advertised to the Frontends to show which credentials are required for submission.
If the authentication method is a credential pair (key_pair, cert_pair, username_password), then the VM id and type must be specified.
These can be additional entry values: < entry vm_type="vmtype" or vm_id="vmid" >
OR
added to the authentication method so the Frontend is required to pass them: <entry auth_method="key_pair+vm_id+vm_type" >
If the entry is for a TeraGrid site, it requires a project id from the Corral Frontend (not supported by VO Frontends). The authentication method then must be <entry auth_method="grid_proxy+project_id" > -
<glidein><entries><entry name="entry name" trust_domain="OSG">Required The trust domain for the entry. This is not interpreted by the Factory code, only used to show what credentials are valid. For example, there may be two ec2-type entries for two different clouds with the authentication method of "key-pair". This shows allows the Frontend to map a key pair to a particular cloud.
-
<glidein><entries><entry name="entry name" gatekeeper="gatekeeper">Required The identifier of your Grid resource (like "cmsitbsrv01.fnal.gov/jobmanager-condor").
-
<glidein><entries><entry name="entry name" rsl="rsl">Please check the Grid site documentation and/or ask the Grid site administrator for the proper rsl and queue name for the site.
(example: '(condorsubmit=(universe vanilla)(requirements \"(ISMINOSAFS=?=True)\"))').
NOTE: If the auth_method contains "+project_id" for a TeraGrid entry, the string "(project=TG_PROJECT_ID)" will be added by the Factory and populated with the project id passed in the Corral request. Only Corral Frontends can use these entries since we currently do not support a VO Frontend passing the project id. -
<glidein><entries><entry name="entry name" gridtype="grid type [default: gt2]">Optional The current implementation has been tested with Globus v2 Gatekeepers only, but this tag can specify additional Condor Grid types.
-
<glidein><entries><entry name="entry name" work_dir="WN dir">Recommended This argument defines where the glidein should run once on the worker node. Note: Most OSG sites are known to crash if you use your starting directory to run. For those sites, it is good practice to specify "Condor" if they are running Condor as the underlying batch system, and "OSG" else. On EGEE sites, "." is usually fine.
-
<glidein><entries><entry name="entry name" proxy_url="Proxy URL">Recommended If you have a Web cache you can use, you set it here (like "cmsitbsquid002.fnal.gov:3128"). On OSG resources, you can set it to "OSG", and the default OSG squid will be used. If you cannot use any Web cache server, you can skip this argument (the default is not to use caching). If defined, the user jobs will be able to use it as "GLIDEIN_Proxy_URL" environment variable.
-
<glidein><entries><entry name="entry name" schedd_name="schedd name">Optional If you have an entry that needs a dedicated schedd, you can set it here (to something like "myveryspecialschedd@mymachine.mydomain")
-
<glidein><entries><entry name="entry name" enabled="True/False">Optional You can define an entry point even if you do not plan to use it right away. The entry point directory will be created independently of the enabled flag, but will only be used by the glidein factory if it is set to True. (Defaults to True).
-
<glidein><entries><entry name="entry name" verbosity="std/fast/nodebug">Optional Specify the verbosity level and termination time in case of validation errors:
- std (default) – reasonable verbosity (including the condor log files) and 20min sleep in case of error (to reduce the damage resulting from broken nodes)
- fast – same verbosity as std, but will only wait 2 mins before terminating in case of error (good for debugging)
- nodebug – very low verbosity, if you want to save on disk space
-
-
<glidein><entries><entry name="entry name">
<config>
<max_jobs/>
<release max_per_cycle="20" sleep="0.2">
<remove max_per_cycle="5" sleep="0.2">
<submit cluster_size="10" max_per_cycle="100" sleep="0.2" slots_layout="single_slot">
<config/>The values in the config section are limits for the entry.
- The max_jobs section limits the number of held, idle, and total (including running). You can specify the limits per entry, a default per frontend-security class, or override a specific frontend-security class.
- Release regulates how many glideins are released per cycle.
- Remove regulates how many glideins are removed per cycle.
- Submit values limit the number of glideins submitted, how long to wait to submit, and what kinds of glideins the entry supports - single slot or whole node glideins.
-
<glidein><entries><entry name="entry name"><attrs>><attr name="attr name" value="value" const="True" parameter="True" publish="True" glidein_publish="True" comment="comment"/>
Attributes you want to publish into the Condor classad. These are used by the VO frontend matchmaking and job matchmaking.
Example attributes are:
<attr name="HasMySoftware" value="True" publish="True" parameter=" True" const="True" glidein_publish="True" comment=“My users cannot live without” /> <attr name="OS" value="Linux" publish="True" parameter="True" const="True" glidein_publish="True"/>
Other pre-defined attributes are listed below:-
<glidein><entries><entry name="entry name"><attrs><attr name="GLIDEIN_Site" value="value" const="True" parameter="True" publish="True"/>Recommended This defines the glidein attribute GLIDEIN_Site, both for use of the Frontend and for the use of the job negotiation. Logically defining a site is useful, so that you can change entry points but the user jobs do still known where they are running. If not specified, it defaults to the entry point name in the startd ClassAd.
-
<glidein><entries><entry name=entry name"><attrs>Recommended Select a non-default condor binary.
<attr name="CONDOR_VERSION" value="os" type="string" const="True" parameter="True" publish="False" />
<attr name="CONDOR_ARCH" value="arch" type="string" const="True" parameter="True" publish="False"/>
<attr name="CONDOR_OS" value="version" type="string" const="True" parameter="True" publish="False"/>
The entry will default to CONDOR_OS="default" CONDOR_ARCH="default" CONDOR_VERSION="default", if not otherwise defined.
-
-
<glidein><entries><entry name="entry name"> <allow_frontends> <allow_frontend name="vofrontend_name" security_class="security_class_name">
This argument allows you to create a whitelist of vo frontends that can access this entry point. If this tag is blank or missing, it is assumed that all vo frontends can submit glideins to this entry point. However, if any allow_frontend tags exist, the entry point will only allow those frontends to submit glideins. The name of the frontend must match the name given in the security class above in the configuration.
For each frontend, you must tell it which security classes (e.g. proxies) can use the frontend. The factory will only submit glideins on behalf of these security classes. If you want all security classes to be allowed, you can put "All" in this field. Otherwise, it must match the security class configuration higher up in the xml.
-
<glidein><entries><entry name="entry name"> <infosys_refs><infosys_ref ref="filename" server="SERVER" type="RESS/BDII">
This argument is placed here by the installers based on information from BDII/RESS. This gives you information on where the server's information was retrieved from. It can also be used to retrieve downtime information from RESS/BDII.
Advanced topics
While the above is enough for setting up a personal glidein pool on the local area network, you will need to do more fine tuning when deploying a larger one. In this section, the various advanced aspects of glidein pools will be presented.
Integration with gLExec
As you may have noticed, all of the glideins are submitted with the same service proxy. While this has the advantage of simplifying the architecture and improve both efficiency and VO control, it does have a few problems:
All glidein scripts and Condor daemons, AND user jobs all run under the same Unix UID. So users can interfere with the glidein tasks, possibly hacking the system.
Plus, when several glideins start on the same node (on multi processor/core machines), one user job can interfere with another user job.The real user is never authenticated against the Grid site authorization infrastructure. This makes it impossible for the sites to enforce their policies, nor can they analyze the usage of their resources; they see only glideins. This makes them very unfriendly toward the glidein based WMS.
To solve this problem, some Grid sites are deploying gLExec on the worker nodes. gLExec is a service that will take the following:
- the user proxy, and
- the desired command
It will contact the local authorization and mapping system, switch to the UID of the user (as opposed to the glidein UID), and execute the provided command as that UID.
By using gLExec, a Glidein Factory can get rid of both of the above problems, and still keep all the advantages.
To enable gLExec support, you need to specify:
[<entries><entry>]
<files>
<file absfname="web_base/glexec_setup.sh>" executable="True"/>
</files>
<attrs>
<attr name="GLEXEC_BIN" value="path to glexec" publish="False" parameter="True"/>
For most current gLExec installation this comes down to:
More details about scripts in general can be found in the "custom code" section.
You will also need to properly configure the shadow config files on the submit machine, by adding the following to the condor_config:
GLEXEC_STARTER = True
GLEXEC = /bin/false
As of version 7.1.3 of Condor, a new, better glexec operation mode is supported; in the old operation mode, condor_startd invoked condor_starter through glexec. The result was that condor_starter was running under the same UID as the user job, leaving it vulnerable to attack from a malicious user. The new operating mode solves this by having condor_starter run the user jobs via glexec; this adds a little more overhead to handle the user jobs, but makes the system much more secure.
To enable the new operation mode, add the following line to your configuration file:
Note that you still need to set GLEXEC_BIN, too.
Warning: Use it only if you use Condor 7.1.3 or later, as it will not work on any older Condor version!
Troubleshooting
gLExec installations on at least one site had problems with delegated proxies. If in doubt, try to disable the delegation.
To disable delegation, add the following to the shadow configuration file:
DELEGATE_JOB_GSI_CREDENTIALS=False
SEC_DEFAULT_ENCRYPTION=PREFERRED
Then, set the following tags in the glidein creation file:
[<entries><entry>]
<attrs> <attr name="SEC_DEFAULT_ENCRYPTION" value="REQUIRED" publish="False" parameter="True>"/>
Another thing to consider is the startup directory; it must be accessible by both the starting user and the target user(s). The directory you usually start in the Grid is most often not readable by any other user, so you must select something else. Both Condor and OSG should be fine, or you can specify any other fixed, WN-local location.
Private networks and firewalls
Condor daemons need two way communication in order to work properly. This clashes with the network policies of most Grid sites, that have worker nodes in private networks or implement a restrictive firewall.
Condor provides two mechanisms to address this:
GCB - Generic Connection Broker
GCB was the first Condor
implementation that allowed it to work in restrictive network
environments.
The detailed description of GCB is beyond the
scope of this manual and you should refer to the Condor
documentation available at
http://www.cs.wisc.edu/condor/manual/v7.2/3_7Networking_includes.html#sec:GCB.
Here you will find only the parameters needed to enable it in the
glideins.
To use Condor with GCB, you need to specify:
[entries><entry>]
<attrs>
<attr name="GCB_LIST" value="IP[:PORT],IP[:PORT],..." publish="False" parameter="True"/>
&nbps;<attr name="GCB_ORDER" value="NONE|RANDOM|GCBLOAD|ROUNDROBIN|SEQUENTIAL" publish="False" parameter="True"/>
where:
NONE: Do not use GCB (a good way to selectively disable it)
RANDOM: Randomly distributes between the listed GCBs
ROUNDROBIN (or RR): Round robin between them, based on the job submission number.
SEQUENTIAL (or SEQ): Keep the order. Essentially always tries the first one first (the others will be used only if the first one fails)
GCBLOADi: Order by GCB load. All GCBs must support the freesockets query and you must upload the gcb_broker_query binary, too. See below.
If your GCBs support freesockets queries (v7.0 and above), you most probably want to protect your glideins from failing due to an overloaded GCB. To do that, gcb_broker_query binary needs to be part of the Condor distribution you are using. You also need to decide what is the minimum number of free sockets you are comfortable with:
I would suggest you set it to at least 100, possibly more. Most Condor versions use around 5 sockets per VM (depending on configuration).
You can also specify a default GCB port (defaults to 65432):
(Note that Condor GCB right now does not support any other port number).
Also, for more flexibility, you can let the frontends to provide their own GCB servers, by setting publish="True" const="False".
If you are more sophisticated, and want to use GCB routing tables , too, add:
[entries><entry>]
<files>
<file absfname="path to routing file" relfname="gcb_route.cfg"/>
</files>
<attrs>
<attr name="GCB_REMAP_ROUTE" value="gcb_route.cfg" publish="False" parameter="True"/>
Please be aware that the above will configure the glideins only; you still need to properly configure the Collector and the submit machines.
CCB - Condor Connection Broker
CCB was introduced in Condor v7.3.0
to replace GCB in most circumstances. It is much more reliable than
GCB and also easier to setup.
The detailed description of CCB is
beyond the scope of this manual and you should refer to the Condor
documentation available at
http://www.cs.wisc.edu/condor/manual/v7.3/3_7Networking_includes.html#sec:CCB.
Here you will find only the parameters needed to enable it in the
glideins.
To use Condor with CCB, you need to specify:
[entries><entry>]
<attrs>
<attr name="USE_CCB" value="True" publish="False" parameter="True"/>
and you are done. Just make sure you follow the suggested scalability guidelines described in the Condor manual.
Security handles
As mentioned in the startup
page, the glidein pool must be properly configured to protect it
from hackers and malicious users. The same page also describes what
needs to be done on the collector machine.
The glidein itself can
also be configured. The default configuration works fine for most
users, but you may need to change them.
The values are set
using the <attr /> option,
and the default values are:
- SEC_DEFAULT_ENCRYPTION=OPTIONAL
- SEC_DEFAULT_INTEGRITY=REQUIRED
- DELEGATE_JOB_GSI_CREDENTIALS=False
As of Condor version 7.1.3 condor also supports a more efficient authentication mechanism between the condor_schedd/condor_shadow and condor_startd/condor_starter. This method uses the match ClaimId as a shared password for authentication between these daemons. Since using a shared secret is much cheaper that using GSI authentication, this should be used every time it is feasible.
This option is enabled by default. <attr /> option:
<attr name=USE_MATCH_AUTH ... value=True.. /> ... enabled
<attr name=USE_MATCH_AUTH ... value=False.. /> ... disabled
When enabled, this condor attribute must be set in the
condor_config of the submit machine.
This option is not used by the Condor negotiator or collector and therefore
not needed if they are installed separately.
SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION = True
Using TCP to send updates to the Collector
By default, Condor uses UDP packets to communicate between the glideins and the Collector. While more efficient than TCP, UDP packets are often blocked at the firewall, or lost on the WAN.
To disable TCP updates, specify, with the <attr/> option:
UPDATE_COLLECTOR_WITH_TCP=False
In glideinWMS, we enable the glideins to update the user collector using TCP by default.
Please be aware that this will configure the glideins only; you
still need to properly configure the Collector machine. See
Condor documentation
for more details.
Multiple Collectors
By default, Condor uses only one Collector for the glidein (user) pool. However, if the load becomes too high on the collector, you can configure multiple collectors in a chain.
You will need a master and a set of slave collectors. Each slave collector will service a portion of the pool and will forward communication between the startd daemons to the master collector. Machine classads from these startd's will be sent to the master collector. The negotiator and the schedds will talk to the master collector, and the startds will talk to one of the slave ones. This will reduce load on the central manager.
To set up slave collector in the glidein (user) pool, one way is to set the following env variables before starting up the condor_master:
COLH=`condor_config_val COLLECTOR_HOST` LD=`condor_config_val LOCAL_DIR` export _CONDOR_COLLECTOR_HOST=$COLH: export _CONDOR_MASTER_NAME=collector_ export _CONDOR_DAEMON_LIST="MASTER, COLLECTOR" export _CONDOR_LOCAL_DIR=$LD/$_CONDOR_MASTER_NAME export _CONDOR_LOCK=$_CONDOR_LOCAL_DIR/lock # Forward all the traffic to the main collector export _CONDOR_CONDOR_VIEW_HOST=$COLH:9618
Once you have the slave collectors set up, you will want to use them.
The VO frontend will have to point the factory to a list of collectors.
The configuration internally will add a line in the factory configuration file that will set up the glideins to handle the multiple collectors. (You should now see a line like: "<file absfname="web_base/collector_setup.sh" executable="True"/>" after reconfiguring).
Setting the glidein start and rank condition
As with any Condor pool, you may need to set the startd
start
and rank
conditions.
For a glidein, you can set this with the <attr/> options:
GLIDEIN_Start=expression
GLIDEIN_Rank=expression
For example:
[entries><entry>]
<attrs>
<attr name="GLIDEIN_Start" value="Owner=="sfiligoi"" publish="False" parameter="True"/>
<attr name="GLIDEIN_Rank" value="ImageSize" publish="False" parameter="True"/>
Internal Configuration
The configuration is parsed during the reconfiguration of the factory, and split into a number of files:- job.descript = is read by the daemon do decide how to work
- attributes.cfg = are fixed values, these are published in the factory classad
- params.cfg = are for values the frontend will change, also published in the factory classad
Multiple Condor Tarballs
One frequent problem is that one particular condor binary will not run on all compute nodes. Entry points require different architectures, or have different versions of glibc (ie. SL3 does not have glib2.4).
The solution (only available on glideinWMS v2+) is to have multiple condor binaries. The way to do this is to specify a tarball tag in the factory configuration file.
- Download the Condor binary from the University of Wisconsin site. (Alternatively, you can build it from scratch on the architecture, refere to Condor instructions for this).
- Put it in a directory owned by the wmsfactory and unzip/untar it.
- Add a new tarball tag to the factory tag:
<glidein ... >
...
<condor_tarballs >
<condor_tarball os="OS" arch="Arch" base_dir="DIR_OF_UNTARRED_BINARY" version="Condor_Version" /> - Verify your entry point attributes. Each entry point will have the following attr set up. Make sure that this matches the above tarball parameters:
<attrs>
The CONDOR_OS and the CONDOR_ARCH should match the os and arch defined in the tarball tag. If set to "auto", the glidein will decide the appropriate tarball to use for that worker node. By default, the CONDOR_VERSION will be defined globally in <glidein><attrs> and should match the version in the tarball tag. You can overwrite this global version and define one locally in the entry if needed.
<attr name="CONDOR_ARCH" const="True" glidein_publish="False" parameter="True" publish="False" type="string" value="Arch"/>
<attr name="CONDOR_OS" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="False" type="string" value="Condor_Version"/>
<attr name="GLEXEC_JOB" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="False" type="string" value="True"/>
</attrs>
- Reconfigure the factory using the command:
./factory_startup reconfig ../CONFIG_DIR/glideinWMS.xml
- After reconfig, you can see the tar_file created from the condor distribution in the tarball line in the configuration.
Limiting time spent on a Grid resource
The whole concept of gliding into Grid resources is based on the
idea that you are getting those resources on a temporary basis.
This implies that you need to leave the slot as soon as possible,
else your jobs will simply be killed by the annoyed Grid
administrators.
On the other hand, submitting new glideins is not
cost free, so you want to keep the resource for at least some period
of time.
The glideins have two mechanisms to regulate this:
After a specified amount of time, the glidein will enter the RETIRING state. This means, it will wait for the current job to finish (or kill it if it does not end within a configurable timeout) and exit immediately afterwards. This obviously implies that no new jobs will start after it entered that state.
The two timeouts can be set with the <attr /> options:GLIDEIN_Retire_Time=nr_of_seconds
GLIDEIN_Job_Max_Time=nr_of_secondsThe two default to 2 and 100 hours.
If a glidein is not claimed within a configurable timeout, the glidein will exit.
The timeout can be set qith:GLIDEIN_Max_Idle=nr_of_seconds
The default is 20 minutes.
An example:
[entries><entry>]
<attrs>
<attr name="GLIDEIN_Max_Idle" value="300" type="int" publish="False" parameter="True"/>
< attr name="GLIDEIN_Retire_Time" value="14400" type="int" publish="False" parameter="True"/>
<attr name="GLIDEIN_Job_Max_Time" value="180000" type="int" publish="False" parameter="True"/>
Old-style pseudo-interactive monitoring
Since v1_4_1, the pseudo-interactive monitoring uses a dedicated startd in the glideins for monitoring purposes. This allows for monitoring even when the job starter enters the “Retiring” activity.
The side effect is that you do not have anymore the cross-VM statistics and the names of the slots is also different.
To enable the old mode, use:
Adding custom code/files to the glideins
While provided code should cover most of the general purpose use cases, some administrators may have additional needs. For these cases, the glidein creation command adds the options listed below.
Note: Files and subsystems will be downloaded before the scripts. User provided scripts will be executed in the specified order, and before the Condor daemons are started up.
-
<glidein>
[<entries><entry>]
<files>
<file absfname="script name" executable="True" comment="comment"/>Path to the custom script. The script will be copied in the Web-accessible area, and when a glidein starts, the glidein startup script will pull it and execute it. If any parameters are needed, they can be specified using <attr />, or stored in a file (see below).
For more detailed information, see the page dedicated to writing custom scripts.<glidein>
[entries><entry>]
<files>
<file absfname="script name" wrapper="True" comment="comment"/>Path to the wrapper custom script. The script will be copied in the Web-accessible area, and will be sourced just before a user job starts starts; i.e. it will become part of the user job wrapper.
-
<glidein>
[entries><entry>]
<files>
<file absfname="local file name" relfname="target file name" const="Bool" executable="False" comment="comment"/>Path to the config file. The file will be copied in the Web-accessible area, and pulled by the glidein startup script when a glidein starts. It can be then used by any script (see above).
Note: Please be cautious in using the const flag; if set to False, the content of the file will not be verified by the glidein startup script and could be tampered in transit by a malicious user. So never put sensitive data (like the switch to disable security checks) in a changeable file. -
<glidein>
[entries><entry>]
<files>
<file absfname="local file name" untar="True" comment="comment">
<untar_options cond_attr="conf_sw" dir="dir name" absdir_outattr="attr name">Sometimes it is useful to transfer a whole set of files, or even directories, and that is much easier to accomplish by means of a tar-ball. A subsystem is the glidein way to describe a compressed tarball that is delivered to the worker nodes, untarred in a separate directory and advertised to the other scripts.
- absfname: Path to the costum tarball. (like "/tmp/mytar_v12.5.tgz")
- conf_sw:
Name of a configuration switch.
(like "ENABLE_KRB5")
The tarball will be unpacked only if that parameter will be set to 1. Use the <attr /> switch to define the default value. A special name TRUE can be used to always untar it. - dir: Name of the subdirectory to untar it in. (like "krb5")
- absdir_outattr: Name of a
variable name. (like "KRB5_SUBSYS_DIR")
The variable will be set to the absolute path of the directory where the tarball was unpacked, if and only if the unpacking actually happened else it will not be define.
Grouping glidein Entries for monitoring purposes
Certain monitoring graphs are useful when grouped together. You can use the “monitorgroups” tag as follows to group the entries together.
For example, below, entry1 and entry2 will be grouped together and the group information can be plotted against individual entry information.
<entries>
<entry name=“entry1” ...>
<monitorgroups>
<monitorgroup group_name="Group1"/>
<monitorgroup group_name="Group2"/>
</monitorgroups>
</entry>
<entry name=“entry2” ...>
<monitorgroups>
<monitorgroup group_name="Group1"/>
<monitorgroup group_name="Group4"/>
</monitorgroups>
</entry>
</entries>
</glidein>