Glidein based WMS

Description

The glideinWMS package holds the Condor glidein based Workload Management System (WMS).

Index

Introduction

This document describes the glidein based WMS and it's architecture in details. Most of the components of glideinWMS described in this manual can be installed by the installation tool provided with the glideinWMS. Please refer to the installation guide for more details.

The glidein mechanism

As the title says, this WMS is based on glideins. But what are the glideins?

A glidein is, simply put, a properly configured Condor Startd submitted as a Grid job.
Once the glidein starts on a worker node, it will join the the specified Condor pool, making the obtained Grid-batch slot a slot in the Condor pool. At this point, a regular Condor job can start there as if was a dedicated resource. See the picture below for a schematic overview.

Condor glideins are a convenient way to expand a Condor pool. Apart from the task of submitting new glideins as needed, everything else stays the same as in a dedicated pool.

WMS Overview

The glidein based WMS is made of two pieces:

The two pieces perform each a part of the job:

The two pieces communicate via a common Condor Collector; see the image below for a schematic view.

A single WMS (identified by the Condor Collector) can serve multiple frontends via multiple factories. It is a truly many-to-many system. However, the current implementation does not support prioritization between different frontends, so it is best to have only one frontend; multiple factories will instead work as desired.

More details can be found

Data exchange overview

In the previous section we described the general architecture of the WMS. Let now have a look at the type of information that the two pieces publish.

The Glidein Factory Class-Ad

A factory publishes one class-ad per glidein entry. The class-ad contains

Look at the picture below for a generic description.



Please notice that the glidein attributes can be completely arbitrary; the only predefined attributes are the glidein and factory name, plus the convention that anything that starts with GlideinParam is a parameter and anything that starts with GlideinMonitor is a monitoring attribute.

Once the factory starts serving frontends, it will publish also another ClassAd for every frontend served. This ClassAd contains only monitoring information, and is not used by the glideinWMS itself.
Find below a graphical representation of these ClassAds.

The VO Frontend Class-Ad

A VO frontend will obtain the list all available glideins and select the ones that fit its needs, based on the published attributes. For each fitting glidein, a frontend class-ad will be published. Such a class-ad will contain

If encryption is used, the the encrypted identity must must match the AuthenticatedIdentity attribute inserted by the condor collector (needs Condor version 7.3.1 or better)

Have a look at the picture below for a generic description.

The most important parameters that the VO fronted sends to the factory are:

The picture below shows this in a graphics format.

In the current implementation, the only glidein rate setting parameters supported are ReqIdleGlideins, that tells the factory how many idle glideins to keep in the queue at any given time, and ReqMaxRunningGlideins, that tells the factory to stop submitting new glideins when the number of running glidiens reaces that level. Future versions may contain more sophisticated controls, like the maximum number of glideins to keep in the queue or the maximum rate at which the glideins should be submitted.

Implementation details

Due to the amount of material available, the documentation is being kept

The Condor pool

To make a working Glidein-based WMS, you will need to install one or more Condor pools:

If you want, you can merge several of them together, but at least logically, they are separate, and I would strongly suggest to keep them separate unless you have just a small system.

Installing and properly configuring a Condor pool is (unfortunately) a complex process, and this manual will not attempt to describe all the glory details. In this manual, it is assumed you have installed a Condor pool already (see http://www.cs.wisc.edu/condor/ for more information) and only the WMS specific options are discussed.

Most configuration changes needed by the Glidein-based WMS regard security; you definitely don't want just anybody to mess with you pool!

The WMS pool

The WMS pool can use any authentication mechanism you prefer; GSI (i.e. x509), Kerberos, etc., all work fine for this purpose. There is also nothing to hide, so encryption can be turned off. However, you do want to make sure that the information is not tampered with, so you should enable the integrity checks!

If you happen to know the IP addresses of all the factories and all the frontends, it is also good practice to limit the ability to submit new requests to only those nodes ( + obviously the Collector machine).

Here is a prototype of the security config:

SEC_DEFAULT_AUTHENTICATION = REQUIRED
SEC_DEFAULT_AUTHENTICATION_METHODS = <Your choice here>
SEC_DEFAULT_INTEGRITY = REQUIRED
HOSTALLOW_WRITE = machine1,machine2, ... ,machineN

For more information about GSI, look at the configuration of the glidein (user) pool. For all other kinds, please follow the instructions in the Condor manual.

To ease monitoring and debugging, it is also useful to allow unauthenticated read of the information; as said before, there is nothing to hide. However, be sure you enable this only for the Collector, and require the factory and the frontend to authenticate at all times, since this is needed for the integrity checks to work.

The lines to add, on the Collector machine, are:

SEC_READ_AUTHENTICATION = OPTIONAL
SEC_CLIENT_AUTHENTICATION = OPTIONAL
SEC_READ_INTEGRITY = OPTIONAL
SEC_CLIENT_INTEGRITY = OPTIONAL

Last, please notice that there is no need for a Negotiator in this pool. All the matching is done by the factory and frontend software itself. (An no need for Startds either.)

The glidein (user) pool

The glidein pool relies heavily on GSI, also known as x509, authentication. The reason for this is due to the nature of glideins; the Condor Startd is a Grid job. And thus have a x509 proxy available at startup.
So the glidein Condor pool must be set up to support at least GSI; other methods can be used too, to support your local user job submission policies.

Regarding other security features, integrity checks should be enabled and encryption should be left as an option for the transfer of sensitive information in user jobs.

Here is a prototype of the security config:

SEC_DEFAULT_AUTHENTICATION = REQUIRED
SEC_DEFAULT_AUTHENTICATION_METHODS = <local methods>, GSI
SEC_DEFAULT_INTEGRITY = REQUIRED
SEC_DEFAULT_ENCRYPTION = OPTIONAL

The GSI authentication needs some other parameters to be enabled, too. See the Condor manual for the full details.

First of all, you must tell Condor what kind of proxies you will trust; i.e. you must provide it with the public keys of all the trusted Certification Authorities (CAs).

Next, you have to tell condor how it will authenticate with the other parties. You can use either a host certificate/key pair or a regular proxy. The glidein will always use a proxy, but all the other Condor daemons (Collector, Negotiator and Schedd) can use a host certificate if they are running as root and the machine has the host certificates installed.  Please note that the two methods are equivalent, as Condor will extract, as needed, a proxy from the certificate using the provided key.

Last but not least, you must specify who do you trust. A sensitive sysadmin will want only the known DNs (Distinguished Names, i.e. the text representation of the proxy identity) to be allowed to impersonate Condor daemons. There are two places where this has to be set:

A prototype of the above configuration is:

GSI_DAEMON_DIRECTORY = /opt/condor/certs
# Use a Proxy
GSI_DAEMON_PROXY = $(GSI_DAEMON_DIRECTORY)/x509_service_proxy
# Use a Certificate/Key pair
#GSI_DAEMON_CERT = $(GSI_DAEMON_DIRECTORY)/hostcert.pem
#GSI_DAEMON_KEY = $(GSI_DAEMON_DIRECTORY)/hostkey.pem
GRIDMAP=$(GSI_DAEMON_DIRECTORY)/grid-mapfile
GSI_DAEMON_NAME= /DC=org/DC=doegrids/OU=Robots/CN=FNAL CMSWMS Head,\
 /DC=org/DC=doegrids/OU=Robots/CN=FNAL CMSWMS Factory1,\
 /DC=org/DC=doegrids/OU=Robots/CN=FNAL CMSWMS Frontend1


Last, to ease monitoring and debugging, it is also useful to allow unauthenticated read of the information, unless you have strict privacy requirements in your environment.

To enable it, add the following lines:

SEC_READ_AUTHENTICATION = OPTIONAL
SEC_CLIENT_AUTHENTICATION = OPTIONAL
SEC_READ_INTEGRITY = OPTIONAL
SEC_CLIENT_INTEGRITY = OPTIONAL

As of Condor version 7.1.3 condor also supports a more efficient authentication mechanism between the condor_schedd/condor_shadow and condor_startd/condor_starter. This method uses the match ClaimId as a shared password for authentication between these daemons. Since using a shared secret is much cheaper that using GSI authentication, this should be used every time it is feasible.

To enable it, add the following lines:
SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION = True

Use this option for the schedd only; do not add this option to either the negotiator or collector as it will not work.

Obviously, this will only work with Condor versions 7.1.3 and above.


Please note that you do not need to worry about glidein security configuration at a global level; the glidein factory will make sure the settings are correct in the glideins.


Monitoring

The glideinWMS comes with 4 types of monitoring facilities:

Standard condor monitoring

The glideinWMS creates a standard condor pool, so users can monitor their jobs using the standard Condor tools, like condor_q and condor_status.

The administrator can also install graphical tools like the Condor View.

Pseudo interactive monitoring

The glideinWMS also provides pseudo interactive monitoring functionalities; a user can run short lived commands alongide any already running job in the queue.

To run a pseudo interactive command, move into

glideinWMS/tools

and run

./glidein_interactive.py jobid cmdline

There are also a set of useful often used commands that you may want to use:

Glidein Factory Web monitoring

The glidein factory also produces a large amount of web accessible information.

Look into

http://gfactory_node:http_port/glidefactory/monitor/glidein_factory_name/

For most users, start at

http://gfactory_node:http_port/glidefactory/monitor/glidein_factory_name/factoryStatus.html

and go on from there.

VO frontend Web monitoring

The Vo frontend also produces a moderate amount of web accessible information.

Look into

http://frontend_node:http_port/vofrontend//monitor/frontend_frontend_name/

For most begginner users, start at

http://frontend_node:http_port/vofrontend/monitor/frontend_frontend_name/frontendStatus.html

and go on from there.

glideinWMS text monitoring

A system administrator can also monitor the glideinWMS status using text tools.

Move into

glideinWMS/tools

and run either

./wmsXMLView.py

or

./wmsTxtView.py [Entries|Sites|Gatekeepers]



Repository

CVSROOT

cvsuser@cdcvs.fnal.gov:/cvs/cd

Package(s)

glideinWMS

Author(s)

Since Aug. 14th - Igor Sfiligoi (Fermilab Computing Division)

glideinWMS Support:

glideinwms-support@fnal.gov