Installation of the WMS Collector and collocated glidein Factory - By hand

1. Description

The glidein Factory node will be the Condor Central Manager for the WMS,  i.e. it  will run the Condor Collector and Negotiator daemons, but it will also act as a Condor Submit node for the glidein factory, running Condor schedds used for Grid submission.
On top of that, this node also hosts the glidein factory daemons.

2. Hardware requirements

This machine needs one or two CPUs (one for the Condor damons, and one for the glidein factory daemons) and a moderate amount of memory (512MB should be enough).
It must must be on the public internet, with at least one port open to the world; all worker nodes will load data from this node trough HTTP.
The disk needed is just for binaries, config files and log files (10GB should be enough)

3. Needed software

A reasonably recent Linux OS (SL4 used at press time).
The OSG client software.
A HTTP server, like Apache or TUX.
A PostgreSQL server.
The Condor distribution.
The RRDTool package.
The glideinWMS software.

4. Base tools installation instructions

The installation will assume you install Condor v6.9.2 from tarballs, as root.
The install directory will be /opt/glidecondor, the working directory is /opt/glidecondor/condor_local and the machine name is wmsmachine.fnal.gov.
The HTTP documents are assumed to be located in /var/www/html.

If you want to use a different setup, make the necessary changes.

4.1 Install a HTTP server

Most Linux distributions come with Apache RPMs.
On SL4,  issue:
yum install httpd

If you are behind a firewall, you will need to set Listen in /etc/httpd/conf/httpd.conf, to an open port, like:
Listen 8256

4.2 Install RRDTool

Unfortunatelly, most Linux distributions don't come with a RRDTool RPM avaialble.

There are RPMs for various Linux distributions on http://dag.wieers.com/rpm/packages/rrdtool/.

I have installed them on my SL4 machine with:
wget http://dag.wieers.com/rpm/packages/rrdtool/rrdtool-1.2.18-1.el4.rf.i386.rpm
wget http://dag.wieers.com/rpm/packages/rrdtool/perl-rrdtool-1.2.18-1.el4.rf.i386.rpm
http://dag.wieers.com/rpm/packages/rrdtool/python-rrdtool-1.2.18-1.el4.rf.i386.rpm
rpm -i rrdtool-1.2.18-1.el4.rf.i386.rpm perl-rrdtool-1.2.18-1.el4.rf.i386.rpm python-rrdtool-1.2.18-1.el4.rf.i386.rpm

4.3 Install Condor

Follow instructions in the Condor v6.9.2 installation document.

4.4 Configure Condor GSI security

Follow instructions in the Configuring GSI security in Condor document.

The /opt/glidecondor/certs/grid-mapfile must be populated with the WMS pool service proxy DN, as well as the DNs of the glidein factory and the VO frontend.
Assuming the local service proxy DN is "/DC=org/DC=doegrids/OU=Service/CN=wms2", the glidein factory has DN "/DC=org/DC=doegrids/OU=Service/CN=factory3", and you have  a VO frontend that use  DN "/DC=org/DC=doegrids/OU=Service/CN=frontend45", the grid-mapfile would contain something like:
"/DC=org/DC=doegrids/OU=Service/CN=wms2" wcondor
"/DC=org/DC=doegrids/OU=Service/CN=factory3" fcondor
"/DC=org/DC=doegrids/OU=Service/CN=frontend45" vcondor
The names you specify after the DN are not important, but they must differ one from another.

4.5 Configure Quill

Follow the instructions in the Condor Quill setup document.

4.6 Configure Condor-G settings

The schedd on this machine will be used for Grid submission.

Please add to /opt/glidecondor/etc/condor_config the following lines:
###############################
# Condor-G settings
###############################
GRIDMANAGER_LOG = /tmp/GridmanagerLog.$(SCHEDD_NAME).$(USERNAME)
GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE=2000
GRIDMANAGER_MAX_PENDING_SUBMITS_PER_RESOURCE=20
MAX_JOBS_RUNNING        = 2000

4.7 Disable startd on this node

Unless you plan to use this node for other purpuses, too, change DAEMON_LIST in /opt/glidecondor/condor_local/condor_config.local, to:
DAEMON_LIST   = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, QUILL

4.8 Setup multiple schedds

You will most probably need multiple schedd, if you want to scale.
Configure the schedds schedd_glideins1, schedd_glideins2, schedd_glideins3 and schedd_glideins4 , by following the instructions in the Configuring multiple Schedds in Condor document.

4.9 Example config files

If following the above instructions does not make you feel confident, you can find the complete Condor config files in
example-config/glide-factory/mymachine/condor_config
and
example-config/glide-factory/mymachine/condor_config.local.

5. Start base services

If not already running, start the HTTP server with:
/etc/init.d/httpd start

If not already running, start PostgreSQL with:
/etc/init.d/postgresql start

In order to start all the schedulers, you now run several scripts. In order to simplify your life, create /opt/glidecondor/start_condor.sh, contatining:
#!/bin/bash
/opt/glidecondor/sbin/condor_master
sleep 1
/opt/glidecondor/start_master_schedd.sh glideins1
/opt/glidecondor/start_master_schedd.sh glideins2
/opt/glidecondor/start_master_schedd.sh glideins3
/opt/glidecondor/start_master_schedd.sh glideins4
The same file can be downloaded from example-config/glide-factory/start_condor.sh.

After you make it executable:
chmod a+x /opt/glidecondor/start_condor.sh
just run the new script:
/opt/glidecondor/start_condor.sh

6. Glidein Factory installation guide

This document describes how to install the glidein factory found in glideinWMS v1_0.
The glidein factory will be installed as user gfactory, and the home directory will be /home/gfactory.
It also assumes your GCB servers are located at IP addresses 132.225.204.208 and 132.225.207.114.
If you want to use a different setup, make the necessary changes.

Unless otherwise stated, all the installation is done as user gfactory.

6.1 Create the necessary base directories

Create the configuration base directory, ~/glidein_submit:
mkdir /home/gfactory/glidein_submit

As root, create the Web base directory /var/www/html/glidefactory hierarchy:
mkdir /var/www/html/glidefactory
mkdir /var/www/html/glidefactory/stage
mkdir /var/www/html/glidefactory/monitor
chown -R gfactory /var/www/html/glidefactory

6.2 Setup the GSI security

Create and maintain the a valid x509 proxy in ~/.globus/x509_service_proxy. This proxy must, at any point in time, have a validity of at least the longest expected job being run by the glideinWMS (and not less than 12 hours).
How you keep this proxy valid (via MyProxy, kx509, voms-proxy-init from a local certificate, scp from other nodes, or other methods), is beyond the scope of this document.

Setup the environment, by adding the following lines to ~/.bashrc:
VDT_BASE=/opt/vdt
export X509_CERT_DIR=$VDT_BASE/globus/TRUSTED_CA
export X509_USER_PROXY=~/.globus/x509_service_proxy
(Remember to exit end reenter the shell to pick up the new environment)

Make sure the DN of this proxy is put into the local /opt/glidecondor/certs/grid-mapfile
and in the grid-mapfiles of the glidein pool Collector and the glidein schedulers.

6.3 Get glideinWMS binaries

In your home directory, check  out glideinWMS from CVS:
export CVSROOT=:pserver:anonymous@cdcvs.fnal.gov:/cvs/cd_read_only
cvs co -r v1_2 glideinWMS

6.4 Create the glidein factory configuration directory

To configure the glidein factory, you need to create a configuration directory.
The complete guide can be found in the glideinWMS documentation, but below you can find a workable example.

Create ~/glideinWMS/creation/config_examples/config_v1.xml:
<glidein factory_name="factory1"
         glidein_name="v1_0"
         schedd_name="schedd_glideins1@wmsmachine.fnal.gov,schedd_glideins2@wmsmachine.fnal.gov,schedd_glideins3@wmsmachine.fnal.gov,schedd_glideins4@wmsmachine.fnal.gov">
   <monitor base_dir="/var/www/html/glidefactory/monitor"/>
   <condor base_dir="/opt/glidecondor"/>
   <submit base_dir="/home/gfactory/glidein_submit"/>
   <stage web_base_url="http://wmsmachine.fnal.gov:8256/glidefactory/stage" base_dir="/var/www/html/glidefactory/stage" use_symlink="True"/>
   <attrs>
      <attr name="GLIDEIN_UseGCB" const="True" parameter="False" value="1" publish="True" type="string"/>
      <attr name="GCB_ORDER" const="False" type="string" value="RANDOM" publish="True" parameter="True"/>
      <attr name="GCB_LIST" const="True" type="string" value="132.225.204.208,132.225.207.114" publish="False" parameter="True"/>
      <attr name="SEC_DEFAULT_ENCRYPTION" const="False" type="string" value="REQUIRED" publish="True" parameter="True"/>
      <attr name="Have_CREOS_Software_Version" const="True" type="int" value="12" publish="True" parameter="True" glidein_publish="True"/>
    </attrs>
   <files>
      <file executable="False" untar="False" absfname="/opt/glidecondor/certs/grid-mapfile" const="False"/>
      <file executable="True" untar="False" absfname="web_base/gcb_setup.sh" const="True"/>
   </files>
   <entries>
      <entry name="Ultralight" gridtype="gt2" gatekeeper="cit-gatekeeper.ultralight.org/jobmanager-condor" work_dir="OSG">
         <attrs>
            <attr name="GLIDEIN_Site" const="True" type="string" value="ultralight" publish="True" parameter="True"/>
         </attrs>
      </entry>
      <entry name="Wisconsin" gridtype="gt2" gatekeeper="cmsgrid02.hep.wisc.edu/jobmanager-condor" work_dir="Condor">
         <attrs>
            <attr name="GLIDEIN_Site" const="True" type="string" value="wisc" publish="True" parameter="True"/>
          </attrs>
      </entry>
      <entry name="SanDiego" gridtype="gt2" gatekeeper="osg-gw-2.t2.ucsd.edu/jobmanager-condor" work_dir="Condor">
         <attrs>
            <attr name="GLIDEIN_Site" const="True" type="string" value="sdsc" publish="True" parameter="True"/>
         </attrs>
      </entry>
      <entry name="Nebraska" gridtype="gt2" gatekeeper="red.unl.edu/jobmanager-pbs" work_dir="OSG">
         <attrs>
            <attr name="GLIDEIN_Site" const="True" type="string" value="nebraska" publish="True" parameter="True"/>
         </attrs>
      </entry>
      <entry name="CERN" gridtype="gt2" gatekeeper="ce101.cern.ch/jobmanager-lcglsf" work_dir=".">
         <attrs>
            <attr name="GLIDEIN_Site" const="True" type="string" value="cern" publish="True" parameter="True"/>
         </attrs>
      </entry>
   </entries>
</glidein>
The same file can be downloaded from example-config/glide-factory/config_v1.xml.

Now run create_glidein:
cd ~/glideinWMS/creation
./create_glidein config_examples/config_v1.xml
# Output
#Created glidein 'v1_0'
#Submit files can be found in /home/gfactory/glidein_submit/glidein_v1_0
#Support files are in /var/www/html/glidefactory/stage/glidein_v1_0
#Monitoring files are in /var/www/html/glidefactory/monitor/glidein_v1_0

6.5 Start the glidein factory

Decide which configuration directory you want to use, how often should the glidein query the WMS collector, how often should it advertise itself, and finally execute ~/glideinWMS/factory/glideFactory.py.

For example, if you want to use the above configuration, quesry the collector once a minute and advertise every 5, you would run:
cd ~/glideinWMS/factory/
./glideFactory.py /home/gfactory/glidein_submit/glidein_v1_0 >/dev/null 2>&1 </dev/null&

7 Glidein Factory monitoring

There are several ways to monitor the entry points of the glidein factory:

7.1 Glidein factory entry Web monitoring

You can either monitor the factory as a whole, or just a single entry point.

In the above example, the factory monitoring is located at
http://wmsmachine.fnal.gov:8256/glidefactory/monitor/glidein_v1_0/
Moreover, each entry point, has its own history on the Web.

For example, the SanDiego entry can be monitored at
http://wmsmachine.fnal.gov:8256/glidefactory/monitor/glidein_v1_0/entry_SanDiego/

7.2 Glidein factory monitoring via WMS tools

You can get the equivalent of the Web page snaphot by using ~/glideinWMS/tools/wmsXMLView.py:
cd ~/glideinWMS/tools/
python wmsXMLView.py

7.3 Glidein factory entry log files

The glidein factory writes two log files per entry point factory_info.YYYYMMDD.log and factory_err.YYYYMMDD.log.

In the above example, the SanDiego log files are in
/home/gfactory/glidein_submit/glidein_v1_0/entry_SanDiego/log

All errors are reported in the factory_err.YYYYMMDD.log. file, while factory_info.YYYYMMDD.log contains entries about what the factory is doing.

7.4 Glidein output

Each glidein creates 2 files on exit; job.ID.out and job.ID.err.

In the above example, the SanDiego log files are in
/home/gfactory/glidein_submit/glidein_v1_0/entry_SanDiego/log

Problems are usually reasonably easy to spot.

7.5 Glidein factory ClassAds in the WMS Collector

The glidein factory also advertises summary information in the WMS collector.

Use condor_status:
condor_status -any
and look for glidefactory and glidefactoryclient ads.


Back to the index