GlideinWMS The Glidein-based Workflow Management System

GlideinWMS

Overview

Introduction

Corral is a lightweight frontend for glideinWMS. It is useful for individuals and when running across national cyberinfrastructures such as OSG and TeraGrid. It is also a useful tool when running Pegasus workflows.

Although much work has been done in developing the national cyberinfrastructure in support of science, there is still a gap between the needs of the scientific applications and the capabilities provided by the resources. Leadership-class systems are optimized for highly-parallel, tightly coupled applications. Many scientific applications, however, are composed of a large number of loosely-coupled individual components, many with data and control dependencies. Running these complex, many-step workflows robustly and easily still poses difficulties on today?s cyberinfrastructure. One effective solution that allows applications to efficiently use the current cyberinfrastructure is resource provisioning using Condor glideins.

GlideinWMS was initially developed to meet the needs of the CMS (Compact Muon Solenoid) experiment at the Large Hadron Collider (LHC) at CERN. It generalizes a Condor glideIn system developed for CDF (The Collider Detector at Fermilab) and first deployed for production in 2003. It has been in production across the Worldwide LHC Computing Grid (WLCG), with major contributions from the Open Science Grid (OSG) in support of CMS for the past two years, and has recently been adopted for user analysis. GlideinWMS also is currently being used by the CDF, DZero, and MINOS experiments, and servicing the NEBioGrid and Holland Computing Center communities. GlideinWMS has been used in production with more than 12,000 concurrently running jobs; the CMS use alone totals over 45 million hours.

Corral, a tool developed to complement the Pegasus Workflow Management System was recently built to meet the needs of workflow-based applications running on the TeraGrid. It is being used today by the Southern California Earthquake Center (SCEC) CyberShake application. In a period of 10 days in May 2009, SCEC used Corral to provision a total of 33,600 cores and used them to execute 50 workflows, each containing approximately 800,000 application tasks, which corresponded to 852,120 individual jobs executed on the TeraGrid Ranger system. The 50-fold reduction from the number of workflow tasks to the number of jobs is due to job-clustering features within Pegasus designed to improve overall performance for workflows with short duration tasks.

The integrated system provides a robust and scalable resource provisioning service that supports a broad set of domain application workflow and workload execution environments. The aim is to integrate and enable these services across local and distributed computing resources, the major national cyberinfrastructure providers (Open Science Grid and TeraGrid), as well as emerging commercial and community cloud environments.

Funded by the National Science Foundation under the OCI SDCI program, grant #0943725

Installation

There are two parts to a fully working system, a frontend and a factory. The frontend in this case is Corral, whereas the factory can either be installed locally, or you can use a hosted version at UCSD.

Installing the frontend is done by downloading the Corral frontend tar ball, untarring it and setting the CORRAL_HOME environment variable to point to the new install. Also add $CORRAL_HOME/bin to your path. The host your are deploying on should have GSI set up (host cert, grid-mapfile, ...).

Configuration of submit host

The submit host has to be set up to accept the glideins registering. This requires enabling GSI as an authentication method and it is pretty straightforward if the machine is not already part of a bigger pool. Replace the certificate DNs with the ones from your host and user certificate. Example condor_config.local file:

############################################################
## GSI Security config
############################################################

############################
# Authentication settings
############################
SEC_DEFAULT_AUTHENTICATION = REQUIRED
SEC_READ_AUTHENTICATION = OPTIONAL
SEC_WRITE_AUTHENTICATION = REQUIRED
SEC_CLIENT_AUTHENTICATION = OPTIONAL
SEC_DEFAULT_AUTHENTICATION_METHODS = FS,GSI
SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION = True
DENY_WRITE = anonymous@*
DENY_ADMINISTRATOR = anonymous@*
DENY_DAEMON = anonymous@*
DENY_NEGOTIATOR = anonymous@*
DENY_CLIENT = anonymous@*

############################
# Privacy settings
############################
SEC_DEFAULT_ENCRYPTION = OPTIONAL
SEC_DEFAULT_INTEGRITY = REQUIRED
SEC_READ_INTEGRITY = OPTIONAL
SEC_CLIENT_INTEGRITY = OPTIONAL
SEC_READ_ENCRYPTION = OPTIONAL
SEC_CLIENT_ENCRYPTION = OPTIONAL

# Grid Certificate directory
GSI_DAEMON_TRUSTED_CA_DIR=/etc/grid-security/certificates

############################
# Set daemon cert location
############################
GSI_DAEMON_DIRECTORY = /etc/condor

############################
# Credentials
############################
GSI_DAEMON_CERT = /etc/grid-security/hostcert.pem
GSI_DAEMON_KEY  = /etc/grid-security/hostkey.pem

#################################
# Where to find ID->uid mappings
#################################
CERTIFICATE_MAPFILE=/etc/condor/condor_mapfile

#####################################
# Add whitelist of condor daemon DNs
#####################################
GSI_DAEMON_NAME=/DC=org/DC=doegrids/OU=Services/CN=yggdrasil.isi.edu,/DC=org/DC=doegrids/OU=People/CN=Mats Rynge 274484

#####################################################
# With strong security, do not use IP based controls
#####################################################
HOSTALLOW_WRITE = *
ALLOW_WRITE = $(HOSTALLOW_WRITE)

#####################################
# Limit session caching to ~12h
#####################################
SEC_DAEMON_SESSION_DURATION = 50000

You also need a condor mapfile (/etc/condor/condor_mapfile in this case):


GSI "/DC=org/DC=doegrids/OU=Services/CN=yggdrasil.isi.edu" condor001
GSI "/DC=org/DC=doegrids/OU=People/CN=Mats Rynge 274484" condor003
GSI (.*) anonymous
FS (.*) \1

Check the CollectorLog for problems if the glideins do not work

Usage Examples

Provisioner Description

The way to tell Corral about your glidein needs, a XML configuration files is needed. Example:

<corral-request>

    <local-resource-manager type="condor">
        <main-collector>cwms-corral.isi.edu:9620</main-collector>
        <job-owner>testuser</job-owner>

        <!-- alias for the site - make this match your Pegasus site catalog -->
        <pegasus-site-name>Firefly</pegasus-site-name>
    </local-resource-manager>

    <remote-resource type="glideinwms">

        <!-- get these values from the factory admin -->
        <factory-host>cwms-factory.isi.edu</factory-host>
        <entry-name>UNL</entry-name>
        <security-name>corral_frontend</security-name>
        <security-class>corral003</security-class>

        <!-- project is required when running on TeraGrid -->
        <project-id>TG-...</project-id>

        <min-slots>0</min-slots>
        <max-slots>1000</max-slots>

        <!-- number of glideins to submit as one gram job -->
        <chunk-size>1</chunk-size>

        <max-job-walltime>600</max-job-walltime>

        <!--  List of entries for the grid-mapfile for the glideins. Include the daemon
              certificate of the collector, and the certificate of the user submitting the glideins.  -->
        <grid-map>
            <entry>"/DC=org/DC=doegrids/OU=People/CN=TestUser 001" condor001</entry>
            <entry>"/DC=org/DC=doegrids/OU=Services/CN=cmws-corral.isi.edu" condor002</entry>
        </grid-map>
    </remote-resource>

</corral-request>

 

Submitting the Request

One you have created the request XML file you can submit it to Corral:

$ corral create-provisioner -h cwms-corral.isi.edu -f firefly.xml

 

Listing Provisioners

$ corral list-provisioners -h cwms-corral.isi.edu

 

Removing provisioners

$ corral remove-provisioner -h cwms-corral.isi.edu