GlideinWMS The Glidein-based Workflow Management System

WMS Factory

Monitoring

Glidein Factory monitoring

Monitoring is an essential part of any service. One has to monitor to both maintain the health of the system and to tune the system for anticipated growth. The various ways you can monitor a Glidein Factory are described below.

Log files

Both the Factory Daemon, the Entry Daemons, HTCondor-G and the Glideins write extensive log files. The logs are kept for a week and then deleted.

Log file locations in GlideinWMS v2.3.x and earlier:

The Glidein Factory Daemon log files are located in

<glidein directory>/log/factory_info.<date>.log
<glidein directory>/log/factory_err.<date>.log

Each Entry Daemon has its log files in

<glidein directory>/entry_<entry name>/log/factory_info.<date>.log
<glidein directory>/entry_<entry name>/log/factory_err.<date>.log

For each client an Entry Daemon is serving, one HTCondor-G job log is used

<glidein directory>/entry_<entry name>/log/condor_activity_<date>_<client-name>.log

Each Glidein also writes a couple of log files, that get transfered back to the factory node after the glidein terminates. The log files are named:

<glidein directory>/entry_<entry name>/log/job.<condor-g job nr>.out
<glidein directory>/entry_<entry name>/log/job.<condor-g job nr>.err

The Glidein .out files are readable using any text editor, while the .err files contain the compressed logs of the condor daemons.
Use the following commands to extract that information in simple text format

glideinWMS/factory/tools/cat_MasterLog.py <err_fname>
glideinWMS/factory/tools/cat_StartdLog.py <err_fname>
glideinWMS/factory/tools/cat_StarterLog.py <err_fname>
glideinWMS/factory/tools/cat_StartdHistoryLog.py <err_fname>

Note: If you need HTCondor log files from a still running glidein, use the following HTCondor command

<condor dir>/sbin/condor_fetchlog -pool <pool collector> <glidein slot name> -startd MASTER|STARTD|STARTER

The Entry Daemons also summarize the information about completed glideins into

<glidein directory>/entry_<entry name>/log/completed_jobs_<date>.log

Log file locations in GlideinWMS v2.4 and later:

With the introduction of privilage separation in GlideinWMS, location for log files have changed, altough a link to log directory is still maintained from the <glidein directory>. Location for the log files is controlled through configuration. GlideinWMS uses condor_switchboard to control the access to the log directories. This makes the deployment more secure.

Glidein Factory entry Web monitoring

You can either monitor the factory as a whole, or just a single entry point.

The factory monitoring is located at a URL like the one below (starting with the hostname and ending with the instance name):

http://gfactory1.my.org/glidefactory/monitor/glidein_v1_0/

Moreover, each entry point, has its own history on the Web.

Assuming you have a SanDiego entry, it can be monitored at

http://gfactory1.my.org/glidefactory/monitor/glidein_v1_0/entry_SanDiego/

Historical Web monitoring

The Entry Point Daemons will also create RRD databases and associated graphs for a period of up to one year. This way, one can easily monitor the evolution of the system.

Glidein factory monitoring via WMS tools

You can get the equivalent of the Web page snaphot by using

cd glideinWMS/tools/
python wmsXMLView.py

Glidein factory entry log files

By default the glidein factory writes two log files per entry point factory_info.YYYYMMDD.log and factory_err.YYYYMMDD.log.

Assuming you have a SanDiego entry, the log files are in the directory

/home/gfactory/glidein_submit/glidein_v1_0/entry_SanDiego/log

All errors are reported in the factory_err.YYYYMMDD.log. file, while factory_info.YYYYMMDD.log contains entries about what the factory is doing.

Glidein output

Each glidein creates 2 files on exit; job.ID.out and job.ID.err.

Assuming you have a SanDiego entries, the log files are in

/home/gfactory/glidein_submit/glidein_v1_0/entry_SanDiego/log

Problems are usually reasonably easy to spot.

Glidein factory ClassAds in the WMS Pool

The glidein factory also advertises summary information in the WMS Pool collector.

Use condor_status:

condor_status -any

and look for glidefactory and glidefactoryclient ads.

Looking at ClassAds

As explained in the Data exchange overview, the Entry Point Daemons expose a lot of monitoring information in the ClassAds sent to the WMS collector. While this may not be the most user friendly interface, most of the monitoring information you'll ever need is present there.

On top of the HTCondor provided tools (e.g. condor_status), the factory provides two tools to look at the ClassAds; the first one returns a human readable, but limited text, while the other provides a complete XML-formated output

glideinWMS/tools/wmsTxtView.py [Entries|Sites|Gatekeepers]
glideinWMS/tools/wmsXMLView.py

analyze_entries

This tool, found in factory/tools/analyze_entries can list many statistics useful for factory administrators. It can list the total time used by glideins, the number of glideins submitted, the number of jobs run, as well as the efficiency of glideins.

This utility can sort by frontend, entry or by attributes such as the most time used.

The command line arguments are shown below:
--source ABSPATHRequired. Unless you are running this in the factory directory, you must specify the absolute path to the factory directory
-xInterval to look at (in hours). Valid options are 2, 24, or 168.
-pLook at all the above intervals.
-f FRONTENDFilter by frontend. You can omit "frontend_" at the beginning of the name.
-mEmphasize frontend data, no entries shown. By default, slots are shown. You can use -ms to show seconds or -mh to show hours.
-msort ATTRIBUTEFrontend mode, sort by attribute (strt, fval, 0job, val, idle, wst, badp, waste, time, total)
-hShow usage message

Below is an example of usage:

./analyze_entries --source /opt/wmsfactory/glidein_w
msfactory_instance -x 168

Glidein log analysis for All Entries - 01-11-2011_13:42:28

----------------------------------------
Past 168.0 hours:

Total Glideins: 78
Total Jobs: 57 (Average jobs/glidein: 0.73)

Total time:             66.6K (  18.5 hours -   0.1 slots)
Total time used:        29.5K (   8.2 hours -   0.0 slots - 44%)
Total time validating:   4.4K (   1.2 hours -   0.0 slots -  6%)
Total time idle:        32.4K (   9.0 hours -   0.1 slots - 48%)
Total time wasted:      37.1K (  10.3 hours -   0.1 slots - 55%)
Time used/time wasted: 0.8
Time efficiency: 0.44


---------------------------------------
---------------------------------------
Per Entry (all frontends) stats for the past 168 hours.

                                         strt fval 0job |  val idle  wst badp |
 waste    time   total
ss_ITB_GRATIA_TEST_1                       2%   4%  37% |   6%  48%  55%  92% |
    10      18  |   78

---------------------------------------
---------------------------------------
Per Entry (per frontend) stats for the past 168 hours.

----------------------------------------
frontend_vofrontend_service_frontend:

Glideins: 78 - 100.0% of total
Jobs: 57 (Average jobs/glidein: 0.73)

time:             66.6K (  18.5 hours -   0.1 slots)
time used:        29.5K (   8.2 hours -   0.0 slots - 44%)
time validating:   4.4K (   1.2 hours -   0.0 slots -  6%)
time idle:        32.4K (   9.0 hours -   0.1 slots - 48%)
time wasted:      37.1K (  10.3 hours -   0.1 slots - 55%)
Time used/time wasted: 0.8
Time efficiency: 0.44

                                         strt fval 0job |  val idle  wst badp |
 waste    time   total
ss_ITB_GRATIA_TEST_1                       2%   4%  37% |   6%  48%  55%  92% |
    10      18  |   78

-----------------------------------
LEGEND:

K - kilseconds (*1,000 seconds)
M - megaseconds (*1,000,000 seconds)

strt - % of jobs where condor failed to start
fval - % of glideins that failed to validate (hit 1000s limit)
0job - %  0 jobs/glidein
----------
val - % of time used for validation
idle - % of time spend idle
wst - % of time wasted (Lasted - JobsLasted)
badp - % of badput (Lasted - JobsGoodput)
----------
waste - wallclock time wasted (hours) (Lasted - JobsLasted)
time - total wallclock time (hours) (Lasted)
total - total number of glideins
-------------------------------------

analyze_queues

This tool, found in factory/tools/analyze_queues can list many statistics useful for factory administrators. It will show the number of slots in running, held, and idle status. It can display this information per entry or per frontend.

--source ABSPATHRequired. Unless you are running this in the factory directory, you must specify the absolute path to the factory directory
-xInterval to look at (in hours). Valid options are 2, 24, or 168.
-pLook at all the above intervals.
-f FRONTENDFilter by frontend. You can omit "frontend_" at the beginning of the name.
-zZero suppression (don't show entries with 0s across all attributes)
-mEmphasize frontend data, no entries shown. By default, slots are shown. You can use -ms to show seconds or -mh to show hours.
-msort ATTRIBUTEFrontend mode, sort by attribute (strt, fval, 0job, val, idle, wst, badp, waste, time, total)
-hShow usage message

Example usage:

./analyze_queues --source /opt/wmsfactory/glidein_wmsfactory_instance -x 168

Status Attributes analysis for All Entries - 01-11-2011_15:10:22

----------------------------------------
Past 168.0 hours:

Status Running:   71.0K (  19.7 hours -   0.1 slots - 62%)
Status Held:        0.0 (   0.0 hours -   0.0 slots -  0%)
Status Idle:      19.4K (   5.4 hours -   0.0 slots - 17%)
Status Unknown:     0.0 (   0.0 hours -   0.0 slots -  0%)

Status Pending:   19.4K (   5.4 hours -   0.0 slots - 17%)
Status Wait:        0.0 (   0.0 hours -   0.0 slots -  0%)
Status Stage In:    0.0 (   0.0 hours -   0.0 slots -  0%)
Status Stage Out:  3.9K (   1.1 hours -   0.0 slots -  3%)

RunDiff (Running-ClientRegistered):  23.1K (   6.4 hours -   0.0 slots - 32%)
IdleDiff (Idle - RequestedIdle):     -6.1K (  -1.7 hours -  -0.0 slots)


---------------------------------------
---------------------------------------
Per Entry (all frontends) stats for the past 168 hours.

                                            Run    Held    Idle  Unknwn | Pending    Wait   StgIn  StgOut |  RunDiff IdleDiff   %RD

ss_ITB_GRATIA_TEST_1                      71.0K     0.0   19.4K     0.0 |   19.4K     0.0     0.0    3.9K |    23.1K    -6.1K   32%


---------------------------------------
---------------------------------------
Per Entry (per frontend) stats for the past 168 hours.

frontend_vofrontend_service_frontend        Run    Held    Idle  Unknwn | Pending    Wait   StgIn  StgOut |  RunDiff IdleDiff   %RD

ss_ITB_GRATIA_TEST_1                      71.0K     0.0   19.4K     0.0 |   19.4K     0.0     0.0    3.9K |    23.1K    -6.1K   32%

-----------------------------------
LEGEND:

K = x   1,000
M = x 100,000

Run : Status Running
Held : Status Held
Idle : Status Idle
Unknwn : Status Unknown (StatusIdleOther)
Pending : Status Pending
Wait : Status Wait
StgIn : Status Stage In
StgOut : Status Stage Out
RunDiff : StatusRunning - ClientRegistered (ClientGlideTotal)
IdleDiff : StatusIdle - ReqIdle (Requested Idle)
%RD - Percent RunDiff over Running (RunDiff/StatusRunning)
-------------------------------------

analyze_frontends

This tool, found in factory/tools/analyze_frontends can list many statistics useful for factory administrators. This tool will show thee number of registered, claimed, and unmatched slots. It can list these per frontend or per entry.

The command-line arguments are listed below:
--source ABSPATHRequired. Unless you are running this in the factory directory, you must specify the absolute path to the factory directory
-xInterval to look at (in hours). Valid options are 2, 24, or 168.
-pLook at all the above intervals.
-f FRONTENDFilter by frontend. You can omit "frontend_" at the beginning of the name.
-zZero suppression (don't show entries with 0s across all attributes)
-mEmphasize frontend data, no entries shown. By default, slots are shown. You can use -ms to show seconds or -mh to show hours.
-msort ATTRIBUTEFrontend mode, sort by attribute (strt, fval, 0job, val, idle, wst, badp, waste, time, total)
-hShow usage message

Example usage:

./analyze_frontends --source /opt/wmsfactory/glidein_wmsfactory_instance -x 168

Status Attributes (Clients) analysis for All Entries - 01-11-2011_15:28:06

----------------------------------------
Past 168.0 hours:

Registered:      47.8K (  13.3 hours -   0.1 slots)
Claimed:         31.8K (   8.8 hours -   0.1 slots - 66%)
Unmatched :      14.6K (   4.1 hours -   0.0 slots - 30%)

Requested Idle:  25.5K (   7.1 hours -   0.0 slots)

Jobs Running:    31.7K (   8.8 hours -   0.1 slots)
Jobs Run Here:     0.0 (   0.0 hours -   0.0 slots)
Jobs Idle:       40.0K (  11.1 hours -   0.1 slots)

RunDiff (Running-ClientRegistered):  23.1K (   6.4 hours -   0.0 slots - 32%)


---------------------------------------
---------------------------------------
Per Entry (all frontends) stats for the past 168 hours.

                                            Regd  Claimd Unmtchd | ReqIdle |  JobRun JobHere JobIdle | RunDiff   %UM   %RD

ss_ITB_GRATIA_TEST_1                       47.8K   31.8K   14.6K |   25.5K |   31.7K     0.0   40.0K |   23.1K   30%   32%


---------------------------------------
---------------------------------------
Per Entry (per frontend) stats for the past 168 hours.

frontend_vofrontend_service_frontend        Regd  Claimd Unmtchd | ReqIdle |  JobRun JobHere JobIdle | RunDiff   %UM   %RD

ss_ITB_GRATIA_TEST_1                       47.8K   31.8K   14.6K |   25.5K |   31.7K     0.0   40.0K |   23.1K   30%   32%

-----------------------------------
LEGEND:

K = x   1,000
M = x 100,000

Regd - Registered (ClientGlideTotal)
Claimd - Claimed (ClientGlideRunning)
Unmtchd - Unmatched (ClientGlideIdle)
ReqIdle - Requested Idle
JobRun - Client Jobs Running
JobHere - Client Jobs Run Here
JobIdle - Client Jobs Idle
RunDiff - StatusRunning - ClientRegistered (ClientGlideTotal)
%UM - Percent Unmatched (Unmatched/Registered)
%RD - Percent RunDiff over Running (RunDiff/StatusRunning)
-------------------------------------