Glidein Factory monitoring
Monitoring is an essential part of any service. One has to monitor to both maintain the health of the system and to tune the system for anticipated growth. The various ways you can monitor a Glidein Factory are described below.
Log files
Both the Factory Daemon, the Entry Daemons, Condor-G and the Glideins write extensive log files. The logs are kept for a week and then deleted.
Log file locations in glideinWMS v2.3.x and earlier:
The Glidein Factory Daemon log files are located in
<glidein directory>/log/factory_info.<date>.log
<glidein directory>/log/factory_err.<date>.log
Each Entry Daemon has its log files in
<glidein directory>/entry_<entry name>/log/factory_info.<date>.log
<glidein directory>/entry_<entry name>/log/factory_err.<date>.log
For each client an Entry Daemon is serving, one Condor-G job log is used
<glidein directory>/entry_<entry name>/log/condor_activity_<date>_<client-name>.log
Each Glidein also writes a couple of log files, that get transfered back to the factory node after the glidein terminates. The log files are named:
<glidein directory>/entry_<entry name>/log/job.<condor-g job nr>.out
<glidein directory>/entry_<entry name>/log/job.<condor-g job nr>.err
The Glidein .out files are readable using any text editor, while the .err files
contain the compressed logs of the condor daemons.
Use the following commands to extract that information in simple text format
glideinWMS/factory/tools/cat_MasterLog.py <err_fname>
glideinWMS/factory/tools/cat_StartdLog.py <err_fname>
glideinWMS/factory/tools/cat_StarterLog.py <err_fname>
Note: If you need Condor log files from a still running glidein, use the following Condor command
<condor dir>/sbin/condor_fetchlog -pool <pool collector> <glidein slot name> -startd MASTER|STARTD|STARTER
The Entry Daemons also summarize the information about completed glideins into
<glidein directory>/entry_<entry name>/log/completed_jobs_<date>.log
Log file locations in glideinWMS v2.4 and later:
With the introduction of privilage separation in glideinWMS, location for log files have changed, altough a link to log directory is still maintained from the <glidein directory>. Location for the log files is controlled through configuration. glideinWMS uses condor_switchboard to control the access to the log directories. This makes the deployment more secure.
Glidein factory entry Web monitoring
You can either monitor the factory as a whole, or just a single entry point.
The factory monitoring is located at a URL like the one below
http://gfactory1.my.org/glidefactory/monitor/glidein_v1_0/
Moreover, each entry point, has its own history on the Web.
Assuming you have a SanDiego entry, it can be monitored at
http://gfactory1.my.org/glidefactory/monitor/glidein_v1_0/entry_SanDiego/
Historical Web monitoring
The Entry Point Daemons will also create RRD databases and associated graphs for a period of up to one year. This way, one can easily monitor the evolution of the system.
Glidein factory monitoring via WMS tools
You can get the equivalent of the Web page snaphot by using
cd glideinWMS/tools/
python wmsXMLView.py
Glidein factory entry log files
The glidein factory writes two log files per entry point factory_info.YYYYMMDD.log
and factory_err.YYYYMMDD.log.
Assuming you have a SanDiego entry, the log files are in
/home/gfactory/glidein_submit/glidein_v1_0/entry_SanDiego/log
All errors are reported in the factory_err.YYYYMMDD.log. file, while factory_info.YYYYMMDD.log contains entries about what the factory is doing.
Glidein output
Each glidein creates 2 files on exit; job.ID.out and job.ID.err.
Assuming you have a SanDiego entries, the log files are in
/home/gfactory/glidein_submit/glidein_v1_0/entry_SanDiego/log
Problems are usually reasonably easy to spot.
Glidein factory ClassAds in the WMS Collector
The glidein factory also advertises summary information in the WMS collector.
Use condor_status:
condor_status -any
and look for glidefactory and glidefactoryclient ads.
Looking at ClassAds
As explained in the Data exchange overview, the Entry Point Daemons expose a lot of monitoring information in the ClassAds sent to the WMS collector. While this may not be the most user friendly interface, most of the monitoring information you'll ever need is present there.
On top of the Condor provided tools, the factory provides two tools to look at the ClassAds; the first one returns a human readable, but limited text, while the other provides a complete XML-formated output
glideinWMS/tools/wmsTxtView.py [Entries|Sites|Gatekeepers]
glideinWMS/tools/wmsXMLView.py
analyze_entries
This tool, found in factory/tools/analyze_entries can list many statistics useful for factory administrators. It can list the total time used by glideins, the number of glideins submitted, the number of jobs run, as well as the efficiency of glideins.
This utility can sort by frontend, entry or by attributes such as the most time used.
The command line arguments are show below:
--source ABSPATH | Required. Unless you are running this in the factory directory, you must either specify the absolute path to the factory directory, or |
-x | Interval to look at (in hours). Valid options are 2, 24, or 168. |
-p | Look at all the above intervals. |
-f FRONTEND | Filter by frontend. You can omit "frontend_" at the beginning of the name. |
-m | Emphasize frontend data, no entries shown. By default, slots are shown. You can use -ms to show seconds or -mh to show hours. |
-msort ATTRIBUTE | Frontend mode, sort by attribute (strt, fval, 0job, val, idle, wst, badp, waste, time, total) |
-h | Show usage message |
Below is an example of usage:
./analyze_entries --source /opt/wmsfactory/glidein_w msfactory_instance -x 168 Glidein log analysis for All Entries - 01-11-2011_13:42:28 ---------------------------------------- Past 168.0 hours: Total Glideins: 78 Total Jobs: 57 (Average jobs/glidein: 0.73) Total time: 66.6K ( 18.5 hours - 0.1 slots) Total time used: 29.5K ( 8.2 hours - 0.0 slots - 44%) Total time validating: 4.4K ( 1.2 hours - 0.0 slots - 6%) Total time idle: 32.4K ( 9.0 hours - 0.1 slots - 48%) Total time wasted: 37.1K ( 10.3 hours - 0.1 slots - 55%) Time used/time wasted: 0.8 Time efficiency: 0.44 --------------------------------------- --------------------------------------- Per Entry (all frontends) stats for the past 168 hours. strt fval 0job | val idle wst badp | waste time total ss_ITB_GRATIA_TEST_1 2% 4% 37% | 6% 48% 55% 92% | 10 18 | 78 --------------------------------------- --------------------------------------- Per Entry (per frontend) stats for the past 168 hours. ---------------------------------------- frontend_vofrontend_service_frontend: Glideins: 78 - 100.0% of total Jobs: 57 (Average jobs/glidein: 0.73) time: 66.6K ( 18.5 hours - 0.1 slots) time used: 29.5K ( 8.2 hours - 0.0 slots - 44%) time validating: 4.4K ( 1.2 hours - 0.0 slots - 6%) time idle: 32.4K ( 9.0 hours - 0.1 slots - 48%) time wasted: 37.1K ( 10.3 hours - 0.1 slots - 55%) Time used/time wasted: 0.8 Time efficiency: 0.44 strt fval 0job | val idle wst badp | waste time total ss_ITB_GRATIA_TEST_1 2% 4% 37% | 6% 48% 55% 92% | 10 18 | 78 ----------------------------------- LEGEND: K - kilseconds (*1,000 seconds) M - megaseconds (*1,000,000 seconds) strt - % of jobs where condor failed to start fval - % of glideins that failed to validate (hit 1000s limit) 0job - % 0 jobs/glidein ---------- val - % of time used for validation idle - % of time spend idle wst - % of time wasted (Lasted - JobsLasted) badp - % of badput (Lasted - JobsGoodput) ---------- waste - wallclock time wasted (hours) (Lasted - JobsLasted) time - total wallclock time (hours) (Lasted) total - total number of glideins -------------------------------------
analyze_queues
This tool, found in factory/tools/analyze_queues can list many statistics useful for factory administrators. It will show the number of slots in running, held, and idle status. It can display this information per entry or per frontend.
--source ABSPATH | Required. Unless you are running this in the factory directory, you must either specify the absolute path to the factory directory, or |
-x | Interval to look at (in hours). Valid options are 2, 24, or 168. |
-p | Look at all the above intervals. |
-f FRONTEND | Filter by frontend. You can omit "frontend_" at the beginning of the name. |
-z | Zero suppression (don't show entries with 0s across all attributes) |
-m | Emphasize frontend data, no entries shown. By default, slots are shown. You can use -ms to show seconds or -mh to show hours. |
-msort ATTRIBUTE | Frontend mode, sort by attribute (strt, fval, 0job, val, idle, wst, badp, waste, time, total) |
-h | Show usage message |
Example usage:
./analyze_queues --source /opt/wmsfactory/glidein_wmsfactory_instance -x 168 Status Attributes analysis for All Entries - 01-11-2011_15:10:22 ---------------------------------------- Past 168.0 hours: Status Running: 71.0K ( 19.7 hours - 0.1 slots - 62%) Status Held: 0.0 ( 0.0 hours - 0.0 slots - 0%) Status Idle: 19.4K ( 5.4 hours - 0.0 slots - 17%) Status Unknown: 0.0 ( 0.0 hours - 0.0 slots - 0%) Status Pending: 19.4K ( 5.4 hours - 0.0 slots - 17%) Status Wait: 0.0 ( 0.0 hours - 0.0 slots - 0%) Status Stage In: 0.0 ( 0.0 hours - 0.0 slots - 0%) Status Stage Out: 3.9K ( 1.1 hours - 0.0 slots - 3%) RunDiff (Running-ClientRegistered): 23.1K ( 6.4 hours - 0.0 slots - 32%) IdleDiff (Idle - RequestedIdle): -6.1K ( -1.7 hours - -0.0 slots) --------------------------------------- --------------------------------------- Per Entry (all frontends) stats for the past 168 hours. Run Held Idle Unknwn | Pending Wait StgIn StgOut | RunDiff IdleDiff %RD ss_ITB_GRATIA_TEST_1 71.0K 0.0 19.4K 0.0 | 19.4K 0.0 0.0 3.9K | 23.1K -6.1K 32% --------------------------------------- --------------------------------------- Per Entry (per frontend) stats for the past 168 hours. frontend_vofrontend_service_frontend Run Held Idle Unknwn | Pending Wait StgIn StgOut | RunDiff IdleDiff %RD ss_ITB_GRATIA_TEST_1 71.0K 0.0 19.4K 0.0 | 19.4K 0.0 0.0 3.9K | 23.1K -6.1K 32% ----------------------------------- LEGEND: K = x 1,000 M = x 100,000 Run : Status Running Held : Status Held Idle : Status Idle Unknwn : Status Unknown (StatusIdleOther) Pending : Status Pending Wait : Status Wait StgIn : Status Stage In StgOut : Status Stage Out RunDiff : StatusRunning - ClientRegistered (ClientGlideTotal) IdleDiff : StatusIdle - ReqIdle (Requested Idle) %RD - Percent RunDiff over Running (RunDiff/StatusRunning) -------------------------------------
analyze_frontends
This tool, found in factory/tools/analyze_frontends can list many statistics useful for factory administrators. This tool will show thee number of registered, claimed, and unmatched slots. It can list these per frontend or per entry.
The command-line arguments are listed below:--source ABSPATH | Required. Unless you are running this in the factory directory, you must either specify the absolute path to the factory directory, or |
-x | Interval to look at (in hours). Valid options are 2, 24, or 168. |
-p | Look at all the above intervals. |
-f FRONTEND | Filter by frontend. You can omit "frontend_" at the beginning of the name. |
-z | Zero suppression (don't show entries with 0s across all attributes) |
-m | Emphasize frontend data, no entries shown. By default, slots are shown. You can use -ms to show seconds or -mh to show hours. |
-msort ATTRIBUTE | Frontend mode, sort by attribute (strt, fval, 0job, val, idle, wst, badp, waste, time, total) |
-h | Show usage message |
Example usage:
./analyze_frontends --source /opt/wmsfactory/glidein_wmsfactory_instance -x 168 Status Attributes (Clients) analysis for All Entries - 01-11-2011_15:28:06 ---------------------------------------- Past 168.0 hours: Registered: 47.8K ( 13.3 hours - 0.1 slots) Claimed: 31.8K ( 8.8 hours - 0.1 slots - 66%) Unmatched : 14.6K ( 4.1 hours - 0.0 slots - 30%) Requested Idle: 25.5K ( 7.1 hours - 0.0 slots) Jobs Running: 31.7K ( 8.8 hours - 0.1 slots) Jobs Run Here: 0.0 ( 0.0 hours - 0.0 slots) Jobs Idle: 40.0K ( 11.1 hours - 0.1 slots) RunDiff (Running-ClientRegistered): 23.1K ( 6.4 hours - 0.0 slots - 32%) --------------------------------------- --------------------------------------- Per Entry (all frontends) stats for the past 168 hours. Regd Claimd Unmtchd | ReqIdle | JobRun JobHere JobIdle | RunDiff %UM %RD ss_ITB_GRATIA_TEST_1 47.8K 31.8K 14.6K | 25.5K | 31.7K 0.0 40.0K | 23.1K 30% 32% --------------------------------------- --------------------------------------- Per Entry (per frontend) stats for the past 168 hours. frontend_vofrontend_service_frontend Regd Claimd Unmtchd | ReqIdle | JobRun JobHere JobIdle | RunDiff %UM %RD ss_ITB_GRATIA_TEST_1 47.8K 31.8K 14.6K | 25.5K | 31.7K 0.0 40.0K | 23.1K 30% 32% ----------------------------------- LEGEND: K = x 1,000 M = x 100,000 Regd - Registered (ClientGlideTotal) Claimd - Claimed (ClientGlideRunning) Unmtchd - Unmatched (ClientGlideIdle) ReqIdle - Requested Idle JobRun - Client Jobs Running JobHere - Client Jobs Run Here JobIdle - Client Jobs Idle RunDiff - StatusRunning - ClientRegistered (ClientGlideTotal) %UM - Percent Unmatched (Unmatched/Registered) %RD - Percent RunDiff over Running (RunDiff/StatusRunning) -------------------------------------