GlideinWMS The Glidein-based Workflow Management System

Search Results

Glidein Frontend

Troubleshooting

Verifying all glideinWMS services are communicating.

One means of verifying that all the glideinWMS services are commnunicating correctly is to perform the following on the UserCollector.

> source CONDOR_LOCATION/condor.sh
> condor_status -any -pool NODE:PORT (of the User Collector)
MyType

TargetType

Name

. 1.The DaemonMaster, Negotiator and Collector types indicate the UserCollector services are running.
2.The number of Scheduler types should equal the number of schedds you specified for the Submit service.
3.The glideresource indicates the VOFrontend is talking to the WMS Collector/Factory and the UserCollector. NOTE: You do require at least one entry point in the Factory for this to show.
Scheduler Nonecms-xen21.fnal.gov
DaemonMaster Nonecms-xen21.fnal.gov
Negotiator Nonecms-xen21.fnal.gov
Collector Nonefrontend_service@cms-xen21.fnal
glideresource Noneress_ITB_INSTALL_TEST@same_node
Scheduler Noneschedd_jobs2@cms-xen21.fnal.gov
One exception is the communcation between the Frontend and the Submit/Schedd services. This is one area of failure this check does not cover.

General Issues

This section contains tips and troubles relevant to all phases of a job's execution. Also see user tutorials with example job submissions for VO Frontends.

Authentication Issues

Many glideinWMS issues are caused by authentication. Make sure that your proxy and certificate are correct. Each process needs a proxy/cert that is owned by that user.
Also, make sure that this cert has authorization to run a job by running a command such as (all on one line):

X509_USER_CERT=/tmp/x509up_u<UID> globus-job-run -a -r <gatekeeper in factory config>
Note that /tmp/x509up_u<UID> is the typical location for kerberos certificates, but use the proper location if the place of your server certificate varies.

Wrong condor.sh sourced

Always source the correct condor.sh before running any commands. Many problems are caused are by using the wrong path/environment, (for instance, sourcing the user pool condor.sh then running WMS collector commands). Run "which condor_q" to see if your path is correct.

Note: If you are using VDT and source the setup.sh (e.g. for voms-proxy-init), this may change your path/environment, and you may need to run condor.sh again.

Using the correct CONDOR_CONFIG for the Frontend

The Frontend is the only Glidein service that uses a modified version of the CONDOR_CONFIG file(s). If you are troubleshooting problems using Condor commands (e.g. condor_status), you may not see the same results that the Frontend service does.

The Frontend uses GLIDEINWMS_VOFRONTEND_HOME/frontend.condor_config

So, in order to simulate the Frontend environment, you should set the environment as follows before executing the Condor commands (bash syntax shown):

> source CONDOR_LOCATION/condor.sh
> export CONDOR_CONFIG=GLIDEINWMS_VOFRONTEND_HOME/frontend.condor_config
> export _CONDOR_CERTIFICATE_MAPFILE=GLIDEINWMS_VOFRONTEND_HOME/GROUP_NAME/group.mapfile
Note: You have to be aware of the Frontend group (GROUP_NAME) you are troubleshooting.

Note: The frontend.condor_config is created everytime you do a frontend reconfig. If you change the Condor condor_config with a new configuration, you need to reconfig the Frontend for the Frontend to see these changes and take effect.

Failed to talk to factory_pool (WMS Collector)

This is a failure to communicate with the WMS Collector and Factory services.

Symptoms: Many
Useful files:

In the GLIDEINWMS_VOFRONTEND_HOME/log/*/group_main/frontend.*.info.log:
    [2011-09-21T13:54:04-05:00 3859] WARNING: Failed to talk to factory_pool cms-xen21.fnal.gov:9618. See debug log for more details.
In the GLIDEINWMS_VOFRONTEND_HOME/log/*/group_main/frontend.*.debug.log:
    [2011-09-21T13:54:04-05:00 3859] Failed to talk to factory_pool cms-xen21.fnal.gov:9618:
    Error running '/usr/local2/glideins/git-same-condor-v2plus.ini/condor-frontend/bin/condor_status
    ... (more traceback)
    code 1:Error: communication error
    CEDAR:6001:Failed to connect to <131.225.206.78:9618> Error: Couldn't contact the condor_collector on cms-xen21.fnal.gov (<131.225.206.78:9618>).

Debugging Steps:

  1. Verify the WMS Collector is running.
  2. Verify the IP/NODE is correct for the WMS Collector.
    If it is incorrect, the frontend.xml should be corrected and a frontend reconfig executed.
    <frontend .... >
    <match .... > <factory .... > <collectors>
    <collector node="cms-xen21.fnal.gov:9618"/> ... This one. This is WMS Collector
    </collectors> </factory> </match .... >
    <collectors>
    <collector ... node="cms-xen21.fnal.gov:9640"/> ... Not this one. This is User Collector
    </collectors>
    </frontend>
  3. If you have access to the WMS Collector, check the ALLOW/DENY configuration in its condor_config.
  4. Another reason for failure is a GSI authentication error (aka permission denied) error occuring on the WMS Collector.
    If you have access to the Condor log files for that service, check the MasterLog and CollectorLog for authentication errors. The VOFrontend's proxy (proxy_DN) must be in the CONDOR_LOCATION/certs/condor_mapfile of the WMS Collectors to allow classads to be published.
    <frontend .... >
    <security classad_proxy="VOFRONTEND PROXY" proxy_DN="VOFRONTEND PROXY ISSUER"... />
    </frontend>

Failure to talk to collector (User Collector)

This is a failure to communicate with the User Collector service. This does not affect the ability to submit and run jobs.

Symptoms: Many
Useful files:

In the GLIDEINWMS_VOFRONTEND_HOME/log/*/group_main/frontend.*.info.log:
    [2011-09-20T13:51:25-05:00 4619] WARNING: Failed to talk to collector. See debug log for more details.
    [2011-09-20T13:51:25-05:00 2994] WARNING: Exception in jobs. See debug log for more details.
In the GLIDEINWMS_VOFRONTEND_HOME/log/*/group_main/frontend.*.debug.log:
    [2011-09-20T13:51:25-05:00 4619] Failed to talk to collector:
    Error running '/usr/local2/glideins/git-same-condor-v2plus.ini/condor-frontend/bin/condor_status
    ... (more traceback)
    code 1:Error: communication error
    CEDAR:6001:Failed to connect to <131.225.206.78:9640>

Debugging Steps:

  1. Verify the User Collector is running.
  2. Verify the IP/NODE is correct for the User Collector.
    If it is incorrect, the frontend.xml should be corrected and a frontend reconfig executed.
    <frontend .... >
    <match .... > <factory .... > <collectors>
    <collector ... node="cms-xen21.fnal.gov:9618"/> ... Not this one. This is WMS Collector
    </collectors> </factory> </match .... >
    <collectors>
    <collector node="cms-xen21.fnal.gov:9640" .. /> ... This one. This is User Collector
    </collectors>
    </frontend>
  3. If you have access to the User Collector, check the ALLOW/DENY configuration in its condor_config.
  4. Another reason for failure is a GSI authentication error (aka permission denied) error occuring on the User Collector. If you have access to the Condor log files for that service, check the MasterLog and CollectorLog for authentication errors. The VOFrontend's proxy (proxy_DN) must be in the CONDOR_LOCATION/certs/condor_mapfile, of both collectors, to allow classads to be published.
    <frontend .... >
    <security classad_proxy="VOFRONTEND PROXY" proxy_DN="VOFRONTEND PROXY ISSUER"... />
    </frontend>

Problems submitting your job

Symptoms: Error submitting user job
Useful files: GLIDEINWMS_USERSCHEDD_HOME/condor_local/logs/SchedLog
Debugging Steps:

If you encounter errors submitting your job using condor_submit, the error messages printed on the screen will be useful in identifying potential problems. Occasionally, you can additional information in the condor schedd logs.

Always make sure that you have sourced the condor.sh and that the path and environment is correct.

source $GLIDEINWMS_USERSCHEDD_HOME/condor.sh

Based on the actual condor scheduler, you can find scheduler logfile, SchedLog, in one of the sub directories of directory listed by “condor_config_val local_dir”

If you are installing all services on one machine (not recommended but sometimes useful for testing) make sure that the user collector and wms collector are on two different ports (such as 9618 and 8618). You can do "ps -ef" to see if the processes are started (should be multiple condor_masters, condor_schedds and condor_procd for each machine). Make sure they are running as the proper users (user schedd should be probably be run as root. wms collector should be run as root if you want privsep).

Also refer to the Collector install for verification steps.

User Jobs Stay Idle

Symptoms:User job stays idle and there are no glideins submitted that correspond to your job.

This step involves the interaction of the VO frontend and WMS Factory. Hence, there are two separate facilities to see why no glideins are being created. See the Factory Troubleshooting page if none of these suggestions help.

Frontend unable to map your job to any entry point

Symptoms: User job stays idle and there is no information in the frontend logs about glideins required to run your job.
Useful files: GLIDEINWMS_VOFRONTEND_HOME/log/*
GLIDEINWMS_VOFRONTEND_HOME/group_<GROUP_NAME>/log/*
Debugging Steps:

Check if the VO frontend is running. If not start it.

Glidein Frontend processes periodically query for user jobs in the user schedd. Once you have submitted the job, VO frontend should notice it during its next queering cycle. Once the frontend identifies potential entry points that can run your job, it will reflect this information in the glideclient classad in WMS collector for that corresponding entry point. You can find this information by running “condor_status -any -pool <wms collector fqdn>”

Check for error messages in logs located in GLIDEINWMS_VOFRONTEND_HOME/log. Assuming that you have named frontend main group as “main”, check the log files in GLIDEINWMS_VOFRONTEND_HOME/group_main/log.

[2009-12-07T15:16:25-05:00 12398] For ress_GRATIA_TEST_31@v1_0@mySites-cmssrv97@cmssrv97.fnal.gov Idle 19 (effective 19 old 19) Running 0 (max 10000)
[2009-12-07T15:16:25-05:00 12398] Glideins for ress_GRATIA_TEST_31@v1_0@mySites-cmssrv97@cmssrv97.fnal.gov Total 0 Idle 0 Running 0
[2009-12-07T15:16:25-05:00 12398] Advertize ress_GRATIA_TEST_31@v1_0@mySites-cmssrv97@cmssrv97.fnal.gov Request idle 11 max_run 22
You should notice something like above in the logs corresponding to your job. If the frontend does not identify any entry that can run your job, then either the the desired entry is not configured in the glidein factory or the requirements you have expressed in your jobs are not correct.

Also, check the security classad to make sure the proxy/cert for the frontend is correct. It should be chmod 600 and owned by the frontend user.
If using voms, try to query the information to verify:

X509_USER_CERT=<vofronted_proxy_location> voms-proxy-info.

The symptoms of this issue are a break in communication between the VO frontend and the factory. In this case, the problem may also be a problem with the factory. See the Factory Troubleshooting guide for more details.

Found an untrusted factory

Symptoms: You will receive an error similar to:
info log:
[2010-09-29T09:07:24-05:00 26824] WARNING: Found an untrusted factory ress_ITB_GRATIA_TEST_2@v2_4_3@factory_service at cms-xen21.fnal.gov; ignoring.
debug log:
[2010-09-29T09:07:24-05:00 26824] Found an untrusted factory ress_ITB_GRATIA_TEST_2@v2_4_3@factory_service at cms-xen21.fnal.gov; identity mismatch ' weigand@cms-xen21.fnal.gov'!='factory@cms-xen21.fnal.gov '
Debugging Steps:

This happens if a factory is found that does not match the expected identity in the the frontend config. This is located in the frontend config:

<frontend > <collector ... factory_identity="...">

The frontend config's security element security_name attribute must match what Condor is mapping the factory to. Verify that this value is correct. Also, see the color coded guide for a full list of settings that must match.
Alternatively, the wms factory itself may be using the wrong identity. For instance if you start the factory as the wrong user or if your condor map file is not set up correctly, then the factory may be running with the wrong identity.
You can find the authenticated identity by:

condor_status [-any] -collector >WMSCollector_node:port> -long |grep -i AuthenticatedIdentity |sort -u