glideinwms.factory package¶
Subpackages¶
- glideinwms.factory.tools package
- Subpackages
- Submodules
- glideinwms.factory.tools.OSG_autoconf module
- glideinwms.factory.tools.cat_MasterLog module
- glideinwms.factory.tools.cat_StartdHistoryLog module
- glideinwms.factory.tools.cat_StartdLog module
- glideinwms.factory.tools.cat_StarterLog module
- glideinwms.factory.tools.cat_XMLResult module
- glideinwms.factory.tools.cat_logs module
- glideinwms.factory.tools.cat_named_log module
- glideinwms.factory.tools.find_StartdLogs module
- glideinwms.factory.tools.find_logs module
- glideinwms.factory.tools.get_tarballs module
- glideinwms.factory.tools.gfdiff module
- glideinwms.factory.tools.manual_glidein_submit module
- Module contents
Submodules¶
glideinwms.factory.checkFactory module¶
glideinwms.factory.glideFactory module¶
This is the main of the glideinFactory
- param $1 = glidein submit_dir:
- exception glideinwms.factory.glideFactory.HUPException[source]¶
Bases:
Exception
Used to catch SIGHUP and trigger a reconfig
- glideinwms.factory.glideFactory.aggregate_stats(in_downtime)[source]¶
Aggregate all the monitoring stats
@type in_downtime: boolean @param in_downtime: Entry downtime information :return stats dictionary
- glideinwms.factory.glideFactory.entry_grouper(size, entries)[source]¶
Group the entries into n smaller groups KNOWN ISSUE: Needs improvement to do better grouping in certain cases TODO: Migrate to itertools when only supporting python 2.6 and higher
@type size: long @param size: Size of each subgroup @type entries: list @param size: List of entries
@rtype: list @return: List of grouped entries. Each group is a list
- glideinwms.factory.glideFactory.generate_log_tokens(startup_dir, glideinDescript)[source]¶
Generate the JSON Web Tokens used to authenticate with the remote HTTP log server. Note: tokens are generated for disabled entries too
- Parameters:
startup_dir (str|Path) – Path to the glideinsubmit directory
glideinDescript – Factory config’s glidein description object
- Returns:
None
- Raises:
IOError – If can’t open/read/write a file (key/token)
- glideinwms.factory.glideFactory.hupsignal(signr, frame)[source]¶
Signal handler. Raise HUPException when receiving SIGHUP. Used to trigger a reconfig and restart.
- glideinwms.factory.glideFactory.increase_process_limit(new_limit=10000)[source]¶
Raise RLIMIT_NPROC to new_limit
- glideinwms.factory.glideFactory.is_crashing_often(startup_time, restart_interval, restart_attempts)[source]¶
Check if the entry is crashing/dieing often
@type startup_time: long @param startup_time: Startup time of the entry process in second @type restart_interval: long @param restart_interval: Allowed restart interval in second @type restart_attempts: long @param restart_attempts: Number of allowed restart attempts in the interval
@rtype: bool @return: True if entry process is crashing/dieing often
- glideinwms.factory.glideFactory.is_file_old(filename, allowed_time)[source]¶
Check if the file is older than given time
@type filename: String @param filename: Full path to the file @type allowed_time: long @param allowed_time: Time is second
@rtype: bool @return: True if file is older than the given time, else False
- glideinwms.factory.glideFactory.main(startup_dir)[source]¶
Reads in the configuration file and starts up the factory
@type startup_dir: String @param startup_dir: Path to glideinsubmit directory
- glideinwms.factory.glideFactory.save_stats(stats, fname)[source]¶
Serialize and save aggregated statistics so that each component (Factory and Entries) can retrieve and use it to log and advertise
stats is a dictionary pickled in binary format stats[‘LogSummary’] - log summary aggregated info
- Parameters:
stats – aggregated Factory statistics
fname – name of the file with the serialized data
- Returns:
- glideinwms.factory.glideFactory.spawn(sleep_time, advertize_rate, startup_dir, glideinDescript, frontendDescript, entries, restart_attempts, restart_interval)[source]¶
Spawn and keep track of the entry processes. Restart them if required. Advertise glidefactoryglobal classad every iteration
@type sleep_time: long @param sleep_time: Delay between every iteration @type advertize_rate: long @param advertize_rate: Rate at which entries advertise their classads @type startup_dir: String @param startup_dir: Path to glideinsubmit directory @type glideinDescript: glideFactoryConfig.GlideinDescript @param glideinDescript: Factory config’s glidein description object @type frontendDescript: glideFactoryConfig.FrontendDescript @param frontendDescript: Factory config’s frontend description object @type entries: list @param entries: Sorted list of entry names @type restart_interval: long @param restart_interval: Allowed restart interval in second @type restart_attempts: long @param restart_attempts: Number of allowed restart attempts in the interval
- glideinwms.factory.glideFactory.termsignal(signr, frame)[source]¶
Signal handler. Raise KeyboardInterrupt when receiving SIGTERN or SIGQUIT
- glideinwms.factory.glideFactory.update_classads()[source]¶
Loads the aggregate job summary pickle files, and then quedit the finished jobs adding a new classad called MONITOR_INFO with the monitor information.
- Returns:
- glideinwms.factory.glideFactory.write_descript(glideinDescript, frontendDescript, monitor_dir)[source]¶
Write the descript.xml to the monitoring directory
@type glideinDescript: glideFactoryConfig.GlideinDescript @param glideinDescript: Factory config’s glidein description object @type frontendDescript: glideFactoryConfig.FrontendDescript @param frontendDescript: Factory config’s frontend description object @type monitor_dir: String @param monitor_dir: Path to monitoring directory
glideinwms.factory.glideFactoryConfig module¶
- class glideinwms.factory.glideFactoryConfig.ConfigFile(config_file, convert_function=<built-in function repr>)[source]¶
Bases:
object
In memory dictionary-like representation of key-value config files Loads a file composed of
NAME VAL
- and creates
self.data[NAME]=convert_function(VAL) # repr is the default conversion
- It also defines:
self.config_file=”name of file”
This is used only to load into memory and access the dictionary, not to update the on-disk persistent values
- class glideinwms.factory.glideFactoryConfig.EntryConfigFile(entry_name, config_file, convert_function=<built-in function repr>)[source]¶
Bases:
ConfigFile
Load from the entry subdir
- It also defines:
self.config_file=”name of file with entry directory” (from parent ConfigFile) self.entry_name=”Entry name” self.config_file_short=”name of file” (just the file name since the other had the directory)
- class glideinwms.factory.glideFactoryConfig.FrontendDescript[source]¶
Bases:
ConfigFile
Contains the security identity and username mappings for the Frontends that are authorized to use this factory.
Contains dictionary of dictionaries: obj.data[frontend][‘ident’]=identity obj.data[frontend][‘usermap’][sec_class]=username
- get_all_frontend_sec_classes()[source]¶
Get a list of all frontend:sec_class
- Returns:
Frontend security classes
- Return type:
list
- get_all_usernames()[source]¶
Get all the usernames assigned to all the frontends.
- Returns:
list of usernames
- Return type:
list
- get_frontend_name(identity)[source]¶
Get the frontend:sec_class mapping for the given identity
- Parameters:
identity (str) – identity
- Returns:
Frontend name
- Return type:
str
- class glideinwms.factory.glideFactoryConfig.GlideinDescript[source]¶
Bases:
ConfigFile
- class glideinwms.factory.glideFactoryConfig.GlideinKey(pub_key_type, key_fname=None, recreate=False)[source]¶
Bases:
object
- class glideinwms.factory.glideFactoryConfig.JobAttributes(entry_name)[source]¶
Bases:
JoinConfigFile
- class glideinwms.factory.glideFactoryConfig.JobDescript(entry_name)[source]¶
Bases:
EntryConfigFile
- class glideinwms.factory.glideFactoryConfig.JobParams(entry_name)[source]¶
Bases:
JoinConfigFile
- class glideinwms.factory.glideFactoryConfig.JobSubmitAttrs(entry_name)[source]¶
Bases:
JoinConfigFile
- class glideinwms.factory.glideFactoryConfig.JoinConfigFile(entry_name, config_file, convert_function=<built-in function repr>)[source]¶
Bases:
ConfigFile
Load both the main and entry subdir config file and join the results
Data is only read, not saved self.data will contain the joint items (initially the common one, then is updated using the content of entry_obj.data)
- It also defines:
- self.config_file=”name of both files, with and without entry directory”
It is not an actual file
self.entry_name=”Entry name” self.config_file_short=”name of file” (just the file name without the directory)
- class glideinwms.factory.glideFactoryConfig.SignatureFile[source]¶
Bases:
ConfigFile
Signatures File dictionary
- load(fname, convert_function)[source]¶
Load the signatures.sha1 file into the class as a dictionary. The convert_function is completely ignored here. The line format is different from all the other class in that there are three values with the key being the last value. The internal dictionary has the following structure:
- where:
line[0] is the signature for the line line[1] is the descript file for the line line[2] is the key for the line
- for each line:
line[2]_sign = line[0] line[2]_descript = line[1]
glideinwms.factory.glideFactoryCredentials module¶
- exception glideinwms.factory.glideFactoryCredentials.CredentialError[source]¶
Bases:
Exception
defining new exception so that we can catch only the credential errors here and let the “real” errors propagate up
- class glideinwms.factory.glideFactoryCredentials.SubmitCredentials(username, security_class)[source]¶
Bases:
object
Data class containing all information needed to submit a glidein.
- glideinwms.factory.glideFactoryCredentials.check_security_credentials(auth_method, params, client_int_name, entry_name, scitoken_passthru=False)[source]¶
Verify that only credentials for the given auth method are in the params
- Parameters:
auth_method – (string): authentication method of an entry, defined in the config
params – (dictionary): decrypted params passed in a frontend (client) request
client_int_name (string) – internal client name
entry_name – (string): name of the entry
scitoken_passthru – (bool): if True, scitoken present in credential. Override checks for ‘auth_method’ and proceded with glidein request
- Raises:
CredentialError – if the credentials in params don’t match what is defined for the auth method
- glideinwms.factory.glideFactoryCredentials.get_globals_classads(factory_collector='default')[source]¶
- glideinwms.factory.glideFactoryCredentials.get_key_obj(pub_key_obj, classad)[source]¶
Gets the symmetric key object from the request classad
@type pub_key_obj: object @param pub_key_obj: The factory public key object. This contains all the encryption and decryption methods @type classad: dictionary @param classad: a dictionary representation of the classad
- glideinwms.factory.glideFactoryCredentials.process_global(classad, glidein_descript, frontend_descript)[source]¶
- glideinwms.factory.glideFactoryCredentials.update_credential_file(username, client_id, credential_data, request_clientname)[source]¶
Updates the credential file
- Parameters:
username – credentials’ username
client_id – id used for tracking the submit credentials
credential_data – the credentials to be advertised
request_clientname – client name passed by frontend
:return:the credential file updated
- glideinwms.factory.glideFactoryCredentials.validate_frontend(classad, frontend_descript, pub_key_obj)[source]¶
Validates that the frontend advertising the classad is allowed and that it claims to have the same identity that Condor thinks it has.
@type classad: dictionary @param classad: a dictionary representation of the classad @type frontend_descript: class object @param frontend_descript: class object containing all the frontend information @type pub_key_obj: object @param pub_key_obj: The factory public key object. This contains all the encryption and decryption methods
@return: sym_key_obj - the object containing the symmetric key used for decryption @return: frontend_sec_name - the frontend security name, used for determining the username to use.
glideinwms.factory.glideFactoryDowntimeLib module¶
- class glideinwms.factory.glideFactoryDowntimeLib.DowntimeFile(fname)[source]¶
Bases:
object
Handle a downtime file
space separated file with downtime information Each line has space-separated values The first line is a comment (starts with #) and header line :
“#%-29s %-30s %-20s %-30s %-20s # %s
- “ % (“Start”, “End”, “Entry”, “Frontend”, “Sec_Class”, “Comment”)
- Each non-comment line in the file has at least two entries
start_time end_time expressed in utime
if end_time is None, the downtime does not have a set expiration (i.e. it runs forever) Additional entries are used to limit the scope (Entry, Frontend, Sec_Class) and to add a comment
- addPeriod(start_time, end_time, entry='All', frontend='All', security_class='All', comment='', create_if_empty=True)[source]¶
Add a scheduled downtime Maintin a lock (fcntl.LOCK_EX) on the downtime file while writing entry, frontend, and security_class default to “All”
- Parameters:
start_time (int) – start time in seconds from Epoch
end_time (int) – end time in seconds from Epoch
entry (str) – entry name or “All”
frontend (str) – frontend name os “All”
security_class (str) – security class name or “All”
comment (str) – comment to add
create_if_empty (bool) – if False, raise FileNotFoundError if there is not already a downtime file
- Returns:
0
- Return type:
int
- endDowntime(end_time=None, entry='All', frontend='All', security_class='All', comment='')[source]¶
End a downtime (not a scheduled one) if end_time==None, use current time entry, frontend, and security_class default to “All”
- Parameters:
end_time (int|None) – end time in seconds from Epoch. If end_time==None, default, use current time
entry (str) – entry name or “All”
frontend (str) – frontend name os “All”
security_class (str) – security class name or “All”
comment (str) – comment to add
- Returns:
number of records closed
- Return type:
int
- purgeOldPeriods(cut_time=None, raise_on_error=False)[source]¶
Purge old downtime periods if cut time<0, use current_time-abs(cut_time)
- Parameters:
cut_time (int) – cut time in seconds from epoch, if cut_time==None or 0, use current time, if cut time<0, use current_time-abs(cut_time)
raise_on_error (bool) – if not True, mask all exceptions
- Returns:
number of records purged
- Return type:
int
- read(raise_on_error=False)[source]¶
Return a list of downtime periods (utimes) a value of None idicates “forever” for example: [(1215339200,1215439170),(1215439271,None)]
- Parameters:
raise_on_error (bool) – if not True mask all the exceptions
- Returns:
- list of downtime periods [(start, end), …]
a value of None idicates “forever”, no start time, or no end time timestamps are in seconds from epoch (utime) [] returned when raise_on_error is False (default) and there is no downtime file
- Return type:
list
- startDowntime(start_time=None, end_time=None, entry='All', frontend='All', security_class='All', comment='', create_if_empty=True)[source]¶
start a downtime that we don’t know when it will end if start_time==None, use current time entry, frontend, and security_class default to “All”
- Parameters:
start_time (int|None) – start time in seconds from Epoch
end_time (int|None) – end time in seconds from Epoch
entry (str) – entry name or “All”
frontend (str) – frontend name os “All”
security_class (str) – security class name or “All”
comment (str) – comment to add
create_if_empty (bool) – if False, raise FileNotFoundError if there is not already a downtime file
Returns:
- glideinwms.factory.glideFactoryDowntimeLib.addPeriod(fname, start_time, end_time, entry='All', frontend='All', security_class='All', comment='', create_if_empty=True)[source]¶
Add a downtime period Maintin a lock (fcntl.LOCK_EX) on the downtime file while writing
- Parameters:
fname (str|Path) – downtime file
start_time (int) – start time in seconds from Epoch
end_time (int) – end time in seconds from Epoch
entry (str) – entry name or “All”
frontend (str) – frontend name os “All”
security_class (str) – security class name or “All”
comment (str) – comment to add
create_if_empty (bool) – if False, raise FileNotFoundError if there is not already a downtime file
- Returns:
0
- Return type:
int
- glideinwms.factory.glideFactoryDowntimeLib.checkDowntime(fname, entry='Any', frontend='Any', security_class='Any', check_time=None)[source]¶
Check if there is a downtime at check_time if check_time==None, use current time “All” (default) is a wildcard for entry, frontend and security_class
- Parameters:
fname (str|Path) – Downtime file
entry (str) – entry name or “All”
frontend (str) – frontend name os “All”
security_class (str) – security class name or “All”
check_time – time to check in seconds from epoch, if check_time==None, use current time
- Returns:
- tuple with the comment string and True is in downtime
or (“”, False) is not in downtime
- Return type:
(str, bool)
- glideinwms.factory.glideFactoryDowntimeLib.endDowntime(fname, end_time=None, entry='All', frontend='All', security_class='All', comment='')[source]¶
End a downtime (not a scheduled one) if end_time==None, use current time “All” (default) is a wildcard for entry, frontend and security_class
- Parameters:
fname (str|Path) – Downtime file
end_time (int) – end time in seconds from epoch, if end_time==None, use current time
entry (str) – entry name or “All”
frontend (str) – frontend name os “All”
security_class (str) – security class name or “All”
comment (str) – comment to add
- Returns:
Number of downtime records closed
- Return type:
int
- glideinwms.factory.glideFactoryDowntimeLib.printDowntime(fname, entry='Any', check_time=None)[source]¶
- glideinwms.factory.glideFactoryDowntimeLib.purgeOldPeriods(fname, cut_time=None, raise_on_error=False)[source]¶
Purge old rules using cut_time if cut_time==None or 0, use current time if cut time<0, use current_time-abs(cut_time)
- Parameters:
fname (str|Path) – downtime file
cut_time (int) – cut time in seconds from epoch, if cut_time==None or 0, use current time, if cut time<0, use current_time-abs(cut_time)
raise_on_error (bool) – if not True, mask all exceptions
- Returns:
number of records purged
- Return type:
int
- glideinwms.factory.glideFactoryDowntimeLib.read(fname, raise_on_error=False)[source]¶
Return a list of downtime periods (utimes) a value of None idicates “forever” for example: [(1215339200,1215439170),(1215439271,None)]
- Parameters:
fname (str|Path) – downtimes file
raise_on_error (bool) – if not True mask all the exceptions
- Returns:
- list of downtime periods [(start, end), …]
a value of None idicates “forever”, no start time, or no end time timestamps are in seconds from epoch (utime) [] returned when raise_on_error is False (default) and there is no file
- Return type:
list
glideinwms.factory.glideFactoryEntry module¶
Entry class Model and behavior of a Factory Entry (element describing a resource)
- class glideinwms.factory.glideFactoryEntry.Entry(name, startup_dir, glidein_descript, frontend_descript)[source]¶
Bases:
object
- advertise(downtime_flag)[source]¶
Advertises the glidefactory and the glidefactoryclient classads.
@type downtime_flag: boolean @param downtime_flag: Downtime flag
- getGlideinExpectedCores()[source]¶
- Return the number of cores expected for each glidein.
This is the GLIDEIN_CPU attribute when > 0, GLIDEIN_ESTIMATED_CPUS when GLIDEIN_CPU <= 0 or auto/node/slot, or 1 if not set The actual cores received will depend on the RSL or HTCondor attributes and the Entry and could also vary over time.
- getLogStatsCurrentStatsData()[source]¶
Returns the gflFactoryConfig.log_stats.current_stats_data that can be pickled
@rtype: glideFactoryMonitoring.condorLogSummary @return: condorLogSummary from current iteration
- getLogStatsData(stats_data)[source]¶
Returns the stats_data(stats_data[frontend][user].data) that can be pickled
@rtype: dict @return: Relevant stats data to pickle
- getLogStatsOldStatsData()[source]¶
Returns the gflFactoryConfig.log_stats.old_stats_data that can be pickled
@rtype: glideFactoryMonitoring.condorLogSummary @return: condorLogSummary from previous iteration
- getState()[source]¶
Compile a dictionary containt useful state information
@rtype: dict @return: Useful state information that can be pickled and restored
- glideinsWithinLimits(condorQ)[source]¶
Check the condorQ info and see we are within limits & init entry limits
@rtype: boolean @return: True if glideins are in limits and we can submit more
- initIteration(factory_in_downtime)[source]¶
Perform the reseting of stats as required before every iteration
@type factory_in_downtime: boolean @param factory_in_downtime: Downtime flag for the factory
- isClientBlacklisted(client_sec_name)[source]¶
Check if the frontend whitelist is enabled and client is not in whitelist
@rtype: boolean @return: True if the client’s security name is blacklist
- isClientInWhitelist(client_sec_name)[source]¶
Check if the client’s security name is in the whitelist of this entry
@rtype: boolean @return: True if the client’s security name is in the whitelist
- isClientWhitelisted(client_sec_name)[source]¶
Check if the client’s security name is in the whitelist of this entry and the frontend whitelist is enabled
@rtype: boolean @return: True if the client’s security name is whitelisted
- isInDowntime()[source]¶
Check the downtime file to find out if entry is in downtime
@rtype: boolean @return: True if the entry is in downtime
- isSecurityClassAllowed(client_sec_name, proxy_sec_class)[source]¶
Check if the security class is allowed
@rtype: boolean @return: True if the security class is allowed
- isSecurityClassInDowntime(client_security_name, security_class)[source]¶
Check if the security class is in downtime in the Factory or in this Entry
@rtype: boolean @return: True if the security class is in downtime
- loadContext()[source]¶
Load context for this entry object so monitoring and logs are writen correctly. This should be called in every method for now.
- queryQueuedGlideins()[source]¶
Query WMS schedd (on Factory) and get glideins info. Re-raise in case of failures. Return a loaded condorMonitor.CondorQ object using the entry attributes (name, schedd, …). Consists of a fetched dictionary w/ jobs (keyed by job cluster, ID) in .stored_data, some query attributes and the ability to reload (load/fetch)
@rtype: condorMonitor.CondorQ already loaded @return: Information about the jobs in condor_schedd
- setDowntime(downtime_flag)[source]¶
Check if we are in downtime and set info accordingly
@type downtime_flag: boolean @param downtime_flag: Downtime flag
- setLogStatsCurrentStatsData(new_data)[source]¶
Set gflFactoryConfig.log_stats.current_stats_data from pickled info
@type new_data: glideFactoryMonitoring.condorLogSummary @param new_data: Data from pickled object to load
- setLogStatsData(stats_data, new_data)[source]¶
Sets the stats_data(stats_data[frontend][user].data) from pickled info
@type stats_data: dict @param stats_data: Stats data
@type new_data: dict @param new_data: Stats data from pickled info
- setLogStatsOldStatsData(new_data)[source]¶
Set old_stats_data or current_stats_data from pickled info
@type new_data: glideFactoryMonitoring.condorLogSummary @param new_data: Data from pickled object to load
- setState(state)[source]¶
Load the post work state from the pickled info
- Parameters:
state (dict) – Pickled state after doing work
- setState_old(state)[source]¶
Load the post work state from the pickled info
- Parameters:
state (dict) – Picked state after doing work
- writeClassadsToFile(downtime_flag, gf_filename, gfc_filename, append=True)[source]¶
Create the glidefactory and glidefactoryclient classads to advertise but do not advertise
@type downtime_flag: boolean @param downtime_flag: downtime flag
@type gf_filename: string @param gf_filename: Filename to write glidefactory classads
@type gfc_filename: string @param gfc_filename: Filename to write glidefactoryclient classads
@type append: boolean @param append: True to append new classads. i.e Multi classads file
- writeStats()[source]¶
Calls the statistics functions to record and write stats for this iteration.
There are several main types of statistics:
log stats: That come from parsing the condor_activity and job logs. This is computed every iteration (in perform_work()) and diff-ed to see any newly changed job statuses (ie. newly completed jobs)
qc stats: From condor_q data.
rrd stats: Used in monitoring statistics for javascript rrd graphs.
- glideinwms.factory.glideFactoryEntry.check_and_perform_work(factory_in_downtime, entry, work)[source]¶
Check if we need to do the work and then do the work. Called by child process per entry
@param factory_in_downtime: Flag if factory is in downtime
@type entry: glideFactoryEntry.Entry @param entry: Entry object
@param work: all the work requests for the Entry
- Returns:
- glideinwms.factory.glideFactoryEntry.perform_work_v3(entry, condorQ, client_name, client_int_name, client_security_name, submit_credentials, remove_excess, idle_glideins, max_glideins, idle_lifetime, credential_username, glidein_totals, frontend_name, client_web, params)[source]¶
Perform the work (Submit or remove glideins)
@type entry: glideFactoryEntry.Entry @param entry: Entry object
@type condorQ: condorMonitor.CondorQ @param condorQ: Information about the jobs in condor_schedd (entry values sub-query from glideFactoryLib.getQCredentials())
@type client_int_name: string @param client_in_name: Internal name of the client
@type client_securty_name: string @param client_security_name: Security name of the client
@type submit_credentials: @param submit_credentials: credentials used
@type remove_excess: tuple @param remove_excess: remove_excess_str, remove_excess_margin; if frontend wants us to remove excess glideins
@type idle_glideins: int @param idle_glideins: Number of idle glideins
@type max_glideins: int @param max_glideins: Maximum number of running glideins
@type idle_lifetime: @param idle_lifetime:
@type credential_username: string @param credential_username: Credential username
@type glidein_totals: object @param glidein_totals: glidein_totals object
@type frontend_name: string @param frontend_name: Name of the frontend
@type client_web: string @param client_web: Client’s web location
@type params: object @param params: Params object
@return: 1 if something was submitted, 0 otherwise
- glideinwms.factory.glideFactoryEntry.unit_work_v3(entry, work, client_name, client_int_name, client_int_req, client_expected_identity, decrypted_params, params, in_downtime, condorQ)[source]¶
Perform a single work unit using the v3 protocol.
- Parameters:
entry – Entry
work – work requests
client_name – work_key (key used in the work request)
client_int_name – client name declared in the request
client_int_req – name of the request (declared in the request)
client_expected_identity
decrypted_params
params
in_downtime
condorQ – list of HTCondor jobs for this entry as returned by entry.queryQueuedGlideins()
- Returns:
Return dictionary w/ success, security_names and work_done
- glideinwms.factory.glideFactoryEntry.update_entries_stats(factory_in_downtime, entry_list)[source]¶
Update client_stats for the entries in the list. Used for entries with no job requests TODO: #22163, skip update when in downtime? NOTE: qc_stats cannot be updated because the frontend certificate information are missing @param factory_in_downtime: True if the Factory is in downtime, here for future needs (not used now) @param entry_list: list of entry names for the entries to update @return: list of names of the entries that have been updated (subset of entry_list)
glideinwms.factory.glideFactoryEntryGroup module¶
- This is the glideinFactoryEntryGroup. Common Tasks like querying collector
and advertizing the work done by group are done here
- param $1 = parent_pid:
The pid for the Factory daemon
- type $1 = parent_pid:
int
- param $2 = sleep_time:
The number of seconds to sleep between iterations
- type $2 = sleep_time:
int
- param $3 = advertize_rate:
The rate at which advertising should occur (every $3 loops)
- type $3 = advertize_rate:
int
- param $4 = startup_dir:
The “home” directory for the entry.
- type $4 = startup_dir:
str|Path
- param $5 = entry_names:
Colon separated list with the names of the entries this process should work on
- type $5 = entry_names:
str
- param $6 = group_id:
Group id, normally a number (with the “group_” prefix it forms the group name), It can change between Factory reconfigurations
- type $6 = group_id:
str
- glideinwms.factory.glideFactoryEntryGroup.check_parent(parent_pid, glideinDescript, my_entries)[source]¶
Check to make sure that we aren’t an orphaned process. If Factory daemon has died, then clean up after ourselves and kill ourselves off.
@type parent_pid: int @param parent_pid: pid for the Factory daemon process
@type glideinDescript: glideFactoryConfig.GlideinDescript @param glideinDescript: Object that encapsulates glidein.descript in the Factory root directory
@type my_entries: dict @param my_entries: Dictionary of entry objects keyed on entry name
@raise KeyboardInterrupt: Raised when the Factory daemon cannot be found
- glideinwms.factory.glideFactoryEntryGroup.compile_pickle_data(entry, work_done)[source]¶
Extract the state of the entry after doing work
- Parameters:
entry (Entry) – Entry object
work_done (int) – Work done info
- Returns:
pickle-friendly version of the Entry (state of the Entry)
- Return type:
dict
- glideinwms.factory.glideFactoryEntryGroup.find_and_perform_work(do_advertize, factory_in_downtime, glideinDescript, frontendDescript, group_name, my_entries)[source]¶
For all entries in this group, find work requests from the WMS collector, validate credentials, and requests Glideins. If an entry is in downtime, requested Glideins is zero.
- Parameters:
do_advertize (bool) – Advertise (publish the gfc ClassAd) event if no work is preformed
factory_in_downtime (bool) – True if factory is in downtime
glideinDescript (dict) – Factory glidein config values
frontendDescript (dict) – Security mappings for frontend identities, security classes, and usernames
group_name (str) – Name of the group
my_entries (dict) – Dictionary of entry objects (glideFactoryEntry.Entry) keyed on entry name
- Returns:
Dictionary of work to do keyed using entry name
- Return type:
dict
- glideinwms.factory.glideFactoryEntryGroup.find_work(factory_in_downtime, glideinDescript, frontendDescript, group_name, my_entries)[source]¶
Find work for all the entries in the group
@type factory_in_downtime: boolean @param factory_in_downtime: True if factory is in downtime
@type glideinDescript: dict @param glideinDescript: Factory glidein config values
@type frontendDescript: dict @param frontendDescript: Security mappings for frontend identities, security classes, and usernames
@type group_name: string @param group_name: Name of the group
@type my_entries: dict @param my_entries: Dictionary of entry objects keyed on entry name
@return: Dictionary of work to do keyed on entry name @rtype: dict
- glideinwms.factory.glideFactoryEntryGroup.forked_check_and_perform_work(factory_in_downtime, entry, work)[source]¶
Do the work assigned to an entry (glidein requests) @param factory_in_downtime: flag, True if the Factory is in downtime @param entry: entry object (glideFactoryEntry.Entry) @param work: work requests for the entry @return: dictionary with entry state + work_done
- glideinwms.factory.glideFactoryEntryGroup.forked_update_entries_stats(factory_in_downtime, entries_list)[source]¶
Update statistics for entries that have no work to do
- Parameters:
factory_in_downtime
entries_list
- Returns:
- glideinwms.factory.glideFactoryEntryGroup.get_work_count(work)[source]¶
Get total work to do i.e. sum of work to do for every entry
@type work: dict @param work: Dictionary of work to do keyed on entry name
@rtype: int @return: Total work to do.
- glideinwms.factory.glideFactoryEntryGroup.iterate(parent_pid, sleep_time, advertize_rate, glideinDescript, frontendDescript, group_name, my_entries)[source]¶
Iterate over set of tasks until it is time to quit or die. The main “worker” function for the Factory Entry Group.
- Parameters:
parent_pid (int) – The pid for the Factory daemon
sleep_time (int) – The number of seconds to sleep between iterations
advertize_rate (int) – The rate at which advertising should occur
glideinDescript (glideFactoryConfig.GlideinDescript) – glidein.descript object in the Factory root dir
frontendDescript (glideFactoryConfig.FrontendDescript) – frontend.descript object in the Factory root dir
group_name (str) – Name of the group
my_entries (dict) – Dictionary of entry objects keyed on entry name
- glideinwms.factory.glideFactoryEntryGroup.iterate_one(do_advertize, factory_in_downtime, glideinDescript, frontendDescript, group_name, my_entries)[source]¶
One iteration of the entry group
- Parameters:
do_advertize (bool) – True if glidefactory classads should be advertised
factory_in_downtime (bool) – True if factory is in downtime
glideinDescript (dict) – Factory glidein config values
frontendDescript (dict) – Security mappings for frontend identities, security classes, and usernames
group_name (str) – Name of the group
my_entries (dict) – Dictionary of entry objects (glideFactoryEntry.Entry) keyed on entry name
- Returns:
Units of work preformed (0 if no Glidein was submitted)
- Return type:
int
- glideinwms.factory.glideFactoryEntryGroup.main(parent_pid, sleep_time, advertize_rate, startup_dir, entry_names, group_id)[source]¶
GlideinFactoryEntryGroup main function
Setup logging, monitoring, and configuration information. Starts the Entry group main loop and handles cleanup at shutdown.
- Parameters:
parent_pid (int) – The pid for the Factory daemon
sleep_time (int) – The number of seconds to sleep between iterations
advertize_rate (int) – The rate at which advertising should occur
startup_dir (str|Path) – The “home” directory for the entry.
entry_names (str) – Colon separated list with the names of the entries this process should work on
group_id (str) – Group id, normally a number (with the “group_” prefix formes the group name), It can change between Factory reconfigurations
glideinwms.factory.glideFactoryInterface module¶
This module implements the functions needed to advertize and get commands from the Collector
- class glideinwms.factory.glideFactoryInterface.EntryClassad(factory_name, glidein_name, entry_name, trust_domain, auth_method, supported_signtypes, pub_key_obj=None, glidein_submit={}, glidein_attrs={}, glidein_params={}, glidein_monitors={}, glidein_stats={}, glidein_web_attrs={}, glidein_config_limits={})[source]¶
Bases:
Classad
This class describes the glidefactory classad. Factory advertises the glidefactory classad to the user pool as an UPDATE_AD_GENERIC type classad
- class glideinwms.factory.glideFactoryInterface.FactoryGlobalClassad(factory_name, glidein_name, supported_signtypes, pub_key_obj)[source]¶
Bases:
Classad
This class describes the glidefactoryglobal classad. Factory advertises the glidefactoryglobal classad to the user pool as an UPDATE_AD_GENERIC type classad
glidefactory and glidefactoryglobal classads must be of the same type because they may be invalidated together (with a single command)
- class glideinwms.factory.glideFactoryInterface.MultiAdvertizeGlideinClientMonitoring(factory_name, glidein_name, entry_name, glidein_attrs, factory_collector='default')[source]¶
Bases:
object
- glideinwms.factory.glideFactoryInterface._remove_if_there(fname)[source]¶
Remove the file and ignore errors (e.g. file not there)
- glideinwms.factory.glideFactoryInterface.advertizeGlideinClientMonitoring(factory_name, glidein_name, entry_name, client_name, client_int_name, client_int_req, glidein_attrs={}, client_params={}, client_monitors={}, factory_collector='default')[source]¶
- glideinwms.factory.glideFactoryInterface.advertizeGlideinClientMonitoringFromFile(fname, remove_file=True, is_multi=False, factory_collector='default')[source]¶
- glideinwms.factory.glideFactoryInterface.advertizeGlideinFromFile(fname, remove_file=True, is_multi=False, factory_collector='default')[source]¶
- glideinwms.factory.glideFactoryInterface.advertizeGlobal(factory_name, glidein_name, supported_signtypes, pub_key_obj, stats_dict={}, factory_collector='default')[source]¶
Creates the glidefactoryglobal classad and advertises.
@type factory_name: string @param factory_name: the name of the factory @type glidein_name: string @param glidein_name: name of the glidein @type supported_signtypes: string @param supported_signtypes: suppported sign types, i.e. sha1 @type pub_key_obj: GlideinKey @param pub_key_obj: for the frontend to use in encryption @type stats_dict: dict @param stats_dict: completed jobs statistics @type factory_collector: string or None @param factory_collector: the collector to query, special value ‘default’ will get it from the global config
@todo add factory downtime?
- glideinwms.factory.glideFactoryInterface.createGlideinClientMonitoringFile(fname, factory_name, glidein_name, entry_name, client_name, client_int_name, client_int_req, glidein_attrs={}, client_params={}, client_monitors={}, limits_triggered={}, do_append=False)[source]¶
- glideinwms.factory.glideFactoryInterface.deadvertizeAllGlideinClientMonitoring(factory_name, glidein_name, entry_name, factory_collector='default')[source]¶
Deadvertize monitoring classads for the given entry.
- glideinwms.factory.glideFactoryInterface.deadvertizeFactory(factory_name, glidein_name, factory_collector='default')[source]¶
Deadvertize all entry and global classads for this factory.
- glideinwms.factory.glideFactoryInterface.deadvertizeFactoryClientMonitoring(factory_name, glidein_name, factory_collector='default')[source]¶
Deadvertize all monitoring classads for this factory.
- glideinwms.factory.glideFactoryInterface.deadvertizeGlidein(factory_name, glidein_name, entry_name, factory_collector='default')[source]¶
Removes the glidefactory classad advertising the entry from the WMS Collector.
- glideinwms.factory.glideFactoryInterface.deadvertizeGlobal(factory_name, glidein_name, factory_collector='default')[source]¶
Removes the glidefactoryglobal classad advertising the factory globals from the WMS Collector.
- glideinwms.factory.glideFactoryInterface.exe_condor_advertise(fname, command, is_multi=False, factory_collector=None)[source]¶
- glideinwms.factory.glideFactoryInterface.findGroupWork(factory_name, glidein_name, entry_names, supported_signtypes, pub_key_obj=None, additional_constraints=None, factory_collector='default')[source]¶
Find request classAds that have my (factory, glidein name, entries) and create the dictionary of dictionary of work request information. Example: work[entry_name][frontend] = {‘params’:’value’, ‘requests’:’value}
@type factory_name: string @param factory_name: name of the factory
@type glidein_name: string @param glidein_name: name of the glidein instance
@type entry_names: list @param entry_names: list of factory entry names
@type supported_signtypes: list @param supported_signtypes: only support one kind of signtype, ‘sha1’, default is None
@type pub_key_obj: string @param pub_key_obj: only support ‘RSA’, defaults to None
@type additional_constraints: string @param additional_constraints: any additional constraints to include for querying the WMS collector, default is None
@type factory_collector: string or None @param factory_collector: the collector to query, special value ‘default’ will get it from the global config
@rtype: dict @return: Dictionary of work to perform. Return format is work[entry_name][frontend] = {‘params’:’value’, ‘requests’:’value}
- glideinwms.factory.glideFactoryInterface.findWork(factory_name, glidein_name, entry_name, supported_signtypes, pub_key_obj=None, additional_constraints=None, factory_collector='default')[source]¶
Find request classAds that have my (factory, glidein name, entry name) and create the dictionary of work request information.
@type factory_name: string @param factory_name: name of the factory @type glidein_name: string @param glidein_name: name of the glidein instance @type entry_name: string @param entry_name: name of the factory entry @type supported_signtypes: list @param supported_signtypes: only support one kind of signtype, ‘sha1’, default is None @type pub_key_obj: string @param pub_key_obj: only support ‘RSA’ @type additional_constraints: string @param additional_constraints: any additional constraints to include for querying the WMS collector, default is None
@type factory_collector: string or None @param factory_collector: the collector to query, special value ‘default’ will get it from the global config
@return: dictionary, each key is the name of a frontend. Each value has a ‘requests’ and a ‘params’ key. Both refer to classAd dictionaries.
glideinwms.factory.glideFactoryLib module¶
This module implements the functions needed to keep the required number of idle glideins It also has support for glidein sanitizing
- class glideinwms.factory.glideFactoryLib.ClientWeb(client_web_url, client_signtype, client_descript, client_sign, client_group, client_group_web_url, client_group_descript, client_group_sign, factoryConfig=None)[source]¶
Bases:
object
- class glideinwms.factory.glideFactoryLib.GlideinTotals(entry_name, frontendDescript, jobDescript, entry_condorQ, log=None)[source]¶
Bases:
object
Keeps track of all glidein totals.
- add_idle_glideins(nr_glideins, frontend_name)[source]¶
Updates the totals with the additional glideins.
- glideinwms.factory.glideFactoryLib.clean_glidein_queue(remove_excess_tp, glidein_totals, condorQ, req_min_idle, req_max_glideins, frontend_name, log=None, factoryConfig=None)[source]¶
Cleans up the glideins queue (removes any excesses) per the frontend request.
We are not adjusting the glidein totals with what has been removed from the queue. It may take a cycle (or more) for these totals to occur so it would be difficult to reflect the true state of the system.
- TODO: req_min_idle=0 when remove_excess_tp.frontend_req_min_idle is not means that a limit was reached in the Factory
or some component (Factory/Entry) is in downtime. Check if the removal behavior should change
- Parameters:
remove_excess_tp (tuple) – remove_excess_str (NO, WAIT, IDLE, ALL), remove_excess_margin, frontend_req_min_idle remove_excess_str and remove_excess_margin are the removal request form the Frontend The frontend_req_min_idle item of the tuple indicates the original frontend pressure. We use this instead of req_min_idle for the IDLE pilot removal because the factory could set req_min_idle to 0 if an entry is in downtime, or the factory limits are reached. We do not want to remove idle pilots in these cases!
glidein_totals (dict) – Number of Glideins in different states for each Frontend
condorQ (dict) – Results of condor_q, classified
req_min_idle – min_idle requested by the Frontend (NOT USED, used frontend_req_min_idle in remove_excess_tp instead to avoid Factory limits effects)
req_max_glideins – max_glideins requested by the Frontend
frontend_name (str) – Name of the Frontend, to use as key
log (logging.Logger) – logger
factoryConfig (FactoryConfig) – configuration object
- Returns:
1 if some glideins were removed, 0 otherwise
- Return type:
int
TODO:V could return the number of glideins removed
- glideinwms.factory.glideFactoryLib.executeSubmit(log, factoryConfig, username, schedd, exe_env, submitFile)[source]¶
Submit Glideins using the condor_submit command in a custom environment
- Parameters:
log – logger to use
factoryConfig
username
schedd (str) – HTCSS schedd name
exe_env (list) – environment list
submitFile (str) – path os the submit file
- Returns:
list of submitted Glideins
- Return type:
list
- glideinwms.factory.glideFactoryLib.extractHeldSimple(q, factoryConfig=None)[source]¶
All Held Glideins: JobStatus == 5
q: dictionary of Glideins from condor_q factoryConfig (FactoryConfig): Factory configuartion (NOT USED, for interface)
- Returns:
dictionary of Held Glideins from condor_q
- Return type:
dict
- glideinwms.factory.glideFactoryLib.extractIdleQueued(q, factoryConfig=None)[source]¶
All Idle Glideins already submitted: with hash_status 1xxx except 1001
hash_status 1xxx implies JobStatus 1
- Parameters:
q – dictionary of Glideins from condor_q
factoryConfig (FactoryConfig) – Factory configuartion (NOT USED, for interface)
- Returns:
dictionary of Idle and Submitted Glideins from condor_q
- Return type:
dict
- glideinwms.factory.glideFactoryLib.extractIdleSimple(q, factoryConfig=None)[source]¶
All Idle Glideins: JobStatus == 1
q: dictionary of Glideins from condor_q factoryConfig (FactoryConfig): Factory configuartion (NOT USED, for interface)
- Returns:
dictionary of Idle Glideins from condor_q
- Return type:
dict
- glideinwms.factory.glideFactoryLib.extractIdleUnsubmitted(q, factoryConfig=None)[source]¶
All Idle Glideins not yet submitted (Unsubmitted, aka Waiting): with hash_status 1001
hash_status 1xxx implies JobStatus 1
q: dictionary of Glideins from condor_q factoryConfig (FactoryConfig): Factory configuartion (NOT USED, for interface)
- Returns:
dictionary of Idle not Submitted Glideins from condor_q
- Return type:
dict
- glideinwms.factory.glideFactoryLib.extractJobId(submit_out)[source]¶
Extracts the number of jobs and cluster id from a condor output.
- Parameters:
submit_out (list) – Condor output. Expects a list of str.
- Raises:
condorExe.ExeError – When it failts to apply a regular expression to a line of the output.
- Returns:
Number of jobs and cluster id.
- Return type:
tuple
- glideinwms.factory.glideFactoryLib.extractNonRunSimple(q, factoryConfig=None)[source]¶
All NOT Running Glideins: JobStatus != 2
q: dictionary of Glideins from condor_q factoryConfig (FactoryConfig): Factory configuartion (NOT USED, for interface)
- Returns:
dictionary of Not Running Glideins from condor_q
- Return type:
dict
- glideinwms.factory.glideFactoryLib.extractRecoverableHeldSimpleWithinLimits(q, factoryConfig=None)[source]¶
- glideinwms.factory.glideFactoryLib.extractRunSimple(q, factoryConfig=None)[source]¶
All Running Glideins: JobStatus == 2
q: dictionary of Glideins from condor_q factoryConfig (FactoryConfig): Factory configuartion (NOT USED, for interface)
- Returns:
dictionary of Running Glideins from condor_q
- Return type:
dict
- glideinwms.factory.glideFactoryLib.getCondorQCredentialList(factoryConfig=None)[source]¶
Returns a list of all currently used proxies based on the glideins in the queue.
- glideinwms.factory.glideFactoryLib.getCondorQData(entry_name, client_name, schedd_name, factoryConfig=None)[source]¶
Get Condor data, given the glidein name To be passed to the main functions if client_name=None, return all clients
- glideinwms.factory.glideFactoryLib.getCondorStatusData(entry_name, client_name, pool_name=None, factory_startd_attribute=None, glidein_startd_attribute=None, entry_startd_attribute=None, client_startd_attribute=None, factoryConfig=None)[source]¶
- glideinwms.factory.glideFactoryLib.getQClientNames(condorq, factoryConfig=None)[source]¶
Return a dictionary grouping the condorQ by client_names (factoryConfig.client_schedd_attribute)
- Parameters:
condorq – condor_q query object from condorMonitor.CondorQ
- Returns:
list of client names (factoryConfig.client_schedd_attribute)
- glideinwms.factory.glideFactoryLib.getQCredentials(condorq, client_name, creds, client_sa, cred_secclass_sa, cred_id_sa)[source]¶
Get the current queue status for a given client and credential (condorQ sub-query). v3 equivalent for getQProxySecClass
- Parameters:
condorq – condor_q query to select within
client_name (param) – client name (e.g. the Frontend requesting the jobs)
creds – credential object (to extract sec class and id)
client_sa – schedd attribute to compare the client name
cred_secclass_sa – schedd attribute to compare the security class
cred_id_sa – schedd attribute to compare the credential ID
- Returns:
sub-query with only the desired jobs
- glideinwms.factory.glideFactoryLib.getQProxSecClass(condorq, client_name, proxy_security_class, client_schedd_attribute=None, credential_secclass_schedd_attribute=None, factoryConfig=None)[source]¶
Get the current queue status for client and security class.
- glideinwms.factory.glideFactoryLib.getQStatus(condorq)[source]¶
Return a dictionary with detailed_jobStatus/numJobs, where detailed_jobStatus is returned by hash_status Idle jobs may be: 1001, 1002, 1010 depending on the GridJobStatus Running maybe: 2 or 4010 if in stage-out 1100 is used for unknown GridJobStatus
- glideinwms.factory.glideFactoryLib.getQStatusSF(condorq)[source]¶
Return a dictionary where keys are GlideinEntrySubmitFile(s) and values is a jobStatus/numJobs dict NOTE: this has not the same level of detail as getQStatus, e.g. Idle jobs are not split depending on GridJobStatus
- glideinwms.factory.glideFactoryLib.getQStatusStale(condorq)[source]¶
Return a dictionary with jobStatus, stale_info/numJobs, where stale_info is 1 if the status information is old
- glideinwms.factory.glideFactoryLib.get_submit_environment(entry_name, client_name, submit_credentials, client_web, params, idle_lifetime, log=None, factoryConfig=None)[source]¶
- glideinwms.factory.glideFactoryLib.isGlideinHeldNTimes(jobInfo, factoryConfig=None, n=20)[source]¶
This function looks at the glidein job’s information and returns if the CondorG job is held for more than N(defaults to 20) iterations
This is useful to remove Unrecoverable glidein (CondorG job) with forcex option.
- Parameters:
jobInfo (dict) – Dictionary containing glidein job’s classad information
- Returns:
True if job is held more than N(defaults to 20) iterations, False if otherwise.
- Return type:
bool
- glideinwms.factory.glideFactoryLib.isGlideinUnrecoverable(jobInfo, factoryConfig=None, glideinDescript=None)[source]¶
This function looks at the glidein job’s information and returns if the CondorG job is unrecoverable. Condor hold codes are available at: https://htcondor.readthedocs.io/en/v8_9_4/classad-attributes/job-classad-attributes.html
This is useful to change to status of glidein (CondorG job) from hold to idle.
In 3.6.2 the behavior of the function changed. Instead of having a list of unrecoverable codes in the function (that got outdated once gt was deprecated), we consider each code unrecoverable and give the operators the possibility of specify a list of recoverable codes in the config.
- Parameters:
jobInfo (dict) – Dictionary containing glidein job’s classad information
- Returns:
True if job is unrecoverable, False if recoverable
- Return type:
bool
- glideinwms.factory.glideFactoryLib.isGlideinWithinHeldLimits(jobInfo, factoryConfig=None)[source]¶
This function looks at the glidein job’s information and returns if the CondorG job can be released.
This is useful to limit how often held jobs are released.
- Parameters:
jobInfo (dict) – Dictionary containing glidein job’s classad information
- Returns:
True if job is within limits, False if it is not
- Return type:
bool
- glideinwms.factory.glideFactoryLib.keepIdleGlideins(client_condorq, client_int_name, req_min_idle, req_max_glideins, idle_lifetime, remove_excess, submit_credentials, glidein_totals, frontend_name, client_web, params, log=None, factoryConfig=None)[source]¶
Looks at the status of the queue and determines how many glideins to submit. Returns the number of newly submitted glideins.
If the system is unable to submit glideins because has reached one of the limits (request, entry, frontend:security_class), and the frontend asks for removal (RemoveExcess) in the request, it will try to remove excess glideins.
- Parameters:
client_condorq (CondorQ) – Condor queue filtered by security class
client_int_name (str) – internal representation of the client name
req_min_idle (int) – min number of idle glideins needed from the frontend request
req_max_glideins (int) – max number of running glideins allowed in the frontend request
idle_lifetimei (int) – how much to wait before removing glideins that are idle
remove_excess (tuple) – remove_excess_str (NO, WAIT, IDLE, ALL), remove_excess_margin
submit_credentials (SubmitCredentials) – all the information needed to submit the glideins
glidein_totals (GlideinTotals) – entry and frontend glidein counts
frontend_name (str) – frontend name, used to map frontend totals in glidein_totals (“frontend:sec_class”)
client_web (glideFactoryLib.ClientWeb) – client web values
params (dict) – params from the entry configuration or frontend to be passed to the glideins
log (logger) – factory logger
factoryConfig – factory configuration
- Raises:
condorExe.ExeError – in case of issues executing condor commands
- glideinwms.factory.glideFactoryLib.logStats(condorq, client_int_name, client_security_name, proxy_security_class, log=None, factoryConfig=None)[source]¶
Sum to the current schedd statistics of this entry (from condor_q on the Factory) to the values already stored in factoryConfig.client_stats, factoryConfig.qc_stats
- Parameters:
condorq – condorQ object, containing a list of all jobs of the schedd (.data) for the entry invoking this
client_int_name – client name (from the requestor/Frontend)
client_security_name – security name used by the client
proxy_security_class – credential security class used by the client
log – to log errors/info/…
factoryConfig – common data block for the entry to get schedd statistics (client_stats, qc_stats)
- glideinwms.factory.glideFactoryLib.logStatsAll(condorq, log=None, factoryConfig=None)[source]¶
Sum to the current schedd statistics of this entry (from condor_q on the Factory) to the values already stored in factoryConfig.client_stats, factoryConfig.qc_stats Do that for all the clients that have jobs on this entry
- Parameters:
condorq – condorQ object, containing a list of all jobs of the schedd (.data) for the entry invoking this
log – to log errors/info/…
factoryConfig – common data block for the entry to get schedd statistics (client_stats, qc_stats)
- glideinwms.factory.glideFactoryLib.logWorkRequest(client_int_name, client_security_name, proxy_security_class, req_idle, req_max_run, remove_excess, work_el, fraction=1.0, log=None, factoryConfig=None)[source]¶
Logs work requests
- Parameters:
client_int_name – client internal name
client_security_name – client security name
proxy_security_class – security ID
req_idle – requested idle glideins
req_max_run – max running glideins
remove_excess – tuple, remove_excess_str, remove_excess_margin
work_el – Work requests, temporary workaround; the requests should always be processed at the caller level
fraction – fraction for this entry
log
factoryConfig
- glideinwms.factory.glideFactoryLib.releaseGlideins(schedd_name, jid_list, log=None, factoryConfig=None)[source]¶
Release the glideins in the list
We are assuming the gfactory to be a condor superuser or the only user owning jobs (Glideins) and thus does not need identity switching to release jobs
- Parameters:
schedd_name
jid_list
log
factoryConfig
Returns:
- glideinwms.factory.glideFactoryLib.removeGlideins(schedd_name, jid_list, force=False, log=None, factoryConfig=None)[source]¶
Remove the Glideins in the list
We are assuming the gfactory to be a condor superuser or the only user owning jobs (Glideins) and thus does not need identity switching to remove jobs
- Parameters:
schedd_name (str) – HTCSS schedd name
jid_list
force
log
factoryConfig
- Returns:
None
- glideinwms.factory.glideFactoryLib.secClass2Name(client_security_name, proxy_security_class)[source]¶
- glideinwms.factory.glideFactoryLib.submitGlideins(entry_name, client_name, nr_glideins, idle_lifetime, frontend_name, submit_credentials, client_web, params, status_sf, log=None, factoryConfig=None)[source]¶
Submit the Glideins Calls executeSubmit to run the HTCSS command
- Parameters:
entry_name (str)
client_name (str) – client (e.g. Frontend group) name
nr_glideins (int)
idle_lifetime (int)
frontend_name (str)
submit_credentials (dict)
client_web (str) – None means client did not pass one, backwards compatibility
params
status_sf (dict) – keys are GlideinEntrySubmitFile(s) and values is a jobStatus/numJobs dict
log
factoryConfig
- glideinwms.factory.glideFactoryLib.sum_idle_count(qc_status)[source]¶
Add the summary of idle jobs to the statistics passed as input
- Parameters:
qc_status – Query count summary with job_status/number_of_jobs
- Returns:
Adds qc_status[1] to qc_status
- glideinwms.factory.glideFactoryLib.update_x509_proxy_file(entry_name, username, client_id, proxy_data, factoryConfig=None)[source]¶
Create/update the proxy file
It is simply a safe update of the file w/ the new proxy data if different from the current data. The data is written in a binary proxy file. The path of the proxy file is: f”{factoryConfig.client_proxies_base_dir}/user_{username}/glidein_{factoryConfig.glidein_name}/entry_{entry_name}” with name “x509_%s.proxy” % escapeParam(client_id)
Proxy DN and VOMS extension are extracted but never used (?!)
- Parameters:
entry_name (str) – entry name
username (str) – user name (identity of the Frontend)
client_id (str) – client ID
proxy_data (bytes) – Proxy data in PEM format
factoryConfig – Factory configuration
- Returns:
Proxy file full path
- Return type:
str
glideinwms.factory.glideFactoryLogParser module¶
This module implements classes to track changes in glidein status logs
- class glideinwms.factory.glideFactoryLogParser.dirSummarySimple(obj)[source]¶
Bases:
object
dirSummary Simple
for now it is just a constructor wrapper Further on it will need to implement glidein exit code checks
- class glideinwms.factory.glideFactoryLogParser.dirSummaryTimingsOut(dirname, cache_dir, client_name, user_name, inactive_files=None, inactive_timeout=86400)[source]¶
Bases:
cacheDirClass
This class uses a lambda function to initialize an instance of cacheDirClass. The function chooses all condor_activity files in a directory that correspond to a particular client.
- class glideinwms.factory.glideFactoryLogParser.dirSummaryTimingsOutFull(dirname, cache_dir, inactive_files=None, inactive_timeout=86400)[source]¶
Bases:
cacheDirClass
This class uses a lambda function to initialize an instance of cacheDirClass. The function chooses all condor_activity files in a directory regardless of client name.
- glideinwms.factory.glideFactoryLogParser.extractLogData(fname)[source]¶
Given a filename of a job file “path/job.NUMBER.out” extract the statistics of the job duration, etc.
@param fname: Filename to extract @return: a dictionary with keys:
glidein_duration - integer, how long did the glidein run
validation_duration - integer, how long before starting condor
condor_started - Boolean, did condor even start (if false, no further entries)
condor_duration - integer, how long did Condor run
stats - dictionary of stats (as in KNOWN_SLOT_STATS), each having
jobsnr - integer, number of jobs started
secs - integer, total number of secods used
For example {‘glidein_duration’:20305,’validation_duration’:6,’condor_started’ : 1, ‘condor_duration’: 20298, ‘stats’: {‘badSignal’: {‘secs’: 0, ‘jobsnr’: 0}, ‘goodZ’: {‘secs’ : 19481, ‘jobsnr’: 1}, ‘Total’: {‘secs’: 19481, ‘jobsnr’: 1}, ‘goodNZ’: {‘secs’: 0, ‘jobsnr’: 0}, ‘badOther’: {‘secs’: 0, ‘jobsnr’: 0}}}
- class glideinwms.factory.glideFactoryLogParser.logSummaryTimingsOut(logname, cache_dir, username)[source]¶
Bases:
logSummaryTimings
Class logSummaryTimingsOut logs timing and status of a job. It declares a job complete only after the output file has been received The format is slightly different than the one of logSummaryTimings; we add the dirname in the job id When a output file is found, it adds a 4th parameter to the completed jobs See extractLogData below for more details
- diff(other)[source]¶
Diff self.data with other for use in comparing current iteration data with previous iteration.
Uses diff_raw to perform symmetric difference of self.data and other and puts it into data[status][‘Entered’|’Exited’] Completed jobs are augmented with data from the log
@return: data[status][‘Entered’|’Exited’] - list of jobs
- diff_raw(other)[source]¶
Diff self.data with other info, add glidein log data to Entered/Exited. Used to compare current data with previous iteration.
Uses symmetric difference of sets to compare the two dictionaries.
@type other: dictionary of statuses -> jobs @return: data[status][‘Entered’|’Exited’] - list of jobs
- loadFromLog()[source]¶
This class inherits from cachedLogClass. So, load() will first check the cached files. If changed, it will call this function. This uses the condorLogParser to load the log, then does some post-processing to check the job.NUMBER.out files to see if the job has finished and to extract some data.
glideinwms.factory.glideFactoryMonitorAggregator module¶
This module implements the functions needed to aggregate the monitoring fo the glidein factory
- class glideinwms.factory.glideFactoryMonitorAggregator.MonitorAggregatorConfig[source]¶
Bases:
object
- glideinwms.factory.glideFactoryMonitorAggregator.aggregateJobsSummary()[source]¶
Loads the job summary pickle files for each entry, aggregates them per schedd/collector pair, and return them. :return: A dictionary containing the needed information that looks like:
- {
- (‘schedd_name’,’collector_name’){
‘2994.000’: {‘condor_duration’: 1328, ‘glidein_duration’: 1334, ‘condor_started’: 1, ‘numjobs’: 0}, ‘2997.000’: {‘condor_duration’: 1328, ‘glidein_duration’: 1334, ‘condor_started’: 1, ‘numjobs’: 0}, …
}, (‘schedd_name’,’collector_name’) : {
‘2003.000’: {‘condor_duration’: 1328, ‘glidein_duration’: 1334, ‘condor_started’: 1, ‘numjobs’: 0}, ‘206.000’: {‘condor_duration’: 1328, ‘glidein_duration’: 1334, ‘condor_started’: 1, ‘numjobs’: 0}, …
}
}
- glideinwms.factory.glideFactoryMonitorAggregator.aggregateLogSummary()[source]¶
Create an aggregate of log summary files, write it in an aggregate log summary file and in the end return the values
- glideinwms.factory.glideFactoryMonitorAggregator.aggregateRRDStats(log=None)[source]¶
Create an aggregate of RRD stats, write it files
- Parameters:
log (logging.Logger) – logger to use
- glideinwms.factory.glideFactoryMonitorAggregator.aggregateStatus(in_downtime)[source]¶
Create an aggregate of status files, write it in an aggregate status file and in the end return the values
@type in_downtime: boolean @param in_downtime: Entry downtime information
@rtype: dict @return: Dictionary of status information
- glideinwms.factory.glideFactoryMonitorAggregator.verifyRRD(fix_rrd=False, backup=True)[source]¶
Go through all known monitoring rrds and verify that they match existing schema (could be different if an upgrade happened) If fix_rrd is true, then also attempt to add any missing attributes.
- Parameters:
fix_rrd (bool) – if True, will attempt to add missing attrs
backup (bool) – if True, backup the old RRD before fixing
- Returns:
True if all OK, False if there is a problem w/ RRD files
- Return type:
bool
glideinwms.factory.glideFactoryMonitoring module¶
This module implements the functions needed to monitor the glidein factory
- class glideinwms.factory.glideFactoryMonitoring.Descript2XML(log=None)[source]¶
Bases:
object
create an XML file out of glidein.descript, frontend.descript, entry.descript, attributes.cfg, and params.cfg TODO: The XML is used by … “the monioring page”? The file created is descript.xml, w/ glideFactoryDescript and glideFactoryEntryDescript elements
- class glideinwms.factory.glideFactoryMonitoring.FactoryStatusData(log=None, base_dir=None)[source]¶
Bases:
object
this class handles the data obtained from the rrd files
- fetchData(rrd_file, pathway, res, start, end)[source]¶
Uses rrdtool to fetch data from the clients. Returns a dictionary of lists of data. There is a list for each element.
rrdtool fetch returns 3 tuples: a[0], a[1], & a[2]. [0] lists the resolution, start and end time, which can be specified as arugments of fetchData. [1] returns the names of the datasets. These names are listed in the key. [2] is a list of tuples. each tuple contains data from every dataset. There is a tuple for each time data was collected.
- getData(input_val, monitoringConfig=None)[source]¶
Return the data fetched by rrdtool as a dictionary
This also modifies the rrd data dictionary for the client (input_val) in all RRD files and appends the client to the list of frontends
Where this side effect is used: - totals are updated in Entry.writeStats (writing the XML) - frontend data in check_and_perform_work
- getXMLData(rrd)[source]¶
Return a XML formatted string the specific RRD file for the data fetched from a given site (all clients+total).
This also has side effects in the getData(self.total) invocation: - modifies the rrd data dictionary (all RRDs) for the total for this entry - and appends the total (self.total aka ‘total/’) to the list of clients (frontends)
@param rrd: @return: XML formatted string with stats data
- class glideinwms.factory.glideFactoryMonitoring.MonitoringConfig(log=None)[source]¶
Bases:
object
- logCompleted(client_name, entered_dict)[source]¶
This function takes all newly completed glideins and logs them in logs/entry_Name/completed_jobs_date.log in an XML-like format.
It counts the jobs completed on a glidein but does not keep track of the cores received or used by the jobs
@type client_name: String @param client_name: the name of the frontend client @type entered_dict: Dictionary of dictionaries @param entered_dict: This is the dictionary of all jobs that have “Entered” the “Completed” states. It is indexed by job_id. Each data is an info dictionary containing the keys: username, jobs_duration (subkeys:total,goodput,terminated), wastemill (subkeys:validation,idle,nosuccess,badput) , duration, condor_started, condor_duration, jobsnr
- rrd_obj¶
The name of the attribute that identifies the glidein
- Type:
@ivar
- write_completed_json(relative_fname, time, val_dict)[source]¶
Write val_dict to a json file, creating if needed relative_fname: location of json relative to self.monitor_dir time: typically self.updated val_dict: dictionary object to be dumped to file
- write_file(relative_fname, output_str)[source]¶
Write out a string or bytes to a file
- Parameters:
relative_fname (AnyStr) – The relative path name to write out
output_str (AnyStr) – the string (unicode str or bytes) to write to the file
- write_rrd_multi(relative_fname, ds_type, time, val_dict, min_val=None, max_val=None)[source]¶
Create a RRD file, using rrdtool.
- write_rrd_multi_hetero(relative_fname, ds_desc_dict, time, val_dict)[source]¶
Create a RRD file, using rrdtool. Like write_rrd_multi, but with each ds having each a specified type each element of ds_desc_dict is a dictionary with any of ds_type, min, max if ds_desc_dict[name] is not present, the defaults are {‘ds_type’:’GAUGE’, ‘min’:’U’, ‘max’:’U’}
- class glideinwms.factory.glideFactoryMonitoring.condorLogSummary(log=None)[source]¶
Bases:
object
This class handles the data obtained from parsing the glidein log files
- aggregate_frontend_data(updated, diff_summary)[source]¶
This goes into each frontend in the current entry and aggregates the completed/stats/wastetime data into completed_data.json at the entry level
- computeDiff()[source]¶
This function takes the current_stats_data from the current iteration and the old_stats_data from the last iteration (see reset() function) to create a diff of the data in the stats_diff dictionary.
This stats_diff will be a dictionary with two entries for each status: “Entered” and “Exited” denoting which job ids have recently changed status, ie. stats_diff[frontend][username:client_int_name][“Completed”][“Entered”]
- get_data_summary()[source]¶
Summarizes stats_diff data (computeDiff should have already been called) Sums over username in the dictionary stats_diff[frontend][username][entered/exited][status] to make stats_data[client_name][entered/exited][status]=count
@return: dictionary[client_name][entered/exited][status]=count
- get_diff_summary()[source]¶
Flattens stats_diff differential data.
@return: Dictionary of client_name with sub_keys Wait,Idle,Running,Held,Completed,Removed
- get_stats_data_summary()[source]¶
Summarizes current_stats_data: Adds up current_stats_data[frontend][user:client][status] across all username keys.
@return: returns dictionary stats_data[frontend][status]=count
- logSummary(client_name, stats)[source]¶
log_stats taken during during an iteration of perform_work are added/merged into the condorLogSummary class here.
@type stats: dictionary of glideFactoryLogParser.dirSummaryTimingsOut @param stats: Dictionary keyed by “username:client_int_name” client_int_name is needed for frontends with multiple groups
- reset()[source]¶
Replaces old_stats_data with current_stats_data Sets current_stats_data to empty. This is called every iteration in order to later compare the diff of the previous iteration and current one to find any newly changed jobs (ie newly completed jobs)
- write_job_info(scheddName, collectorName)[source]¶
The method itereates over the stats_diff dictionary looking for completed jobs and then fills out a dictionary that contains the monitoring information needed for this job. Those info looks like:
- {
‘schedd_name’: ‘name’, ‘collector_name’: ‘name’, ‘joblist’ : {
‘2994.000’: {‘condor_duration’: 1328, ‘glidein_duration’: 1334, ‘condor_started’: 1, ‘numjobs’: 0, ‘2997.000’: {‘condor_duration’: 1328, ‘glidein_duration’: 1334, ‘condor_started’: 1, ‘numjobs’: 0 …
}
}
- Parameters:
scheddName – The schedd name to update the job
collectorName – The collector name to update the job
- class glideinwms.factory.glideFactoryMonitoring.condorQStats(log=None, cores=1)[source]¶
Bases:
object
- aggregateStates(qc_status, el)[source]¶
For each status in the condor_q count status dictionary (qc_status) add the count to the el dictionary (whose keys are state like ‘Idle’ instead of its number: 1)
- static getEntryFromSubmitFile(submitFile)[source]¶
Extract the entry name from submit files that look like: ‘entry_T2_CH_CERN/job.CMSHTPC_T2_CH_CERN_ce301.condor’
- static get_xml_data(data, indent_tab=' ', leading_tab='')[source]¶
Return a string with the XML formatted statistic data @param data: self.get_data() @param indent_tab: indentation space @param leading_tab: leading space @return: XML string
- static get_xml_total(total, indent_tab=' ', leading_tab='')[source]¶
Return formatted XML for the total statistics @param total: self.get_total() @param indent_tab: indentation space @param leading_tab: leading space @return: XML string
- get_zero_data_element()[source]¶
Return a dictionary with the keys defined in self.attributes, and all values to 0
- Returns:
data element w/ all 0 values
- logClientMonitor(client_name, client_monitor, client_internals, fraction=1.0)[source]¶
client_monitor is a dictionary of monitoring info (GlideinMonitor… from glideclient ClassAd) client_internals is a dictionary of internals (from glideclient ClassAd) If fraction is specified it will be used to extract partial info
- At the moment, it looks only for
‘Idle’ ‘Running’ ‘RunningHere’ ‘GlideinsIdle’, ‘GlideinsIdleCores’ ‘GlideinsRunning’, ‘GlideinsRunningCores’ ‘GlideinsTotal’, ‘GlideinsTotalCores’ ‘LastHeardFrom’
updates go in self.data (self.data[client_name][‘ClientMonitor’])
- logRequest(client_name, requests)[source]¶
requests is a dictinary of requests params is a dictinary of parameters
- At the moment, it looks only for
‘IdleGlideins’ ‘MaxGlideins’
Request contains only that (no real cores info) It is eveluated using GLIDEIN_CPUS
- logSchedd(client_name, qc_status, qc_status_sf)[source]¶
Create or update a dictionary with aggregated HTCondor stats
client_name is the client requesting the glideins qc_status is a dictionary of condor_status:nr_jobs qc_status_sf is a dictionary of submit_file:qc_status OUTPUT: self.data[client_name][‘Status’] is the status for all Glideins
self.data[client_name][‘StatusEntries’] is the Glidein status by Entry
glideinwms.factory.glideFactoryPidLib module¶
- class glideinwms.factory.glideFactoryPidLib.EntryGroupPidSupport(startup_dir, group_name)[source]¶
Bases:
PidWParentSupport
- class glideinwms.factory.glideFactoryPidLib.EntryPidSupport(startup_dir, entry_name)[source]¶
Bases:
PidWParentSupport
- class glideinwms.factory.glideFactoryPidLib.FactoryPidSupport(startup_dir)[source]¶
Bases:
PidSupport
glideinwms.factory.glideFactorySelectionAlgorithms module¶
- glideinwms.factory.glideFactorySelectionAlgorithms.selectionAlgoDefault(submit_files, status_sf, jobDescript, nr_glideins, log)[source]¶
Given the list of sub entries (aka submit files), and the status of each sub entry (how many idle + running etc) figures out how many glideins to submit for each sub entry. 1) Shuffle the submit_files list 2) Try to “depth-wise” fill all the subentries untillimits are reached
@type submit_files: list @param submit_files: list of strings containing the name of the submit files for this entry set @type status_sf: dict @param status_sf: dictrionary where the keys are the submit files and the values is a condor states dict @type jobDescript: object @param jobDescript: will read here maximum number of idle/total fglideins for each sub entry @type nr_glideins: int @param nr_glideins: total number of glideins to submit to all the entries @type log: object @param log: logging object
Return a dictionary where keys are the submit files, and values are int indicating how many glideins to submit
glideinwms.factory.manageFactoryDowntimes module¶
- glideinwms.factory.manageFactoryDowntimes.get_downtime_fd_dict(entry_or_id, cmdname, opt_dict)[source]¶
- glideinwms.factory.manageFactoryDowntimes.get_production_ress_entries(server, ref_dict_list)[source]¶