create_job_table

View page source

Table of Job Parent Child Relationships

Prototype

# at_cascade.create_job_table
def create_job_table(
   all_node_database                 ,
   node_table                        ,
   start_node_id                     ,
   fit_goal_set                      ,
   start_split_reference_id   = None ,
) :
   assert type(all_node_database) == str
   assert type(node_table) == list
   if len(node_table) > 0 :
      assert type( node_table[0] ) == dict
   assert type(start_node_id) == int
   assert type(start_split_reference_id) == int or \
      start_split_reference_id == None
   assert type(fit_goal_set) == set
   # ...
   assert type(job_table)      == list
   assert type( job_table[0] ) == dict
   assert job_table[0]['fit_node_id'] == start_node_id
   assert job_table[0]['split_reference_id'] == start_split_reference_id
   assert job_table[0]['prior_only'] == False
   for job_id in range(1, len(job_table) ) :
      parent_job_id = job_table[job_id]['parent_job_id']
      assert job_table[parent_job_id]['prior_only'] == False
   return job_table

Summary

This routine returns a list where each element corresponds to a job:

  1. A job is a combination of a node and split reference value. For example, if the node is n0 and we are splitting on sex some possible jobs are n0.female, n0.male.

  2. All of the jobs that have prior_only false, must be fit to fit all the jobs for the nodes in the fit_goal_set.

  3. Each job has a parent_job_id for the job that needs to be fit before it, except for the start job which corresponds to the start node and start split reference id. The prior_only field is false for any job that is a parent; i.e., all the parent jobs are fit.

  4. Each job also has a list of which jobs need be run after it (to fit the fit_goal_set ).

  5. If a job has prior_only true, it does not need to be fit for this fit_goal_set, but its priors should be created (when the corresponding parent job is fit) so it could be the start job for a different fit_goal_set .

all_node_database

is a python string specifying the location of the all_node_db relative to the current working directory. This argument can’t be None.

node_table

is a list of dict containing the node table for this cascade. This argument can’t be None.

start_node_id

This, together with start_split_reference_id corresponds to a completed fit that we are starting from. We assume that the priors for this fit have been created; see prior_only below. The start node must be a descendant of the root_node .

start_split_reference_id

This, together with start_node_id corresponds to a completed fit that we are starting from. Only jobs that depend on the start jobs completion will be included in the job table. This is None if and only if split_reference_table is empty.

fit_goal_set

This is the a fit_goal_set. In addition, each such node must be the start node, or a descendant of the start node.

job_table

The return value job_table is a list of dict :

job_id

We use this_job_id to denote the index of a row in the job_table list. The value job_table[job_id] is a dict with the following keys:

job_name

This is a str containing the job name. If the split_reference_table is empty, job_name is equal to node_name where node_name is the node name corresponding to node_id. Otherwise, job_name is equal to node_name.split_reference_name where split_reference_name is the split reference name corresponding to split_reference_id.

prior_only

If this bool is false, this job must be run to fit all the nodes in fit_goal_set . It will be false if this is the start job; i.e, the start job must be fit to fit the nodes in fit_goal_set.

If prior_only is true, prior_only cannot be true for the corresponding parent job. The priors for this job will be created if the parent job succeeds, but this job will not be run and it will not have any children. These priors are intended to be used by a subsequent call to continue_cascade where this job is the start job ( and prior_only is false because we have a different fit_goal_set ).

fit_node_id

This is an int containing the node_id for the fit_node for this this_job_id.

split_reference_id

If the split_reference table is empty, this is None. Otherwise it is an int containing the split_reference_id for this this_job_id; i.e. the splitting covariate has this reference value.

parent_job_id

This is an int containing the job_id corresponding to the parent job which must be greater than the job_id for this row of the job table. The parent job (and only the parent job) must have completed before this job can be run. This first row of the job table has parent_job_id equal to None; i.e., there is not parent for the first job.

start_child_job_id

This is the job_id for the first job that can run as soon as this job is completed. The start_child_job_id is always greater than the job_id for the current row. The simplest way to run the jobs is in job table order (not in parallel).

end_child_job_id

This is the job_id plus one for the last job that can run as soon as this job is completed. If end_child_job_id is equal to start_child_job_id, there are no jobs that require the results of this job. Note that this job is the parent of each job between the start and end,