create_job_table¶

Table of Job Parent Child Relationships¶

Prototype¶

# at_cascade.create_job_table
def create_job_table(
    all_node_database                 ,
    node_table                        ,
    start_node_id                     ,
    fit_goal_set                      ,
    start_split_reference_id   = None ,
) :
    assert type(all_node_database) == str
    assert type(node_table) == list
    if len(node_table) > 0 :
        assert type( node_table[0] ) == dict
    assert type(start_node_id) == int
    assert type(start_split_reference_id) == int or \
        start_split_reference_id == None
    assert type(fit_goal_set) == set
    # ...
    assert type(job_table)      == list
    assert type( job_table[0] ) == dict
    assert job_table[0]['fit_node_id'] == start_node_id
    assert job_table[0]['split_reference_id'] == start_split_reference_id
    assert job_table[0]['prior_only'] == False
    for job_id in range(1, len(job_table) ) :
        parent_job_id = job_table[job_id]['parent_job_id']
        assert job_table[parent_job_id]['prior_only'] == False
    return job_table

Summary¶

This routine returns a list where each element corresponds to a job:

A job is a combination of a node and split reference value. For example, if the node is n0 and we are splitting on sex some possible jobs are n0.female, n0.male.
All of the jobs that have prior_only false, must be fit to fit all the jobs for the nodes in the fit_goal_set.
Each job has a parent_job_id for the job that needs to be fit before it, except for the start job which corresponds to the start node and start split reference id. The prior_only field is false for any job that is a parent; i.e., all the parent jobs are fit.
Each job also has a list of which jobs need be run after it (to fit the fit_goal_set ).
If a job has prior_only true, it does not need to be fit for this fit_goal_set, but its priors should be created (when the corresponding parent job is fit) so it could be the start job for a different fit_goal_set .

all_node_database¶

is a python string specifying the location of the all_node_db relative to the current working directory. This argument can’t be None.

node_table¶

is a list of dict containing the node table for this cascade. This argument can’t be None.

start_node_id¶

This, together with start_split_reference_id corresponds to a completed fit that we are starting from. We assume that the priors for this fit have been created; see prior_only below. The start node must be a descendant of the root_node .

start_split_reference_id¶

This, together with start_node_id corresponds to a completed fit that we are starting from. Only jobs that depend on the start jobs completion will be included in the job table. This is None if and only if split_reference_table is empty.

fit_goal_set¶

This is the a fit_goal_set. In addition, each such node must be the start node, or a descendant of the start node.

job_table¶

The return value job_table is a list of dict :

job_id¶

We use this_job_id to denote the index of a row in the job_table list. The value job_table[job_id] is a dict with the following keys:

job_name¶

This is a str containing the job name. If the split_reference_table is empty, job_name is equal to node_name where node_name is the node name corresponding to node_id. Otherwise, job_name is equal to node_name.split_reference_name where split_reference_name is the split reference name corresponding to split_reference_id.

prior_only¶

If this bool is false, this job must be run to fit all the nodes in fit_goal_set . It will be false if this is the start job; i.e, the start job must be fit to fit the nodes in fit_goal_set.

If prior_only is true, prior_only cannot be true for the corresponding parent job. The priors for this job will be created if the parent job succeeds, but this job will not be run and it will not have any children. These priors are intended to be used by a subsequent call to continue_cascade where this job is the start job ( and prior_only is false because we have a different fit_goal_set ).

fit_node_id¶

This is an int containing the node_id for the fit_node for this this_job_id.

split_reference_id¶

If the split_reference table is empty, this is None. Otherwise it is an int containing the split_reference_id for this this_job_id; i.e. the splitting covariate has this reference value.

parent_job_id¶

This is an int containing the job_id corresponding to the parent job which must be greater than the job_id for this row of the job table. The parent job (and only the parent job) must have completed before this job can be run. This first row of the job table has parent_job_id equal to None; i.e., there is not parent for the first job.

start_child_job_id¶

This is the job_id for the first job that can run as soon as this job is completed. The start_child_job_id is always greater than the job_id for the current row. The simplest way to run the jobs is in job table order (not in parallel).

end_child_job_id¶

This is the job_id plus one for the last job that can run as soon as this job is completed. If end_child_job_id is equal to start_child_job_id, there are no jobs that require the results of this job. Note that this job is the parent of each job between the start and end,