csv.predict

View page source

Prediction for a CSV Fit

Prototype

# at_cascade.csv.predict
def predict(fit_dir, sim_dir=None, start_job_name=None, max_job_depth=None) :
   assert type(fit_dir)  == str
   assert sim_dir        == None or type(sim_dir) == str
   assert start_job_name == None or type(start_job_name) == str
   assert max_job_depth  == None or type(max_job_depth) == int

Example

csv.sim_fit_pred .

fit_dir

Same as the csv fit fit_dir .

sim_dir

If this is None, the file tru_predict.csv is not created. Otherwise, sim_dir is the directory used to simulate the data for this fit and the file tru_predict.csv is created.

start_job_name

Is the name of the job (fit) that the predictions should start at. This is a node name, followed by a period, followed by a sex. Only this fit, and its descendants, will be included in the predictions. If this argument is None, all of the jobs (fits) will be included.

max_job_depth

This is the number of generations below start_job_name that are included; see Node Depth Versus Job Depth and note that sex is the split_covariate_name . If max_job_depth is zero, only the start job will be included. If max_job_depth is None, start job and all its descendants are included;

Input Files

option_predict.csv

This csv file has two columns, one called name and the other called value. The rows of this table are documented below by the name column. If an option name does not appear, or the corresponding value is empty, the default value is used for the option. The final value for each of the options is reported in the file option_predict_out.csv . Because each option has a default value, new option are added in such a way that previous option_predict.csv files are still valid.

db2csv

If this boolean option is true, the dismod_at db2csv_command is used to generate the csv files corresponding to each dismod.db . This is only done for (node, sex) pairs that have samples; i.e., a successful fit and posterior samples. If this option is true, the csv files will make it more difficult to see the tree structure corresponding to the dismod.db files. The default value for this option is false .

descendant_std_factor

This factor scales an ancestor fit posterior samples before predicting for a descendant job; i.e., (node, sex) pair. It must be greater than zero and it’s default value is 1. It is only used when predicting for a job that does not have samples. In this case the closest ancestor that does have samples is used to predict for the (node, sex) pair; see csv.ancestor_fit. For an example, see csv.predict_descend .

float_precision

This integer is the number of decimal digits of precision to include for float values in the output csv files. The default value for this option is 5.

max_number_cpu

This integer is the maximum number of cpus (processes) to use This must be greater than zero. If it is one, the jobs are run sequentially, more output is printed to the screen, and the program can be cleanly stopped with a control-C. The default value for this option is

   max_number_cpu = max(1, multiprocessing.cpu_count() - 1)
plot

The default value for this option is false . If this boolean option is true, a data_plot.pdf and rate_plot.pdf file is created for each dismod.db database. This is only done for (node, sex) pairs that have samples; i.e., a successful fit and posterior samples. The data plot includes a maximum of 1,000 randomly chosen points for each integrand in the predict_integrand.csv file. The rate plot includes all the non-zero rates. These are no effect rates; i.e., they are the estimated rate for this node and sex without any covariate effects. Predictions with covariate effects can be found in the csv Output Files .

zero_meas_value

If this boolean option is true, the meas_value covariate multipliers are set to zero during the predictions (instead of their simulation values, fit, or sample values). The default value for this option is false .

number_sample_predict

This integer option specifies the number of samples generated for each prediction. Its default value is the value of number_sample in option_fit.csv. If number_sample_predict does not appear in option_predict.csv, and number_sample does not appear in option_fit.csv, the default value for number_sample is the value used for number_sample_predict.

covariate.csv

Same as the csv fit covariate.csv .

fit_goal.csv

Same as the csv fit fit_goal.csv .

option_fit.csv

The value option_fit.csv refit_split value is used.

predict_integrand.csv

This is the list of integrands at which predictions are made and stored in fit_predict.csv .

Output Files

option_predict_out.csv

This is a copy of option_predict.csv with the default filled in for missing values.

fit_predict.csv

  1. If start_job_name is None, fit_predict.csv contains the predictions for all the fits. These predictions for all of the nodes at the age, time and covariate values specified in covariate.csv. The prediction is done using the optimal variable values.

  2. If start_job_name is not None, the predictions are only for jobs at or below the starting job. In addition, the predictions are stored below fit_dir in the file

    predict/fit_start_job_name.csv

    and not in fit_predict.csv .

avgint_id

Each avgint_id corresponds to a different value for age, time, or integrand in the fit_predict file. The age and time values comes from the covariate.csv file. The integrands values come from the predict_integrand.csv file and the covariate multiplier list.

sample_index

Each sample_index corresponds to an independent random sample of the model variables.

  1. If sample_method is asymptotic, model variables for each sample are Gaussian correlated with mean equal to the optimal value and variance equal to the asymptotic approximation.

  2. If sample_method is censor_asymptotic, model variables are the same as for asymptotic expect that values above (below) their upper bound (lower bound) are converted to the corresponding bound.

  3. If sample_method is simulate, the model variables for each sample at the optimal values corresponding to an independent data set.

integrand_name

is the integrand for this sample is equal to the integrand names in predict_integrand.csv The integrand names mulcov_0 , mulcov_1 , … corresponds to the first , second , … covariate multiplier in the csv fit mulcov.csv file.

avg_integrand

This float is the mode value for the average of the integrand, with covariate and other effects but without measurement noise.

node_name

is the node name for this sample is predicting for. This cycles through all the nodes in covariate.csv.

sex

is the sex, female, both, or male, that the predictions are for.

fit_node_name

is the node name corresponding to the fit, and samples, that was used to do these predictions. This identifies the nearest ancestor that had a successful fit and samples.

fit_sex

is the sex corresponding to the fit, and samples, that were used to do these prediction.

posterior

If fit_node_name and fit_sex are the same as node_name and sex , the fit and samples succeeded for this node_name and sex and this row contains a posterior prediction for this node_name and sex .

prior

If fit_node_name is not the same as node_name , or fit_sex is not the same as sex , this row contains a prior prediction for this node_name and sex . The pair ( fit_node_name , fit_sex ) correspond to the closest ancestor fit that was successful.

age

is the age for this prediction and is one of the ages in covariate.csv.

time

is the time for this prediction and is one of the times in covariate.csv.

covariate_names

The rest of the columns are covariate names and contain the value of the corresponding covariate in covariate.csv .

tru_predict.csv

If sim_dir is None, this file is not created. Otherwise, this file contains the predictions for the model variables corresponding to the simulation. It is similar to fit_predict.csv with the following differences:

  1. The first line (header line) is the same in this file and fit_predict.csv.

  2. If the other lines, in both files, are sorted by ( node_name , avgint_id ) , the other lines are the same except for the value in the avg_integrand column.

  3. The model variables and true values, are for the fit_node_name and fit_sex . Hence this does not really represent truth unless these are the same as node_name and sex .

sam_predict.csv

This is a sampling of the predictions, using the posterior distribution of the model variables: It is similar to fit_predict.csv with the following differences:

  1. The first line (header line) is the same in this file and fit_predict.csv except that sam_predict.csv has an extra column named sample_index.

  2. Suppose that the other lines in sam_predict.csv and fit_predict.csv are sorted by ( node_name , avgint_id ) .

  3. Let n_sample be the number of other lines in sam_predict.csv divided by the number of other lines in fit_predict.csv.

  4. For each line in fit_predict.csv (not counting the header line), there are n_sample lines in sam_predict.csv, that are the same as the line in fit_predict.csv except for the value in the avg_integrand column and the extra sample_index column in sam_predict.csv.

start_job_name

If start_job_name is not None, the predictions are only for jobs at or below the starting job. In addition, the predictions are stored below fit_dir in the file

predict/sam_start_job_name.csv

and not in sam_predict.csv .

sample_index

For each sample_index value, there is a complete set of all the values in the fit_predict.csv table. A different (independent) sample from of the model variables from their posterior distribution is used to do the predictions for each sample index.