csv.fit

View page source

Fit a CSV Specified Cascade

Prototype

# at_cascade.csv.fit
def fit(fit_dir, max_node_depth = None) :
    assert type(fit_dir) == str
    assert max_node_depth == None or type(max_node_depth) == int

Example

csv.break_fit_pred .

fit_dir

This string is the directory name where the csv files are located.

max_node_depth

This is the number of generations below root node that are included; see Node Depth Versus Job Depth and note that sex is the split_covariate_name . If max_node_depth is zero, only the root node will be included. If max_node_depth is None, the root node and all its descendants are included.

Input Files

option_fit.csv

This csv file has two columns, one called name and the other called value. The rows of this table are documented below by the name column. If an option name does not appear, the corresponding value is empty, the default value is used for the option. The final value for each of the options is reported in the file option_fit_out.csv . Because each option has a default value, new option are added in such a way that previous option_fit.csv files are still valid.

absolute_covariates

This is a space separated list of the names of the absolute covariates. The reference value for an absolute covariate is always zero. (The reference value for a relative covariate is its average for the location that is being fit.) The default value for absolute_covariates is the empty string; i.e., there are no absolute covariates. The covariate named one is automatically created and is always absolute and should not be in this list.

age_avg_split

This string contains a space separated list of float values (there is one or more spaces between each float value). Each float value is age at which to split the integration of both the ODE and the average of an integrand over an interval. The default for this value is the empty string; i.e., no extra age splitting over the uniformly spaced grid specified by ode_step_size.

asymptotic_rcond_lower

This float is a lower bound for an approximate reciprocal condition number of the Hessian of the fixed effects objective. This Hessian is used as an approximation for the information matrix when using the asymptotic or censor_asymptotic sample_method . This option must be between zero and one and its default value is zero.. If the approximate reciprocal condition number is less than asymptotic_rcond_lower, the asymptotic sample method will fail.

balance_sex

This is a boolean option. The subsample of a data with size max_fit always attempts to balance the child nodes; i.e., get an equal number data values for each child of the node currently being fit. If balance_sex is true, the selection will also try to balance the sex covariate values; i.e., get an equal amount of male and female data for each child node.

bound_random

This float option specifies a bound on the random effects. Sometimes the initial fixed effects are very far from truth and the random effects try to compensate with large values. This bound can stabilize the optimization in this case. It is the intention that this bound not be active at the final value for the fixed effects. The default value for this option is infinity; i.e., no bound.

child_prior_dage

This option is true or false. If it is false, no dage priors are created for the child jobs. The default value for this option is true. See the Problem for a discussion of why you may want to use this option.

child_prior_dtime

This option is true or false. If it is false, no dtime priors are created for the child jobs. The default value for this option is true. See the Problem for a discussion of why you may want to use this option.

child_prior_std_factor

This factor multiplies the parent fit posterior standard deviation for the value priors the during a child fit (except for the covariate multipliers). If it is greater (less) than one, the child priors are larger (smaller) than indicated by the posterior corresponding to the parent fit. The default value for this option is 2.0.

child_prior_std_factor_mulcov

This factor multiplies the parent fit posterior standard deviation for the value priors for the covariate multipliers. The default value for this option is child_prior_std_factor .

compress_interval

This string contains two float values separated by one or more spaces. The first (second) float value is called age_size ( time_size ). The default value for this option is both age_size and time_size are 100.

  1. If for a data_in.csv row, age_upper - age_lower <= age_size , the age average for that data is approximated by its value at age ( age_upper - age_lower ) / 2.

  2. If for a data_in.csv row, time_upper - time_lower <= time_size , the time average for that data is approximated by its value at time ( age_upper - age_lower ) / 2.

covariate_reference

This string is either data_in.csv or covariate.csv . If it is data_in.csv the reference value for each (sex, node, covariate) is the average of the covariate corresponding to the data that is fit for that (sex, node) . If it is covariate.csv the reference value for each (sex, node, covariate) is the average of the values in covariate.csv that are for that sex, node, and covariate. The default value for this option is data_in.csv . See covariate_reference in the csv.shock for an example use of this option.

freeze_type

This options specifies the type of covariate multiplier freeze that is done. It is either mean or posterior and its default is mean . If refit_split is false, the freeze fit is the only fit at the root level. If refit_split is true, the freeze fit is the second fit at the root level; i.e, the fit directly after the sex split. Note that in general the cascade can freeze the covariate multipliers at any level; see freeze_type in the option_all table.

mean

If the freeze_type is mean , the mean (optimal value) for the covariate multipliers, determined by the freeze fit, is used as the lower and upper limit for fits that are descendant of the freeze fit. Note that if the lower and upper limits are equal, the corresponding model variable is treated as if it has no uncertainty.

posterior

If the freeze_type is posterior , the posterior distribution for the covariate multipliers, determined by the freeze fit, is used as the prior for all the descendants of the freeze fit. This enables one to account for the uncertainty of covariate multiplier values.

hold_out_integrand

This string contains a space separate list of integrand names. These integrands are held out from all the fits except for the no_ode_fit . The no_ode_fit is used to initialize the rates. You can use this option to hold out direct measurements of the rates that are only intended to help with the initialization (are not real data). The following is a list of the rates and corresponding integrand that is a direct measurement of the rate:

Rate

Integrand

iota

Sincidence

rho

remission

chi

mtexcess

The default value for hold_out_integrand is the empty string; i.e., all of the data is real data and is included in the fits.

max_abs_effect

This float option specifies an extra bound on the absolute value of the covariate multipliers, except for the measurement noise multipliers. To be specific, the bound on the covariate multiplier is as large as possible under the condition

max_abs_effect <= | mul_bnd * ( cov_value - cov_ref ) |

where mul_bnd is the non-negative covariate multiplier bound, cov_value is a data table value of the covariate, and cov_ref is the reference value for the covariate. It is an extra bound because it is in addition to the priors for a covariate multiplier. The default value for this option is 2.

max_fit

This integer is the maximum number of data values to fit per integrand. If for a particular fit an integrand has more than this number of data values, a subsample of this size is randomly selected. There is an exception to this rule, the three fits for the root node (corresponding to sex equal to female, both and male) use twice this number of values per integrand. This is because the sex covariate multiplier is frozen after the both fit and the other covariate multipliers are frozen of the female and male fits. The default value for max_fit is 250.

max_fit_parent

If this integer is greater than or equal zero, max_fit only applies to the child data for a fit, and max_fit_parent is the maximum number of data values for the parent. The default value for max_fit_parent is minus one in which case max_fit only applies to the all the data for a fit. Note that data corresponding to the parent node will not be used when fitting any of its descendants.

max_num_iter_fixed

This integer is the maximum number of Ipopt iterations to try before giving up on fitting the fixed effects. The default value for max_num_iter_fixed is 100.

max_number_cpu

This integer is the maximum number of cpus (processes) to use. It must be greater than zero. If it is one, the jobs are run sequentially, more output is printed to the screen, and the program can be cleanly stopped with a control-C. The default value for this option is

    max_number_cpu = max(1, multiprocessing.cpu_count() - 1)
minimum_meas_cv

This float must be non-negative (greater than or equal zero). It specifies a lower bound on the standard deviation for each measured data value as a fraction of the measurement value. The default value for minimum_meas_cv is zero.

no_ode_ignore

The is a space separated list of rate and integrand names. It specifies which integrands are ignored during a no_ode_fit . The priors for the following variables will not be changed by no_ode_fit:

  1. The rate names in no_ode_ignore .

  2. The covariate multiplies that affect the rates in no_ode_ignore.

  3. The covariate multiplies that affect measurement values for the integrands in no_ode_ignore .

all

In the special case where no_ode_ignore is all , the no_ode fit is not run and none of the priors are changed before the root_node fit.

no_ode_fit

If this is true (false) a no_ode_fit is (is not) used to get better values for the fixed effects prior means. The default value for no_ode_fit is true.

number_sample

This is the number of independent samples of the posterior distribution for the fitted variables to generate (for each fit).

  1. This sampled posterior is used to created priors for the children of the node being fit.

  2. When splitting, the samples are used to create priors for the same node at the new split covariate values.

  3. These samples are also used by csv.predict to create posterior predictions for any function of the fitted variables.

The default value for this option is 20. (You can get 1000 MCMC samples by just repeating each of the 20 independent samples 50 times.)

ode_method

This default for ode_method is iota_pos_rho_zero (see below).

no_ode

The ode_method value does not matter for the following integrands: Sincidence , remission , mtexcess , mtother , mtwith , relrisk , mulcov_ mulcov_id . If all of your integrands are in the set above, you can use no_ode as the ode_method and avoid having to worry about constraining certain rates to be positive or zero.

2DO

This ode_method does not currently work in the context of csv.fit because csv.fit automatically requests the prevalence integrand for predicting values of pini. This should either be fixed or no_ode should be removed from the possible ode_method values.

trapezoidal

If ode_method is trapezoidal , a trapezoidal method is used to approximation the ODE solution. Like no_ode, you do not have to worry about constraining certain rates to be positive or zero when using the trapezoidal method.

iota_zero_rho_zero

If ode_method is iota_zero_rho_zero , the smoothing for iota and rho must always have lower and upper limit zero. In this case an eigen vector method is used to approximate the ODE solution.

iota_pos_rho_zero

If ode_method is iota_pos_rho_zero , the smoothing for iota must always have lower limit greater than zero and for rho lower and upper limit zero. In this case an eigen vector method is used to approximate the ODE solution.

iota_zero_rho_pos

If ode_method is iota_zero_rho_pos , the smoothing for rho must always have lower limit greater than zero and for iota lower and upper limit zero. In this case an eigen vector method is used to approximate the ODE solution.

iota_pos_rho_pos

If ode_method is iota_pos_rho_pos , the smoothing for iota and rho must always have lower limit greater than zero. In this case an eigen vector method is used to approximate the ODE solution.

ode_step_size

This float must be positive (greater than zero). It specifies the step size in age and time to use when solving the ODE. It is also used as the step size for approximating average integrands over age-time intervals. The smaller ode_step_size, the more computation is required to approximation the ODE solution and the average integrands. Finer resolution for specific ages can be achieved using the age_avg_split option. The default value for this option is 10.0.

perturb_optimization_scale

This is the standard deviation of the log of a random multiplier that perturbs the scaling point; see perturb_optimization_scale . The default value for this option is 0.3.

perturb_optimization_start

This is the standard deviation of the log of a random multiplier that perturbs the starting point; see perturb_optimization_start . The default value for this option is 0.1.

quasi_fixed

If this boolean option is true, a quasi-Newton method is used to optimize the fixed effects. Otherwise a Newton method is used The Newton method uses second derivatives of the objective and hence requires more work per iteration but it can often attain much more accuracy in the final solution. The default value quasi_fixed is true.

random_seed

This integer is used to seed the random number generator. The default value for this option is

    random_seed = int( time.time() )
refit_split
  1. If this boolean is true, there is a female, male, and both fit at the root level. The both fit is used for the female and male priors. The female and male fits are used for the priors below the root level.

  2. If refit_split is false, There is no female or male fit at the root level and the both fit is used for the priors below the root level.

  3. The default value for this option is true.

Multiplier Freeze

If refit_split is true, the covariate multipliers are frozen after the sex split; i.e., after the separate female, male fits at the root level. If refit_split is false, the covariate multipliers are frozen after the both fit at the root level.

root_node_name

This string is the name of the root node. The default for root_node_name is the top root of the entire node tree. Only the root node and its descendants will be fit. Sometimes it is useful to set max_node_depth to zero and change root_node_name to a particular node that the cascade is having trouble fitting. This can greatly speed up model building.

root_node_sex

This is either female , male , or both. If it is both, then the female and male directories occur directory below the directory for the root node; i.e., the sexes are split just after fitting the root node.. If it is not both, there is no female or male directory directly below the directory for the root node and all of the fits are for the root_node_sex .

sample_method

This string specifies the sample_method . It must be asymptotic , censor_asymptotic or simulate ‘and it’s default value is asymptotic .

shared_memory_prefix

This string is used added to the front of the name of the shared memory objects used to run the cascade in parallel. No two cascades can run at the same time with the same shared memory prefix. If a cascade does not terminate cleanly, you may have to clear the shared memory before you can run it again; see clear_shared . The default value for this option is your user name ($USER) with spaces replaced by underbars. If the USER environment variable is not defined, the value none is used for this default.

tolerance_fixed

is the tolerance for convergence of the fixed effects optimization problem. This is relative to one and its default value is 1e-4.

node.csv

This file has the same description as the simulate node.csv file.

covariate.csv

This csv file has the same description as the simulate covariate.csv file.

Compression

The csv.covariate_same routine is used to detect when two (node, sex) pairs have the same values for a covariate. In addition, csv.fit detects when a covariate is constant with respect to age or time or both. If many (node_name, sex) pairs have the same values for a covariate, or do not depend on age or time, this can result in a large savings in the size of the root node database and the amount of memory required by dismod_at. This depends on the values you choose in covariate.csv. The following summary of this savings is printed when csv.fit is run:

csv.fit: create_root_database: covariate counts
number (node, sex, covariate) combinations = ...
number of corresponding weights            = ...
number that are constant w.r.t. age        = ...
number that are constant w.r.t. time       = ...
number that are constant w.r.t. both       = ...
population

If this table has a covariate called population , it is also used to weight the data as a function of age and time; e.g., see csv.population . This function is different for each sex and location.

  1. The csv.simulate routine does not yet do this data weighting.

  2. No population weighting is used during the predictions in fit_predict.csv because these predictions are for a single (age, time) point and not a rectangular (age, time) region.

Both Sexes

The population weighting, and covariate value, for data with sex equal to both is the average of the female and male populations. One might think the both population would be the sum of the female and male populations but this would make the population covariate different than all the other covariates (which use the average of the female and male values for both).

fit_goal.csv

If a node_name is in this table, and the node is a descendant of the root node, it will be included in the fit. All the ancestors of goal nodes, up to the root node, are also fit.

  1. This is different from the fit_goal_set which only contains nodes that are descendants of the root node.

  2. A fit_goal.csv file that only has its header line is the same as one that contain all the nodes in the node table.

  3. If you only have one node in this file, at_cascade will do a drill from the root node to the goal node.

node_name

Is the name of a node in the fit goal set. Each such node must be an descendant of the root node.

predict_integrand.csv

This is the list of integrands at which predictions are made and stored in fit_predict.csv .

integrand_name

This string is the name of one of the prediction integrands. You can use the integrand name mulcov_0 , mulcov_1 , … which corresponds to the first , second , … covariate multiplier in the mulcov.csv file.

prior.csv

This csv file has the following columns:

name

is a string contain the name of this prior. No two priors can have the same name.

density

is one of the following strings: uniform, gaussian, cen_gaussian, log_gaussian laplace, cen_laplace, log_laplace. (Only these densities are included, so far, so that we do not have to worry about the degrees of freedom.)

mean

is a float containing the mean for the density for this prior (before truncation). If density is uniform, this value is only used for starting and scaling the optimization. This column must appear and its value cannot be empty.

std

is a float containing the standard deviation for the density for this prior (before truncation). If density is uniform, this value is not used and can be empty. If all the densities are uniform, this column is optional.

eta

is a float specifying the offset for the log_gaussian, and log_laplace densities. If the density is not log_gaussian or log_laplace, this value is not used and can be empty. If none of the densities are log_gaussian or log_laplace, this column is optional.

lower

is a float containing the lower limit for the truncated density for this prior. This column is optional, if it does not appear or its value is empty, there is no lower bound.

upper

is a float containing the upper limit for the truncated density for this prior. This column is optional, if it does not appear or its value is empty, there is no upper bound.

parent_rate.csv

This file specifies the prior for the root node parent rates. These are no effect rates; i.e., no random or covariate effects are included in these rates. For each value of rate_name, this file must have a rectangular grid in age and time .

rate_name

is a string containing the name for the non-zero rates (except for omega which is specified by covariate.csv).

age

is a float containing the age for this grid point.

time

is a float containing the time for this grid point.

value_prior

is a string containing the name of the value prior for this grid point. Either value_prior or const_value must be non-empty but not both. The standard deviation for a value prior is always in the same units as the mean for the prior, even when the prior is log-scaled.

dage_prior

is a string containing the name of the dage prior for this grid point. If dage_prior is empty, there is no prior for the forward age difference of this rate at this grid point. This prior cannot be censored. If a dage prior is log-scaled, the standard deviation is for the difference w.r.t age of the offset log transform of the corresponding model variable. Otherwise, the standard deviation is for the difference w.r.t age of the corresponding model variable.

dtime_prior

is a string containing the name of the dtime prior for this grid point. If dtime_prior is empty, there is no prior for the forward time difference of this rate at this grid point. This prior cannot be censored. If a dtime prior is log-scaled, the standard deviation is for the difference w.r.t time of the offset log transform of the corresponding model variable. Otherwise, the standard deviation is for the difference w.r.t time of the corresponding model variable.

const_value

is a float specifying a constant value for this grid point or the empty string. This is equivalent to the upper and lower limits being equal to this value. Either const_value or value_prior must be non-empty but not both.

child_rate.csv

This csv file specifies the prior for the child rate effects pini, iota, rho and chi. These are random effects. (The parent and child priors for omega are created automatically using the omega column in the covariate.csv file. )

rate_name

this string is the name of this rate and is one of the following: pini, iota, rho, chi . If one of these rates does not appear in child_rate.csv , that rate has not random effects.

value_prior

is a string containing the name of the value prior for this child rate effects. The child rate effects are constant in age and time (this is a limitation of the csv.fit).

Note that the child rate effects are in log of rate space. In other words, if \(u\) is a child rate effect and \(p(a, t)\) is the corresponding parent rate as a function of age, time. The corresponding child rate as a function of age and time \(c(a, t)\) is

\[c(a,t) = \exp(u) p(a,t)\]

mulcov.csv

This csv file specifies the covariate multipliers.

covariate

this string is the name of the covariate for this multiplier. The covariate one is an absolute covariate that is always equal to one and sex is the splitting covariate ( sex is sex name in sex_name2value ). All the other covariates are specified by covariate.csv. If one of these covariates appears in the absolute_covariates list it is an absolute covariate. The other covariates in covariate.csv are relative covariates . For relative covariates, the average of the covariate (for the current node and sex being fit) is subtracted before it is multiplied by a multiplier.

type

This string is rate_value, meas_value, or meas_noise.

rate_value

The multiplier times the covariate affects the rate in the effected column; i.e. the exponential of the product multiplies the rate.

meas_value

The multiplier times the covariate affects the model for the integrand in the effected column; i.e. the exponential of the product multiplies the model for the integrand.

meas_noise

The multiplier times the covariate affects the model for the measurement noise for the integrand in the effected column. To be more specific, the product is added to the standard deviation for measurements for the integrand.

effected

is the name of the integrand or rate affected by this multiplier; see type above.

value_prior

is a string containing the name of the value prior for this covariate multiplier. Note that the covariate multipliers are constant in age and time (this is a limitation of the csv.fit). Either value_prior or const_value must be non-empty but not both.

const_value

is a float specifying a constant value for this grid point or the empty string. This is equivalent to the upper and lower limits being equal to this value. Either value_prior or const_value must be non-empty but not both.

data_in.csv

This csv file specifies the data set with each row corresponding to one data point.

Optional Columns

The following columns are optional and the empty string is used for all the rows of a column that does not appear: meas_std, eta, nu, sample_size.

data_id

is an Index Column for data_in.csv. This is necessary so that the dismod_at data table data_id values correspond to the data_in.csv data_id values.

integrand_name

This string is a dismod_at integrand name; e.g. Sincidence.

density_name

This string is one of the following dismod_at density names:

gaussian

cen_gaussian

log_gaussian

cen_log_gaussian

laplace

cen_laplace

log_laplace

cen_log_laplace

students

log_students

binomial

node_name

This string identifies the node corresponding to this data point.

sex

This string is the sex name for this data point.

age_lower

This float is the lower age limit for this data row.

age_upper

This float is the upper age limit for this data row.

time_lower

This float is the lower time limit for this data row.

time_upper

This float is the upper time limit for this data row.

meas_value

This float is the measured value for this data point.

meas_std

This float is the standard deviation of the measurement noise for this data point. This standard deviation is always in the same units as the data, even when the density is log-scaled.

binomial

The meas_std must be empty when the density is binomial. In this case the standard deviation corresponding to a measurement is a function of the sample size and the model for the mean of the data. This requires that the model for the mean of the data is positive; i.e., greater than zero.

eta

This float is the offset in the log transformation for the log densities (it can be empty if this is not a log density).

nu

This float is the degrees of freedom for the students densities (it can be empty if this is not a students density).

sample_size

This float should be empty if the density is not binomial. Otherwise, it the sample size for a binomial distribution (see csv.binomial for an example):

y

is the meas_value for this data

n

is the sample size

k

is the counts in the binomial distribution; k = y * n .

p

is the success rate; p is the mean of y

The log of the binomial density function is:

\[\log {n \choose k} + k \log(p) + (n-k) \log(1 - p)\]

We suggest using gaussian approximation of the binomial when p * n is greater than 5. This approximation will be faster and less likely to have evaluation issues during the optimization. If you do not have a good idea as to the value of p, uses a gaussian when k = y * n is greater than 5.

hold_out

This integer is one (zero) if this data point is held out (not held out) from the fit.

Output Files

root.db

This is the dismod_at sqlite database corresponding to the root node for the cascade.

all_node.db

This is the at_cascade sqlite all node database for the cascade.

dismod.db

  1. There is a subdirectory of the :ref:csv.fit@`fit_dir` with the name of the root node. The dismod.db file in this directory is the dismod_at_database corresponding to the fit and predictions for the root node fit for both sexes.

  2. The root node directory has a female and male subdirectory. These directories contain dismod.db database for the root node fit of the corresponding sex.

  3. For each node between the root node and the fit_goal nodes , and for the female and male sex, there is a directory. This is directly below the directory for its parent node and same sex. It contains the dismod.db data base for the corresponding fit.

option_fit_out.csv

This is a copy of option_fit.csv with the default filled in for missing values.

fit_predict.csv

This is the predictions for all of the nodes at the age, time and covariate values specified in covariate.csv. The prediction is done using the optimal variable values.

avgint_id

Each avgint_id corresponds to a different value for age, time, or integrand in the sam_predict file. The age and time values comes from the covariate.csv file. The integrands come for the predict_integrand.csv file.

integrand_name

is the integrand for this sample is equal to the integrand names in predict_integrand.csv

avg_integrand

This float is the mode value for the average of the integrand, with covariate and other effects but without measurement noise.

node_name

is the node name for this sample and cycles through the nodes in covariate.csv.

age

is the age for this prediction and is one of the ages in covariate.csv.

time

is the time for this prediction and is one of the times in covariate.csv.

sex

is the sex name for this data point; i.e., female, both, or male.

covariate_names

The rest of the columns are covariate names and contain the value of the corresponding covariate in covariate.csv.

sam_predict.csv

This is a sampling of the predictions for all of the nodes at the age, time and covariate values specified in covariate.csv. It has the same columns as fit_predict.csv (see above) plus an extra column named sample_index.

sample_index

For each sample_index value, there is a complete set of all the values in the fit_predict.csv table. A different (independent) sample from of the model variables from their posterior distribution is used to do the predictions for each sample index.