csv.simulate

View page source

Simulate A Cascade Data Set

Prototype

# at_cascade.csv.simulate
def simulate(sim_dir) :
    assert type(sim_dir) == str

sim_dir

This string is the directory name where the csv files are located.

Example

csv.simulate_xam

Input Files

option_sim.csv

This csv file has two columns, one called name and the other called value. The rows of this table are documented below by the name column. If an option name does not appear, the corresponding default value is used for the option. The final value for each of the options is reported in the file option_sim_out.csv . Because each option has a default value, new option are added in such a way that previous option_sim.csv files are still valid.

absolute_covariates

This is a space separated list of the names of the absolute covariates. The reference value for an absolute covariate is always zero. (The reference value for a relative covariate is its average for the location that is being fit.) The default value for absolute_covariates is the empty string; i.e., there are no absolute covariates.

absolute_tolerance

This float is the absolute error tolerance for the integrator. It determines the accuracy of meas_mean for integrand that require the ODE; e.g., prevalence requires the ODE and Sincidence does not. The default value for this option is 1e-5.

float_precision

This integer is the number of decimal digits of precision to include for float values in the output csv files. The default value for this option is 4.

integrand_step_size

This float is the step size in age and time used to approximate integrand averages from age_lower to age_upper and time_lower to time_upper (in data_sim.csv). It must be greater than zero. The default value for this option is 5.0.

random_depend_sex

If new_random_effects is false, this option is not used. Otherwise if this boolean is true, the random effects depend on sex. if it is false, for each node_name and rate, the random effect for female and male will be equal; see random_effect.csv . The default value for this option is false.

new_random_effects

If this boolean is true, a new set of random effects is generated and random_effect.csv is an output file. Otherwise random_effect.csv is an input file. The default value for this boolean is true.

random_seed

This integer is used to seed the random number generator. The default value for this option is

    random_seed = int( time.time() )
std_random_effects_rate

If new_random_effects is false, this option is not used. Otherwise, this float is the standard deviation of the random effects for the corresponding rate where rate is pini, iota, rho, or chi. The effects are in log of rate space, so this standard deviation is also in log of rate space. Hence only the rates that appear in no_effect_rate.csv have an effect (the other random effects multiply zero). The default value for this option is 0.0; i.e., there are no random effects for the corresponding rate.

trace

If this boolean is true, a trace will be printed during the simulation. This will show that the simulation is making progress and is useful for cases where there is a lot of data to simulate. The default value for this boolean is true.


node.csv

This csv file defines the node tree. It has the columns documented below.

node_name

This string is a name describing the node in a way that is easy for a human to remember. It be unique for each row.

parent_name

This string is the node name corresponding to the parent of this node. The root node of the tree has an empty entry for this column. If a node is a parent, it must have at least two children. This avoids fitting the same location twice as one goes from parent to child nodes.


covariate.csv

This csv file specifies the value of omega and the covariates. For each node_name it has a rectangular grid in age and time. In addition, the rectangular grid is the same for nodes.

node_name

This string identifies the node, in node.csv, corresponding to this row.

sex

This identifies which sex this row corresponds to. The sex values female and male must appear in this table. The sex value both does not appear.

age

This float is the age, in years, corresponding to this row.

time

This float is the time, in years, corresponding to this row.

omega

This float is the value of omega (other cause mortality) for this row. Often other cause mortality is approximated by all cause mortality. Omega is a rate that is assumed to be know ahead of time and hence it is specified together with the covariates.

covariate_name

Except for node_name, sex, age. time, and omega, the columns of this file are covariates. The header row specifies the covariate_name for a column and the other rows are floats containing the corresponding covariate value. The option_sim.csv absolute_covariates specifies which covariates are absolute. All the others are relative covariates. Note that omega and sex are not referred to as covariates for this simulation.


no_effect_rate.csv

This csv file specifies the grid points at which each rate is modeled during a simulation. For each rate_name it has a Rectangular Grid in age and time. These are no-effect rates; i.e., the rates without the random and covariate effects. Covariate multipliers that are constrained to zero during the fitting can be used to get variation between nodes in the no-effect rates corresponding to the fit.

rate_name

This string is iota, rho, chi, or pini and specifies the rate. If one of these rates does not appear, it is modeled as always zero. Other cause mortality omega is specified in covariate.csv .

age

This float is the age, in years, corresponding to this row.

time

This float is the time, in years, corresponding to this row.

rate_truth

This float is the no-effect rate value for all the nodes. It is used to simulate the data. As mentioned, above knocking out covariate multipliers can be used to get variation in the no-effect rates that correspond to the fit. If rate_name is pini, rate_truth should be constant w.r.t age (because it is prevalence at age zero).


multiplier_sim.csv

This csv file provides information about the covariate multipliers. Each row of this file, except the header row, corresponds to a different multiplier. The multipliers are constant in age and time.

multiplier_id

is an Index Column for multiplier_sim.csv.

rate_name

This string is iota, rho, chi, or pini and specifies which rate this covariate multiplier is affecting.

covariate_or_sex

If this is sex it specifies that this multiplier multiples the sex values where

sex_covariate_value = { 'female' : -0.5,  'both' : 0.0, 'male' : +0.5 }

female = -0.5, male = +0.5, and both = 0.0. Otherwise this is one of the covariate names in the covariate.csv file and specifies which covariate value is being multiplied.

multiplier_truth

This is the value of the covariate multiplier used to simulate the data.


simulate.csv

This csv file specifies the simulated data set with each row corresponding to one data point.

simulate_id

is an Index Column for simulate.csv.

integrand_name

This string is a dismod_at integrand; e.g. Sincidence.

node_name

This string identifies the node corresponding to this data point.

sex

This string is the sex for this data point.

age_lower

This float is the lower age limit for this data row.

age_upper

This float is the upper age limit for this data row.

time_lower

This float is the lower time limit for this data row.

time_upper

This float is the upper time limit for this data row.

meas_std_cv

This float is the coefficient of variation for the measurement noise for this data row; see meas_std .

meas_std_min

This float is the minimum value for the standard deviation of the measurement noise for this data row; see meas_std .


random_effect.csv

This file reports the random effect for each node, rate and sex. If new_random_effects is true (false) , this an input (output) file. Only the rate names that appear in rate_name are included in random_effect.csv . (Random effect for rates not in no_effect_rate.csv have no effect.)

node_name

This string identifies the row in node.csv that this row corresponds to. All of the nodes in the node table are present in this file.

rate_name

This is a string and is one of the For each rate_name in the no_effect rate table, All of the rates in the no_effect rate table are present in this file.

sex

This identifies which sex the random effect corresponds to. The sex values female and male will appear and both will not appear.

random_effect

This float value is the random effect for the specified node, rate, and sex. If new_random_effects is true and random_depend_sex is false, the value in this column will not depend on the value in the sex column.

Discussion

  1. For a given parent node, rate, and sex, the sum of the random effects with respect to the child nodes is zero.

  2. All the random effects for the root node are set to zero (the root node does not have a parent node).


Output Files

option_sim_out.csv

This is a copy of option_sim.csv with the default filled in for missing values.

data_sim.csv

This contains the simulated data. It is created during a simulate command and has the following columns:

simulate_id

This integer identifies the row in the simulate.csv corresponding to this row in data_sim.csv. This is an Index Column for simulate.csv and data_sim.csv.

meas_mean

This float is the mean value for the measurement. This is the model value without any measurement noise. It corresponds to the simulation value for all the model variables and covariates. We refer to this as the true value for the average integrand even when we have model miss-specification; i.e., when the set of model variables or covariates in csv.simulate is different from the set in csv.fit .

meas_std

This float is the measurement standard deviation for the simulated data point. This standard deviation is before censoring and given by

meas_std = max ( meas_std_min , meas_std_cv * meas_mean )

where meas_std_min is the minimum measure standard deviation, and meas_std_cv is the coefficient of variation for the measurement noise.

meas_value

This float is the simulated measured value. The data will be generated with a normal distribution that has mean meas_mean and standard deviation meas_std . If the resulting measurement value would be less than zero, the value zero is used; i.e., a censored normal is used to simulate the data.

covariate_name

For each covariate_name there is a column with this name in simulate.csv. The values in these columns are floats corresponding to the covariate value at the mid point of the ages and time intervals for this data point. This value is obtained using bilinear interpolation of the covariate values in covariate.csv. The interpolate is extended as constant in age (time) for points outside the age rage (time range) in the covariate.csv file.