csv.break_fit_pred

View page source

Breakup Fitting and Prediction and Run in Parallel

csv_file

This dictionary is used to hold the data corresponding to the csv files for this example:

csv_file = dict()

node.csv

For this example the root node, n0, has two children, n1 and n2.

csv_file['node.csv'] = \
'''node_name,parent_name
n0,
n1,n0
n2,n0
'''

option_fit.csv

This example uses the default value for all the options in option_fit.csv except for:

  1. random_seed is chosen using the python time package

  2. refit_split is set to false

  3. max_number_cpu should be either 1 or None

max_number_cpu = None
random_seed    = str( int( time.time() ) )
csv_file['option_fit.csv']  = 'name,value\n'
csv_file['option_fit.csv'] += f'random_seed,{random_seed}\n'
csv_file['option_fit.csv'] += 'refit_split,false\n'
csv_file['option_fit.csv'] += 'tolerance_fixed,1e-8\n'
if max_number_cpu == 1 :
   csv_file['option_fit.csv'] += f'max_number_cpu,1\n'

option_predict.csv

This example uses the default value for all the options in option_predict.csv.

csv_file['option_predict.csv']  = 'name,value\n'
csv_file['option_predict.csv'] += 'db2csv,true\n'

covariate.csv

This example has one covariate called haqi. Other cause mortality, omega, is constant and equal to 0.02. The covariate only depends on the node and has values 1.0, 0.5, 1.5 for nodes n0, n1, n2 respectively.

csv_file['covariate.csv'] = \
'''node_name,sex,age,time,omega,haqi
n0,female,50,2000,0.02,1.0
n1,female,50,2000,0.02,0.5
n2,female,50,2000,0.02,1.5
n0,male,50,2000,0.02,1.0
n1,male,50,2000,0.02,0.5
n2,male,50,2000,0.02,1.5
'''

fit_goal.csv

The goal is to fit the model for nodes n1 and n2.

csv_file['fit_goal.csv'] = \
'''node_name
n1
n2
'''

predict_integrand.csv

For this example we want to know the values of Sincidence and prevalence for each of the goal nodes. (Note that Sincidence is a direct measurement of iota.)

csv_file['predict_integrand.csv'] = \
'''integrand_name
Sincidence
prevalence
'''

prior.csv

We define three priors:

uniform_1_1

a uniform distribution on [ -1, 1 ]

uniform_eps_1

a uniform distribution on [ 1e-6, 1 ]

gauss_01

a mean 0 standard deviation 1 Gaussian distribution

csv_file['prior.csv'] = \
'''name,lower,upper,mean,std,density
uniform_-1_1,-1.0,1.0,0.5,1.0,uniform
uniform_eps_1,1e-6,1.0,0.5,1.0,uniform
gauss_01,,,0.0,1.0,gaussian
'''

parent_rate.csv

The only non-zero rates are omega and iota (omega is known and specified by the covariate.csv file). The model for iota is constant (with respect to age and time). Its value prior is uniform_eps_1. It does not have any dage or dtime priors because it is constant (so there are no age or time difference between grid values).

csv_file['parent_rate.csv'] = \
'''rate_name,age,time,value_prior,dage_prior,dtime_prior,const_value
iota,0.0,0.0,uniform_eps_1,,,
'''

child_rate.csv

The child rates are random effects that represent the difference between the rate for a node being fit and the rate for one of its child nodes. These random effects are different for each child node. The are constant in age and time so age and time do not appear in child_rate.csv. In this example, when fitting n0, the child nodes are n1 and n2. When fitting n1 and n2, there are no child nodes (no random effects). Our prior for the random effects is gauss_01.

csv_file['child_rate.csv'] = \
'''rate_name,value_prior
iota,gauss_01
'''

mulcov.csv

There is one covariate multiplier, it multiplies haqi and affects iota. The prior distribution for this multiplier is uniform_-1,1.

csv_file['mulcov.csv'] = \
'''covariate,type,effected,value_prior,const_value
haqi,rate_value,iota,uniform_-1_1,
'''

data_in.csv

The data_in.csv file has one point for each combination of node and sex. The integrand is Sincidence (a direct measurement of iota.) The age interval is [20, 30] and the time interval is [2000, 2010] for each data point. (These do not really matter because the true iota for this example is constant.) The measurement standard deviation is 1e-5 (during the fitting) and none of the data is held out. The small standard deviation during the fitting makes checking the posterior samples easier.

header  = 'data_id, integrand_name, node_name, sex, age_lower, age_upper, '
header += 'time_lower, time_upper, meas_value, meas_std, hold_out, '
header += 'density_name, eta, nu'
csv_file['data_in.csv'] = header + \
'''
0, Sincidence, n0, female, 0,  10, 1990, 2000, 0.0000,  1e-4, 0, gaussian, ,
1, Sincidence, n0, male,   0,  10, 1990, 2000, 0.0000,  1e-4, 0, gaussian, ,
2, Sincidence, n1, female, 10, 20, 2000, 2010, 0.0000,  1e-4, 0, gaussian, ,
3, Sincidence, n1, male,   10, 20, 2000, 2010, 0.0000,  1e-4, 0, gaussian, ,
4, Sincidence, n2, female, 20, 30, 2010, 2020, 0.0000,  1e-4, 0, gaussian, ,
5, Sincidence, n2, male,   20, 30, 2010, 2020, 0.0000,  1e-4, 0, gaussian, ,
'''
csv_file['data_in.csv'] = csv_file['data_in.csv'].replace(' ', '')

The measurement value meas_value is 0.0000 above and gets replaced by the following code:

      haqi              = node2haqi[node_name]
      effect            = true_mulcov_haqi * (haqi - haqi_avg)
      iota              = math.exp(effect) * no_effect_iota
      row['meas_value'] = float_format.format( iota )

breakup_computation

Sometimes it is useful to fit some nodes, look at the results, and if they are good, continue the computation to the entire fit goal set. This will be done during if breakup_computation is true (see source code below):

breakup_computation = True

Source Code

#
# computation
def computation(fit_dir) :
   #
   # csv.fit, csv.predict
   if not breakup_computation:
      at_cascade.csv.fit(fit_dir)
      at_cascade.csv.predict(fit_dir)
   else :
      # csv.fit: Just fit the root node
      # Since refit_split is false, this will only fit include n0.both.
      at_cascade.csv.fit(fit_dir, max_node_depth = 0)
      #
      # all_node_database
      all_node_database = f'{fit_dir}/all_node.db'
      #
      # Run two continues starting at n0.both.
      # If max_number_cpu != 1, run them in parallel.
      # p_fit
      p_fit = dict()
      fit_database      = f'{fit_dir}/n0/dismod.db'
      fit_type          = [ 'both', 'fixed']
      for node_name in [ 'n1' , 'n2' ] :
         fit_goal_set  = { node_name }
         shared_unique = '_' + node_name
         args          = (
            all_node_database,
            fit_database,
            fit_goal_set,
            fit_type,
            shared_unique,
         )
         if max_number_cpu == 1 :
            at_cascade.continue_cascade( *args )
         else :
            p_fit[node_name] = multiprocessing.Process(
               target = at_cascade.continue_cascade , args = args ,
            )
            p_fit[node_name].start()
      #
      # Run one predict for n0.both using this process
      # If max_number_cpu != 1, this is in parallel with the continues above
      p_predict      = dict()
      sim_dir        = None
      start_job_name = 'n0.both'
      max_job_depth  = 0
      args            = (fit_dir, sim_dir, start_job_name, max_job_depth)
      at_cascade.csv.predict( *args )
      #
      # If max_number_cpu != 1, wait for continue jobs to finish
      for key in p_fit :
         p_fit[key].join()
      #
      #
      # Run predict starting at
      # n1.female, n1.male, n2.female, n2.male.
      # If max_number_cpu != 1, run them in parallel.
      sim_dir       = None
      max_job_depth = 0
      for node_name in [ 'n1', 'n2' ] :
         for sex in [ 'female', 'male' ] :
            start_job_name = f'{node_name}.{sex}'
            args           = (fit_dir, sim_dir, start_job_name, max_job_depth)
            if max_number_cpu == 1 :
               at_cascade.csv.predict(*args)
            else :
               key            = (node_name, sex)
               p_predict[key] = multiprocessing.Process(
                  target = at_cascade.csv.predict, args = args,
                )
               p_predict[key].start()
      #
      # If max_number_cpu != 1, wait for predict jobs to finish
      for key in p_predict :
         p_predict[key].join()
      #
      # predict
      # fit_predict.csv, sam_predict.csv
      for prefix in [ 'fit' , 'sam' ] :
         file_name = f'{fit_dir}/{prefix}_predict.csv'
         file_out  = open(file_name, 'w')
         writer    = None
         for start_job_name in [
            'n0.both', 'n1.female', 'n1.male', 'n2.female', 'n2.male'
         ] :
            file_name = f'{fit_dir}/predict/{prefix}_{start_job_name}.csv'
            file_in   = open(file_name, 'r')
            reader    = csv.DictReader(file_in)
            for row in reader :
               if writer == None :
                  writer = csv.DictWriter(file_out, fieldnames = row.keys() )
                  writer.writeheader()
               writer.writerow(row)
         file_out.close()
   return
#
# main
def main() :
   #
   # fit_dir
   fit_dir = 'build/example/csv'
   at_cascade.empty_directory(fit_dir)
   #
   # write csv files
   for name in csv_file :
      file_name = f'{fit_dir}/{name}'
      file_ptr  = open(file_name, 'w')
      file_ptr.write( csv_file[name] )
      file_ptr.close()
   #
   # node2haqi, haqi_avg
   node2haqi  = { 'n0' : 1.0, 'n1' : 0.5, 'n2' : 1.5 }
   file_name  = f'{fit_dir}/covariate.csv'
   table      = at_cascade.csv.read_table( file_name )
   haqi_sum   = 0.0
   for row in table :
      node_name = row['node_name']
      haqi      = float( row['haqi'] )
      haqi_sum += haqi
      assert haqi == node2haqi[node_name]
   haqi_avg = haqi_sum / len(table)
   #
   # data_in.csv
   float_format      = '{0:.5g}'
   true_mulcov_haqi  = 0.5
   no_effect_iota    = 0.1
   file_name         = f'{fit_dir}/data_in.csv'
   table             = at_cascade.csv.read_table( file_name )
   for row in table :
      node_name      = row['node_name']
      integrand_name = row['integrand_name']
      assert integrand_name == 'Sincidence'
      #
      # BEGIN_MEAS_VALUE
      haqi              = node2haqi[node_name]
      effect            = true_mulcov_haqi * (haqi - haqi_avg)
      iota              = math.exp(effect) * no_effect_iota
      row['meas_value'] = float_format.format( iota )
      # END_MEAS_VALUE
   at_cascade.csv.write_table(file_name, table)
   #
   # computation
   computation(fit_dir)
   #
   # prefix
   for prefix in [ 'fit' , 'sam' ] :
      #
      # predict_table
      file_name = f'{fit_dir}/{prefix}_predict.csv'
      predict_table = at_cascade.csv.read_table(file_name)
      #
      # node
      for node in [ 'n0', 'n1', 'n2' ] :
         # sex
         for sex in [ 'female', 'both', 'male' ] :
            #
            # sample_list
            sample_list = list()
            for row in predict_table :
               if row['integrand_name'] == 'Sincidence' and \
                     row['node_name'] == node and row['fit_node_name'] == node \
                        and row['sex'] == sex and row['fit_node_name'] == node :
                  #
                  sample_list.append(row)
            if node == 'n0' and sex == 'both' :
               assert len(sample_list) != 0
            elif node != 'n0' and sex != 'both' :
               assert len(sample_list) != 0
            else :
               assert len(sample_list) == 0
            #
            if len(sample_list) > 0 :
               sum_avgint = 0.0
               for row in sample_list :
                  sum_avgint   += float( row['avg_integrand'] )
               avgint    = sum_avgint / len(sample_list)
               haqi      = float( row['haqi'] )
               effect    = true_mulcov_haqi * (haqi - haqi_avg)
               iota      = math.exp(effect) * no_effect_iota
               rel_error = (avgint - iota) / iota
               if abs(rel_error) > 1e-3 :
                  msg = f'node = {node}, sex = {sex}, rel_error = {rel_error}'
                  assert False, msg
#
if __name__ == '__main__' :
   main()
   print('break_fit_pred.py: OK')