Computing Health Expectancies using IMaCh

(a Maximum Likelihood Computer Program using Interpolation of Markov Chains)

 

INED and EUROREVES

March 2000


Authors of the program: Nicolas Brouard, senior researcher at the Institut National d'Etudes Démographiques (INED, Paris) in the "Mortality, Health and Epidemiology" Research Unit

and Agnès Lièvre

Contribution to the mathematics: C. R. Heathcote (Australian National University, Canberra).

Contact: Agnès Lièvre (lievre@ined.fr)



Introduction

This program computes Healthy Life Expectancies from cross-longitudinal data. Within the family of Health Expectancies (HE), Disability-free life expectancy (DFLE) is probably the most important index to monitor. In low mortality countries, there is a fear that when mortality declines, the increase in DFLE is not proportionate to the increase in total Life expectancy. This case is called the Expansion of morbidity. Most of the data collected today, in particular by the international REVES network on Health expectancy, and most HE indices based on these data, are cross-sectional. It means that the information collected comes from a single cross-sectional survey: people from various ages (but mostly old people) are surveyed on their health status at a single date. Proportion of people disabled at each age, can then be measured at that date. This age-specific prevalence curve is then used to distinguish, within the stationary population (which, by definition, is the life table estimated from the vital statistics on mortality at the same date), the disable population from the disability-free population. Life expectancy (LE) (or total population divided by the yearly number of births or deaths of this stationary population) is then decomposed into DFLE and DLE. This method of computing HE is usually called the Sullivan method (from the name of the author who first described it).

Age-specific proportions of people disable are very difficult to forecast because each proportion corresponds to historical conditions of the cohort and it is the result of the historical flows from entering disability and recovering in the past until today. The age-specific intensities (or incidence rates) of entering disability or recovering a good health, are reflecting actual conditions and therefore can be used at each age to forecast the future of this cohort. For example if a country is improving its technology of prosthesis, the incidence of recovering the ability to walk will be higher at each (old) age, but the prevalence of disability will only slightly reflect an improve because the prevalence is mostly affected by the history of the cohort and not by recent period effects. To measure the period improvement we have to simulate the future of a cohort of new-borns entering or leaving at each age the disability state or dying according to the incidence rates measured today on different cohorts. The proportion of people disabled at each age in this simulated cohort will be much lower (using the exemple of an improvement) that the proportions observed at each age in a cross-sectional survey. This new prevalence curve introduced in a life table will give a much more actual and realistic HE level than the Sullivan method which mostly measured the History of health conditions in this country.

Therefore, the main question is how to measure incidence rates from cross-longitudinal surveys? This is the goal of the IMaCH program. From your data and using IMaCH you can estimate period HE and not only Sullivan's HE. Also the standard errors of the HE are computed.

A cross-longitudinal survey consists in a first survey ("cross") where individuals from different ages are interviewed on their health status or degree of disability. At least a second wave of interviews ("longitudinal") should measure each new individual health status. Health expectancies are computed from the transitions observed between waves and are computed for each degree of severity of disability (number of life states). More degrees you consider, more time is necessary to reach the Maximum Likelihood of the parameters involved in the model. Considering only two states of disability (disable and healthy) is generally enough but the computer program works also with more health statuses.

The simplest model is the multinomial logistic model where pij is the probability to be observed in state j at the second wave conditional to be observed in state i at the first wave. Therefore a simple model is: log(pij/pii)= aij + bij*age+ cij*sex, where 'age' is age and 'sex' is a covariate. The advantage that this computer program claims, comes from that if the delay between waves is not identical for each individual, or if some individual missed an interview, the information is not rounded or lost, but taken into account using an interpolation or extrapolation. hPijx is the probability to be observed in state i at age x+h conditional to the observed state i at age x. The delay 'h' can be split into an exact number (nh*stepm) of unobserved intermediate states. This elementary transition (by month or quarter trimester, semester or year) is modeled as a multinomial logistic. The hPx matrix is simply the matrix product of nh*stepm elementary matrices and the contribution of each individual to the likelihood is simply hPijx.

The program presented in this manual is a quite general program named IMaCh (for Interpolated MArkov CHain), designed to analyse transition data from longitudinal surveys. The first step is the parameters estimation of a transition probabilities model between an initial status and a final status. From there, the computer program produces some indicators such as observed and stationary prevalence, life expectancies and their variances and graphs. Our transition model consists in absorbing and non-absorbing states with the possibility of return across the non-absorbing states. The main advantage of this package, compared to other programs for the analysis of transition data (For example: Proc Catmod of SAS®) is that the whole individual information is used even if an interview is missing, a status or a date is unknown or when the delay between waves is not identical for each individual. The program can be executed according to parameters: selection of a sub-sample, number of absorbing and non-absorbing states, number of waves taken in account (the user inputs the first and the last interview), a tolerance level for the maximization function, the periodicity of the transitions (we can compute annual, quaterly or monthly transitions), covariates in the model. It works on Windows or on Unix.


On what kind of data can it be used?

The minimum data required for a transition model is the recording of a set of individuals interviewed at a first date and interviewed again at least one another time. From the observations of an individual, we obtain a follow-up over time of the occurrence of a specific event. In this documentation, the event is related to health status at older ages, but the program can be applied on a lot of longitudinal studies in different contexts. To build the data file explained into the next section, you must have the month and year of each interview and the corresponding health status. But in order to get age, date of birth (month and year) is required (missing values is allowed for month). Date of death (month and year) is an important information also required if the individual is dead. Shorter steps (i.e. a month) will more closely take into account the survival time after the last interview.


The data file

In this example, 8,000 people have been interviewed in a cross-longitudinal survey of 4 waves (1984, 1986, 1988, 1990). Some people missed 1, 2 or 3 interviews. Health statuses are healthy (1) and disable (2). The survey is not a real one. It is a simulation of the American Longitudinal Survey on Aging. The disability state is defined if the individual missed one of four ADL (Activity of daily living, like bathing, eating, walking). Therefore, even is the individuals interviewed in the sample are virtual, the information brought with this sample is close to the situation of the United States. Sex is not recorded is this sample.

Each line of the data set (named data1.txt in this first example) is an individual record which fields are:

 

If your longitudinal survey do not include information about weights or covariates, you must fill the column with a number (e.g. 1) because a missing field is not allowed.


Your first example parameter file

#Imach version 0.63, February 2000, INED-EUROREVES

This is a comment. Comments start with a '#'.

First uncommented line

title=1st_example datafile=data1.txt lastobs=8600 firstpass=1 lastpass=4

 

Second uncommented line

ftol=1.e-08 stepm=1 ncov=2 nlstate=2 ndeath=1 maxwav=4 mle=1 weight=0

Guess values for optimization

You must write the initial guess values of the parameters for optimization. The number of parameters, N depends on the number of absorbing states and non-absorbing states and on the number of covariates.
N is given by the formula N=(nlstate + ndeath-1)*nlstate*ncov .

Thus in the simple case with 2 covariates (the model is log (pij/pii) = aij + bij * age where intercept and age are the two covariates), and 2 health degrees (1 for disability-free and 2 for disability) and 1 absorbing state (3), you must enter 8 initials values, a12, b12, a13, b13, a21, b21, a23, b23. You can start with zeros as in this example, but if you have a more precise set (for example from an earlier run) you can enter it and it will speed up them
Each of the four lines starts with indices "ij":

ij aij bij

# Guess values of aij and bij in log (pij/pii) = aij + bij * age
12 -14.155633  0.110794 
13  -7.925360  0.032091 
21  -1.890135 -0.029473 
23  -6.234642  0.022315 

or, to simplify:

12 0.0 0.0
13 0.0 0.0
21 0.0 0.0
23 0.0 0.0

Guess values for computing variances

This is an output if mle=1. But it can be used as an input to get the vairous output data files (Health expectancies, stationary prevalence etc.) and figures without rerunning the rather long maximisation phase (mle=0).

The scales are small values for the evaluation of numerical derivatives. These derivatives are used to compute the hessian matrix of the parameters, that is the inverse of the covariance matrix, and the variances of health expectancies. Each line consists in indices "ij" followed by the initial scales (zero to simplify) associated with aij and bij.

# Scales (for hessian or gradient estimation)
12 0. 0. 
13 0. 0. 
21 0. 0. 
23 0. 0. 

Covariance matrix of parameters

This is an output if mle=1. But it can be used as an input to get the vairous output data files (Health expectancies, stationary prevalence etc.) and figures without rerunning the rather long maximisation phase (mle=0).

Each line starts with indices "ijk" followed by the covariances between aij and bij:

   121 Var(a12) 
   122 Cov(b12,a12)  Var(b12) 
          ...
   232 Cov(b23,a12)  Cov(b23,b12) ... Var (b23) 
# Covariance matrix
121 0.
122 0. 0.
131 0. 0. 0. 
132 0. 0. 0. 0. 
211 0. 0. 0. 0. 0. 
212 0. 0. 0. 0. 0. 0. 
231 0. 0. 0. 0. 0. 0. 0. 
232 0. 0. 0. 0. 0. 0. 0. 0.

last uncommented line

agemin=70 agemax=100 bage=50 fage=100

Once we obtained the estimated parameters, the program is able to calculated stationary prevalence, transitions probabilities and life expectancies at any age. Choice of age ranges is useful for extrapolation. In our data file, ages varies from age 70 to 102. Setting bage=50 and fage=100, makes the program computing life expectancy from age bage to age fage. As we use a model, we can compute life expectancy on a wider age range than the age range from the data. But the model can be rather wrong on big intervals.

Similarly, it is possible to get extrapolated stationary prevalence by age raning from agemin to agemax.


Running Imach with this example

We assume that you entered your 1st_example parameter file as explained above. To run the program you should click on the imach.exe icon and enter the name of the parameter file which is for example C:\usr\imach\mle\biaspar.txt (you also can click on the biaspar.txt icon located in
C:\usr\imach\mle and put it with the mouse on the imach window).

The time to converge depends on the step unit that you used (1 month is cpu consuming), on the number of cases, and on the number of variables.

The program outputs many files. Most of them are files which will be plotted for better understanding.


Output of the program and graphs

Once the optimization is finished, some graphics can be made with a grapher. We use Gnuplot which is an interactive plotting program copyrighted but freely distributed. Imach outputs the source of a gnuplot file, named 'graph.gp', which can be directly input into gnuplot.
When the running is finished, the user should enter a caracter for plotting and output editing.

These caracters are:

Results files

- Observed prevalence in each state (and at first pass): prbiaspar.txt

The first line is the title and displays each field of the file. The first column is age. The fields 2 and 6 are the proportion of individuals in states 1 and 2 respectively as observed during the first exam. Others fields are the numbers of people in states 1, 2 or more. The number of columns increases if the number of states is higher than 2.
The header of the file is

# Age Prev(1) N(1) N Age Prev(2) N(2) N
70 1.00000 631 631 70 0.00000 0 631
71 0.99681 625 627 71 0.00319 2 627 
72 0.97125 1115 1148 72 0.02875 33 1148 
# Age Prev(1) N(1) N Age Prev(2) N(2) N
    70 0.95721 604 631 70 0.04279 27 631

It means that at age 70, the prevalence in state 1 is 1.000 and in state 2 is 0.00 . At age 71 the number of individuals in state 1 is 625 and in state 2 is 2, hence the total number of people aged 71 is 625+2=627.

- Estimated parameters and covariance matrix: rbiaspar.txt

This file contains all the maximisation results:

 Number of iterations=47
 -2 log likelihood=46553.005854373667  
 Estimated parameters: a12 = -12.691743 b12 = 0.095819 
                       a13 = -7.815392   b13 = 0.031851 
                       a21 = -1.809895 b21 = -0.030470 
                       a23 = -7.838248  b23 = 0.039490  
 Covariance matrix: Var(a12) = 1.03611e-001
                    Var(b12) = 1.51173e-005
                    Var(a13) = 1.08952e-001
                    Var(b13) = 1.68520e-005  
                    Var(a21) = 4.82801e-001
                    Var(b21) = 6.86392e-005
                    Var(a23) = 2.27587e-001
                    Var(b23) = 3.04465e-005 
 
- Transition probabilities: pijrbiaspar.txt

Here are the transitions probabilities Pij(x, x+nh) where nh is a multiple of 2 years. The first column is the starting age x (from age 50 to 100), the second is age (x+nh) and the others are the transition probabilities p11, p12, p13, p21, p22, p23. For example, line 5 of the file is:

 100 106 0.03286 0.23512 0.73202 0.02330 0.19210 0.78460 

and this means:

p11(100,106)=0.03286
p12(100,106)=0.23512
p13(100,106)=0.73202
p21(100,106)=0.02330
p22(100,106)=0.19210 
p22(100,106)=0.78460 
- Stationary prevalence in each state: plrbiaspar.txt
#Age 1-1 2-2 
70 0.92274 0.07726 
71 0.91420 0.08580 
72 0.90481 0.09519 
73 0.89453 0.10547

At age 70 the stationary prevalence is 0.92274 in state 1 and 0.07726 in state 2. This stationary prevalence differs from observed prevalence. Here is the point. The observed prevalence at age 70 results from the incidence of disability, incidence of recovery and mortality which occurred in the past of the cohort. Stationary prevalence results from a simulation with actual incidences and mortality (estimated from this cross-longitudinal survey). It is the best predictive value of the prevalence in the future if "nothing changes in the future". This is exactly what demographers do with a Life table. Life expectancy is the expected mean time to survive if observed mortality rates (incidence of mortality) "remains constant" in the future.

- Standard deviation of stationary prevalence: vplrbiaspar.txt

The stationary prevalence has to be compared with the observed prevalence by age. But both are statistical estimates and subjected to stochastic errors due to the size of the sample, the design of the survey, and, for the stationary prevalence to the model used and fitted. It is possible to compute the standard deviation of the stationary prevalence at each age.

Observed and stationary prevalence in state (2=disable) with the confident interval: vbiaspar2.gif


This graph exhibits the stationary prevalence in state (2) with the confidence interval in red. The green curve is the observed prevalence (or proportion of individuals in state (2)). Without discussing the results (it is not the purpose here), we observe that the green curve is rather below the stationary prevalence. It suggests an increase of the disability prevalence in the future.

Convergence to the stationary prevalence of disability: pbiaspar1.gif

This graph plots the conditional transition probabilities from an initial state (1=healthy in red at the bottom, or 2=disable in green on top) at age x to the final state 2=disable at age x+h. Conditional means at the condition to be alive at age x+h which is hP12x + hP22x. The curves hP12x/(hP12x + hP22x) and hP22x/(hP12x + hP22x) converge with h, to the stationary prevalence of disability. In order to get the stationary prevalence at age 70 we should start the process at an earlier age, i.e.50. If the disability state is defined by severe disability criteria with only a few chance to recover, then the incidence of recovery is low and the time to convergence is probably longer. But we don't have experience yet.

- Life expectancies by age and initial health status: erbiaspar.txt
# Health expectancies 
# Age 1-1 1-2 2-1 2-2 
70 10.7297 2.7809 6.3440 5.9813 
71 10.3078 2.8233 5.9295 5.9959 
72 9.8927 2.8643 5.5305 6.0033 
73 9.4848 2.9036 5.1474 6.0035 
For example 70 10.7297 2.7809 6.3440 5.9813 means:
e11=10.7297 e12=2.7809 e21=6.3440 e22=5.9813

For example, life expectancy of a healthy individual at age 70 is 10.73 in the healthy state and 2.78 in the disability state (=13.51 years). If he was disable at age 70, his life expectancy will be shorter, 6.34 in the healthy state and 5.98 in the disability state (=12.32 years). The total life expectancy is a weighted mean of both, 13.51 and 12.32; weight is the proportion of people disabled at age 70. In order to get a pure period index (i.e. based only on incidences) we use the computed or stationary prevalence at age 70 (i.e. computed from incidences at earlier ages) instead of the observed prevalence (for example at first exam) (see below).

- Variances of life expectancies by age and initial health status: vrbiaspar.txt

For example, the covariances of life expectancies Cov(ei,ej) at age 50 are (line 3)

   Cov(e1,e1)=0.4667  Cov(e1,e2)=0.0605=Cov(e2,e1)  Cov(e2,e2)=0.0183
- Health expectancies with standard errors in parentheses: trbiaspar.txt
#Total LEs with variances: e.. (std) e.1 (std) e.2 (std) 
70 13.42 (0.18) 10.39 (0.15) 3.03 (0.10)70 13.81 (0.18) 11.28 (0.14) 2.53 (0.09) 

Thus, at age 70 the total life expectancy, e..=13.42 years is the weighted mean of e1.=13.51 and e2.=12.32 by the stationary prevalence at age 70 which are 0.92274 in state 1 and 0.07726 in state 2, respectively (the sum is equal to one). e.1=10.39 is the Disability-free life expectancy at age 70 (it is again a weighted mean of e11 and e21). e.2=3.03 is also the life expectancy at age 70 to be spent in the disability state.

Total life expectancy by age and health expectancies in states (1=healthy) and (2=disable): ebiaspar.gif

This figure represents the health expectancies and the total life expectancy with the confident interval in dashed curve.

        

Standard deviations (obtained from the information matrix of the model) of these quantities are very useful. Cross-longitudinal surveys are costly and do not involve huge samples, generally a few thousands; therefore it is very important to have an idea of the standard deviation of our estimates. It has been a big challenge to compute the Health Expectancy standard deviations. Don't be confuse: life expectancy is, as any expected value, the mean of a distribution; but here we are not computing the standard deviation of the distribution, but the standard deviation of the estimate of the mean.

Our health expectancies estimates vary according to the sample size (and the standard deviations give confidence intervals of the estimate) but also according to the model fitted. Let us explain it in more details.

Choosing a model means ar least two kind of choices. First we have to decide the number of disability states. Second we have to design, within the logit model family, the model: variables, covariables, confonding factors etc. to be included.

More disability states we have, better is our demographical approach of the disability process, but smaller are the number of transitions between each state and higher is the noise in the measurement. We do not have enough experiments of the various models to summarize the advantages and disadvantages, but it is important to say that even if we had huge and unbiased samples, the total life expectancy computed from a cross-longitudinal survey, varies with the number of states. If we define only two states, alive or dead, we find the usual life expectancy where it is assumed that at each age, people are at the same risk to die. If we are differentiating the alive state into healthy and disable, and as the mortality from the disability state is higher than the mortality from the healthy state, we are introducing heterogeneity in the risk of dying. The total mortality at each age is the weighted mean of the mortality in each state by the prevalence in each state. Therefore if the proportion of people at each age and in each state is different from the stationary equilibrium, there is no reason to find the same total mortality at a particular age. Life expectancy, even if it is a very useful tool, has a very strong hypothesis of homogeneity of the population. Our main purpose is not to measure differential mortality but to measure the expected time in a healthy or disability state in order to maximise the former and minimize the latter. But the differential in mortality complexifies the measurement.

Incidences of disability or recovery are not affected by the number of states if these states are independant. But incidences estimates are dependant on the specification of the model. More covariates we added in the logit model better is the model, but some covariates are not well measured, some are confounding factors like in any statistical model. The procedure to "fit the best model' is similar to logistic regression which itself is similar to regression analysis. We haven't yet been sofar because we also have a severe limitation which is the speed of the convergence. On a Pentium III, 500 MHz, even the simplest model, estimated by month on 8,000 people may take 4 hours to converge. Also, the program is not yet a statistical package, which permits a simple writing of the variables and the model to take into account in the maximisation. The actual program allows only to add simple variables without covariations, like age+sex but without age+sex+ age*sex . This can be done from the source code (you have to change three lines in the source code) but will never be general enough. But what is to remember, is that incidences or probability of change from one state to another is affected by the variables specified into the model.

Also, the age range of the people interviewed has a link with the age range of the life expectancy which can be estimated by extrapolation. If your sample ranges from age 70 to 95, you can clearly estimate a life expectancy at age 70 and trust your confidence interval which is mostly based on your sample size, but if you want to estimate the life expectancy at age 50, you should rely in your model, but fitting a logistic model on a age range of 70-95 and estimating probabilties of transition out of this age range, say at age 50 is very dangerous. At least you should remember that the confidence interval given by the standard deviation of the health expectancies, are under the strong assumption that your model is the 'true model', which is probably not the case.

- Copy of the parameter file: orbiaspar.txt

This copy of the parameter file can be useful to re-run the program while saving the old output files.


Trying an example

Since you know how to run the program, it is time to test it on your own computer. Try for example on a parameter file named imachpar.txt which is a copy of mypar.txt included in the subdirectory of imach, mytry. Edit it to change the name of the data file to ..\data\mydata.txt if you don't want to copy it on the same directory. The file mydata.txt is a smaller file of 3,000 people but still with 4 waves.

Click on the imach.exe icon to open a window. Answer to the question:'Enter the parameter file name:'

IMACH, Version 0.63

Enter the parameter file name: ..\mytry\imachpar.txt

Most of the data files or image files generated, will use the 'imachpar' string into their name. The running time is about 2-3 minutes on a Pentium III. If the execution worked correctly, the outputs files are created in the current directory, and should be the same as the mypar files initially included in the directory mytry.

 

Once the running is finished, the program requires a caracter:

Type g for plotting (available if mle=1), e to edit output files, c to start again,

and q for exiting:

First you should enter g to make the figures and then you can edit all the results by typing e.

This software have been partly granted by Euro-REVES, a concerted action from the European Union. It will be copyrighted identically to a GNU software product, i.e. program and software can be distributed freely for non commercial use. Sources are not widely distributed today. You can get them by asking us with a simple justification (name, email, institute) mailto:brouard@ined.fr and mailto:lievre@ined.fr .

Latest version (0.63 of 16 march 2000) can be accessed at http://euroreves.ined.fr/imach