The imputation of income


1. Overview

A strong case has been made by a wide constituency of user groups for the inclusion of an income question in the 2001 Census. Nonetheless, the Census Office retains a number of legitimate concerns over the wisdom of such a course of action. In the light of these arguments, both for and against the inclusion of an income question in the census, the current government white paper on the 2001 Census (Cm 4253) has proposed that an income question should be trialled in the 1999 Census Rehearsal. A firm decision regarding the inclusion of income in the 2001 Census will be made in the light of results from this trial.

It is proposed here to take advantage of the unique nature of the data collected by this 1999 Census Rehearsal. For the first time, a UK dataset will exist that captures the income of individuals located within spatially contiguous households for a large number of enumeration districts. Analysis of these data will allow extensive exploration of the role of place in ‘determining’ income. In particular, analysis of these data will permit three key objectives to be met:

  1. An evaluation of extant and currently proposed methods of small area income imputation.
  2. An evaluation of the use of other Census based measures as proxies for income.
  3. An assessment of the effectiveness of incorporating non-Census information in small area income imputation process, looking in particular at the role of house prices as measured by council tax bands.

Irrespective of the decision eventually reached by the government, the fruits of this research will be of use, both to the Census Office and to the wider user community.

The main outcome of the proposed research will be the identification of the best possible strategy, dependent upon data availability, for imputing small area income. In an income question is included in the 2001 Census, the research proposed here will offer the Census Office the opportunity to be involved in the evaluation of a wider range of item imputation strategies for income that would otherwise have been possible. If an income question is not included, then the Census Office will be able to demonstrate that they have actively collaborated in meeting the government’s stated preferred aim of ‘identify[ing] possible alternative means of securing relevant information’ (Cmd Cm 4253, para. 103), by helping to improve small area income imputation methodology. This would also contribute directly to ongoing efforts elsewhere within the Office for National Statistics to develop small area income estimates. In either case the benefit to the wider user community will be an improvement over the current situation with regards to the availability of accurate estimates of small area income distributions.

Back to contents

2. The need for income data

Consultations before the 1991 Census and again in the run up to the 2001 Census have shown almost overwhelming support for the inclusion of an income question in the Census. (Rees, 1998; Cmd 4253). As Dorling (1999) has argued elsewhere, it has become increasingly recognised that inequalities in income distribution appear to underpin a wide range of social phenomena, from voting and leisure habits, through to long term health prospects and, ultimately, life expectancy.

That this movement in opinion has not been confined solely to academic circles was perhaps demonstrated most graphically in the recent white paper ‘Our Healthier Nation’, which set out the government’s intent "to improve the health of the worst off in society and to narrow the health gap". (Cm 3852, p5, my emphasis).

A number of government sponsored surveys already collect information on a range of socio-economic and demographic characteristics including income (for example, the Labour Force Survey, the General Household Survey, The Family Expenditure Survey and the New Earnings Survey) (reviewed in Green, 1998). However, each of these surveys fail to satisfy (secondary) user needs in one or more ways; partial coverage of the population (e.g. earners only), insufficient sample size and/or lack of spatial coding. Alternative commercial sources of spatially detailed income data suffer from problems of access and after severe sampling bias, due to the way in which data are collected (Birkin and Clarke, 1995). An income question in the Census, addressed to all members of the population, including pensioners and those on benefits as well as earners, would address all of these perceived shortcomings in currently available data.

Set against these strong arguments for the inclusion of an income question in the forthcoming 2001 Census, it needs to be recognised that there remain legitimate concerns that have yet to be fully addressed. These include the likely accuracy of any data collected due to mis-reporting, the possibility of substantial non-response to an income question, in tandem with differential non-response bias and, worst of all, the possibility of reduced overall enumeration rates. Even in the potential reduction in overall response rates remains small, the danger is non-response would rise most amongst those groups of people who are already the hardest to enumerate. Only a careful assessment of the outcome of the 1999 Census Rehearsal can inform the final government decision on this question.

Back to contents

3. Current practice

Over time a wide range of methods have been adopted to impute income for small areas. These methods can be divided into two main camps: those that seek to impute aggregate measures of income for small geographic areas, and those that seek in addition to impute measures of income for individual people and households within those same areas.

3.1 Aggregate measures of income

i) Estimated aggregate income measures

Using such partial information as does exist it is possible to estimate aggregate measures of income for small areas. Such sources of information are almost all-commercial in origin, and include building society data (relevant only to house buyers), credit score data and consumer surveys. All these sources of information are known to be biased, due to differential response rates and limited coverage (for example, only information only on house buyers). As a result the people of most policy relevance, those on low income, tend to be precisely those missing from these large datasets. Standard statistical methods do exist to compensate for these known biases (Birken and Clarke, 1995), but a further hurdle remains. Commercial sensitivity and/or cost means that these data are often not readily available for use by government (national or local), academics or others.

ii) Proxy variables

More generally, the single most common way in which the lack of information on income within small areas is dealt with is to refer to census variables seen as plausible substitutes for income. By common consensus these include car ownership, the occupation based measure of ‘social class’, employment status and household tenure. These indicators, either singly or in combination, have been used to underpin many studies into the impacts of (income) inequalities. Multiple ‘deprivation’ indices in common use both within government and academia, include the Townsend index and the Department of Environment Index of Local Conditions. The problem with such studies is that, being based upon indirect measures of wealth, interpretation of results is not always straightforward. For example, having controlled for age and education, does the observed relationship between Social Class and health reflect income (or the lack of it) or social standing/status within the community? Statistically more sophisticated geodemographic area classifications are essentially used to perform the same function. All of these types of study suffer from the problem that all households/individuals within an area are necessarily assumed to be the same as the area average, playing down the real level of diversity in incomes that exists within even the smallest geographical areas. More damagingly, recent work has emphasised the highly diverse nature of such supposedly ‘similar’ area (Voas and Williamson, 1999). In one commonly used geodemographic profile only 10% of the between area proportion of women working full-time was ‘explained’ by the geodemographic classification. This is not too criticise the role and use of geodemographic profiles and other proxy variables, but rather to underline the shortcomings that are inevitable in any attempt to indirectly measure variations in the income between individuals and areas.

3.2 Imputing income for small area population microdata

A third alternative is to impute income data for individuals and household records in publicly available population microdata. Williamson (forthcoming) GSS (1997) and David et al. (1986) review a range of imputation strategies that could be adopted in this case, ranging from sub-group means through regression to neural networks and donor imputation. As present the Census Office in proposing to impute all missing variables in the 2001 Census using a revised version of the ‘Hot Deck’ imputation system adopted during the 1991 Census (Vickers and Yar, 1998). As a result, research is currently underway to fine-tune the performance of this methodology for each proposed variable in the 2001 Census, including income. Outside of government, Dale et al. (1995) have imputed the income to individuals in the SAR, albeit for relatively large geographical areas, using mean earnings for population sub-groups derived from the New Earnings Survey. This deterministic type of approach ensures that the overall sub-group mean income will be correct, but also leads to reduced within-group income variance and distorts relationships with income for variables not considered in the defining imputation sub-groups. Bramley and Lancaster (1998) report the imputation of income for individuals and households contained in geographically detailed population microdata released from the Scottish House Condition Survey. Again income was imputed separately for a range of population sub-groups, but in this case a stochastic element was retained, ensuring that not only the mean but also the overall sub-group income distributions were retained. Geographically detailed microdata are not available for England and Wales. Therefore, Williamson (1995) first estimated the requisite small-area population microdata before also stochastically imputing individual and household income. Crucially, in all three of the examples above, the lack of a direct measure of income for small areas made a thorough evaluation of the estimates produced impossible.

Back to contents

4. A new dataset

The need for a direct measure of income within small areas is clear, as are the shortcomings of all current methods of attempting, directly or indirectly, to produce synthetic estimates of small area income distributions. This serves only to underline the potential and uniqueness of the 1999 Census Rehearsal. The data collected in this rehearsal have three key features:

a) large sample size (approx. 150 000)

b) a question of income

c) an enumeration district based sampling framework, ensuring spatial contiguity between households within small areas

None of these features by themselves are unique amongst government surveys, but in combination they represent an entirely new dataset. These data offer the opportunity to answer a number of key questions, the single most important of which is identifying the spatial scale at which variations in income operate. In other words, they offer the opportunity to quantify the importance of place in determining income. Do individuals with similar key characteristics (such as age and occupation) have similar incomes, irrespective of location (having taken account of known regional pay differentials), or does place matter? The same data allow the answering of a second key question: how the income of individuals/households in small areas should best be estimated.

The quality of the data collected in the 1999 Census Rehearsal is, at the time of writing, an unknown. However, it can be anticipated that it will have a number of key characteristics:

  • under-enumeration (due both to non-response and missed households)
  • some degree of mis-reporting of income, both deliberate and accidental, particularly for sub-groups of the population such as the self-employed pensioners
  • increased levels of item non-response for certain questions, but especially for income

These shortcomings should not be denied, but nor should they be over-played. The collected data can be expected to have a fairly high response rate (well in excess of 50%, on average), and the vast majority of responding households (well in excess of 80%), can be expected to have completed the full test Census forms, including the question on income. These data are entirely adequate to evaluate the importance of place in determining income, and for evaluating alternative income imputation schemes. However, they are unlikely to be of sufficient quality to provide accurate estimates of the income for smaller geographic areas, especially for those ‘hard to count’ areas with the lowest response rates.

Back to contents

5. Proposed research

The research proposed here has three main aims, set within the broader framework of exploring the importance of place in determining individual and household incomes:

To evaluate currently existing and proposed methods of income imputation

To evaluate the use of other Census based measures as proxies for income

To evaluate the incorporation of other, non-Census, information in income

Imputation

5.1 Evaluating methods of income imputation

Section 3 above reviewed the main methods of small-area income imputation that have been used in the recent past, or are proposed for use in the near future. Using each of these methodologies in turn, it is proposed to evaluate the effectiveness of imputation on the Census Rehearsal dataset, by attempting to impute the income for individuals with known income and, crucially, spatial location.

5.2 Evaluating proxies for income

In so far as it is possible, given the data quality issues discussed above, it is proposed to compare small based estimates of income with proxy measures, such as % car ownership, tenure, social class and so on. Multivariate measures of deprivation commonly used as proxies for poverty, geodemographic profiles and the Townsend and Doe indices will also be compared to actual observed levels of income. It is recognised that this goal, although in many ways the simplest to perform, is the one most likely to prove problematic, due to the likely nature of data quality issues with the Census Rehearsal discussed above. Nonetheless, the aim will be to obtain at least some firm indications of the value of each income proxy.

5.3 Incorporating non-census information in income imputation

It is possible that, having controlled for regional pay differentials and relevant individual and household census characteristics, spatial variations in income still persist. This would suggest that certain variables not captured in the Census contribute to ‘determining’ income. Such variables might include house price, rents and local environment (proximity to major roads or open spaces). It is proposed to augment the most successful income imputation strategy identified in Section 4.1 above by incorporating some of these additional variables. House prices can be estimated by reference to council tax band for individual households, which are generally readily available, although not necessarily in machine-readable form. Other non-census variables will be incorporated as proves practicable, depending upon the type of information that can most readily be elicited from local sources (particularly relevant local authorities). Time and resource limits necessarily preclude the use of specially commissioned surveys.

Back to contents

6. Collaboration with ONS

The research outlined above would not be possible without the active collaboration of staff from the Office for National Statistics. Agreement has been obtained in principle from Alex Clark, the official Census Data Custodian, to consider providing access to Census Rehearsal data within a secure computing environment at Titchfield, although a final decision will not be made before early summer. If data access is permitted, the earliest time anticipated at which access could be provided is late February, after Lockheed-Martin, the contracted census data processor, has delivered a set of fully coded and edited rehearsal data to the Census Office. Subject to agreement, it is also proposed to set up a project steering group, comprising Patrick Heady and Paul Vickers, both of whom are currently working on different aspects of small area income imputation within ONS. This steering group would meet with the principal investigator and appointed research three times (at the start, middle and end of the project), both to offer advice and to encourage a two-way dialogue over results between the project team and ONS.

Back to contents

7. Timetable for research

The proposed timetable for research is set out below. The intended start date is 1st February 2000, on the understanding that the current intention is for the Census Rehearsal data to have been processed into machine readable form by the end of February 1999. Before recruiting for the proposed post, ONS will be consulted for a current best estimate of when data processing will be completed. A slight slippage in this deadline, of up to a month, would not seriously affect the research timetable proposed, as the couple of months of the project could usefully be spent collecting non-census data to be used in the later stages of the project. Subject to approval from ESRC, it is proposed that any slippage beyond this will, is possible, be incorporated into a suitably delayed start date.

Month Task(s)

1 Familiarisation with computing environment

Collection of further information on methods of income imputation used by

Census office and others

2 Collection of non-census data to augment imputation routines (e.g. house

Prices, rent etc.)

3 Census Rehearsal data cleaning, familiarisation and basic analysis

3-5 Evaluation of existing imputation methodologies

6 Evaluation of proxy variables for income

7-10 Incorporation of house price (council tax band)

10-12 Write-up and dissemination

Back to contents

8. Dissemination strategy

A five-pronged approach to dissemination is envisaged, all subject to prior ONS clearance:

Direct feedback to the Census Office and other relevant personnel in the Office for National Statistics will occur as the result of on-going collaboration and via the proposed project steering group. This feedback will focus particularly on those aspects of the research concerning the evaluation of alternative imputation methodologies.

Publication of papers in peer reviewed academic journals. Three main papers are envisaged, dealing with each of the main research questions in turn. A fourth, summary paper, is proposed for the Journal of the Market Research Society, targeted specifically at commercial practitioners.

Publication in specialist newsletters, including the Data Archive bulletin, the SAR and MIDAS newsletters (aimed primarily at academic audiences, but involving also a number of commercial practitioners), and the BURISA/LARIA newsletter (aimed primarily at local authority practitioners).

Working papers and other information arising from the project will be posted on a project web-site, building upon the maintained web-site that has already been created for currently on-going ESRC funded project to create a national validated set of small area population microdata.

Final project results will be presented at two conferences; the annual Association for Survey Computing conference (aimed at commercial practitioners) and the annual conference of the Royal Geographical Society (with the Institute of British Geographers).

Results will also, of course, be disseminated via the biannual Census programme day workshops.

Back to contents

References

Birkin M and Clarke G (1995) ‘Using microsimulation methods to synthesise census microdata’, in S Openshaw [ed.] Census Users’ Handbook, GeoInformation International, Cambridge, 363-387.

Bramley G and Lancaster S (1998) ‘Modelling local and small-area income distributions in Scotland’, Environment and Planning C, 16, 681-706

Cm 4253 (1999) The 2001 Census of Population, TSO, London

Dale A Middleton E and Schofield T (1995) ‘New Earnings Survey variables added to the SARs’, SAR Newsletter 6, Census Microdata Unit, University of Manchester

David M Little R Samuhel M E and Triest R K (1986) ‘Alternative methods for CPS income imputation’, Journal of the American Statistical Association, 81, 29-41

Dorling D (1999) ‘Commentary: Who’s afraid of income inequality?’, Environment and Planning A, 31(4), 571-374

Green A (1998) ‘The geography and earnings and incomes in the 1990s: an overview’, Environment and Planning C, 633-647

GSS (1997) ‘Report of the task force on imputation’, Government Statistical Service Methodology Series no. 3, Office for National Statistics, London

Office for National Statistics (1996) ‘Report of the task force on imputation’, Government Statistical Service Methodologies Series, 3

Rees P H (1998) ‘What do you want from the 2001 Census? Results of an ESRC/JISC survey of user views’, Environment and Planning A, 30, 1775-1796

Vickers P and Yar M (1998) ‘The evaluation of the donor imputation system (DIS) for the 2001 UK Census of population and housing’, Proceedings of the Joint IASS/IAOS Conference, Statistics for Economic and Social Development, September 1998.

Voas D and Williamson P (1999) ‘The diversity of diversity: measuring socio-demographic difference in England and Wales’, Working Paper 99/1, Population Microdata Unit, Department of Geography, University of Liverpool

Williamson P (1995) ‘Adding value to the SAR: income estimation for small areas’, in R Banks [ed.] Vital statistics: the benefits of public surveys, Proceedings of a conference organised by the Association for Survey Computing, London, 29 June 1994.

Back to contents