The ecological inference problem arises when making inferences about
individual behavior from aggregate data. Such a situation is
frequently encountered in the social sciences and epidemiology. In
this article, we propose a Bayesian approach based on data
augmentation. We formulate ecological inference in $2 \times 2$ tables
as a missing data problem where only the weighted average of two
unknown variables is observed. This framework directly incorporates
the deterministic bounds, which contain all information available from
the data, and allow researchers to incorporate the individual-level
data whenever available. Within this general framework, we first
develop a parametric model. We show that through the use of an $EM$
algorithm, the model can formally quantify the effect of missing
information on parameter estimation. This is an important diagnostic
for evaluating the degree of aggregation effects. Next, we introduce a
nonparametric Bayesian model using a Dirichlet process prior to relax
the distributional assumption of the parametric model. Through
simulations and an empirical application, we evaluate the relative
performance of our models and other existing methods. We show that in
many realistic scenarios, aggregation effects are so severe that more
than half of the information is lost, yielding estimates with little
precision. We also find that our nonparametric model generally
outperforms parametric models.
C-code, along with an
R interface,
is publicly available for implementing our Markov chain Monte
Carlo algorithms to fit the proposed models.
(Last Revised November 30, 2004)