A statistical model is a formalization of relationships between variables in the form of mathematical equations. A statistical model describes how one or more random variables are related to one or more random variables. The model is statistical as the variables are not deterministically but stochastically related. In mathematical terms, a statistical model is frequently thought of as a pair (Y,P) where Y is the set of possible observations and P the set of possible probability distributions on Y. It is assumed that there is a distinct element of P which generates the observed data. Statistical inference enables us to make statements about which element(s) of this set are likely to be the true one.
Most statistical tests can be described in the form of a statistical model. For example, the Student's ttest for comparing the means of two groups can be formulated as seeing if an estimated parameter in the model is different from 0. Another similarity between tests and models is that there are assumptions involved. Error is assumed to be normally distributed in most models.^{[1]}
Contents
Model comparison
Models can be compared to each other. This can either be done when you have done an exploratory data analysis or a confirmatory data analysis. In an exploratory analysis, you formulate all models you can think of, and see which describes your data best. In a confirmatory analysis you test which of your models you have described before the data was collected fits the data best, or test if your only model fits the data. In linear regression analysis you can compare the amount of variance explained by the independent variables, R^{2}, across the different models. In general, you can compare models that are nested by using a Likelihoodratio test. Nested models are models that can be obtained by restricting a parameter in a more complex model to be zero.
An example
Length and age are probabilistically distributed over humans. They are stochastically related, when you know that a person is of age 7, this influences the chance of this person being 6 feet tall. You could formalize this relationship in a linear regression model of the following form: length_{i} = b_{0} + b_{1}age_{i} + ε_{i}, where b_{0} is the intercept, b_{1} is a parameter that age is multiplied by to get a prediction of length, ε is the error term, and i is the subject. This means that length starts at some value, there is a minimum length when someone is born, and it is predicted by age to some amount. This prediction is not perfect as error is included in the model. This error contains variance that stems from sex and other variables. When sex is included in the model, the error term will become smaller, as you will have a better idea of the chance that a particular 16yearold is 6 feet tall when you know this 16yearold is a girl. The model would become length_{i} = b_{0} + b_{1}age_{i} + b_{2}sex_{i} + ε_{i}, where the variable sex is dichotomous. This model would presumably have a higher R^{2}. The first model is nested in the second model: the first model is obtained from the second when b_{2} is restricted to zero.
Full article ▸
