The Residual Method
Recently I stumbled across an interesting method to leverage lots of partial data in order to accurately find a relationship among a little data. I doubt I’m the first to come up with such a method, but its news to me and my statistically inclined friends, so I figured I’d better write it up.
First some motivation. Imagine you are working in the promotions department for a new health insurance company. You want to send out a mailing to potential customers, but you’d like to find customers that are going to be profitable, so you want to try to estimate their potential health costs and send mail to those that will be low.
You don’t know too much about the people you are mailing to, just their age. So you want to find the relationship between age and health insurance costs as accurately as possible. Fortunately your research department has conducted a smallish study of 1000 people where they asked about their real medical costs, ages, and other things.
You can use this real data to figure out the effect of age on medical costs using a simple regression.
Cost ~ Age
Sounds lovely, but clearly age isn’t the only thing that is affecting medical costs. There are many other factors, for example weight, which will contribute. You’d be tempted to set up a model that’s a bit more complicated to account for these other factors.
Cost ~ Age + Weight + ...
This model is nice because these other factors account for the noise in your data and may produce a cleaner relationship between age and cost to use for your mailing. However, there is a problem here because age and weight are related to one another. Older people tend to be a little heftier than younger people and so by putting weight in the model you will affect the relationship between cost and age. This might be fine in some circumstances, but in this case your mailing will have no information on weight and so its effect on age will be misleading.
This point might not be obvious at first glance, so lets invent some sample data to help us make the case. Imagine that your survey data is composed of Cost, Age, and Weight. Age is a randomly distributed normal variable. Weight is the sum of Age and another randomly distributed normal variable, and Cost is the sum of Age, Weight, and a third error term. Obviously this data and its relationships are contrived, but no bother, they will serve the point none-the-less.
In R we can build this kind of dataset with the following code.
Now if we build our basic relationship Cost ~ Age we might have:
Basically Cost is about 2 times age give or take a little error. What happens when we throw weight in the mix.
You see we’ve magled the relationship between age and cost by throwing weight in the mix. Since we know squat about weight in our mailing, this is quite useless to us and we’d rather stick with the previous result.
If only we could account for the component of weight that does not depend on age. Simple you think! I will find the relationship between weight and age then subtract that portion of weight which is explained by age leaving me with a weight Residual. I’ll add this to the model of cost and get an even better estimate of the relationship between cost and age. In R this might look like:
Marvelous! Our estimate of the standard error of age to cost has gone down! We must have improved the condition. But LO, why is the estimate of the age coefficient the same as it was in our first regression? Surely if the estimate is a better one it must also be a different one?
It seems we have stumbled upon an old problem in statistics. The problem of the generated regressor (Pagan 1984). The standard error is now leading us astray. It is not to be trusted. The basic problem is that we cannot be certain of the true relationship between weight and age.
But what if we could? What if we somehow knew what the true relationship between weight and age actually was. Well in this case we do, because we made up the data. weight is simply age plus some error term. So all we have to do is subtract age from weight and we have the ‘true’ Residual. How does this kind of residual fare?
Now this really is something! Once again the standard error of the estimate of age has shrunk, but more notably the estimate itself has gotten closer to 2 which is the value we actually know it to be!
If only we knew the real relationship between weight and cost we would be in business. Of course we never can know this real relationship without unlimited data but what if we simply had a very good estimate. Imagine that we had access to census records which reported age and weight for the entire country. Then we could probably get an estimate of the relationship that was quite close to the truth and we could use this to calculate the residual. In R we might imagine:
How cool is this once again we have a different better estimate of the relationship between age and cost. Now when we go to make our mailing we can make a more informed mailing.
And that’s the residual method. We leverage a large body of incomplete data to estimate a more precise relationship in a small amount of full data. In this example the result is rather simple and of limited utility. But I think it points is bright new direction for statistics.
Humans are amazing at leveraging all the knowledge we have of the world to solve small new puzzles. When a child learns the alphabet, they don’t need to see ten thousand examples before they have it mastered. Several dozen will do. This is because this child has seen millions of images in its life-time and the ones that present the alphabet are quite distinct. It doesn’t need to learn how to account for lighting and orientation because it has solved those problems using the images that came before. Instead it can focus on crux of the matter.
The residual method is a step in this direction. It works by exploiting knowledge of the world to get at the crux of the matter.
Recently I stumbled across an interesting method to leverage lots of partial data in order to accurately find a relationship among a little data. I doubt I’m the first to come up with such a method, but its news to me and my statistically inclined friends, so I figured I’d better write it up.
First some motivation. Imagine you are working in the promotions department for a new health insurance company. You want to send out a mailing to potential customers, but you’d like to find customers that are going to be profitable, so you want to try to estimate their potential health costs and send mail to those that will be low.
You don’t know too much about the people you are mailing to, just their age. So you want to find the relationship between age and health insurance costs as accurately as possible. Fortunately your research department has conducted a smallish study of 1000 people where they asked about their real medical costs, ages, and other things.
You can use this real data to figure out the effect of age on medical costs using a simple regression.
Cost ~ Age
Sounds lovely, but clearly age isn’t the only thing that is affecting medical costs. There are many other factors, for example weight, which will contribute. You’d be tempted to set up a model that’s a bit more complicated to account for these other factors.
Cost ~ Age + Weight + ...
This model is nice because these other factors account for the noise in your data and may produce a cleaner relationship between age and cost to use for your mailing. However, there is a problem here because age and weight are related to one another. Older people tend to be a little heftier than younger people and so by putting weight in the model you will affect the relationship between cost and age. This might be fine in some circumstances, but in this case your mailing will have no information on weight and so its effect on age will be misleading.
This point might not be obvious at first glance, so lets invent some sample data to help us make the case. Imagine that your survey data is composed of Cost, Age, and Weight. Age is a randomly distributed normal variable. Weight is the sum of Age and another randomly distributed normal variable, and Cost is the sum of Age, Weight, and a third error term. Obviously this data and its relationships are contrived, but no bother, they will serve the point none-the-less.
In R we can build this kind of dataset with the following code.
n = 1000
mean = 0
sd = 1
Age = rnorm(n, mean, sd)
Weight = Age + rnorm(n, mean, sd)
Cost = Age + Weight + rnorm(n, mean, sd)
data = data.frame(Age, Weight, Cost)
Now if we build our basic relationship Cost ~ Age we might have:
summary(glm(Cost ~ Age, data=data))
Estimate Std. Error
(Intercept) -0.000305 0.046821
Age 2.055623 0.045261
Basically Cost is about 2 times age give or take a little error. What happens when we throw weight in the mix.
summary(glm(Cost ~ Age + Weight, data=data))
Estimate Std. Error
(Intercept) 0.01624 0.03260
Age 1.02703 0.04459
Weight 1.02202 0.03135
You see we’ve magled the relationship between age and cost by throwing weight in the mix. Since we know squat about weight in our mailing, this is quite useless to us and we’d rather stick with the previous result.
If only we could account for the component of weight that does not depend on age. Simple you think! I will find the relationship between weight and age then subtract that portion of weight which is explained by age leaving me with a weight Residual. I’ll add this to the model of cost and get an even better estimate of the relationship between cost and age. In R this might look like:
data$Predicted = predict(glm(weight ~ Age, data=data))
data$Residual = data$weight - data$Predicted
summary(glm(Cost ~ Age + Residual, data=data))
Estimate Std. Error
(Intercept) -0.000305 0.032593
Age 2.055623 0.031508
Residual 1.022017 0.031355
Marvelous! Our estimate of the standard error of age to cost has gone down! We must have improved the condition. But LO, why is the estimate of the age coefficient the same as it was in our first regression? Surely if the estimate is a better one it must also be a different one?
It seems we have stumbled upon an old problem in statistics. The problem of the generated regressor (Pagan 1984). The standard error is now leading us astray. It is not to be trusted. The basic problem is that we cannot be certain of the true relationship between weight and age.
But what if we could? What if we somehow knew what the true relationship between weight and age actually was. Well in this case we do, because we made up the data. weight is simply age plus some error term. So all we have to do is subtract age from weight and we have the ‘true’ Residual. How does this kind of residual fare?
data$Residual = data$Weight - data$Age
summary(glm(Cost ~ Age + Residual, data=data))
Estimate Std. Error
(Intercept) 0.01624 0.03260
Age 2.04905 0.03151
Residual 1.02202 0.03135
Now this really is something! Once again the standard error of the estimate of age has shrunk, but more notably the estimate itself has gotten closer to 2 which is the value we actually know it to be!
If only we knew the real relationship between weight and cost we would be in business. Of course we never can know this real relationship without unlimited data but what if we simply had a very good estimate. Imagine that we had access to census records which reported age and weight for the entire country. Then we could probably get an estimate of the relationship that was quite close to the truth and we could use this to calculate the residual. In R we might imagine:
n = 100000
mean = 0
sd = 1
Age = rnorm(n, mean, sd)
Weight = Age + rnorm(n, mean, sd)
census <- data.frame="data.frame" ge="ge" span="span" weight="weight">->
data$Predicted = predict(glm(Weight ~ Age, data=census), newdata=data)
data$Residual = data$Weight - data$Predicted
summary(glm(Cost ~ Age + Residual, data=data))
Estimate Std. Error
(Intercept) 0.01210 0.03260
Age 2.05207 0.03151
Residual 1.02202 0.03135
How cool is this once again we have a different better estimate of the relationship between age and cost. Now when we go to make our mailing we can make a more informed mailing.
And that’s the residual method. We leverage a large body of incomplete data to estimate a more precise relationship in a small amount of full data. In this example the result is rather simple and of limited utility. But I think it points is bright new direction for statistics.
Humans are amazing at leveraging all the knowledge we have of the world to solve small new puzzles. When a child learns the alphabet, they don’t need to see ten thousand examples before they have it mastered. Several dozen will do. This is because this child has seen millions of images in its life-time and the ones that present the alphabet are quite distinct. It doesn’t need to learn how to account for lighting and orientation because it has solved those problems using the images that came before. Instead it can focus on crux of the matter.
The residual method is a step in this direction. It works by exploiting knowledge of the world to get at the crux of the matter.
No comments:
Post a Comment