Monday, January 14, 2013

The Residual Method

The Residual Method

Recently I stumbled across an interesting method to leverage lots of partial data in order to accurately find a relationship among a little data.  I doubt I’m the first to come up with such a method, but its news to me and my statistically inclined friends, so I figured I’d better write it up.

First some motivation.  Imagine you are working in the promotions department for a new health insurance company.  You want to send out a mailing to potential customers, but you’d like to find customers that are going to be profitable, so you want to try to estimate their potential health costs and send mail to those that will be low.

You don’t know too much about the people you are mailing to, just their age.  So you want to find the relationship between age and health insurance costs as accurately as possible.  Fortunately your research department has conducted a smallish study of 1000 people where they asked about their real medical costs, ages, and other things.

You can use this real data to figure out the effect of age on medical costs using a simple regression.

Cost ~ Age

Sounds lovely, but clearly age isn’t the only thing that is affecting medical costs.  There are many other factors, for example weight, which will contribute.  You’d be tempted to set up a model that’s a bit more complicated to account for these other factors.

Cost ~ Age + Weight + ...

This model is nice because these other factors account for the noise in your data and may produce a cleaner relationship between age and cost to use for your mailing.  However, there is a problem here because age and weight are related to one another.  Older people tend to be a little heftier than younger people and so by putting weight in the model you will affect the relationship between cost and age.  This might be fine in some circumstances, but in this case your mailing will have no information on weight and so its effect on age will be misleading.

This point might not be obvious at first glance, so lets invent some sample data to help us make the case.  Imagine that your survey data is composed of Cost, Age, and Weight.  Age is a randomly distributed normal variable.  Weight is the sum of Age and another randomly distributed normal variable, and Cost is the sum of Age, Weight, and a third error term.  Obviously this data and its relationships are contrived, but no bother, they will serve the point none-the-less.

In R we can build this kind of dataset with the following code.

n    = 1000
mean = 0
sd   = 1

Age = rnorm(n, mean, sd)
Weight = Age + rnorm(n, mean, sd)
Cost = Age + Weight + rnorm(n, mean, sd)

data = data.frame(Age, Weight, Cost)

Now if we build our basic relationship Cost ~ Age we might have:

summary(glm(Cost ~ Age, data=data))

            Estimate Std. Error    
(Intercept) -0.000305   0.046821    
Age          2.055623   0.045261

Basically Cost is about 2 times age give or take a little error.  What happens when we throw weight in the mix.

summary(glm(Cost ~ Age + Weight, data=data))

           Estimate Std. Error    
(Intercept)  0.01624    0.03260   
Age          1.02703    0.04459
Weight       1.02202    0.03135

You see we’ve magled the relationship between age and cost by throwing weight in the mix.  Since we know squat about weight in our mailing, this is quite useless to us and we’d rather stick with the previous result.

If only we could account for the component of weight that does not depend on age.  Simple you think!  I will find the relationship between weight and age then subtract that portion of weight which is explained by age leaving me with a weight Residual.  I’ll add this to the model of cost and get an even better estimate of the relationship between cost and age.  In R this might look like:


data$Predicted = predict(glm(weight ~ Age, data=data))
data$Residual = data$weight - data$Predicted

summary(glm(Cost ~ Age + Residual, data=data))

            Estimate Std. Error    
(Intercept) -0.000305   0.032593   
Age          2.055623   0.031508
Residual     1.022017   0.031355

Marvelous!  Our estimate of the standard error of age to cost has gone down!  We must have improved the condition.  But LO, why is the estimate of the age coefficient the same as it was in our first regression?  Surely if the estimate is a better one it must also be a different one?

It seems we have stumbled upon an old problem in statistics.  The problem of the generated regressor (Pagan 1984).  The standard error is now leading us astray.  It is not to be trusted.  The basic problem is that we cannot be certain of the true relationship between weight and age.

But what if we could?  What if we somehow knew what the true relationship between weight and age actually was.  Well in this case we do, because we made up the data.  weight is simply age plus some error term.  So all we have to do is subtract age from weight and we have the ‘true’ Residual.  How does this kind of residual fare?

data$Residual = data$Weight - data$Age
summary(glm(Cost ~ Age + Residual, data=data))


           Estimate Std. Error    
(Intercept)  0.01624    0.03260    
Age          2.04905    0.03151
Residual     1.02202    0.03135

Now this really is something!  Once again the standard error of the estimate of age has shrunk, but more notably the estimate itself has gotten closer to 2 which is the value we actually know it to be!

If only we knew the real relationship between weight and cost we would be in business.  Of course we never can know this real relationship without unlimited data but what if we simply had a very good estimate.  Imagine that we had access to census records which reported age and weight for the entire country.  Then we could probably get an estimate of the relationship that was quite close to the truth and we could use this to calculate the residual.  In R we might imagine:


n    = 100000
mean = 0
sd   = 1

Age = rnorm(n, mean, sd)
Weight = Age + rnorm(n, mean, sd)

census <- data.frame="data.frame" ge="ge" span="span" weight="weight">

data$Predicted = predict(glm(Weight ~ Age, data=census), newdata=data)
data$Residual = data$Weight - data$Predicted

summary(glm(Cost ~ Age + Residual, data=data))

           Estimate Std. Error
(Intercept)  0.01210    0.03260   
Age          2.05207    0.03151
Residual     1.02202    0.03135


How cool is this once again we have a different better estimate of the relationship between age and cost.  Now when we go to make our mailing we can make a more informed mailing.

And that’s the residual method.  We leverage a large body of incomplete data to estimate a more precise relationship in a small amount of full data.  In this example the result is rather simple and of limited utility.  But I think it points is bright new direction for statistics.

Humans are amazing at leveraging all the knowledge we have of the world to solve small new puzzles.  When a child learns the alphabet, they don’t need to see ten thousand examples before they have it mastered.  Several dozen will do.  This is because this child has seen millions of images in its life-time and the ones that present the alphabet are quite distinct.  It doesn’t need to learn how to account for lighting and orientation because it has solved those problems using the images that came before.  Instead it can focus on crux of the matter.

The residual method is a step in this direction.  It works by exploiting knowledge of the world to get at the crux of the matter.

Tuesday, September 11, 2012

Simple English

At PERTS, a student motivation project that I work on, we need to communicate with students.  This means using language that 5th to 12th graders can understand, especially ESL students and those with poor reading skills which are the main target of our psychological interventions.

So we read through all the material that we give students, and try carefully to compose our instructions in the most universal language we can.  Still, every year we make mistakes.  Teachers give us feedback on words that were confusing for their students.  For example, one classroom didn't know what a 'detention' was because their school did not have detentions.  Detecting these mistakes is hard, it is hard to know what words might trip someone up.

Thinking about google's ngram engine I realized there is a simple solution.  We just needed a tool that would highlight those words which were rare or uncommon in a paragraph.  If we could eliminate the rare words, we would end up with text that more students could understand.

So I did what any self respecting programmer would do, I googled around to find the tool I needed.  To my complete surprise, I found no such tool on the Internet.  But this kind of thing is easy; Peter Norvig wrote a 20 line spelling corrector, much more profound than my goal.  So, I resolved myself to create my own.

I needed a corpus of text to find which words were common or rare.  I didn't want to borrow books from project gutenberg, as Norvig had done, because these these older texts were well over the reading level of my target students.  Instead I pulled some pages from the Simple English Wikipedia, stripped out all that annoying markup and used this to generate a set of word counts.

A few lines of python later and I had a working prototype.  I made it pretty using a little bootstrap and then pushed it online using Google's appengine which Steve Huffman had de-convoluted for me.  As you can see from this screen shot, this tool would have identified detention as a word to be careful of.  Now we know.

For those that would like to try it out, you can find the tool at simpleenglishchecker.appspot.com and the code at github.

Saturday, January 14, 2012

What comes to mind

Sometimes I wonder where ideas come from. I have good ideas from time to time, but I don't tell my brain to think them, they generally just come out. Tonight I caught an idea in the act, I saw it happen. Here is the story.

It was late, and I was coming back to bed after going to the bathroom. I was trying to be quiet for Anna's sake, but I've always been bad at being quiet. I tried my hardest to shut the door quietly as I came in the room. I turned the handle to one side, so the latch wouldn't catch. I closed it slowly without slamming and then carefully released the handle so the latch would fall in place with none the wiser. But then the door made a sound despite my careful actions.

That sound has come before, and I never knew why. But in that instance I saw a vision. My minds eye imagined the latch, too small for the hole in the door jamb. When I turned the handle, and the latch fell back in place it wasn't up against the side of the strike plate. There was a gap, and the door is on an angle (we live in an old apartment), so it swings under the weight of gravity. That means the door 'fell', so to speak, until the edge of the latch met the edge of the strike plate. The falling made a noise. Eureka!

OK, admittedly, this was not the most amazing discovery. But it was a discovery, and I wasn't the one who made it; my brain told me about it. I know that sounds weird, but bear with me here.

There was some part of my unconscious that must have been thinking about this problem idly in the background for the last few years. Eventually, it had figured it out. It didn't bother me with the solution right after it solved the problem. Instead it waited, waited for the sound to happen again. When, eventually, my auditory brain sent the signal of the door slam, noise this part of my brain spoke up. It said 'hey visual cortex try running this simulation'. And the visual cortex filled my brain with the image of a latch falling against a slot and then I knew what had happened.

And there you have it, a thought caught in the act. I wonder where they come from? What are they up to before they jump on into our conscience for their big eureka moment? Can we explore our brain to find these forming thoughts? Maybe, but its too much to think about now. I have to get back to bed.



Tuesday, March 8, 2011

A few good papers

Finding high quality articles from PLOS

Today I had to find a great research article for journal club. I wanted something influential and topical. PLOS made me happy by posting article level statistics as an excel file. I analyzed the data and produced a list of articles that were more cited than expected. They look interesting, so I am sharing them here. Enjoy!


Plos Papers with higher than expected citations before 2009

Methods
I did not know whether to use page views or citations as measure of influence. When I plotted them against each other it became clear that lots of citations implied lots of page views, but lots of page views did not imply lots of citations. I wanted articles that had both, so I used citations as my measure.

Articles are cited more over time, but I did not want to bias toward older articles. So I determined the best fit line for number of citations vs date and then used that to calculate the expected number of citations for a given date.

Last, I divided the observed number of citations by the expected number of citations for a given date. Unfortunately, the newest papers in the data set all appeared at the top of the list. This is because there is a low expected number of citations for recent papers and therefore noise is amplified. To correct for this, I eliminated all of the papers that appeared after 2008.

Saturday, January 1, 2011

Vampire Phones


Do we have the energy to charge a cell phone using our blood sugar?

Recently I found myself wondering if it would be possible to charge a cell phone using only the blood sugar in a person's body, given we had some cool technology to convert glucose into electricity. This led me to a critical question: how much energy does our body use compared to cell phones?

To answer that question, we need to determine the power consumption of people and cell phones using a standard unit. Since people use about 2000 food Calories per day and a watt is about a Calorie an hour, we can estimate that a person runs at about 100 Watts. Mother Nature is pretty impressive, no? With only 100 Watts she powers a talking, walking, thinking machine, whereas Thomas Edison could only light up a room.

Next we need to determine the energy use of a cell phone, like an iPhone for instance. The calculation is not straight-forward as a phone uses different amounts of power depending on its usage. In the worst case scenario, browsing the web on 3g, Apple reports that the iPhone can last 6 hours. At this rate the iPhone is running at 3 watts. So Steve Jobs manages to produce a streaming, calculating, communication device for much less energy than Mother Nature uses on us. Pretty impressive too.

Taken together, we realize that a 100 W human could afford to power a 3 Watt cell phone given we could tap into our blood sugar and drink a little Kool-Aid when we want to watch Netflix.

...

Sunday, May 17, 2009

Babbling with Biology


Scientists are the business of making predictions and keep faith that these predictive powers will help to change the future for our betterment. Some phenomenon, like the flight path of a projectile, lend themselves to being predicted. Others, like the weather, are fickle. Some systems, like human behavior, are different all together.

If you remember your high school physics, you will remember that if one throws a ball in the air and charts its course, the chart will form the pleasant shape of a parabola. If one knows the launch speed, angle, and wind conditions they can do a very nice job of predicting where and when the ball will land. Mr. Galileo Galilei was the first to find this elegant relationship and it has been paying predictive dividends ever since in phenomenon like atomic bombs, water balloons, and rocket ships.

Weather is another matter. Even if one were to measure the temperature, pressure, wind speed, and so on at every point on earth it would be difficult to predict the exact weather conditions far outside of the familiar 10 day forecast. The reason prediction is so difficult (impossible, really) is explained by an angry branch of mathematics you may not have encountered known as Chaos Theory.

Like other maths, Chaos is full of a mind numbing equations which serve to confuse the unacquainted and obscure its relatively simple premises. If you break through the obfuscations, the essence of Chaos is the observations that, in some systems, very small differences amplify to very large differences with time. We are all familiar with this kind of phenomenon. Consider the chance encounter of two strangers on the street that leads to coffee, that leads to love, that tends to marriage, that culminates in a brand new person, like you perhaps. In this scenario, if the small chance meeting hadn't occurred, neither would you. So, you can see how a rather small event might be amplified to a big event with time (not that I'm calling you big). In the case of weather, they say that the beating of a butterfly's wings with a little time might cause a typhoon in Tokyo. So, while we can do a decent job of predicting the 10 day forecast, predicting the 11th day requires an absurd level of research including a beautiful, but beguiling, butterfly census.

The effect of Chaos for us scientists is that we are limited in the degree of predictability we can expect to develop for complex systems. However, some chaotic systems have a hidden layer of organization. Human behavior, for instance, is a rather complicated thing, like the weather, as evidenced by the marriage example above. So, you might be tempted to think that human behavior cannot be well predicted outside of some multi-day forecast. But this naive hypothesis is quickly refuted if you bother asking a human what behavior they intend perform in the coming while.

Humans, like you and me, have a theory of self which we are capable of talking about and modifying that allows us to say things like "I'll meet you for tea next week" with a great degree of accuracy. This rather remarkable phenomenon would be the equivalent of being able to ask the weather if it plans to snow next Thursday. So, in developing a model of human behavior, we would be foolish to ignore our capacity to communicate, predict, and effect our own behavior. If I want to predict when you will eat next, I might study your psychology or metabolism, but I would probably be best served just to ask you.

Humans are not unique in having a predictive and communicable theory of self. Bee's can communicate their flight path with a dance, bonobos can communicate hunger with sign language, and computers can communicate a scheduled virus scan with a pop up. Any scientist attempting to model these systems would do well learn the system's languages and add the system's self-assessments to any externally informed predictions.

Because of their computational complexity, I suspect that cellular biology shares the feature of humans, bees, and computers to communicate a theory of self. My quest going forward is to promote technologies that facilitate that conversation. Communicating with people and computers can be hard enough, but biology will be a bit trickier. Fortunately, modern science has given us access to a number of tools which we might engage towards the purpose. If we are clever enough and lucky enough to conduct meaningful dialogues with cells we might start working to each other's benefit. We could provide cells access to nutrients, chemicals, and global networks and cells might help us by quelling that pesky little malignancy we call cancer.

...