Learning the rules of image transformation

text - code - both

How a computer can learn the rules of rotation, reflection, scaling, translation, and many other transformations that images can undergo.

We recognize images despite transformations.

As your eyes move across this sentence, the image hitting your retina is constantly changing. Yet you hardly notice. One reason you do not is because your brain recognizes letters regardless of their position in your field of view.

Consider the following image. Look first at the blue dot and then at the red. Notice that the number '2' between them is recognizable regardless of your focus. This, despite the fact that the image is falling on a completely different set of neurons.



Images go through many such transformations. They reverse, rotate, scale, translate and distort in many ways we have no words for. That's not to mention all the changes in lighting that can occur. Through all of this they remain recognizable.



The number of transformations that can happen to an image is infinite, but that does not mean that all transformations are possible or probable. Many never occur in the real world and our brain cannot recognize the images after these improbable transformations.



But computers are bad at learning the rules of image transformation.

The rules of transformation are important for anyone who wants to teach a computer how to process images. The algorithms that are best at image recognition learn a representation of the world that considers many shifts of focus. These are called translations as illustrated above.

However, these algorithms do not learn that images can be translated, the way they learn to recognize digits. Instead the laws of translation are programmed into the algorithm by the researcher.

Would it be possible to have computers learn about translation without telling them explicitly? What about the myriad other transformations that are possible?

I propose that these transformations can be learned.

I propose they can, and submit the following experiment as evidence. First, I show that we can learn that flipping an image upside down is a valid transformation, but randomly rearranging the pixels is not. Then I show that examples each of the aforementioned transformations can be discovered. Finally, I wax poetic about the future of this kind of work.

note: I can't claim that this work is unique. I just hope that it is interesting.

What data are we using?

I will use MNIST data, a handy collection of handwritten digits.

plot of chunk unnamed-chunk-2

Where to focus?

We will focus on a small region of data. Specifically, the three vertical pixels highlighted in each digit below.

plot of chunk unnamed-chunk-3

What patterns are common?

Next we look at the pixel patterns across many images.

plot of chunk unnamed-chunk-4

We see that certain patterns are more common than others. For example, there are many cases where one of the three pixels is blank but only one case where this is the middle pixel.

Clearly the patterns observed are not random ones.

Upside down, the patterns have similar frequencies.

Then we flip the pixels upside down and look at the patterns.



plot of chunk unnamed-chunk-5

Notice that the patterns have roughly the same frequency as before. Flipping upside down does not substantially change the image.

But switching the first two pixels produces very different frequencies.

Finally, we make an improbable change, switching the first two pixels, while keeping the third in place.



plot of chunk unnamed-chunk-6

The patterns have a very different frequency than the prior cases. For example, the pattern where two filled in pixels surround a blank pixel is common, where it was only observed once in the previous two examples.

What's going on?

We are seeing the difference between a probable image transformation, flipping upside down, and an improbable one, only switching the first two pixels. Objects in the real world flip upside down regularly. But, unless we are stretching taffy or entertaining contortionists, it is unusual to see the middle of something switch places with its top.

After a probable transformation, the image retains the same patterns as any other image. After an improbable transformation the image contains improbable patterns.

Can we quantify the effect?

We can estimate the likelihood of each reordering given the frequencies observed in the original order. For example, if a pattern, off-on-on, occurred 20% of the time in the original image then it will most likely occur 20% of the time in a valid rearrangement of the image. Concretely we use the multinomial distribution.


order loglikelihood
1 2 3 -19.92
1 3 2 -558.85
2 1 3 -596.12
2 3 1 -597.38
3 1 2 -583.60
3 2 1 -45.92

We see quantitative evidence that reinforces our visual proof and our intuition. The reverse order, "3 2 1", is more similar to the original order, "1 2 3", than any other possible transformation.

Let us consider a wider frame of reference.

Up until now, we have considered a very limited set of transformations, each possible order of three pixels. Now let's focus on a wider region. We will continue to use the three pixels as before, but we will compare them to a wider field, the 12 surrounding pixels, highlighted below.

plot of chunk unnamed-chunk-8

What transformations are likely?

Now let us consider each set of three pixels from this wider field in each of their possible orders. As before, we will use the multinomial distribution to determine how similar each set and order is to our original three pixels.

plot of chunk unnamed-chunk-9

Here we see the sets of pixels most like our original three. The most likely set (our original three itself) occupies the top left corner and the subsequent images show other sets in order of descending likelihood. For clarity, a red dot has been put on the 'middle' pixel, where 'middle' is defined as the middle pixel in the original image.

First, note that the middle pixel always stays in the middle through all of the likely transformations. This is because likely transformations tend to maintain order (except that they may flip entirely as illustrated before).

Next, notice that we see examples of each of the likely transformations that we already know to exist.

  • Rotations are common, vertical and horizontal patterns appearing nearly equally often.
  • Reversals are similarly common, though they cannot be seen as the ends are indistinguishable.
  • Translations are ubiquitous as can be clearly seen by how the pixels shift within the region in focus from left to right, top to bottom.
  • Distortions are common, very often we see not-quite straight lines, ones bent or stretched.
  • Scaling occurs, though it is rare, only occasionally do we see pixels more than one unit apart. I haven't determined why scaling is rare in this image. Obviously in the real world scaling is very common as you can see if you press your face against the screen.
Through all of these transformations, the basic pattern holds, the middle pixel in the original image remains the middle through each likely transformation. We see no examples of taffy stretching contortionism.

What transformations are unlikely?

Next lets look at those transformations ranked least likely. This will assure you that I have not hoodwinked you by divining patterns in the results that would be seen regardless of their order.

plot of chunk unnamed-chunk-10

Here we see the misfits, the unlikely patterns. Like before, except the top left is occupied by the least likely transformation.

The thing to notice here is the breakdown of our basic pattern. What was the middle, marked in red, is no longer in the middle. Instead we see the two tails abutting one another and the red middle cast aside. It stands to reason that these transformations are as unlikely as taffy and contortionists.

Conclusion

We understand that images can transform, but computers, generally, are blind to this fact. Here, I have given a simple visual demonstration of how those rules of transformation can be discovered by a computer using real world datasets with little external expertise.

Certainly, huge strides in computational efficiency would be necessary to make this a practical approach to image recognition problems, and it may well be that we can describe the rules of image transformation so well that a computer need never discover them on its own.

However, it is important to realize that such discovery is possible. And this does show practical promise for several reasons:

  1. Not all transformations are easy to describe. Even this simple inquiry uncovered many likely image transformations which are difficult to describe formally. Likely images were stretched and bent in ways that are quite familiar to the eye, but difficult to describe using the geometric transformations we learned in high school. We can only explicitly teach computers those things that we can describe, for the rest, computers, like us, must learn on their own.
  2. Not all datasets are so well understood. I have focused on image and image transformation, a subject both intuitive and well studied. But there are many datasets for which we have much less intuition and understanding. Think for instance of weather systems for which we have only recently developed sophisticated datasets and modeling tools or genetic sequences which are still 95% mysterious to us. In these less-familiar domains we might find that an algorithm which discovers the rules of transformation can quickly outpace the experts who attempt the same.
  3. At some point new rules must be discovered. While we may be able to impart our computers with all the benefit of our expertise in the same way that we impart it upon children, at some point, they must reach the boundaries of what is known. If computers are to join us in the exploration of brave new domains of thought, they must become more adept thinkers. Discovering the rules of transformation as I have illustrated here is part of what makes us intelligent. It is a necessary step to producing truly intelligent machines.

Do they love to learn?

Why our standardized exams should measure student attitude.

Editorial By Ben
Great teachers inspire their students.  They show them the beauty of a subject and ignite within them a burning desire to learn.
dead_poets
The effect of such a teacher reaches far beyond his or her classroom. Our lives are shaped by these people because, once the will to learn is burning bright within us, it continues without the catalytic spark of its creator.
And yet, when we use standardized exams to measure our students, our classrooms, and our education system, we ignore attitude.  The teacher that inspired a lifelong passion for learning is little acknowledged for his or her labor, the fruits of which are spread across the remainder of the student's lifetime.
Why is this? Why do we ignore this fundamental outcome that aligns so closely with our ideal of great teaching?

Perhaps we cannot measure attitude?

At this point, you are probably thinking that we don't measure attitude for a good reason. Several objections come to mind:
  • Perhaps attitude cannot be measured reliably by a multiple choice survey?
  • Perhaps our intuition is wrong -- ability has little to do with attitude?
  • Perhaps measuring attitude is too difficult and time consuming?
j0rXDdD.jpg

Except that we already have.

I could address these plausible objections one by one using a combination of research and reasonable arguments. Fortunately, the Programme for International Student Assessment (PISA) has made my job easy.
In 2012, PISA administered an international Mathematics exam to half a million 15-year-olds around the world. This exam included a 30-minute survey on student attitudes. They have also been gracious enough to provide an extremely good write up of their results.
This quick assessment of attitude was highly informative.  For example, PISA measured students’ math anxiety by asking how much they agree or disagree with statements like, “I get very nervous doing mathematics problems.”  A student who was among the top 15% most anxious math students was likely to perform more than one full grade level below his or her peers with average math anxiety.  
Many attitudes had similar correlations with student success (see table 1).


Scale
Example Question
Difference in performance (grade level)
Sense of belonging
I feel like an outsider at school.
0.2
Perseverance
I give up easily.
0.5
Openness to problem solving
I am quick to understand things.
1
Perceived control of success
Whether or not I do well in mathematics is completely up to me.
1
Intrinsic motivation to learn
I do mathematics because I enjoy it.
0.5
Extrinsic motivation to learn
I will learn many things in mathematics that will help me get a job.
0.5
Self-efficacy
I can understanding graphs presented in newspapers.
1.5
Self-concept
I learn mathematics quickly.
1
Anxiety
I get very nervous doing mathematics problems.
1
Table 1 - How math attitudes related to math aptitude
Students taking the PISA exam were asked several attitude questions in each of the above categories. A one standard deviation difference in their responses corresponded to the following grade level change in performance. One grade level corresponds to one typical year of improvement. An example of each type of question is provided.  (summarized from data presented in the pisa report)


Perhaps aptitude already tells us everything we need to know?

Okay, so attitude can be quickly measured and it correlates to aptitude.  But if we are going to include it on our standardized exams, it must provide additional value.  Perhaps aptitude already tells us everything we need to know?

Except that attitude precedes aptitude.

Intuitively we know that changes in attitude regularly precede changes in aptitude.  A student gets inspired, works hard at math, and then becomes better at math.
LittleEngineCould.jpg
At PERTS, I work with a group that delivers growth mindset interventions.  These interventions are designed to convince students that the brain is like a muscle: it grows stronger with effort.  We work hard to measure the effect of these interventions both on attitude and aptitude.  The pattern that we see is reliable.  First, students become convinced that the brain is like a muscle, and then their grades go up.  We can predict the long term success of a student earlier and better if we measure their changes in attitude.
Moreover, there is solid evidence that GPA predicts college success better than standardized exam scores (using both is best of all).  Many researchers believe that this is because GPA measures 'non-cognative factors', things like attitude and behavior. Additional evidence that our standardized exams would be more powerful instruments if they measured attitude too.
However, more work in this area is needed.  We know that attitude precedes aptitude, but we don’t know by how much or how precisely we could detect the effect.  How much better would our long term predictions of success become if we measured student attitudes?  How much better could we assess the impact of teachers and teaching methodologies?  The answers to these questions requires a study on the scale of the PISA exam; one that follows a group of students over an extended time period.

Ok, I'm convinced. What can I do to help?

PISA's work is a great start, and there are others too. But there is still a lot of work to do as a society to in order to get attitude assessment included as a basic part of our standardized education assessment.
  • Tell your friends - Our education system is ultimately beholden to us. If we as a collective think that measuring student attitude is a priority, then it will become a reality.
  • Measure your students - If you are an educator, measure your student's attitudes and use these measurements to evaluate your impact.  PERTS developed an assessment that can be done online or you can adapt the PISA pencil and paper assessment.
  • Help us quantify the effect of attitude on aptitude - If you help to administer education to large groups of students you could help the most by measuring student attitudes over time.  If someone can prove that measurable changes in attitude regularly precede and predict meaningful changes in aptitude, then the case for measuring attitude will be made much stronger.

And what will the future look like?

Imagine a world where 'teaching to the test' means inspiring your students, igniting within them a burning desire to learn that contributes to happiness and success throughout their entire lives.  This world can be ours if we can learn how to test what matters most.

Nice Virus

On the whole, viruses are actually good for their hosts.

Editorial by Ben
It is a common misconception that most biological viruses are bad for their hosts. The driving reason for this misunderstanding is that the viruses we care about are bad, Influenza, HIV, etc. These viruses hurt us and we hurt them.
But these pathogenic viruses represent a very very small minority of the viruses in the world. If you sample seawater you will find 10 x more viral particles than bacterial cells. Could all of these packets of RNA and DNA be bad?
The answer is no. What we are calling a virus would be more accurately called a message. Just like messages on the internet, a small minority are pathogenic viruses aimed to hurt their hosts. But the vast majority are welcome messages. If they were not, then a cell, just like a computer, would simply stop listening to external messages.
viral_tweet
If the cell is listening to these messages even though some of them are fatal, they must confer some evolutionary advantage. The total advantage of the useful messages must outweigh the detriment of the pathogenic ones. What sort of advantages are these?
In essence viruses play the role of a letter in a critical message exchange system. We don't know everything that RNA and DNA do, but we do know that viruses help to spread useful bits of code between cells. This includes useful proteins, regulatory sequences, and sequences that we do not yet understand the value of.

One in a Thousand


Yesterday I saw a report on an experiment that we were conducting at the Khan Academy.  Students in the experimental group were becoming proficient at more exercises than students in the control group.  The effect was highly significant p < 0.001!  Hurray, the experiment worked, time to make a change, right?

Not so fast.  This was an A vs A test, which means that the experimental condition was exactly the same as the control condition.  There should not have been any difference at all.  So what went wrong?

Lets think about a p value < 0.001.  That means we are confident 1:1000 that the effect is real.  Big confidence right?  Maybe too good to be true?  How often are we this confident in life?  The answer is seldom.

How often would you take a 1:1000 to one sports bet that one team would beat another?  Maybe if the Harlem Globetrotters were playing, except even they lose 1.5% of their games.  Even really low performing teams usually win more than 1 in 1000 games.  It would have to be a very rare matchup to inspire such confidence.

What if I bet you that when you flip your light switch the light will turn on.  Easy bet to win, until the light bulb burns out.  And how many switches before a lightbulb burns out?  Well if you keep your light on for an hour at a time, an incandescent bulb will burn out after ~1200 switches.  And then there’s the chance of a brown out, circuit breaker trip, or rat gnawing on the wires.  A 1 to 1000 be might be reasonable, but just barely.

Now what if we were dealing with a teaching method.  Say I told bet you 1:1000 that Montessori schooling is better for students than traditional public schools.  Is it a fair bet?  How could you possibly have the expertise to judge so strongly?

If you see a p value that is highly significant you should get excited.  Not that things are working, but that something is wrong.  Wrong with your statistics, wrong with your experiment, wrong with you.  You should vet the process every which way in search of the source of over-confidence.  

In our case it was bad stats, a few renegade bots were driving up the proficiency count and our statistical test was brittle in the case of these outliers.  A non-parametric test would not have been confused.  But this is not an essay to extol the virtues of non-parametric tests.  It is an essay extolling the virtues of common sense, skepticism, and rational uncertainty.

The Residual Method

The Residual Method

Recently I stumbled across an interesting method to leverage lots of partial data in order to accurately find a relationship among a little data.  I doubt I’m the first to come up with such a method, but its news to me and my statistically inclined friends, so I figured I’d better write it up.

First some motivation.  Imagine you are working in the promotions department for a new health insurance company.  You want to send out a mailing to potential customers, but you’d like to find customers that are going to be profitable, so you want to try to estimate their potential health costs and send mail to those that will be low.

You don’t know too much about the people you are mailing to, just their age.  So you want to find the relationship between age and health insurance costs as accurately as possible.  Fortunately your research department has conducted a smallish study of 1000 people where they asked about their real medical costs, ages, and other things.

You can use this real data to figure out the effect of age on medical costs using a simple regression.

Cost ~ Age

Sounds lovely, but clearly age isn’t the only thing that is affecting medical costs.  There are many other factors, for example weight, which will contribute.  You’d be tempted to set up a model that’s a bit more complicated to account for these other factors.

Cost ~ Age + Weight + ...

This model is nice because these other factors account for the noise in your data and may produce a cleaner relationship between age and cost to use for your mailing.  However, there is a problem here because age and weight are related to one another.  Older people tend to be a little heftier than younger people and so by putting weight in the model you will affect the relationship between cost and age.  This might be fine in some circumstances, but in this case your mailing will have no information on weight and so its effect on age will be misleading.

This point might not be obvious at first glance, so lets invent some sample data to help us make the case.  Imagine that your survey data is composed of Cost, Age, and Weight.  Age is a randomly distributed normal variable.  Weight is the sum of Age and another randomly distributed normal variable, and Cost is the sum of Age, Weight, and a third error term.  Obviously this data and its relationships are contrived, but no bother, they will serve the point none-the-less.

In R we can build this kind of dataset with the following code.

n    = 1000
mean = 0
sd   = 1

Age = rnorm(n, mean, sd)
Weight = Age + rnorm(n, mean, sd)
Cost = Age + Weight + rnorm(n, mean, sd)

data = data.frame(Age, Weight, Cost)

Now if we build our basic relationship Cost ~ Age we might have:

summary(glm(Cost ~ Age, data=data))

            Estimate Std. Error    
(Intercept) -0.000305   0.046821    
Age          2.055623   0.045261

Basically Cost is about 2 times age give or take a little error.  What happens when we throw weight in the mix.

summary(glm(Cost ~ Age + Weight, data=data))

           Estimate Std. Error    
(Intercept)  0.01624    0.03260   
Age          1.02703    0.04459
Weight       1.02202    0.03135

You see we’ve magled the relationship between age and cost by throwing weight in the mix.  Since we know squat about weight in our mailing, this is quite useless to us and we’d rather stick with the previous result.

If only we could account for the component of weight that does not depend on age.  Simple you think!  I will find the relationship between weight and age then subtract that portion of weight which is explained by age leaving me with a weight Residual.  I’ll add this to the model of cost and get an even better estimate of the relationship between cost and age.  In R this might look like:


data$Predicted = predict(glm(weight ~ Age, data=data))
data$Residual = data$weight - data$Predicted

summary(glm(Cost ~ Age + Residual, data=data))

            Estimate Std. Error    
(Intercept) -0.000305   0.032593   
Age          2.055623   0.031508
Residual     1.022017   0.031355

Marvelous!  Our estimate of the standard error of age to cost has gone down!  We must have improved the condition.  But LO, why is the estimate of the age coefficient the same as it was in our first regression?  Surely if the estimate is a better one it must also be a different one?

It seems we have stumbled upon an old problem in statistics.  The problem of the generated regressor (Pagan 1984).  The standard error is now leading us astray.  It is not to be trusted.  The basic problem is that we cannot be certain of the true relationship between weight and age.

But what if we could?  What if we somehow knew what the true relationship between weight and age actually was.  Well in this case we do, because we made up the data.  weight is simply age plus some error term.  So all we have to do is subtract age from weight and we have the ‘true’ Residual.  How does this kind of residual fare?

data$Residual = data$Weight - data$Age
summary(glm(Cost ~ Age + Residual, data=data))


           Estimate Std. Error    
(Intercept)  0.01624    0.03260    
Age          2.04905    0.03151
Residual     1.02202    0.03135

Now this really is something!  Once again the standard error of the estimate of age has shrunk, but more notably the estimate itself has gotten closer to 2 which is the value we actually know it to be!

If only we knew the real relationship between weight and cost we would be in business.  Of course we never can know this real relationship without unlimited data but what if we simply had a very good estimate.  Imagine that we had access to census records which reported age and weight for the entire country.  Then we could probably get an estimate of the relationship that was quite close to the truth and we could use this to calculate the residual.  In R we might imagine:


n    = 100000
mean = 0
sd   = 1

Age = rnorm(n, mean, sd)
Weight = Age + rnorm(n, mean, sd)

census <- data.frame="data.frame" ge="ge" span="span" weight="weight">

data$Predicted = predict(glm(Weight ~ Age, data=census), newdata=data)
data$Residual = data$Weight - data$Predicted

summary(glm(Cost ~ Age + Residual, data=data))

           Estimate Std. Error
(Intercept)  0.01210    0.03260   
Age          2.05207    0.03151
Residual     1.02202    0.03135


How cool is this once again we have a different better estimate of the relationship between age and cost.  Now when we go to make our mailing we can make a more informed mailing.

And that’s the residual method.  We leverage a large body of incomplete data to estimate a more precise relationship in a small amount of full data.  In this example the result is rather simple and of limited utility.  But I think it points is bright new direction for statistics.

Humans are amazing at leveraging all the knowledge we have of the world to solve small new puzzles.  When a child learns the alphabet, they don’t need to see ten thousand examples before they have it mastered.  Several dozen will do.  This is because this child has seen millions of images in its life-time and the ones that present the alphabet are quite distinct.  It doesn’t need to learn how to account for lighting and orientation because it has solved those problems using the images that came before.  Instead it can focus on crux of the matter.

The residual method is a step in this direction.  It works by exploiting knowledge of the world to get at the crux of the matter.

Simple English

At PERTS, a student motivation project that I work on, we need to communicate with students.  This means using language that 5th to 12th graders can understand, especially ESL students and those with poor reading skills which are the main target of our psychological interventions.

So we read through all the material that we give students, and try carefully to compose our instructions in the most universal language we can.  Still, every year we make mistakes.  Teachers give us feedback on words that were confusing for their students.  For example, one classroom didn't know what a 'detention' was because their school did not have detentions.  Detecting these mistakes is hard, it is hard to know what words might trip someone up.

Thinking about google's ngram engine I realized there is a simple solution.  We just needed a tool that would highlight those words which were rare or uncommon in a paragraph.  If we could eliminate the rare words, we would end up with text that more students could understand.

So I did what any self respecting programmer would do, I googled around to find the tool I needed.  To my complete surprise, I found no such tool on the Internet.  But this kind of thing is easy; Peter Norvig wrote a 20 line spelling corrector, much more profound than my goal.  So, I resolved myself to create my own.

I needed a corpus of text to find which words were common or rare.  I didn't want to borrow books from project gutenberg, as Norvig had done, because these these older texts were well over the reading level of my target students.  Instead I pulled some pages from the Simple English Wikipedia, stripped out all that annoying markup and used this to generate a set of word counts.

A few lines of python later and I had a working prototype.  I made it pretty using a little bootstrap and then pushed it online using Google's appengine which Steve Huffman had de-convoluted for me.  As you can see from this screen shot, this tool would have identified detention as a word to be careful of.  Now we know.

For those that would like to try it out, you can find the tool at simpleenglishchecker.appspot.com and the code at github.