At PERTS, a student motivation project that I work on, we need to communicate with students. This means using language that 5th to 12th graders can understand, especially ESL students and those with poor reading skills which are the main target of our psychological interventions.
So we read through all the material that we give students, and try carefully to compose our instructions in the most universal language we can. Still, every year we make mistakes. Teachers give us feedback on words that were confusing for their students. For example, one classroom didn't know what a 'detention' was because their school did not have detentions. Detecting these mistakes is hard, it is hard to know what words might trip someone up.
Thinking about google's ngram engine I realized there is a simple solution. We just needed a tool that would highlight those words which were rare or uncommon in a paragraph. If we could eliminate the rare words, we would end up with text that more students could understand.
So I did what any self respecting programmer would do, I googled around to find the tool I needed. To my complete surprise, I found no such tool on the Internet. But this kind of thing is easy; Peter Norvig wrote a 20 line spelling corrector, much more profound than my goal. So, I resolved myself to create my own.
I needed a corpus of text to find which words were common or rare. I didn't want to borrow books from project gutenberg, as Norvig had done, because these these older texts were well over the reading level of my target students. Instead I pulled some pages from the Simple English Wikipedia, stripped out all that annoying markup and used this to generate a set of word counts.
A few lines of python later and I had a working prototype. I made it pretty using a little bootstrap and then pushed it online using Google's appengine which Steve Huffman had de-convoluted for me. As you can see from this screen shot, this tool would have identified detention as a word to be careful of. Now we know.
For those that would like to try it out, you can find the tool at simpleenglishchecker.appspot.com and the code at github.