Fun For Word Nerds: On the Google Ngram Viewer

I really like stuff like this. Google’s Ngram Viewer is a rather cool new data visualization tool, drawing upon an impressive corpus of nearly 5.2 million scanned books and something like 500 billion words from the Google book scanning project.

Researchers at Harvard are using some innovative quantitative approaches to humanities studies, tracking the frequency of words as they occur in literature over different historical periods. Note the language employed to describe these projects, very symbolically-telling: research through scanned books becomes one part archaeology (“a digital “fossil record” of human culture) and another part life sciences (in search of parsing out a “cultural genome”). The team of academics working on this project has dubbed this data-driven approach “culturonomics” — an intriguingly broad potential scope, delving into topics such as “humanity’s collective memory, the adoption of technology, the dynamics of fame, and the effects of censorship and propaganda.” The scope is ambitious — from The Harvard Gazette: “It is the largest data release in the history of the humanities, the authors note, a sequence of letters 1,000 times longer than the human genome. If written in a straight line, it would reach to the moon and back 10 times over.” Some of the early findings might confirm what some of us suspect (“Humanity is forgetting its past faster with each passing year”).

The Atlantic surveys some wry observations on the Ngram Viewer (example: “We’re Quicker to Adopt Technology and Forget Celebrities” than previous generations. Cause, or correlation? Maybe technology makes it easier for us to re-remember and subsequently forget celebrities).

And as much as I love books for all that we can discover within them, there was a spot-on observation from one of the researchers noting the limits to what can be learned from quantitative analyses of books (and only books):

“Books are not representative of culture as a whole, even if our corpus contained 100% of all books ever published. Only certain types of people write books and get them published, and that subclass has changed over time, with the advent of things like public literacy.” Eventually, he says, the database will have to include “newspapers, manuscripts, maps, artwork, and a myriad of other human creations.”

In terms of perspective, a great bit of book scanning trivia from The New York Times (“In 500 Billion Words, New Window on Culture”):

“So far, Google has scanned more than 11 percent of the entire corpus of published books, about two trillion words.”

On the academic side of things, some well-known humanities scholars seem to be showing some cautious approval of the Google Ngram Project and its possibilities, such as Harvard Library’s Robert Darnton and Harvard linguistics professor Steven Pinker (although Louis Menand would also like to see some book historians get in on the action, too).


Surprise me


I run the ThinkLab at the University of Cambridge, and research digital habits, productivity, and wellbeing.

tyler shores cambridge

What I’m Reading Now:

Supercommunicators by Charles Duhigg

Related Articles

Have questions or ideas or requests for working together?

Get in touch