How Did Google Count All of the Books in the World?

How does one go about counting all of the book’s in the world? First, let’s check out the Google Books Blog (“Books of the world, stand up and be counted! All 129,864,880 of you”).

[N.B. — for a shorter version of all of this, you can check out Wired’s (“How Google Counted The World’s 129 Million Books”]

To start, Google addresses one of my favorite philosophical questions: what do we mean by “book”?

“Just how many books are out there?”

Well, it all depends on what exactly you mean by a “book.” We’re not going to count what library scientists call “works,” those elusive “distinct intellectual or artistic creations.” It makes sense to consider all editions of “Hamlet” separately, as we would like to distinguish between — and scan — books containing, for example, different forewords and commentaries.”

Looking at this from the side of the book publishers, you’d be surprised how easy and how often different book editions are carelessly marked.

“One definition of a book we find helpful inside Google when handling book metadata is a “tome,” an idealized bound volume. A tome can have millions of copies … or can exist in just one or two copies (such as an obscure master’s thesis languishing in a university library). This is a convenient definition to work with, but it has drawbacks. For example, we count hardcover and paperback books produced from the same text twice, but treat several pamphlets bound together by a library as a single book.”

ISBNs (International Standard Book Numbers) are one logical source of book information, but can only provide so much information about the number of books in the world for a number of reasons: ISBNs have only been adopted since the 1960s; ISBNs are mostly a US and European convention; and a lot of published books that weren’t intended to initially be made commercially available simply don’t have ISBNs. Not to mention, as Google Books rightfully points out, that ISBNs are used in very non-standard ways: “They have sometimes been assigned to multiple books: we’ve seen anywhere from two to 1,500 books assigned the same ISBN. They are also often assigned to things other than books. Even though they are intended to represent “books and book-like products,” unique ISBNs have been assigned to anything from CDs to bookmarks to t-shirts.”

Then of course, there’s other resources of metadata such as Library of Congress (Library of Congress Control Numbers) or OCLC (WorldCat accession numbers), but these two are far from perfect, and as it turns out, are rife with redundancies: “So what does Google do? We collect metadata from many providers (more than 150 and counting) that include libraries, WorldCat, national union catalogs and commercial providers. At the moment we have close to a billion unique raw records. We then further analyze these records to reduce the level of duplication within each provider, bringing us down to close to 600 million records.”

Book cataloguing has not historically been much of an exact science. Which is what makes the task of counting all of the world’s books such a daunting task. How could a computer know the difference between something like the following question?

“We tend to rely on publisher names, as they are cataloged, even less. While publishers are very protective of their names, catalogers are much less so. Consider two records for “At the Mountains of Madness and Other Tales of Terror” by H.P. Lovecraft, published in 1971. One claims that the book it describes has been published by Ballantine Books, another that the publisher is Beagle Books. Is this one book or two? This is a mystery, since Beagle Books is not a known publisher. Only looking at the actual cover of the book will clear this up. The book is published by Ballantine as part of “A Beagle Horror Collection”, which appears to have been mistakenly cataloged as a publisher name by a harried librarian. We also use publication years, volume numbers, and other information.”

While clearly this is far from exact, the official Google Books count, after all of the algorithm flim-flam and redundancy parsing: “ After we exclude serials, we can finally count all the books in the world. There are 129,864,880 of them.” About 130 million books. To be continued, though.

— — — — — — — — — — —

Ars Technica ( “Google’s count of 130 million books is probably bunk”) takes a more academically-interesting look at Google Books, and reviews some of the issues linguists, librarians, and other scholars have taken with the metadata of Google Books. Unlike other observers, I didn’t particularly find the Ars Technica piece to be a hatchet job on the Google Books effort. As mentioned above, Google’s metadata isn’t perfect, and neither is the material it has to work with: “And how many errors must be corrected and subtle fixes made in between printings before a “new printing” gets promoted to a “new edition” — the answer can vary from publisher to publisher and from work to work.”

This is an old problem — dating back to whenever people first started caring about organizing books in some sort of methodical manner — but rather than be frustrated with what is or isn’t being done right now, why wouldn’t we be hopeful about how this presents a new way to tackle a very old problem?

“In the end, most of the “metadata problems” that Google’s engineers are trying to solve are very, very old. Distinguishing between different editions of a work, dealing with mistitled and misattributed works, and sorting out dates of publication — these are all tasks that have historically been carried out by human historians, codicologists, paleographers, library scientists, museum curators, textual critics, and learned lovers of books and scrolls since the dawn of writing. In trying to count the world’s books by identifying which copies of books (or records of books, or copies of records of books, or records of copies of books) signify the “same” printed and bound volume, Google has found itself on the horns of a very ancient dilemma.”

Realistically, no one source (academics, librarians, book historians, Google, etc.) can take on the question of the ‘online universal library’ by themselves. So, why not work together? Google has the technical wherewithal and academics and librarians have the numbers of interested minds and experience to push things even further. [In fact, I have even more to say on this, but will save that for another day]. “Why not just focus on giving new tools to actual historians, and let them do their thing? The results of a more open, inclusive metadata curation process might never reveal how many books their really are in the world, but they would do a vastly better job of enabling scholars to work with the library that Google is building.”


Surprise me


I run the ThinkLab at the University of Cambridge, and research digital habits, productivity, and wellbeing.

tyler shores cambridge

What I’m Reading Now:

Supercommunicators by Charles Duhigg

Related Articles

Have questions or ideas or requests for working together?

Get in touch