*Wait a minute... that stuff is hairy. It's Software Studies on its hind legs and walking around like Bigfoot.
"Just as petrified fossils tell us about the evolution of life on earth, the words written in books narrate the history of humanity. They words tell a story, not just through the sentences they form, but in how often they occur. Uncovering those tales isn’t easy – you’d need to convert books into a digital format so that their text can be analysed and compared. And you’d need to do that for millions of books.
"Fortunately, that’s exactly what Google have been doing since 2004. Together with over 40 university libraries, the internet titan has thus far scanned over 15 million books, creating a massive electronic library that represents 12% of all the books ever published. All the while, a team from Harvard University, led by Jean-Baptiste Michel and Erez Lieberman Aiden have been analysing the flood of data.
"Their first report is available today. Although it barely scratches the surface, it’s already a tantalising glimpse into the power of the Google Books corpus. It’s a record of human culture, spanning six centuries and seven languages. It shows vocabularies expanding and grammar evolving. It contains stories about our adoption of technology, our quest for fame, and our battle for equality. And it hides the traces of tragedy, including traces of political suppression, records of past plagues, and a fading connection with our own history.
"As the team says, the corpus “will furnish a great cache of bones from which to reconstruct the skeleton of a new science.” There are strong parallels to the completion of the human genome. Just as that provided an invaluable resource for biologists, Google’s corpus will allow social scientists and humanities scholars to study human culture in a rigorous way. There’s a good reason that the team are calling this field “culturomics”.
"The project began back in 2007, when the duo published a paper showing that verbs become more regular over time...."
"Contrary to warnings about its imminent demise at the hands of teenagers and Americans, English is booming. In the last 50 years, its vocabulary has expanded by over 70% and around 8500 words are being added every year. The team worked this out by scanning the corpus words for solo words that turned up at least once per billion. They took random samples and culled any non-words (“l8r”), typos and foreign words. By the end, they estimated that English had 544,000 words in 1900, rising to 1,022,000 in 2000 (see above left)
"Dictionaries aren’t keeping pace with this rapid change. Over half of the words added to the American Heritage Dictionary in 2000 were already part of the English language a century ago. There are plenty of missing words too. The current Oxford English Dictionary only has 615,000 solo words, and even proper nouns and compound words can’t explain the gulf between that and the million-plus count from the corpus.
"Instead, it seems that modern dictionaries aren’t very good at including rarer words. Both the OED and Merriam-Webster comprehensively list words that are found once in every hundred thousand words, but they only had a quarter of one-in-a-billion words (see above right). Missing words include technical words like ’aridification’ (the process by which a geographic region becomes dry) and obscure ones like ‘slenthem’ (a musical instrument). This hidden lexicon may be rare but it’s also massive, accounting for around 52% of the English words. The majority of our vocabulary isn’t documented in the big dictionaries...."
"The corpus allows you to chart the rise and fall of people, as well as verbs and dates. Michel and Lieberman-Aiden found that today’s stars, at the height of their celebrity, are more famous than their historical predecessors, but they’re being forgotten more quickly. The team took every one of the 740,000 people with their own Wikipedia pages, removed those who share a name, and sorted the rest by birth date....
"They found that in the early 19th century, celebrities started rising to fame at the age of 43 and it took 8 years for their prominence in books to double. By the mid-20th century, they were starting at 29 and doubling in just over 3 years. However, while the spotlight upon them is more intense, their time in it is briefer. Celebrities tend to peak in fame at the ripe old age of 75 (remember, this is measured by their mentions in books). A century ago, it took 120 years after that for their fame to halve; now, it takes just 71...."