Sunday, August 23, 2015

Big Data Culture

Google announced in 2004 that it intended to digitise all the world’s books. They haven’t finished yet but have already done millions of books.  From this humongous dataset, Erez Aiden and Jean-Baptiste Michel created a shadow dataset containing a record for each word and phrase in those books.  Each record consists of a list of numbers, showing how frequently that particular n-gram appeared in books, year after year.  Based on this dataset, they can analyse computationally many fascinating issues in language, culture and history.  


For example, they plot the use of the term “天安门” in books written in simplified Chinese characters used in Mainland China, and the term “Tiananmen” in books written in English.  In 1976, China’s ruling Gang of Four cracked down on protests and public mourning in Tiananmen Square, which was spurred by the death of venerated premier Zhou Enlai. The 1976 incident leaves a huge fingerprint in the Chinese ngram record, with a massive spike in mentions of 天安门.  But it hardly registered in the English books.  In 1989, another massive protest in Tiananmen Square was violently cleared.  It received massive attention in English books.  But it generated a much smaller spike in the Chinese books, which soon returned to the pre-1989 levels.  While in the English speaking world, the interest has barely abated. Each of us can draw our own conclusions as to whether it is due to censorship or some other reason.  


Google has made a similar dataset available, on the web site Google books Ngram Viewer.  From there, for example, one can compare the appearances of the terms: Hong Kong, Shanghai, Beijing and Peking, over the past 200 years.  In the 1930s and early 1940s, Shanghai was receiving a a lot of attention, but went into relatively obscurity since 1949, when the Communists take over.  But since the 1990s, it is rising again.  Peking has its own ups and downs; but since the 1990s, it is increasingly called Beijing.  As for Hong Kong, the interest has been rising and rising.  Now we can study language, culture and history digitally, or computationally.  


We have already plans to use such techniques in our own work.  Thank you, Aiden and Michel.  




No comments: