Moving Beyond Social Media Towards News As “Big Data” In The Cloud Era

When we talk today about using “big data” to understand human society, we typically mean using “social media” data. In the social sciences “big data” and “social media” have become increasingly interchangeable due to the widespread availability of large social media datasets and their machine-friendly distribution firehoses and archives. Yet, accessible social media reflects only the narrowest windows of human society, blinding us to many of the areas and events we are most interested in. In contrast, other datasets like traditional news media far surpass both the geographic and temporal reach of social media. These alternative data sources remain largely inaccessible to most data mining efforts due to the lack of observational time series datasets and tools that would allow researchers to ask the questions they are most interested in. How might we broaden the world of “big data” societal research?

Social media datasets have become one of the dominant “big data” sources in the social sciences due to their widespread availability and machine-friendly distribution. Simple streaming JSON APIs offer realtime monitoring right out of the box and the vast and continually growing ecosystem of tools and workflows make it easy for researchers to get started.

On the other hand, accessible social data is extraordinarily biased and reflects only a small portion of human society. The most widely used social data, Twitter, captures just a minute fraction of the world’s voices.

Yet, when it comes to size, social media data certainly meets the common definitions of “big data” with its immense size, rapid update rate and diverse media and message types.

Partly this is due to the commercial companies behind those social platforms, which emphasize their importance and viability through the size of their data holdings. Rather than reporting robust usage and engagement statistics that would shed light into their overall health, social media companies tout how many hundreds of petabytes of user data their servers hold, using file size as a replacement for meaningful metrics. By 2014 Facebook advertised that its data warehouse held more than 300 petabytes and grew at a rate of 4 petabytes per day.

In turn, this focus on data size rather than contents has made its way into the research community, with breathless press releases announcing that new social media research datasets will “rival the total amount of data that currently exists in the social sciences.”

By contrast, journalism data is typically seen as mundane and microscopic. After all, the entirety of the New York Times’ total output from 1945 to 2005 consisted of just 5.9 million articles totaling just 2.9 billion words. In contrast, a month of the Twitter Decahose in 2012 contained 2.8TB of data, including 112.7GB of text containing over 14.3 billion words.

In other words, in 2012 there were more words published to Twitter each day than in the entire New York Times over the past half century.

Of course, in the New York Times each article appears just once, whereas in Twitter all it takes is a click to retweet an entire post, meaning a 10-word tweet retweeted a hundred thousand times counts as a million words. Moreover, an unknown fraction of Twitter consists of content produced by bots and automatically generated responses, meaning Twitter data is far less rich, less reflective of society and has a much higher noise ratio than the professional journalism of the Times’ reporting.

In short, word counts as a measure of data size are less than meaningful when those word counts may consist of the same exact post shared millions of times. For some analyses that repetition is a powerful signal of public or algorithmic interest, but in terms of new information, the totality of Twitter’s textual output is far from rich.

File sizes are also a largely irrelevant reporting metric. In the case of the Twitter Decahose dataset above, the total data received from GNIP was 2.8TB. However, the majority of that filesize came from the large number of metadata fields and the use of the JSON file format. The actual text itself came out to only 112.7GB, just 4% of that 2.8TB dataset.

The shift from the machine-optimized minimize-size-at-all-costs file formats of the early computing era towards verbose human and interchange-friendly formats like JSON has led to a dramatic increase in the size of the datasets we work with today that is unrelated to the actual informational content of those datasets. A particularly loquacious JSON format that separates every single datapoint into its own distinct field in proper JSON compliance could easily increase the size of some of my open data GDELT Project datasets from the multi-terabyte range to the multi-petabyte range. One experiment in developing a fully self-contained verbose JSON encoding for GDELT’s 10TB Global Knowledge Graph yielded a JSON dataset nearly a petabyte in size. While a verbose JSON file is easily understandable and readily imported into nearly any storage or analytic platform today, the jump from 10TB to 1PB brings with it its own unique challenges. Of course, JSON formats need not be so verbose and there are many ways of making JSON representations far more compact, but there is a steady trend these days towards exceptionally verbose and wasteful data schemas.

It is important to recognize that modern file formats like JSON ensure that data is human accessible and easily exchangeable across data management and analytic platforms, but at an enormous cost in terms of the disk and computing power required to work with them.

What do large social sciences datasets look like today? Last year Facebook announced its inaugural research dataset, eventually consisting of “a petabyte of data with almost all public URLs Facebook users globally have clicked on, when, and by what types of people.” In total, as of July 2018 the dataset was estimated to contain 30 billion rows beginning with January 1, 2017 and growing at a rate of 2 million unique URLs across 300 million posts per week, once constructed.

While the dataset is described as eventually consisting of “almost all public URLs Facebook users globally have clicked on” since January 2017 that were shared by at least 20 people, 30 billion rows is an incredibly small number compared to the 4.75 billion pieces of content shared daily by May 2013.

One the one hand, 30 billion records is an enormous dataset compared to the total output of the world’s journalistic endeavors of the past decades. On the other, the majority of those shares represent the kinds of information-sparse “event stream” activities that have become the most common way of examining society. While a news article must describe an event or perspective, following the standards of expression and grammar of the language it is written in and fully express the relevant details in a cohesive and self-contained narrative that is both informative and entertaining, a social media post might be nothing more than a hyperlink. Millions of people might all share that link, doing nothing more than reposting the link or adding a single word or two as a description. It is the rare social media user indeed who writes a 500-word treatise for every link they share on Facebook. In fact, the majority of links shared on social media are never even read by the person sharing them, merely blindly forwarded on by title alone.

This consolidation of expression as a form of information compression means that while the volume of social media may be vastly greater than traditional journalism, the actual novel informational content contained in all that content may be far less.

Social media data from platforms like Twitter and Facebook tends to be augmented by an array of enrichment metadata, from the make and model of the device that posted the content to the user’s time zone. This information is typically largely structured, helpfully broken into discrete fields, arrays and hash structures that make it trivial to process. Social-specific metaphors like hashtags similarly summarize and consolidate complex topics and events into simple discrete informational objects that can be readily extracted and analyzed.

In contrast, journalism is published nearly exclusively as free flowing textual and visual narratives designed for human consumption. Some outlets offer structured annotations in the form of Schema.org markup, but the majority of the world’s news output each day consists only of text and images provided as-is.

Rendering this freeform content into the structured data needed for statistical analytics requires a range of computer algorithms, from natural language systems capable of dealing with the world’s diverse languages to neural networks capable of understanding imagery sourced from across the planet.

Seen in this light, journalistic content is far richer than social media, but more difficult to understand due to the need to extract the natural structure of language and visual metaphors, rather than having the world prestructured according to the worldview of the major social platforms and the commercially-oriented metadata fields they produce.

In 2012 my Culturomics 2.0 study demonstrated how geocoding, entity extraction, sentiment mining and relationship inference could transform a collection of just 100 million news articles into a vast network of more than 100 trillion connections that could forecast the Arab Spring, pinpoint Bin Laden’s location to within 200km and even visualize the natural geographic divisions through which we see the world.

To put it another way, it is not that social media is larger or richer than news. It is that social media has helpfully broken out a set of predefined information into structured fields immediately amendable to analysis and researchers have largely adjusted their research agendas to fit into those selected dimensions. In contrast, journalism offers what amounts to a highly compressed and non-machine-friendly representation that requires algorithms and approaches capable of extracting all of that richness.

In short, there is an immense richness of data that is contained in journalism, rivaling that of social media, if only we have the creativity to look for it.

Take for example the question of how news outlets manage the myriad realtime editorial and algorithmic decisions governing their homepages. Historically there have been only a small handful of news sites with regular high-resolution archives of their homepages, while web archives have only sporadic snapshots at best. Scholars wishing to study homepage editorial decisions at scale have been hamstrung by this lack of data.

Addressing this, my own open data GDELT Project launched its Global Frontpage Graph in March 2018. Over the last 10 months it has crawled 50,000 major news website homepages spanning all countries and 65 languages every hour on the hour, publishing an hourly catalog of all of links contained on all 50,000 front pages. To date it has recorded more than 76 billion outlinks to 488 million distinct URLs. Each record records 6 attributes, meaning the total dataset records nearly half a trillion datapoints.

In terms of record size, it is more than two and a half times larger than the Facebook URL dataset in just half the time.

At first glance, news homepages might not seem like a “big data” information source. After all, they are just web pages. Yet, by transforming those human-oriented webpages into structured observational time series data and combining it with the tools capable of analyzing it at scale, suddenly an entirely new class of research questions becomes answerable.

More to the point, it reminds us that we need not abandon research questions merely because there is not an existing dataset that can address them. With a bit of creativity, we can create entirely new “big data” datasets that directly address society’s most pressing issues, such as how we see the world around us.

What happens when we begin to structure news content through data mining algorithms that can generate the kinds of structured insights we traditionally associate with social data?

In the case of the GDELT Project, over just the last four years it has processed more than 875 million global news articles in 65 languages, transforming them as of July 2018 into a massive structured open dataset comprising 6.6 billion location mentions encoding 52.9 billion datapoints, 29 billion thematic references encoding 58 billion datapoints, 1.9 billion person and 2.3 billion organization mentions, encoding respectively 3.8 billion and 4.7 billion datapoints, together with a total of 8.1 billion capitalized phrase mentions, totaling 16.3 billion datapoints. Add to this more than 2.3 trillion emotional assessments and the complete dataset encodes more than 2.5 trillion datapoints. Even a relatively small collection of a fraction of a billion news articles can yield trillions of datapoints, not to mention all of the interconnections and relationships encoded within.

What about visual news? How does one tractably research the underlying patterns of a petabyte of television news broadcasts? In the case of the Internet Archive’s Television News Archive, the Archive’s answer was to focus on the textual closed captioning streams of those broadcasts, transforming the then-inaccessible petabyte of video content into ordinary textual news that can be readily searched and analyzed. In turn, as deep learning-powered image recognition algorithms became accurate enough, this gradually expanded into pilot experiments cataloging their visual contents as well.

This reminds us that even breathtakingly large and non-traditional datasets can often be creatively transformed into modalities or subsets that are closely aligned with existing analytic workflows.

While video is growing in importance and ubiquity, traditional photographic imagery still forms the dominant visual representation of online journalism. Online coverage typically contains at least one illustrative or journalistic image.

Leveraging the incredible progress over the past half-decade in deep learning image recognition algorithms, it is possible today to process this imagery and render it computable in much the same way we have historically addressed text. Indeed, using Google’s Cloud Vision API, the GDELT Project has processed nearly half a billion news images over the past three years totaling almost a quarter trillion pixels and transformed them into 321 billion datapoints.

Of course, simply having access to such immense datasets is only half the problem. The real challenge is how to actually make sense of datasets so large that few data scientists have the workflows and skillsets to readily work with them. Even within the rarified world of social scientists comfortable using academic supercomputing resources, datasets with trillions of points or weighing in the many terabytes to petabytes are rarely seen.

Yet, to the commercial cloud, petabytes are a rather ordinary number. Even startups now must routinely manage petabytes and multi-petabyte datasets have become so common that they are almost a commodity. The cloud has made it so easy to work with petabyte datasets that Google’s BigQuery platform can perform a table scan over an entire petabyte in just under 3.7 minutes. Conducting a population scale analysis of the Internet Archive’s complete 20-year 15PB web archive would take just under 56 minutes.

In the commercial cloud all it takes to perform a petascale analysis is a single line of SQL, with results returned in minutes. Using reserved instances, companies can run essentially wall-to-wall petascale analyses of their data 24 hours a day for a flat rate cost, instantly scaling up as needed for special burst analyses. Instead of cobbling together fragile workflows that must be built fresh for every analysis, the public cloud provides all the tools for robust modern cloud era analytics.

Social media datasets may seem extraordinarily massive to lay academics unfamiliar with the digital world, but they are actually relatively small compared to the totality of the datasets of the world.

Why then do we still talk of social media as “big data” and most other social sciences and humanities datasets like news content as “small data?”

Partly this simply reflects the disconnect between the social sciences and the vast modern digital world. I myself still routinely interact with computational social scientists who view gigabyte datasets as beyond their means.

Partly this reflects that few in the research world have the expertise to build large observational datasets of their own. We think of journalism as small data because we haven’t thought of all the ways we can understand it.

Partly this is because the for-profit companies behind social media platforms have taught us to think of data through the lens of size, rather than insight. They’ve taught us to see the world through a specific set of monetizable dimensions they provide us, rather than take the leap to create new datasets that more directly answer the questions we care most about.

Putting this all together, we live in a world today in which “big data” all too often means “social media.” We’ve become accustomed to seeing the world through existing datasets, using the predefined dimensions offered by their social media creators, selected for their monetization potential, rather than their ability to answer society’s most pressing questions. The structured schemas and machine-friendly formats of social data have made them the darlings of the research community, while the vastly richer contents of media like journalism have been largely absent due to the need of algorithmic intervention and human creativity to recover its vastly richer, but more compressed structure. Large datasets require tools and workflows nearly entirely absent from the academic community, but which are commonplace in the purpose-built commercial cloud, where a single line of SQL can plow through a petabyte in minutes.

In the end, perhaps if we step back and look upon the world with a bit more inquisitiveness and creativity, we will see there is so much more than just Twitter and Facebook and the cloud is there to help us explore it.

[“source-forbes”]