Mining Hope:

Preserving and Exploring Twitter Data for Digital Visual Studies

Content and Scale

In recent history, the scale of digital data has changed dramatically. Paradoxically so, the scale of digital content and its trace data has grown so large that it has toggled from being considered an ever-expanding problem--what has been colloquially called information overload or the data deluge-- to a new commodity and resource of value. Read almost any industry handbook on data science or decision making for business management, and you will likely find the following phrase: data is the new oil—a conception of data that inverts older information overload axioms such as “too many books too little time,” into a commodified, if not a dystopian motto.

In Future Shock—the book credited with popularizing “information overload” in the early 1970s— Alvin Toffler predicted the data-driven turn that has transformed data into a hot commodity. As Toffler explains, “Rational behavior, in particular, depends on a ceaseless flow of data from the environment. It depends upon the power of the individual to predict, with at least fair success, the outcome of his own actions.” Of course, Toffler admits this simple schema breaks down when “the individual is plunged into a fast and irregularly changing situation, or a novelty-loaded context.” In response to this problem, Toffler suggests that in order to compensate for higher speeds of change and greater amounts of information, a person "must scoop up and process far more information than before. […] In short, the more rapidly changing and novel the environment, the more information the individual needs to process in order to make effective, rational decisions” (180). In this conception, the information overload transforms information—or data—into a commodity. Rather than an overload or surplus functioning to devalue information and data as a whole (based on supply/demand approaches to valuation), it has instead, paradoxically, made data comparable in value to a natural resource we cannot live without.

Data has not just changed in quantity (and value) over the last few decades, of course. It has also changed in terms of how it is produced and stored, impacting what we consider to be, and ultimately how we study, digital content. It took approximately 14 years—from 1991 to 2005—for global Internet access to reach 1 billion people, and by 2005 many aspects of the Document Object Model for rendering web pages on computer screens (allowing for dynamic websites and interactive cloud applications using scripting languages such as JavaScript) began to receive full support from the top web browsers. By 2006, Myspace became the most visited website in the United States, YouTube was acquired by Google, Facebook transitioned to a public social network, Twitter was officially launched, Reddit opened its news posts to interactive commenting, and Wikipedia reached 1 million articles and began shifting its focus to improving the quality of its crowd-sourced content. While the influence of Napster and peer-to-peer file sharing on 2005’s emerging “Web 2.0” environment should not be downplayed, one phrase seems more indicative of this new era of digital technology than all the others: the era of user-generated content.

With the advance of digital photography and mobile devices, the era of user-generated content gave way to an immense expansion of digital data. Even though Apple was still a year away from releasing the first iPhone, and smartphone sales would not outpace standard mobile phone sales for another 7 years, in 2006 digital cameras captured approximately 150 billion images and mobile phones snapped an additional 100 billion as well. The transition to digital cameras as a data format for images (as opposed to the printing of photographs) was so disruptive that Kodak, Sony, Fuji, and Nikon were all forced to undertake fundamental changes in how they provided technologies and services for image capture. Furthermore, similar changes also occurred for video recording and distribution. By 2006 YouTube was already hosting nearly 100 million user-generated video streams per day, and in that year more than 1 billion music MP3s were shared over the Internet each day. In 2006 alone, 161 exabytes (161 billion gigabytes) of digital content and data were generated—equivalent to 3 million times the total approximate data contained in every book ever written (“Expanding”). Two decades earlier, in 1986, the total amount of global digital content and data in storage was 2.6 exabytes, and by the end of 2006 this grew to 295 exabytes in total storage—equal to 61 CD-ROM discs of data per-person globally. By 2010, the total amount of global digital storage amassed 1227 exabytes of data, outpacing previous estimates by 239 exabytes, and by 2020 it is estimated that global digital storage will reach 40,000 exabytes of data—equivalent to 5.2 terabytes for every human on the planet (“Digital”).

Of course, the total amount of global data generated is not simply a product of the data attributed to digital content or digital artifacts. As the industry study cited above notes, “the amount of information individuals create themselves—writing documents, taking pictures, downloading music, etc.—is far less than the amount of information being created about them in the digital universe” (“Digital,” emphasis added). Every query conducted within a search engine, every liked post on a social network, every visited website and clicked hyperlink, every digital purchase and online transaction, and every written document, shared image, and streamed video creates data about user behavior and activity—and about the artifacts themselves. Thus, as we head into the 6th decade of public networked computing—from the walled gardens of the intranets to the early years of the Internet, and then from the early Internet to the era of user-generated content—we are witnessing a new era of networked computing and digital content: the era of data-driven content. Browsing, surfing, and searching have been replaced with algorithms, filters, and feeds. Rather than humans searching the Internet or scanning social networks to discover digital visual content, new content is delivered to users directly through programs that "learn" their favorite types of content or through filters that select content based on the topics and categories users “engage” most frequently. And these programs respond to the data collected from user profiles, search histories, purchase patterns, viewing habits, likes, comments, interactive behaviors, and from a multitude of sensor data provided by mobile phones and other internet-connected devices—producing even more data.

Such massive amounts of data demand that digital visual studies scholars develop research methods that keep pace with the scales of content being produced everyday by citizens and businesses alike. The term scale seeks balance—as in the scales of justice or a scale that weighs an object. For example, computer graphics are scalable when they retain their initial clarity as digitally rendered images expand or change in size significantly on a screen. In art, when a sketch or model is drawn to scale, its proportionality stays in tact when the artist paints their sketch as a large mural or carves them into an immense statue from stone. Broadly speaking, the scalability of conventional systems and methods function well within normal deviations in size, but with massive—exponential—increases in scale, the same systems and methods will fail to remain proportionally effective. For example, common techniques for constructing a twenty-floor office building will not erect a mile-high skyscraper. Propulsion systems that once pushed a spacecraft to the moon are incapable of transporting humans to the next closest solar system. And of course, email, the long since standard of digital communication, now seems as impotent as postal mail in a world with over 4.3 billion internet users. The lesson of scale is this: all systems and methods will likely fail to remain proportionally effective at dramatically different scales. Put simply, no system or method will scale infinitely. Therefore, if we want to develop effective descriptions of visual content, as researchers, we must continue to invest in methods that allow us to keep pace with the ever-increasing scale of data about visual content.

In both history and literature, substantial progress has been made to begin applying historical and literary theories to the new scales made available through resources such as the Google Books corpus, Gutenberg, and Hathi Trust. Substantial progress has also been made in applying data-driven tools and methods to the large-scale analysis of historical documents and texts. Matthew Jockers’ Macroanalysis: Digital Methods and Literary History, for instance, provides an extensive rationale for data-driven research in literary studies, and Jockers has gone so far as to produce a companion textbook to supplement this work: Text Analysis in R for Students of Literature. Likewise in Exploring Big Historical Data: The Historian’s Macroscope, Shaun Graham, Ian Milligan, and Scott Weingart provide a comprehensive overview of data-driven methods for historians, building on other valuable resources in this area, like the archive of tutorials available at The Programming Historian. Digital visual studies scholars have much to learn from such work in thinking about how to implement macroscopic methods for our own research. Gries’ work with circulation and “tracking” Obama Hope, in particular, acts as a precursor and a methodological bridge for seriously considering how to apply data-driven methods to pressing questions in digital visual studies. While Gries’s work introduces uses of data-driven methods to account for the rhetorical life of Obama Hope, Gries has admittedly only gestured to the possibilities that can be harnessed by these tools.

If digital visual studies is to flourish, we need to keep pushing to improve our ability to study digital networks, the artifacts that circulate within them, and the massive amounts of data they produce. Given the problems with obsolescence, replication, peer review, and access discussed in the previous section, this will mean continuing to understand and study both the business models and development practices that invent new digital technologies to know when and how the networks change—changes that directly affect how we study them. Data driven methods have developed alongside digital technologies, precisely to enable such studies: at their most fundamental level they are designed to provide feedback on digital activities of all kinds, including substantive changes to networks and their underlying systems and programs that impact our access to data and, ultimately, the stories and knowledge we are able to generate about digital visual artifacts.

Admittedly, studying the networks themselves remains a larger, long term goal for macroscopic methodologies. Yet, we can begin working toward that goal by improving our fluency with data-driven tools in general, applying them to the types of research we already conduct, with an eye toward inventing research practices that account for the artifacts and phenomena that we are already studying—but at much larger scales. Macroscopic methods (here defined as exploratory and descriptive statistics) enable such research and push us to ask: what aspects of our massive image collections can be automated? When researching visual artifacts such as Obama Hope, how can we use already-available metadata, captions, and comments to aid our coding, interpretations, and analyses? And, finally, how can we make these practices reproducible, so that other researchers can make use of the programming scripts, data processing techniques, and visualization practices in when testing our results (peer review) or when trying to think of new ways to analyze or compare visual data?

In the following section, we demonstrate how exploratory and descriptive methods can help to make sense of the reactions, responses, and comments about visual artifacts on social media. We particularly wanted to know what topics related to Obama Hope emerged on Twitter between 2008 and 2016, the years of Obama’s administration. And consulting with Gries, we explore what the data has to say about Obama Hope, and how this offers useful insights into the artist behind Obama Hope, Sheppard Fairey. We begin by providing a definition of exploratory and descriptive methods, and then we identify the methods we used to produce the results/visuals, and finally present and comment on our findings.

Next SectionBack to Top