The Problems of Blackboxed Underlying Digital Data
Blackboxing, to be clear, occurs when technologies, and their underlying systems, are opaque (difficult to understand or make sense of), proprietary (closed to access for financial or legal reasons), or closed off from critique and accountability in some other way. One problem with blackboxing is that without adequate access to underlying data, academic researchers have a difficult time studying and acknowledging the harmful contaminants emerging from the shift to data-driven content: bots, click farms, ideological echo chambers, information cascades, the unintended consequences of algorithmic filtering, the surveillance economy, privacy infringements, and fake news/misinformation campaigns. When it comes to studying the circulation and consequentiality of Obama Hope, for instance, bots and algorithms have likely played a major role in driving Obama Hope’s mobility and virality. As digital visual studies scholars, we argue that without understanding how such technologies are impacting such phenomena, we run the risk of misrepresenting how those phenomena come to be—misrepresentations that impact, in turn, the theories and knowledge we generate that potentially come to impact entire disciplines.
Another problem has to deal with the obsolescence of tools we use to study visual data due to the venture capital business model used to develop many technologies—obsolescence that not only impacts our ability to access researchers’ data visualizations but also our ability to replicate their data-driven research methods. In her data visualizations of Obama Hope, for instance, Gries used Google Fusion tables to store her data and run the geographical maps. Due to Google’s decision to no longer support this technology, Google Fusion Tables are now no longer available, and the data visualizations in Gries’ article are now broken (Gries). Gries did not realize the risk of these tools at the time she published the article, but now some of the data and results she shares in her article are incomplete and inaccessible for new readers. This is a direct result of the business model used by Google and others to develop and forward data-driven technologies. This “free” venture capital model—where users are given free access to tools and technologies, while companies build and test the tools as “proof of concept” for investors—has disastrous consequences for how we make and share knowledge in digital visual studies.1
As another example, in 2015, Eunsong Kim published “The Politics of Trending” to demonstrate how Twitter’s trend monitoring algorithms were ignoring—and potentially suppressing—trends associated with #blacklivesmatter. Kim used Topsy—a “free” cloud tool for analyzing the data within various social networks—to conduct her research. Unfortunately, Topsy is now no longer usable. Apple purchased Topsy in 2013 for $200 million and then proceeded to shut Topsy down entirely in December of 2015, making it no longer possible to replicate Kim’s methods for our own research. The former director of business development at Topsy, Aaron Hayes-Roth—who was unhappy with Apple’s decision to close Topsy—wrote an editorial trying to answer a simple question: why would Apple pay $200 million for a business like Topsy, only to close its doors two years later? According to Hayes-Roth, Apple wanted to use Topsy’s data mining technology and Topsy’s comprehensive Twitter dataset to compete with Google’s search/advertising services on Apple’s mobile devices. Thus, the erasure of Kim’s methods was a consequence of Apple trying to compete with Google. Unfortunately, this example is not an isolated incident—this is a direct consequence of a venture capital business model that drives the development of most digital technologies. The sarcastic profile description on Topsy’s still-available Twitter page summarizes the problem well: “Every Tweet ever published. Previously at your fingertips” (emphasis added).
Certainly, there are free cloud tools being developed precisely for academic research that provide an immense service for extending data science methods to digital visual studies, such as Voyant (which is available as both a cloud tool and as standalone open source software). Voyant’s free cloud application provides natural language analytics that are easy to use, and we regularly use this tool in our classrooms and workshops to teach text mining and to introduce data-driven methods. Many other cloud tools like SumAll also appear useful for studying the circulation of digital visual artifacts. However, while useful, most of these tools were not built for peer-reviewed academic research—they were built explicitly to service a Big Data, venture capital business model. For example, SumAll’s justification for providing free access to data analytics is as follows: “Since we’re still working on building great products for small businesses, SumAll is free. Think of this as our soft launch—no credit cards and no subscriptions required.” SumAll is explicitly telling users their free access will not be sustained. Furthermore, we should look to the future and ask: how long until SumAll is acquired, fundamentally changed, or no longer available?2 Similarly, we might ask: if digital visual studies produces research using data from SumAll, and then SumAll is acquired and closed like Topsy, how can future scholars trust research that relies on data visuals that cannot be replicated, tested, or otherwise evaluated? Cloud applications like SumAll usually do not provide users with access to the underlying data and the methods used to produce their analytics—as these are considered “proprietary”—and, therefore, users of Big Data tools have to simply trust the visuals (analytics) presented to them. This may not be a problem for business data visualization practices, but it is a problem for academic research that relies so heavily on methodological replication and peer review. How, for instance, can digital visual studies scholars peer review data visuals that are based on proprietary, black boxed methods?
Finally, and this is something we have written about at length (Beveridge), the business model discussed throughout this section often limits access to proprietary digital data that we need to do digital visual studies at scale. In March of 2018, Facebook’s primary third-party data provider Datasift, for instance, was acquired by Meltwater, a business analytics firm specializing in data mining technologies. Similarly, Twitter’s primary data partner GNIP, which Twitter purchased in 2014, has now been fully merged with Twitter’s new “Enterprise” developer services. While Facebook continues to systematically reduce the free public access to their data, Twitter has showed renewed efforts to work closely with developers in addressing the many democratic challenges facing digital networks. This is a good sign for digital humanities researchers, as Twitter and GNIP have provided many useful datasets for answering challenging questions about the changing nature of social movements in digital environments and the effects of networking technologies on democratic discourse. However, the public availability of proprietary digital data (like Twitter data) for academic research has not significantly improved since 2015.
In our results section, we show how tools like MassMine and R help address these accessibility issues by providing sustainable open source tools for exploratory and descriptive social media research. But first, we describe the governing methodology of our research, which we identify as extending macroscopic methodologies that have been emerging throughout the humanities. Macroscopic methodologies, as we define them here, refer to theories about digital data, scale, information, and content that are constantly evolving due to the rapid rate of technological change. These methodologies are vital to inventing and adapting the methods of data science for digital visual studies. The humanities, broadly speaking, have utilized various neologisms in discussing and applying these methods—such as distance reading or computational methods or procedural practices—and outside the humanities other terminologies have been used to refer to data-driven methods, like the fourth paradigm, informatics, or eScience. The expanding utility for these methods in academic research is based on a simple yet powerful pragmatics: many of the systematic aspects of analysis or observation, currently undertaken by human reading or other forms of human observation, may be supplemented by computer programs that automate certain aspects of human reading/observation, making them reproducible for analyzing large collections of digital artifacts. In other words, the methods made available by data science and data mining toolsets continue to expand in value because they enable us to more accurately study digital artifacts at scale.
The following section delves into theories about digital data, information, content, and scale in order to explain why the ever-expanding size of digital data—the massive scales introduced by data-driven technologies—call for digital visual studies to adopt and revise data-science methods for our own research practices. Lev Manovich’s work with cultural analytics has made significant strides in observing, analyzing, and interpreting large data sets of visual images. As Manovich defines it, cultural analytics develops new data mining and data visualization research practices to analyze large collections of visual, multimodal, and other born-digital artifacts. Manovich’s Cultural Analytics Lab has studied the growth of image sharing on Twitter using 270 million geo-coded images, and his lab has conducted large-scale analyses of Instagram, asking: “What do millions of Instagram photographs tell us about the world?” The impressive study shows the immense potential of macroscopic methods for digital visual studies, but we argue that digital visual studies will benefit from expanding its repertoire of data driven methods to include exploratory and descriptive statistics that enable us to also understand the conversations, contexts, and histories of posts that impact the circulation of visual artifacts. As the next section will show, the exigence for data-driven methods is not only the benefit they provide in exploring and describing the data we collect, but also in dealing with the massive new scales introduced by data-driven technologies.
1. See Beyond the Camera Panopticon, wherein Aral Balkan explores the many problems stemming from the "free" model of technological development (Timestamp: 35:02–40:00).↩
2. Indeed, since this chapter was originally drafted, our prediction has been proven correct. SumAll is no longer available and the domain has been shut down.↩