Twitter Archives and the Challenges of "Big Social Data" for Media and Communication Research




big data, social media, internet studies

How to Cite

Burgess, J., & Bruns, A. (2012). Twitter Archives and the Challenges of "Big Social Data" for Media and Communication Research. M/C Journal, 15(5).
Vol. 15 No. 5 (2012): list
Published 2012-10-11

Lists and Social Media

Lists have long been an ordering mechanism for computer-mediated social interaction. While far from being the first such mechanism, blogrolls offered an opportunity for bloggers to provide a list of their peers; the present generation of social media environments similarly provide lists of friends and followers. Where blogrolls and other earlier lists may have been user-generated, the social media lists of today are more likely to have been produced by the platforms themselves, and are of intrinsic value to the platform providers at least as much as to the users themselves; both Facebook and Twitter have highlighted the importance of their respective “social graphs” (their databases of user connections) as fundamental elements of their fledgling business models. This represents what Mejias describes as “nodocentrism,” which “renders all human interaction in terms of network dynamics (not just any network, but a digital network with a profit-driven infrastructure).”

The communicative content of social media spaces is also frequently rendered in the form of lists. Famously, blogs are defined in the first place by their reverse-chronological listing of posts (Walker Rettberg), but the same is true for current social media platforms: Twitter, Facebook, and other social media platforms are inherently centred around an infinite, constantly updated and extended list of posts made by individual users and their connections.

The concept of the list implies a certain degree of order, and the orderliness of content lists as provided through the latest generation of centralised social media platforms has also led to the development of more comprehensive and powerful, commercial as well as scholarly, research approaches to the study of social media. Using the example of Twitter, this article discusses the challenges of such “big data” research as it draws on the content lists provided by proprietary social media platforms.

Twitter Archives for Research

Twitter is a particularly useful source of social media data: using the Twitter API (the Application Programming Interface, which provides structured access to communication data in standardised formats) it is possible, with a little effort and sufficient technical resources, for researchers to gather very large archives of public tweets concerned with a particular topic, theme or event. Essentially, the API delivers very long lists of hundreds, thousands, or millions of tweets, and metadata about those tweets; such data can then be sliced, diced and visualised in a wide range of ways, in order to understand the dynamics of social media communication.

Such research is frequently oriented around pre-existing research questions, but is typically conducted at unprecedented scale. The projects of media and communication researchers such as Papacharissi and de Fatima Oliveira, Wood and Baughman, or Lotan, et al.—to name just a handful of recent examples—rely fundamentally on Twitter datasets which now routinely comprise millions of tweets and associated metadata, collected according to a wide range of criteria. What is common to all such cases, however, is the need to make new methodological choices in the processing and analysis of such large datasets on mediated social interaction.

Our own work is broadly concerned with understanding the role of social media in the contemporary media ecology, with a focus on the formation and dynamics of interest- and issues-based publics. We have mined and analysed large archives of Twitter data to understand contemporary crisis communication (Bruns et al), the role of social media in elections (Burgess and Bruns), and the nature of contemporary audience engagement with television entertainment and news media (Harrington, Highfield, and Bruns).

Using a custom installation of the open source Twitter archiving tool yourTwapperkeeper, we capture and archive all the available tweets (and their associated metadata) containing a specified keyword (like “Olympics” or “dubstep”), name (Gillard, Bieber, Obama) or hashtag (#ausvotes, #royalwedding, #qldfloods). In their simplest form, such Twitter archives are commonly stored as delimited (e.g. comma- or tab-separated) text files, with each of the following values in a separate column:

 text:  contents of the tweet itself, in 140 characters or less
 to_user_id: numerical ID of the tweet recipient (for @replies)
 from_user:  screen name of the tweet sender 
 id:  numerical ID of the tweet itself
 from_user_id: numerical ID of the tweet sender
 iso_language_code: code (e.g. en, de, fr, ...) of the sender’s default language
 source: client software used to tweet (e.g. Web, Tweetdeck, ...)
 profile_image_url:  URL of the tweet sender’s profile picture 
 geo_type: format of the sender’s geographical coordinates
 geo_coordinates_0: first element of the geographical coordinates
 geo_coordinates_1:  second element of the geographical coordinates
 created_at:  tweet timestamp in human-readable format
 time:  tweet timestamp as a numerical Unix timestamp

In order to process the data, we typically run a number of our own scripts (written in the programming language Gawk) which manipulate or filter the records in various ways, and apply a series of temporal, qualitative and categorical metrics to the data, enabling us to discern patterns of activity over time, as well as to identify topics and themes, key actors, and the relations among them; in some circumstances we may also undertake further processes of filtering and close textual analysis of the content of the tweets. Network analysis (of the relationships among actors in a discussion; or among key themes) is undertaken using the open source application Gephi. While a detailed methodological discussion is beyond the scope of this article, further details and examples of our methods and tools for data analysis and visualisation, including copies of our Gawk scripts, are available on our comprehensive project website, Mapping Online Publics.

In this article, we reflect on the technical, epistemological and political challenges of such uses of large-scale Twitter archives within media and communication studies research, positioning this work in the context of the phenomenon that Lev Manovich has called “big social data.” In doing so, we recognise that our empirical work on Twitter is concerned with a complex research site that is itself shaped by a complex range of human and non-human actors, within a dynamic, indeed volatile media ecology (Fuller), and using data collection and analysis methods that are in themselves deeply embedded in this ecology.

“Big Social Data”

As Manovich’s term implies, the Big Data paradigm has recently arrived in media, communication and cultural studies—significantly later than it did in the hard sciences, in more traditionally computational branches of social science, and perhaps even in the first wave of digital humanities research (which largely applied computational methods to pre-existing, historical “big data” corpora)—and this shift has been provoked in large part by the dramatic quantitative growth and apparently increased cultural importance of social media—hence, “big social data.” As Manovich puts it:

For the first time, we can follow [the] imaginations, opinions, ideas, and feelings of hundreds of millions of people. We can see the images and the videos they create and comment on, monitor the conversations they are engaged in, read their blog posts and tweets, navigate their maps, listen to their track lists, and follow their trajectories in physical space. (Manovich 461)

This moment has arrived in media, communication and cultural studies because of the increased scale of social media participation and the textual traces  that this participation leaves behind—allowing researchers, equipped with digital tools and methods, to “study social and cultural processes and dynamics in new ways” (Manovich 461). However, and crucially for our purposes in this article, many of these scholarly possibilities would remain latent if it were not for the widespread availability of Open APIs for social software (including social media) platforms. APIs are technical specifications of how one software application should access another, thereby allowing the embedding or cross-publishing of social content across Websites (so that your tweets can appear in your Facebook timeline, for example), or allowing third-party developers to build additional applications on social media platforms (like the Twitter user ranking service Klout), while also allowing platform owners to impose de facto regulation on such third-party uses via the same code. While platform providers do not necessarily have scholarship in mind, the data access affordances of APIs are also available for research purposes.

As Manovich notes, until very recently almost all truly “big data” approaches to social media research had been undertaken by computer scientists (464). But as part of a broader “computational turn” in the digital humanities (Berry), and because of the increased availability to non-specialists of data access and analysis tools, media, communication and cultural studies scholars are beginning to catch up. Many of the new, large-scale research projects examining the societal uses and impacts of social media—including our own—which have been initiated by various media, communication, and cultural studies research leaders around the world have begun their work by taking stock of, and often substantially extending through new development, the range of available tools and methods for data analysis. The research infrastructure developed by such projects, therefore, now reflects their own disciplinary backgrounds at least as much as it does the fundamental principles of computer science. In turn, such new and often experimental tools and methods necessarily also provoke new epistemological and methodological challenges.

The Twitter API and Twitter Archives

The Open API was a key aspect of mid-2000s ideas about the value of the open Web and “Web 2.0” business models (O’Reilly), emphasising the open, cross-platform sharing of content as well as promoting innovation at the margins via third-party application development—and it was in this ideological environment that the microblogging service Twitter launched and experienced rapid growth in popularity among users and developers alike. As José van Dijck cogently argues, however, a complex interplay of technical, economic and social dynamics has seen Twitter shift from a relatively open, ad hoc and user-centred platform toward a more formalised media business:

For Twitter, the shift from being primarily a conversational communication tool to being a global, ad-supported followers tool took place in a relatively short time span. This shift did not simply result from the owner’s choice for a distinct business model or from the company’s decision to change hardware features. Instead, the proliferation of Twitter as a tool has been a complex process in which technological adjustments are intricately intertwined with changes in user base, transformations of content and choices for revenue models. (van Dijck 343)

The specifications of Twitter’s API, as well as the written guidelines for its use by developers (Twitter, “Developer Rules”) are an excellent example of these “technological adjustments” and the ways they are deeply interwined with Twitter’s search for a viable revenue model. These changes show how the apparent semantic openness or “interpretive flexibility” of the term “platform” allows its meaning to be reshaped over time as the business models of platform owners change (Gillespie).

The release of the API was first announced on the Twitter blog in September 2006 (Stone), not long after the service’s launch but after some popular third-party applications (like a mashup of Twitter with Google Maps creating a dynamic display of recently posted tweets around the world) had already been developed. Since then Twitter has seen a flourishing of what the company itself referred to as the “Twitter ecosystem” (Twitter, “Developer Rules”), including third-party developed client software (like Twitterific and TweetDeck), institutional use cases (such as large-scale social media visualisations of the London Riots in The Guardian), and parasitic business models (including social media metrics services like HootSuite and Klout).

While the history of Twitter’s API rules and related regulatory instruments (such as its Developer Rules of the Road and Terms of Use) has many twists and turns, there have been two particularly important recent controversies around data access and control. First, the company locked out developers and researchers from direct “firehose” (very high volume) access to the Twitter feed; this was accompanied by a crackdown on free and public Twitter archiving services like 140Kit and the Web version of Twapperkeeper (Sample), and coincided with the establishment of what was at the time a monopoly content licensing arrangement between Twitter and Gnip, a company which charges commercial rates for high-volume API access to tweets (and content from other social media platforms). A second wave of controversy among the developer community occurred in August 2012 in response to Twitter’s release of its latest API rules (Sippey), which introduce further, significant limits to API use and usability in certain circumstances.

In essence, the result of these changes to the Twitter API rules, announced without meaningful consultation with the developer community which created the Twitter ecosystem, is a forced rebalancing of development activities: on the one hand, Twitter is explicitly seeking to “limit” (Sippey) the further development of API-based third-party tools which support “consumer engagement activities” (such as end-user clients), in order to boost the use of its own end-user interfaces; on the other hand, it aims to “encourage” the further development of “consumer analytics” and “business analytics” as well as “business engagement” tools.

Implicit in these changes is a repositioning of Twitter users (increasingly as content consumers rather than active communicators), but also of commercial and academic researchers investigating the uses of Twitter (as providing a narrow range of existing Twitter “analytics” rather than engaging in a more comprehensive investigation both of how Twitter is used, and of how such uses continue to evolve). The changes represent an attempt by the company to cement a certain, commercially viable and valuable, vision of how Twitter should be used (and analysed), and to prevent or at least delay further evolution beyond this desired stage. Although such attempts to “freeze” development may well be in vain, given the considerable, documented role which the Twitter user base has historically played in exploring new and unforeseen uses of Twitter (Bruns), it undermines scholarly research efforts to examine actual Twitter uses at least temporarily—meaning that researchers are increasingly forced to invest time and resources in finding workarounds for the new restrictions imposed by the Twitter API.

Technical, Political, and Epistemological Issues

In their recent article “Critical Questions for Big Data,” danah boyd and Kate Crawford have drawn our attention to the limitations, politics and ethics of big data approaches in the social sciences more broadly, but also touching on social media as a particularly prevalent site of social datamining. In response, we offer the following complementary points specifically related to data-driven Twitter research relying on archives of tweets gathered using the Twitter API.

First, somewhat differently from most digital humanities (where researchers often begin with a large pre-existing textual corpus), in the case of Twitter research we have no access to an original set of texts—we can access only what Twitter’s proprietary and frequently changing API will provide.

The tools Twitter researchers use rely on various combinations of parts of the Twitter API—or, more accurately, the various Twitter APIs (particularly the Search and Streaming APIs). As discussed above, of course, in providing an API, Twitter is driven not by scholarly concerns but by an attempt to serve a range of potentially value-generating end-users—particularly those with whom Twitter can create business-to-business relationships, as in their recent exclusive partnership with NBC in covering the 2012 London Olympics.

The following section from Twitter’s own developer FAQ highlights the potential conflicts between the business-case usage scenarios under which the APIs are provided and the actual uses to which they are often put by academic researchers or other dataminers:

Twitter’s search is optimized to serve relevant tweets to end-users in response to direct, non-recurring queries such as #hashtags, URLs, domains, and keywords. The Search API (which also powers Twitter’s search widget) is an interface to this search engine. Our search service is not meant to be an exhaustive archive of public tweets and not all tweets are indexed or returned. Some results are refined to better combat spam and increase relevance. Due to capacity constraints, the index currently only covers about a week’s worth of tweets. (Twitter, “Frequently Asked Questions”)

Because external researchers do not have access to the full, “raw” data, against which we could compare the retrieved archives which we use in our later analyses, and because our data access regimes rely so heavily on Twitter’s APIs—each with its technical quirks and limitations—it is impossible for us to say with any certainty that we are capturing a complete archive or even a “representative” sample (whatever “representative” might mean in a data-driven, textualist paradigm).

In other words, the “lists” of tweets delivered to us on the basis of a keyword search are not necessarily complete; and there is no way of knowing how incomplete they are. The total yield of even the most robust capture system (using the Streaming API and not relying only on Search) depends on a number of variables: rate limiting, the filtering and spam-limiting functions of Twitter’s search algorithm, server outages and so on; further, because Twitter prohibits the sharing of data sets it is difficult to compare notes with other research teams.

In terms of epistemology, too, the primary reliance on large datasets produces a new mode of scholarship in media, communication and cultural studies: what emerges is a form of data-driven research which tends towards abductive reasoning; in doing so, it highlights tensions between the traditional research questions in discourse or text-based disciplines like media and communication studies, and the assumptions and modes of pattern recognition that are required when working from the “inside out” of a corpus, rather than from the outside in (for an extended discussion of these epistemological issues in the digital humanities more generally, see Dixon).

Finally, even the heuristics of our analyses of Twitter datasets are mediated by the API: the datapoints that are hardwired into the data naturally become the most salient, further shaping the type of analysis that can be done. For example, a common process in our research is to use the syntax of tweets to categorise it as one of the following types of activity:

original tweets:  tweets which are neither @reply nor retweet
retweets:  tweets which contain RT @user… (or similar)
       unedited retweets: retweets which start with RT @user
       edited retweets: retweets do not start with RT @user
genuine @replies: tweets which contain @user, but are not retweets
URL sharing: tweets which contain URLs

(Retweets which are made using the Twitter “retweet button,” resulting in verbatim passing-along without the RT @user syntax or an opportunity to add further comment during the retweet process, form yet another category, which cannot be tracked particularly effectively using the Twitter API.)

These categories are driven by the textual and technical markers of specific kinds of interactions that are built into the syntax of Twitter itself (@replies or @mentions, RTs); and specific modes of referentiality (URLs). All of them focus on (and thereby tend to privilege) more informational modes of communication, rather than the ephemeral, affective, or ambiently intimate uses of Twitter that can be illuminated more easily using ethnographic approaches: approaches that can actually focus on the individual user, their social contexts, and the broader cultural context of the traces they leave on Twitter.


In this article we have described and reflected on some of the sociotechnical, political and economic aspects of the lists of tweets—the structured Twitter data upon which our research relies—which may be gathered using the Twitter API.

As we have argued elsewhere (Bruns and Burgess)—and, hopefully, have begun to demonstrate in this paper—media and communication studies scholars who are actually engaged in using computational methods are well-positioned to contribute to both the methodological advances we highlight at the beginning of this paper and the political debates around computational methods in the “big social data” moment on which the discussion in the second part of the paper focusses.

One pressing issue in the area of methodology is to build on current advances to bring together large-scale datamining approaches with ethnographic and other qualitative approaches, especially including close textual analysis. More broadly, in engaging with the “big social data” moment there is a pressing need for the development of code literacy in media, communication and cultural studies.  In the first place, such literacy has important instrumental uses: as Manovich argues, much big data research in the humanities requires costly and time-consuming (and sometimes alienating) partnerships with technical experts (typically, computer scientists), because the free tools available to non-programmers are still limited in utility in comparison to what can be achieved using raw data and original code (Manovich, 472).

But code literacy is also a requirement of scholarly rigour in the context of what David Berry calls the “computational turn,” representing a “third wave” of Digital Humanities. Berry suggests code and software might increasingly become in themselves objects of, and not only tools for, research:

I suggest that we introduce a humanistic approach to the subject of computer code, paying attention to the wider aspects of code and software, and connecting them to the materiality of this growing digital world. With this in mind, the question of code becomes increasingly important for understanding in the digital humanities, and serves as a condition of possibility for the many new computational forms that mediate our experience of contemporary culture and society. (Berry 17)

A first step here lies in developing a more robust working knowledge of the conceptual models and methodological priorities assumed by the workings of both the tools and the sources we use for “big social data” research. Understanding how something like the Twitter API mediates the cultures of use of the platform, as well as reflexively engaging with its mediating role in data-driven Twitter research, promotes a much more materialist critical understanding of the politics of the social media platforms (Gillespie) that are now such powerful actors in the media ecology.


Berry, David M. “Introduction: Understanding Digital Humanities.” Understanding Digital Humanities. Ed. David M. Berry. London: Palgrave Macmillan, 2012. 1-20.

boyd, danah, and Kate Crawford. “Critical Questions for Big Data.” Information, Communication & Society 15.5 (2012): 662-79.

Bruns, Axel. “Ad Hoc Innovation by Users of Social Networks: The Case of Twitter.” ZSI Discussion Paper 16 (2012). 18 Sep. 2012 ‹›.

Bruns, Axel, and Jean Burgess. “Notes towards the Scientific Study of Public Communication on Twitter.” Keynote presented at the Conference on Science and the Internet, Düsseldorf, 4 Aug. 2012. 18 Sep. 2012

Bruns, Axel, Jean Burgess, Kate Crawford, and Frances Shaw. “#qldfloods and @QPSMedia: Crisis Communication on Twitter in the 2011 South East Queensland Floods.” Brisbane: ARC Centre of Excellence for Creative Industries and Innovation, 2012. 18 Sep. 2012 ‹›

Burgess, Jean E. & Bruns, Axel (2012) “(Not) the Twitter Election: The Dynamics of the #ausvotes Conversation in Relation to the Australian Media Ecology.” Journalism Practice 6.3 (2012): 384-402

Dixon, Dan. “Analysis Tool Or Research Methodology: Is There an Epistemology for Patterns?” Understanding Digital Humanities. Ed. David M. Berry. London: Palgrave Macmillan, 2012. 191-209.

Fuller, Matthew. Media Ecologies: Materialist Energies in Art and Technoculture. Cambridge, Mass.: MIT P, 2005.

Gillespie, Tarleton. “The Politics of ‘Platforms’.” New Media & Society 12.3 (2010): 347-64.

Harrington, Stephen, Highfield, Timothy J., & Bruns, Axel (2012) “More than a Backchannel: Twitter and Television.” Ed. José Manuel Noguera. Audience Interactivity and Participation. COST Action ISO906 Transforming Audiences, Transforming Societies, Brussels, Belgium, pp. 13-17. 18 Sept. 2012

Lotan, Gilad, Erhardt Graeff, Mike Ananny, Devin Gaffney, Ian Pearce, and danah boyd. “The Arab Spring: The Revolutions Were Tweeted: Information Flows during the 2011 Tunisian and Egyptian Revolutions.” International Journal of Communication 5 (2011): 1375-1405. 18 Sep. 2012 ‹›.

Manovich, Lev. “Trending: The Promises and the Challenges of Big Social Data.” Debates in the Digital Humanities. Ed. Matthew K. Gold. Minneapolis: U of Minnesota P, 2012. 460-75.

Mejias, Ulises A. “Liberation Technology and the Arab Spring: From Utopia to Atopia and Beyond.” Fibreculture Journal 20 (2012). 18 Sep. 2012 ‹›.

O’Reilly, Tim. “What is Web 2.0? Design Patterns and Business Models for the Next Generation of Software.” O’Reilly Network 30 Sep. 2005. 18 Sep. 2012 ‹›.

Papacharissi, Zizi, and Maria de Fatima Oliveira. “Affective News and Networked Publics: The Rhythms of News Storytelling on #Egypt.” Journal of Communication 62.2 (2012): 266-82.

Sample, Mark. “The End of Twapperkeeper (and What to Do about It).” ProfHacker. The Chronicle of Higher Education 8 Mar. 2011. 18 Sep. 2012 ‹›.

Sippey, Michael. “Changes Coming in Version 1.1 of the Twitter API.” 16 Aug. 2012. Twitter Developers Blog. 18 Sep. 2012 ‹›.

Stone, Biz. “Introducing the Twitter API.” Twitter Blog 20 Sep. 2006. 18 Sep. 2012 ‹›.

Twitter. “Developer Rules of the Road.” Twitter Developers Website 17 May 2012. 18 Sep. 2012 ‹›.

Twitter. “Frequently Asked Questions.” 18 Sep. 2012 ‹›.

Van Dijck, José. “Tracing Twitter: The Rise of a Microblogging Platform.” International Journal of Media and Cultural Politics 7.3 (2011): 333-48.

Walker Rettberg, Jill. Blogging. Cambridge: Polity, 2008.

Wood, Megan M., and Linda Baughman. “Glee Fandom and Twitter: Something New, or More of the Same Old Thing?” Communication Studies 63.3 (2012): 328-44.