Big Data and Social Science

Data science methods and tools for research and practice.

Ian Foster, Rayid Ghani, Ron S. Jarmin, Frauke Kreuter and Julia Lane

Preface to the 2nd edition

The class on which this book is based was created in response to a very real challenge: how to introduce new ideas and methodologies about economic and social measurement into a workplace focused on producing high-quality statistics. Since the first edition of this book came out we have been fortunate to train over 450 participants in the Applied Data Analytics classes, resulting in increased data analytics capacity, both in terms of human and technical resources. What we learned in delivering these classes greatly influenced the 2nd edition. We also added an entire new chapter on Bias and Fairness in Machine Learning, and re-organized the book chapters somewhat.

As with any book, there are many people to be thanked. The Coleridge Initiative team at New York University, the University of Maryland and the University of Chicago were critical in shaping the format and structure - we are particularly grateful to Clayton Hunter, Jody Derezinski Williams, Graham Henke, Jonathan Morgan, Drew Gordon, Avishek Kumar, Brian Kim, Christoph Kern, and all the book chapter authors for their contributions to the second edition.

We also thank the critical reviewers solicited from CRC Press and everyone from whom we got revision suggestions online, in particular Stas Kolenikov, who carefully examined the first edition and suggested updates. We owe a great debt of gratitude to the project editor, Vaishali Singh, and the publisher, Rob Calver, for their hard work and dedication.

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here .

Loading metrics

Open Access


Research Article

Enhancing big data in the social sciences with crowdsourcing: Data augmentation practices, techniques, and opportunities

Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Writing – original draft, Writing – review & editing

* E-mail: [email protected]

Affiliation Virginia Polytechnic Institute and State University, Blacksburg, Virginia, United States of America

ORCID logo

Affiliation The Pennsylvania State University, State College, Pennsylvania, United States of America

Affiliation University of California, Los Angeles, California, United States of America

  • Nathaniel D. Porter, 
  • Ashton M. Verdery, 
  • S. Michael Gaddis


  • Published: June 10, 2020
  • Reader Comments

Table 1

Proponents of big data claim it will fuel a social research revolution, but skeptics challenge its reliability and decontextualization. The largest subset of big data is not designed for social research. Data augmentation–systematic assessment of measurement against known quantities and expansion of extant data with new information–is an important tool to maximize such data's validity and research value. Using trained research assistants or specialized algorithms are common approaches to augmentation but may not scale to big data or appease skeptics. We consider a third alternative: data augmentation with online crowdsourcing. Three empirical cases illustrate strengths and limitations of crowdsourcing, using Amazon Mechanical Turk to verify automated coding, link online databases, and gather data on online resources. Using these, we develop best practice guidelines and a reporting template to enhance reproducibility. Carefully designed, correctly applied, and rigorously documented crowdsourcing help address concerns about big data's usefulness for social research.

Citation: Porter ND, Verdery AM, Gaddis SM (2020) Enhancing big data in the social sciences with crowdsourcing: Data augmentation practices, techniques, and opportunities. PLoS ONE 15(6): e0233154.

Editor: Rashid Mehmood, King Abdulaziz University, SAUDI ARABIA

Received: July 10, 2019; Accepted: April 29, 2020; Published: June 10, 2020

Copyright: © 2020 Porter et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Data and code are available at a public repository: .

Funding: he author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.


Big data and computational approaches present a potential paradigm shift in the social sciences, particularly since they allow for measuring human behaviors that cannot be observed with survey research [ 1 , 2 , 3 ]. In fact, the transformative potential of big data for the social sciences has been compared to how “the invention of the telescope revolutionized the study of the heavens” [ 4 ]. However, some areas of social science have been slow to embrace big data. For instance, Lazer and Radford [ 5 ] note that only 15 of 422 articles (3.6%) published in the top journals in sociology between 2012 and 2016 contained analyses of big data. One reason why is “the need for advanced technical training to collect, store, manipulate, analyze, and validate massive quantities of semistructured data,” [ 6 ] training that remains nascent in many fields. But there are deeper, more fundamental constraints on the acceptance of big data among social scientists.

In this article, we make three points. First, we situate social science skepticism about big data in longstanding disciplinary concerns about validity and value. Though big data reveal many previously unseen elements of social life, they are often not created for research purposes, meaning that social researchers must assess whether measures derived from big data reflect their intended purpose (validity) and devise ways to incorporate big data into research questions of interest to social scientists (value). Second, we argue that the very features that make big data appealing as a novel source of information for social research–its size, granularity, and diversity–limit the application of traditional social science approaches to adding validity and value to orthodox sources of data, approaches which do not easily scale to the needs of big data in many research projects. Third, we consider a potential path forward: the use of online crowdsourcing techniques that blend traditional approaches to adding validity and value to social research and can be implemented at the scale necessary for use with big data.

Crowdsourcing is not the best solution to every data augmentation problem. Legal restrictions (such as the General Data Protection Regulation and Health Insurance Portability & Accountability Act) preclude certain crowdsourcing applications; such rules are complex and rapidly changing and outside the scope of this discussion. Both the treatment of workers and the content of the data itself may also raise ethical issues. While our discussion highlights certain ethical issues, as well as practical judgments of when crowdsourcing is likely to be useful, it is ultimately the responsibility of investigators to identify and address ethical concerns.

Our argument: A roadmap

Despite its promise, big data’s perceived limitations cast uncertainty on its applicability in the social sciences. Many scientists have rapidly embraced big data because of the unprecedented information it makes available. Typical taxonomic efforts from computer scientists and others to delineate big data from traditional forms of data focus on these novel characteristics in what is called the “three Vs” framework [ 7 , 8 ]: volume (or amount of data), velocity (or speed of data release), and variety (or data on rarely recorded activities). Volume, velocity, and variety are what make big data compelling and useful in a diverse array of fields.

All scientists are concerned with two other Vs: validity (or alternatively, veracity) and value [ 7 , 9 ], but social scientists have been especially skeptical about the presence of these Vs in the context of big data. For social measurement, the presence of these additional Vs, which indicate authenticity or truth (validity) and what we can do with and learn from/of the data (value), is often difficult to assess and infrequently discussed in academic big data research [ 8 , 9 ]. A search for “big data” in topics and titles indexed in Web of Science (2004–2019) reveals that most has come from research areas with foundational interest in the mechanics of data itself: Computer Science (61.4%), Engineering (37.7%), and Mathematics (13.5%). In these cases, social scientists may not be interested in the data itself, however, but the insight it may offer on social processes that produced it or result from it, and that task requires accessible means to assess its validity and enhance its value.

Characteristic of social science skepticism around big data are concerns that “the reliability, statistical validity and generalizability of new forms of data are not well understood. This means that the validity of research based on such data may be open to question” [ 10 ]. The type of big data we focus on does not come from a heavily theorized and well- planned scientific research project–they “are not the output of instruments designed to produce valid and reliable data amenable for scientific analysis”–which, at a minimum, creates discomfort among social scientists [ 5 , 11 ]. Instead, it is a byproduct of other activity, which “has led some scholars to ask whether [big] data can provide anything beyond crude description” [ 12 ]. Without additional contextual information to help “tame” it, the concern is such data will remain too “wild” for answering valuable questions of interest in the academic social sciences.

Without clear approaches to quantify and increase the validity and value of big data, we believe social science skepticism will remain high. Researchers need to be convinced of the validity and value of big data, without adding substantially to its use cost, all of which we suggest can be accomplished through data augmentation. We define data augmentation as the process of (a) systematic assessment of measurement against known quantities or (b) expansion of existing data by adding new information. Data augmentation is a standard technique throughout the social sciences that can assume a manual or automated approach. Traditionally, these tasks are accomplished using trained research assistants (manual) or specialized algorithms (automated) to detect erroneously coded or poorly measured data (validity) or append existing data sources with new material (value). An example of a big data project that researchers manually augmented to increase validity is a study of posts made by high-schoolers on Twitter that mention bullying. The authors used two human coders to classify whether each post that mentioned bullying (or bullied, or bully, etc.) was a report of adolescent bullying or whether it represented some other use of the search terms [ 13 ]. Another example that used manual augmentation to add value is a recent study where researchers employed graduate students to code whether thousands of Tweets by U.S. Senators contained partisan messages [ 14 ]. On the automated side, Yin et al. [ 15 ] demonstrate how algorithmic approaches can increase validity. They propose a general means of separating human and automated (bot) accounts with high accuracy on Twitter based on a Bayesian detection model, which has the benefit of removing non-human actors from analyses. An example of automated data augmentation used to increase value is a well-known experiment on Facebook [ 16 ]. In this experiment, the authors examined how respondents’ purported emotions changed after being shown more purportedly positive or negative posts from friends. Emotions and their associated positivity or negativity were assessed by applying a sentiment analysis method to the words used in posts. Sentiment analysis, in this case, serves as an automated way to gain additional information about big data (the posts), augmenting its value for research purposes. Of course, there are many more examples of both manual and automated approaches to data augmentation to add either validity or value or both [ 17 , 18 ].

Unfortunately, data augmentation can be challenging to implement at the scale required for big data projects while addressing social science skepticism around issues of validity and value. The manual data augmentation in the aforementioned study of bullying, for instance, was only feasible because researchers examined a manageable number of posts (N = 7,321).

Automated augmentation approaches, such as adding value through sentiment analysis, are also difficult to implement without advanced training and may themselves be of questionable validity. The Facebook experiment discussed above has been criticized by social scientists for the augmentation being of unknown and potentially low validity [ 19 ]. Of course, the validity of automated data augmentation approaches can be assessed and potentially improved through manual data augmentation, as is becoming more commonplace in big data projects through procedures such as supervised machine learning [ 20 ], but the size and complexity of most big data would require substantial time and expense for knowledgeable trained coders such as graduate assistants to check.

In this paper, we argue that online crowdsourcing platforms can complement both manual and automated approaches to data augmentation, increasing the validity and value of big data in the social sciences at a low cost to researchers. We show that such tools are underused for non- experimental designs in the social sciences and that workers on these platforms can efficiently and effectively perform many data augmentation tasks including verifying automated coding, finding errors in embedded metadata, and resolving missing data. In other words, we argue that online crowdsourcing applications offer a scalable blend of manual and automated approaches to data augmentation that can easily be harnessed to increase validity and value for big data applications to social science research questions. We build this case in five steps: (1) review the use and perceived limitations of big data in the social sciences, (2) describe the online crowdsourcing process and its documented strengths and limitations as a platform for academic research, (3) investigate current practices in academic use of the largest online crowdsourcing platform, (4) conduct three case studies implementing online crowdsourcing to enhance ongoing sociological research and test the utility of crowdsourcing across different circumstances, and (5) draw on the above, as well as experiments embedded within the case studies, to produce evidence-based recommendations on when and how to implement online crowdsourcing to augment big data for best results. Finally, in light of the inconsistent and frequently incomplete reporting of online crowdsourcing procedures, we provide a recommended reporting template for online crowdsourcing as an academic data augmentation platform. We believe that this paper offers a clear roadmap for social scientists to begin incorporating more big data into their research designs in ways that directly address issues of validity and value. We conclude by reflecting on the strengths and limits of online crowdsourcing approaches to data augmentation for these purposes.

Big data skepticism in the social sciences

Myriad actors such as corporations, governments, scientists, and even sports teams have embraced big data [ 21 , 22 , 23 ] but adoption has been slow thus far in many social sciences [ 5 ]. The literature indicates that the primary reason social scientists are making relatively rare contributions to big data research is that these fields hold deep skepticism about data that is not designed for academic research [ 5 ]. Even those optimistic about the promise of big data critique its validity and value, including its lack of standardized reporting [ 24 ], poor measurement [ 25 ], decontextualization [ 20 ], and tendency toward “big data hubris” [ 11 ] that ignores threats to validity [ 26 , 27 ]. Generalizability is another concern; most big data studies do not proceed with a clearly conceptualized population to which inference can be made [ 5 , 28 , 29 , 30 ]. Disciplinary divisions in computational skills [ 31 , 32 , 33 ] and epistemology pose additional challenges [ 34 ], as do divides between industry and academic research [ 28 , 35 ]. However, federal funders and several universities have funded a wide range of new training programs and other undertakings at the nexus of big data and the social sciences that may, over time, alleviate these pressures.

The broad range of concerns about big data from social scientists has led to a number of reflections on what steps can be taken to address this skepticism. However, our reading of the literature indicates that these reflections have focused more on the issues of generalizability than other, equally important concerns. For instance, in their review article, Lazer and Radford [ 5 ] list the vulnerabilities of big data research in sociology. The primary listing–indeed the “core issue”–is generalizability, “… who and what get represented” [ 5 ]. While these authors do acknowledge validity and value concerns, they are given only marginal discussion. Among the smaller number of studies paying careful attention to validity and value, there is a belief that they constitute a minority. Tufekci [ 36 ] details specific concerns about the validity and value of many social media analyses, also broadly true of other big data applications. These include platform bias, selection on the dependent variable, "algorithmic invisibility" (511), and intangible "field effects" (505). We argue that the oversight Tufekci observes is symptomatic of a fundamental gap between what researchers worry about with big data and what is being done to address those worries.

In general, the primary means of assessing and increasing the validity and value of data in the social sciences is undertaken through data augmentation. Examples of past big data augmentation include converting less structured data to more analytically tractable forms, linking multiple existing data sources [ 37 ] or collecting additional variables to check for spurious relationships or causal mechanisms [ 12 , 38 ]. As reviewed above, there are both manual and automated approaches to data augmentation, but neither is likely to be sufficient to both scale to the problems posed by big data and address social science skepticism about it. Instead, we focus on a third option that can enhance both automated and manual approaches to data augmentation: using online crowdsourcing marketplaces such as Amazon Mechanical Turk (MTurk). Our work thus seeks to popularize and formalize a new tool within the nascent set of methodologies designed to increase the value and validity of online data collection efforts [ 12 , 39 ].

Online crowdsourcing is less technically demanding than automated approaches and can provide supplemental evidence of accuracy based on user judgment or augmented comparison with outside sources or both. Compared to common manual approaches, MTurk is nimbler and less costly, allowing increased scale of augmented analysis. Compared to purely automated approaches or even blended approaches like supervised machine learning, online crowdsourcing through MTurk has the ability to produce well-understood measures of validity like inter-rater reliability or to merge data with sources that are not amenable to automated discovery, as well as retaining the reassuring feature that actual human beings have examined the coding. While some social scientists are using MTurk for research [ 40 , 41 , 42 ], we argue that formalizing this approach to data augmentation will expedite the widespread acceptance of big data in the social sciences and overcome barriers to its application. In the next section, we review MTurk as a promising research platform that we argue allows researchers to undertake big data augmentation at scale more simply, quickly, and cheaply than data augmentation through traditional automated or manual approaches.

MTurk as a research platform

The name “Mechanical Turk” is derived from the 18th century chess-playing “machine” commonly known simply as “the Turk”. The Turk consisted of a complex cabinet of gears with a magnetic chessboard on top and a model of a human similar to a mannequin dressed in Turkish robes with a turban. Human chess players could play against the “machine” and would often lose. The Turk toured Europe and the United States throughout the late 18th and early 19th centuries. However, the Turk was a hoax as it was not an automated machine but rather an elaborate fake with a man inside playing the actual chess game [ 43 , 44 ]. Amazon named their own version after the original Mechanical Turk to indicate that humans can still do things that computers cannot. Amazon’s MTurk is an online crowdsourcing marketplace that brokers what MTurk parlance refers to as Human Intelligence Tasks (HITs) between requesters and workers. The idea of a HIT is described succinctly by Amazon:

Amazon Mechanical Turk is based on the idea that there are still many things that human beings can do much more effectively than computers, such as identifying objects in a photo or video, performing data de-duplication, transcribing audio recordings, or researching data details. Traditionally, tasks like this have been accomplished by hiring a large temporary workforce (which is time consuming, expensive, and difficult to scale) or have gone undone.

Anyone eligible for employment in the U.S. or India can work on MTurk, although task completion requires reliable internet access. U.S.-based MTurk workers are on average younger, more educated, wealthier, more technologically savvy, and less racially diverse than average Americans [ 45 , 46 , 47 ]. As such, many worry that samples drawn from MTurk are less representative than population based surveys [ 45 ], though not as fraught as convenience samples [ 48 ].

However, when considering MTurk as a big data augmentation platform, rather than a population to sample and survey, we argue that work quality matters more than worker representativeness. MTurk workers tend to pass screening tests at high rates [ 45 ] with high reliability between [ 49 ] and within workers [ 50 ]. At the same time, recruiting workers for data augmentation tasks through MTurk has three major limitations. First, workers lack specialized area knowledge; second, they cannot access restricted information (e.g. workers cannot download most academic journal articles); and third, MTurk compensation is based on task completion, not time, which presents challenges for fielding complex, judgment-based tasks [ 46 , 51 ]. We return to these ideas below. For now, it is worth noting that these limitations mean that crowdsourced tasks are most appropriate for data augmentation when they can be broken into concise and unambiguous chunks using open-access information.

MTurk in the academy: A content analysis

MTurk is popular with academic researchers; a recent report found that academics posted the plurality (36%) of all HIT groups during the study period [ 52 ]. Academics have hailed MTurk’s low costs and rapid results, and even expressed cautious optimism about it as a survey platform [ 30 , 53 ]. Its feasibility and reliability for data augmentation, however, remains unexplored.

To better understand how academics use MTurk, especially for data augmentation, as well as how they report on such use, we conducted a content analysis of a random with-replacement sample of 150 articles from Web of Science matching the topic search “mechanical turk” and published between 2011 and 2018. The search returned 1,684 total records that we then sampled. We removed 19 matches, 11 that did not use MTurk and 8 where full-text access was not available, yielding a final sample size of 129 articles (124 unique; statistics below are weighted for replacement sampling). In the online supplement, we provide metadata about these articles. We address three questions in this content analysis: a) who uses MTurk for academic purposes, b) what is it used for, and c) what details are reported about the use of the platform.

Table 1 reports fields where articles in our sample were published. A plurality (41%) of the papers we examined were in psychology and related fields (psychiatry and social psychology), followed by allied health sciences (16%), with five other fields comprising at least 5% of the sample. Table 1 shows proportions and counts for all fields.


  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image

Article counts grew steadily from MTurk’s founding in 2011 through 2016 and have remained high; 70% were published between 2016 and 2018. In general, these articles are cited frequently, with Web of Science’s citation counts indicating a mean of 19 citations (11 after removing one article with over 500 citations) for articles at least two years post-publication. These levels compare favorably to general article citation counts across many fields, where citation counts often average one per year or less.

We are also interested in what researchers use MTurk for, specifically how often it is used for data augmentation. Table 2 reports on the types of tasks academic researchers assign to MTurk workers. Because of psychology’s disproportionate use of MTurk, we disaggregate results by whether the article was in a psychological field. Most papers used MTurk to conduct surveys (66%), frequently with an embedded experiment (43%), although non-experimental surveys were more common in psychology (40%) than other disciplines (12%). In our sample, data augmentation was much rarer for both psychological (23%) and non-psychological (16%) studies. Of the studies involving data augmentation, workers are most often asked to perform tasks replicating other data, such as lab experiments (16%). Less frequently, they are asked to code data provided by the investigator (6%) or elaborate it with additional information (9%). In none of the studies were workers asked to collect publicly available data from the web.


Another question of interest is how academic researchers report on their use of MTurk to improve transparency and replicability, ensure quality of data augmentation or other tasks, and verify that workers are treated ethically. We found gaps in reporting standards that may impair the validity, value and replicability of MTurk as a data augmentation tool. Nearly every article we examined (98%) described data collection procedures like HIT content in detail, and most (86%) included at least basic summaries of worker demographics. However, few articles we examined reported required worker qualifications, criteria for work rejection, or validation criteria. Only 8% met what we define as minimal reporting standards across all three key areas for peer evaluation and replicability: a) a detailed description of the HITs and process (36%), b) information on worker qualifications, acceptance criteria and pay (25%), and c) descriptive statistics or multivariate analysis to evaluate sample characteristics (64%). We include more details on these standards below in the best practices section, our suggested reporting template, and in the online supplement.

The results of our content analysis highlight that academic use of MTurk is largely limited to experimental studies and surveys. In contrast to this typical use, we advocate that researchers expand their use of MTurk for data augmentation, which will have particular benefits for social science applications of big data that wish to address concerns about validity and value. We found that researchers are beginning to do this, but they do not offer enough detail on the process for formal evaluation or replication. In light of these opportunities and challenges, the remainder of this article examines three case studies and focuses on developing clear, evidence-based best practice guidelines on when and how researchers can successfully augment data with MTurk and report on doing so.

Case studies

We now present three case studies that apply MTurk to diverse sociological subfields to augment big data (cases 1 and 2) or test MTurk’s data augmentation capacities against known benchmarks from ongoing sociological data collection (case 3). These cases allow us to compare MTurk to other data augmentation approaches, both automated and manual. For cases 1 and 3, we collected analogous data automatically and manually, enabling validity comparisons. We also embedded design experiments in cases 2 and 3 to test how HIT design and implementation can affect cost, quality, and worker experience. Our goal is to develop insight for the benefits of big data augmentation through online crowdsourcing and how researchers can best move forward with such projects. The data from these case studies are not optimized for external validity in the sense of reuse in other contexts, but rather are selected as real-world applications of data augmentation in MTurk to avoid the need for expert coding of large samples. Our goal here is to demonstrate crowdsourcing as a tool for rapid use-specific data augmentation.

We designed all HITs based on past recommendations [ 45 , 47 , 48 ] and revised according to common worker concerns voiced in the popular MTurk forum turkernation and our own pilot studies. We collected all data between October 2015 and July 2016. The online supplement provides full versions of instruments and de-identified results.

Study 1: Academic affiliation—Overview and methods

Our first case shows how MTurk can enhance the validity of big data. It is part of a larger project on the role of interdisciplinary dissertation committees in knowledge production [ 54 ]. The original project used an algorithm to code the academic field of faculty based on their roles in doctoral committees. For instance, if a faculty member chaired committees in one field and was a member of committees in another, the algorithm assigned them to the field in which they chaired. Most cases were less clear cut, however, and required more complex assignment rules reviewed in greater depth in the original paper. Such algorithmic assignment indicated a surprising proportion (56%) of interdisciplinary dissertation committees. The credence given to these prevalence statistics, however, hinges on the accuracy of the automated coding. This represents a classic concern voiced by social science skeptics about automated augmentation of big data. For instance, compare the critique of sentiment analysis in the aforementioned Facebook experiment [ 16 , 19 ] or concerns about search term inclusion in Google Flu [ 11 , 55 ]. Manually verifying a sample–manual data augmentation–represents one way to check result validity, however, our tests indicated that finding and hand coding the fields of a sample of 2,000 of the 66,901 faculty (3%) would have demanded over 230 hours of trained coder work. This time commitment translates to more than three quarters of a semester of typical graduate research assistant support, assuming a 15-week semester at 20 hours a week.

Rather than training graduate student or other internal coders to verify these results, we tested the data augmentation capabilities of MTurk. We did so by creating three sequential tasks that split the process of validating the algorithmic coding of faculty members’ fields into discrete steps. First, we asked workers to find the departmental webpages of a random sample of faculty members using a search link that limited results to the official website of their academic institution (see Discussion and S1 Appendix for details). This step provided a sample of faculty whose academic field could be externally validated. Second, we asked workers to verify links obtained in task 1 and indicate whether each faculty member was listed in any of the 10 most common department names in the algorithmically coded field. This step helped to ensure that the links for specific faculty were correct. Finally, in the third task, we asked workers to evaluate whether any field on the faculty member’s page is associated with the field that was algorithmically assigned. For instance, if a faculty member listed “speech pathology” as their field and the assigned field is “speech and hearing sciences,” we aspire for workers to select that these fields are associated. This step constituted our primary interest, quantifying the validity of the algorithmic coding. We adapted all tasks from MTurk templates using the HTML and JavaScript programming languages, and collected them from separate but potentially overlapping pools of workers within the MTurk interface. A graduate research assistant invested approximately 40 hours in learning and managing this MTurk data collection. In all, we used MTurk data augmentation to check 2,043 automated classifications of faculty member fields, at a total cost of $590 including fees and pilot costs. Attaining comparable labor costs with a single graduate coder and no external validation would require pay less than $3.11 per hour, including any benefits.

Study 1: Academic affiliation—Results and discussion

Were MTurk workers, operating without substantial oversight or prior training, able to validate the results assigned by algorithm? This case speaks to MTurk’s ability to add validity to big data, used here to confirm the automated coding of a large data set and bound rates of coding error. Table 3 summarizes the combined results for Case 1. Workers in the initial HIT successfully located 85% of faculty, mostly on preferred page types (faculty homepage, administrative list, or curriculum vitae). Subsequent workers flagged only 3% of URLs that prior workers submitted as referring to the incorrect person or institution. Of cases with unflagged URLs, workers identified 94% of faculty members as matching either the field or department we provided, which suggests that the original automated coding of these big data succeeded at a high rate, even allowing for the possibility of substantial worker error. Mean hourly worker pay in this case ranged from $7 to $16 and was higher for workers completing multiple HITs.


This case revealed some important lessons. Early pilots combined all stages (page location, department classification, and field classification) into a single HIT, but we found that workers took longer and gave flagged results more often in such conditions. With later pilots, we found that dividing tasks into the three steps outlined above minimized worker time and let us build in cross-verification tests where subsequent workers verified both the faculty web pages and affiliations provided by earlier workers. The conclusions of the original study hinged on the accuracy of machine-coded disciplines and fields. Using MTurk, we were able to empirically evaluate that accuracy with speed and cost-efficiency that could not be replicated with trained coders.

Study 2: Linking to OpenLibrary—Overview and methods

Our second case highlights how data augmentation with MTurk can enhance the value of big data. Here, we asked workers to link related data sources, and we experimentally tested how HIT design may affect work quality. This case builds on a project investigating book co- purchasing patterns connecting cultural groups, operationalized with retailer metadata scraped from the web. Unfortunately, necessary metadata were often incomplete, missing, or of questionable quality. For example, a book written by the founder of one Protestant denomination (Martin Luther) was listed as the top-selling item associated with a completely different denomination. To supplement missing information, we matched 1,055 (58%) books to additional metadata provided by using international standard book numbers (ISBNs), a unique code identifying books. For 765 remaining unmatched books, we tested MTurk’s data augmentation capacities by asking workers to search for the books on OpenLibrary. As an experiment to determine means of improving HIT design, we randomly assigned each worker into one of three task variants. The first variant included full instructions with design features to enhance clarity (e.g. highlighting key text); the second used brief instructions but retained design features; while the third included full instructions with minimal formatting. Figs 1 – 3 provide screen shots of each condition; note that Amazon uses the ${variable name} notation as code to substitute values from input data provided by the requester (code available in supplemental files).




Study 2: Linking to OpenLibrary—Results and discussion

Case 2 workers successfully found 283 potential matches (37%) for missing books in the original data. We followed up on HITs with comments and rejected submitted URLs outside the specified page types. A researcher checked every 20th HIT returned for accuracy during data collection and found very low rates of false matches (<1%) and false negatives (5%-10%). Checking during data collection (rather than using a simple random sample of all returned HITs) provides opportunity to save money by cancelling remaining unclaimed HITs if design flaws are discovered. Consistent with case 1, the 33 workers who completed only one task in this case averaged 298 seconds, but the 50 workers who completed multiple tasks averaged only 126 seconds per task. Total cost for this case including fees was $235.

The experiment embedded in this case illuminates how HIT design affects cost and quality. Workers presented with detailed instructions and design features spent less time per completed HIT (mean 171 seconds, S.D. 145) than those provided concise (230, S.D. 317) or minimally formatted (245, S.D. 233) instructions. Because of the small cell sizes in this task, such differences are not significant with two-tailed T-tests; nonetheless, we take the magnitude of the differences to indicate that better instructions are likely to yield better results. Though there is a general concern that paying workers per task may lead them to rush and skim longer instructions, yielding lower quality work, we did not find that this approach compromised accuracy in our testing. Instead, work accuracy in all three groups was high and statistically indistinguishable. We speculate that fuller instructions may reduce cognitive demands on workers and thus lead to lower completion times with comparable accuracy. By connecting retailer data to third-party data on the books in the study, MTurk provided means not only of verifying the validity of retailer topic coding, but also augmenting analysis with topic-modeling and additional layers of networked relationships between books using OpenLibrary data.

Study 3: Mental health websites—Overview and methods

Our third case study does not focus on a big data project directly. Instead, it tests the possible extent of MTurk’s data augmentation capacities and directly evaluates MTurk data augmentation against a “gold standard” benchmark from a set of trained coders in an existing sociological data set. This case reveals how task complexity affects MTurk results and it provides alternate methods of assessing the quality of MTurk data augmentation. In this case, we compare the performance of trained coders against MTurk workers in a study of college student mental health. The Healthy Minds Study Institutional Website Supplement (HMS-IWS) collects data on 74 topics across 8 areas related to resources, information, and the presentation of information on mental health services from college and university websites. It is, itself, adding value to a standard survey (the Healthy Minds Study) [ 56 , 57 , 58 ] through manual data augmentation.

For three years, the HMS-IWS team, including a Ph.D. researcher and two trained graduate research assistants, each coded relevant items from institutional websites. There is high inter-rater reliability in this manual data augmentation approach but also extensive costs and time. In this case study, we asked 40 MTurk workers to record information from one of three college or university websites. We provided workers with a brief explanation for each task (see S1 Appendix ) as well as the website link. We varied HIT construction across four categories to test how HIT organization and design affects work quality and cost. In HITs 1A and 1B, we gave workers a set of 21 items (18 yes/no and 3 open-ended) spanning four broad categories (general information, campus-specific information, information for individuals other than students, and diagnosis) and paid $1.50 for the task. In HITs 2A and 2B, we gave workers a set of 33 items that fit under a single category (services and treatment), including 30 yes/no and three open- ended questions, and paid $1.75 for the task. Finally, we varied the HITs between versions A and B, with the sole difference between versions being the addition of a paragraph in the B variants that told workers we would check accuracy and that users with too many inaccurate answers would not receive payment.

Study 3: Mental health websites—Results and discussion

To evaluate worker accuracy, we compare results from MTurk workers to results from the trained coders, which we take as a gold standard benchmark for accuracy. Three trained researchers first coded each of the 48 binary items for each of the three websites. The researchers initially agreed on 131 of the 144 total items (90%) across the three websites, and the remaining 13 items were rechecked until consensus was reached. In contrast, MTurk workers correctly answered binary items at a rate of 63% for HIT 1A, 70% for HIT 1B, 78% for HIT 2A, and 82% for HIT 2B. Given the binary response choices, these rates are generally low. Consistent with longstanding findings in statistics [ 59 , 60 ], using a majority vote decision rule to aggregate MTurk responses to the same question correct would have resulted in errors for 31% of items. The accuracy difference between HIT 1A and HIT 1B is significant using an unpaired t-test (p<0.05), while the difference between HIT 2A and HIT 2B is not significant under the same test. The pooled difference between HITs 1 and HITs 2 is also statistically significant (p<0.001). Moreover, the pooled results show that individuals given the A variants were more likely to have a low accuracy rate than those seeing the B variants at a rate of 22% to 8%, respectively (p<0.05).

In evaluating this case, we discovered an additional finding that pertains to best practices for MTurk data augmentation. Researchers might be tempted to proxy data quality with task completion time, discarding work completed in the shortest or longest amount of time, or both. However, we found little benefit from doing so. The correlation between accuracy and completion time is 0.34, and falls slightly (to 0.29) if we remove work completed in the bottom decile of completion times. If we remove work completed in the top decile, it increases (to 0.48). Removing both changes the correlation only marginally (to 0.44). On this basis, we conclude that completion time is a weak indicator of work quality. Some who complete the task quickly may simply be good at it, while some taking the longest amounts of time may have stepped away from the computer or worked on multiple tasks at once without sacrificing work quality. Recall that MTurk workers are paid by the task, not by completion time.

Overall, results from this case show that not all data augmentation tasks can be done effectively by online crowdsourcing workers. We focused on simple yes/no questions and received a 63% accuracy rate in one HIT iteration, only marginally better than random chance.

However, we can draw other important conclusions about using MTurk for data augmentation from this case: alerting workers to the possibility of payment loss from sloppy work improves accuracy [ 61 ], as does the careful ordering of work into logical groups. Finally, researchers should be careful when evaluating work accuracy, as high error rates were maintained under consensus coding and showed little relationship to completion time.

The use of online crowdsourcing for survey and quasi-experimental research is gaining acceptance in the social sciences. A series of studies that compare the results of parallel surveys and experiments using MTurk and traditional methods have evaluated online crowdsourcing with generally positive assessments [ 29 , 30 , 45 ]. Our content analysis of published social science papers that use MTurk indicated that such evaluations have generated a set of informal norms around design and reporting for quasi-experimental and survey-style MTurk studies.

We argued that online crowdsourcing as a data augmentation platform holds unique potential to add validity and value to applications of big data to social science research questions at low cost, and our content analysis suggests that researchers are beginning to use it for these purposes. However, in contrast to the emergence of norms for experimental and survey research with online crowdsourcing platforms, we found little evidence of standards for the design and reporting of data augmentation with such tools. We addressed that gap in the literature by presenting a series of three case studies designed to consider specific big data augmentation challenges, test MTurk data augmentation against known benchmarks, and improve the research community’s understanding of best practices of data augmentation through online crowdsourcing.

In this section, we consider the implications of both the content analysis and our three case studies in the context of past recommendations about online crowdsourcing for academic research. We aim to provide evidence-based guidance for researchers in two situations: (1) those exploring the viability of online crowdsourced data augmentation for a project, and (2) those seeking to improve the validity and value of data augmentation efforts with online crowdsourcing. While we believe this guidance will be most useful to researchers seeking to apply big data to social science research questions, we think that they may be of interest to researchers conducting more traditional social science analyses as well. Finally, we hope that future researchers, reviewers, and editors will find these considerations useful when evaluating data quality, reporting adequacy, and replicability in online crowdsourcing studies. To advance that goal we offer a model reporting template in the S1 Appendix .

Strengths and limitations of using online crowdsourcing for data augmentation

Our three case studies test whether and when online crowdsourcing is practical for adding validity and value to big data projects. We found that data augmentation through online crowdsourcing platforms performs best in instances like case 1, where target data are clearly defined and standardized, but it is too time-consuming, challenging, or costly to automate information discovery or for trained coders to manually recover and evaluate this information. In such tasks, workers on online crowdsourcing platforms can find and code information quickly and efficiently. The results of case 2 suggest that researchers must consider the importance of the specific output data and likely return on investment before fielding HITs. While results in this case were accurate, most books lacked a match, reducing the effective value of data augmentation through online crowdsourcing. Nonetheless, were this case focused on a larger project with tens of thousands of missing records, for instance, gains could be substantial. Case 3 looked at MTurk’s potential for research beyond simple data augmentation tasks, but it offers a more cautionary tale, wherein the non-specialized skills and task completion incentives of online crowdsourcing workers led to poor accuracy. While data augmentation through online crowdsourcing may not satisfy the complex needs of standard sociological studies such as the HMS-IWS, it can still save time and cost when used for smaller, more straightforward portions of the data collection process that would be necessary with data augmentation.

To the extent that each of the following are true, we argue that using online crowdsourcing for data augmentation should be considered more beneficial for potential cost and time savings:

  • Data collection cannot readily be automated.
  • Data can be found and/or coded by web-savvy persons without special training or knowledge.
  • Analytic needs for data are factual and do not include population estimates or comparisons with under-represented groups (minorities, individuals outside the US/India, older Americans, etc.).
  • Factual tasks can be split into smaller chunks without substantial duplication of effort.
  • Rapid results and the ability to test alternative instruments (e.g. pilot tests) are advantageous.

Best practices for academic requesters

Given the broad range of goals, methods, and tools used by academic requesters, this section provides evidence-based guidance for maximizing the validity and value of data augmentation using online crowdsourcing marketplaces. It assumes a researcher’s goal is data augmentation, but it is also broadly applicable to surveys and experiments, with differences as noted. Once the decision has been made to use online crowdsourcing for data augmentation, a typical workflow includes three phases: design, collection, and analysis.

The design phase is most critical; it sets conditions for success in subsequent phases. Clear visual design and precise, jargon-free instructions increase worker efficiency and lower the post-collection burden on requesters to manually check data quality. Based on experimental tests in cases 2 and 3, we recommend providing comprehensive instructions and examples, but highlighting (through size, color, placement, etc.) the most important instructions for task success, as well as how work will be evaluated in payment decisions. Formative pilot studies can help to identify problems with design. If using external tools, such as pairing MTurk with survey administration platforms, it is vital to pretest HITs and ensure the correct operation of validation processes for external task completion. Malfunctioning codes are a common complaint on worker forums, as workers who have invested as much as an hour in a survey may be unable to receive compensation. We recommend pre-testing all HITs on the requester sandbox ( ) and testing codes as part of this process.

Clear design for search or evaluation tasks faces the additional challenge of user customization and personalization. Major internet search engines often customize results based on user location and past search history. Requesters seeking to collect data that are comparable across cases should minimize variability by embedding custom search links in the directions, using non-personalized search engines such as DuckDuckGo, as we did in case study 1, and specifying how many results to use (e.g. the first 20). Search links can contain elements from the input that vary between cases, embed Boolean logic, and restrict results to specific domains.

Cases 1 and 3 demonstrated two additional principles specific to data augmentation and other factual HITs: a) iterative data collection, and b) related task grouping. Iterative data collection favours rapid and efficient collection of a limited range of data over single-shot data collections designed to answer numerous questions. With large online crowdsourcing marketplaces, a sizable labor force is always available, and researchers can easily integrate prior task output into subsequent input. Outside of tasks requiring extensive setup or training, delaying follow-up questions to later tasks or collecting data for a sample rather than every case poses little threat to data quality. The ease of redeployment and incremental expansion generally make it better to wait when unclear whether a researcher will need a specific piece of information, preparing follow-ups as necessary.

We refer to the splitting of work into smaller and more coherent tasks as related task grouping and advocate that it improves work quality. Compared to initial single-shot versions of study 1, splitting the design into three HITs decreased cost and improved accuracy. Smart chunking lets workers self-select into tasks and not feel constrained to finish a longer task poorly to avoid sunk time. In both studies 1 and 2, a small proportion of the total number of workers completed most HITs, spending less time per HIT with at least equal accuracy. Related task grouping also avoids overpaying for work that is not completed. For example, a common application of big data augmentation through online crowdsourcing is asking workers to answer questions about a specific web link. If the link is invalid, any subsequent questions are inapplicable. If finding the initial links is also a goal, devoting a single task to identifying a suitable web address and asking subsequent workers to verify web address accuracy can save on excess pay while also providing cross-verification of the initial task’s success.

Big data augmentation with online crowdsourcing is often swift and hands-off once HITs are posted, but some simple steps before, during, and immediately following HITs can improve data quality and requester reputation. Before activating a HIT, requesters can freely specify minimum worker qualifications, such as by only requesting workers with evidence of past task success or who have completed pre-tests [ 62 and 63 discuss tools for requesters more extensively]. Requesters should also monitor their registered email during and immediately following HIT batches, as workers may contact them when they are unsure about the appropriate response, to report unclear directions or glitches, and to appeal rejections. Many circumstances, including browser malfunction, accidental user error, or common mistakes can result in rejection of ambiguous or good work, so researchers often accept all complete HITs and later remove poor quality data.

Of the phases of online crowdsourcing implementation, scholars have paid the least attention to analysis and reporting. The variety of big data, their relative lack of structure, and the priority of computer science and engineering over the social sciences in the field have contributed to inconsistent reporting. For data augmentation with online crowdsourcing tools to increase the validity and value of big data, transparency is imperative as to the procedure used to collect the data, how their integrity was verified, and relevant information on workers.

We provide a recommended reporting template in the S1 Appendix with both standard items that should be included in reporting all online crowdsourcing studies and items to use in reporting specifically for big data augmentation. We recommend researchers report on key study features, its purpose and implementation, and the exact criteria that they used to determine data quality, including at least one of several potential validity checks. Whenever possible, we suggest that both instruments and output data should be made available through public data repositories, such as the Open Science Framework ( and the Dataverse network ( or other publicly accessible sites, such as Github repositories ( In either case, standard confidentiality practices should be observed in removing unique worker numbers and other potentially identifying information before publishing data, and researchers must adhere to relevant human subjects research guidelines when appropriate.

Worker compensation is a final issue that deserves discussion. Typical worker compensation among the few academic studies that report hourly pay on MTurk is $1–2 per hour, rates that prior work suggests produce reliable results [ 48 ]. These rates, however, are far below U.S. minimum wages and legal only because MTurk workers are self-employed contractors not subject to minimum wage laws. Buhrmester and colleagues [ 48 ] found that compensation was not the most commonly cited motivation for workers, but recent findings suggest many workers rely on MTurk as primary or supplemental income [ 52 , 64 , 65 ]. We worry that such low payment rates can damage the broader research community by hurting the reputation of academic researchers. A 2014 experiment [ 66 ] estimated that HITs from requesters with good reputations in the online review forum Turkopticon recruit workers at twice the rate of those with poor reputations [ 64 , 67 ]. We encourage researchers who wish to estimate costs to collect a small pilot study and target average hourly compensation of at least the U.S. federal minimum wage (currently $7.25).

This paper offers data augmentation through online crowdsourcing as a scalable and low- cost means to address common concerns regarding the validity and value of big data in the social sciences. Whereas prior work has focused on the generalizability and ethics of big data, issues of validity and value have received considerably less attention. At the same time, while many have used online crowdsourcing marketplaces such as MTurk for drawing samples, or for experimental studies, few researchers have used them for data augmentation. In this paper, we attempted to bridge these literatures. We reviewed existing practices in academic research using online crowdsourcing and considered three empirical cases where big data augmentation through crowdsourcing enhanced ongoing research or illustrated the limits of data augmentation with such tools. Based on these analyses, we provided general guidance and best practices for academic research that uses online crowdsourcing for data augmentation and a standardized reporting framework. Although we emphasized the use of online crowdsourcing for big data augmentation, many of our findings and recommendations may be of value to researchers considering online crowdsourced labor for other tasks like fielding surveys. There is substantial promise in using online crowdsourcing to free up research assistant time without the need for highly-skilled programmers, and this paper offers some first steps to formalize knowledge about the potential for using these tools to help answer social science research questions.

Supporting information

S1 appendix. reporting template how to use..

  • View Article
  • Google Scholar
  • 4. Watts DJ. Everything Is Obvious: How Common Sense Fails Us. Random House; 2012.
  • 10. Entwisle B, Elias P. Changing Science: New Data for Understanding the Human Condition. OECD Global Science Forum Report on Data and Research Infrastructure for the Social Sciences. 2013 Paris, France: Organization for Economic Co-Operation and Development.
  • 15. Yin P, Ram N, Lee WC, Tucker C, Khandelwal S, Salathe M. Two Sides of a Coin: Separating Personal Communication and Public Dissemination Accounts in Twitter. In V.S. Tseng, Tu Bao Ho, Zhi-Hua Zhou, Arbee L.P. Chen and Hung-Yu Kao (eds.). PAKDD 2014, Part I 163–174.
  • 21. Lohr S. The Age of Big Data. New York Times Feb 12, 2012, p. 1(L).
  • 22. Mayer-Schönberger V, Cukier K. Big Data: A Revolution That Will Transform How We Live, Work, and Think. Houghton Mifflin Harcourt; 2013.
  • 36. Tufekci Z. Big questions for social media big data: Representativeness, validity and other methodological pitfalls. Proceedings of the Eighth International AAAI Conference on Weblogs and Social Media. 2014.
  • PubMed/NCBI
  • 43. Levitt GM. Turk , Chess Automaton . McFarland & Company; 2016.
  • 44. Standage T. Mechanical Turk : The True Story of the Chess Playing Machine that Fooled the World . Penguin Books; 2004.
  • 52. Hitlin P. Research in the Crowdsourcing Age, a Case Study. Pew Research Center. 2016. .
  • 54. Verdery A. Three Essays on Interdisciplinarity and Knowledge Production. 2015. Doctoral Dissertation, Department of Sociology, University of North Carolina at Chapel Hill.
  • 58. Gaddis SM, Ramirez D, Hernandez EL. Variations in Endorsed and Perceived Mental Health Treatment Stigma across U.S. Higher Education Institutions. Stigma and Health.
  • 60. Panagiotis G. Ipeirotis, Foster Provost, and Jing Wang. 2010. Quality management on Amazon Mechanical Turk. In Proceedings of the ACM SIGKDD Workshop on Human Computation (HCOMP ‘10). Association for Computing Machinery, New York, NY, USA, 64–67.
  • 62. Leeper TJ, Messing S, Murphy S, Chang J. MTurkR: R Client for the MTurk Requester API (version 0.6.17). 2015. .
  • 64. Irani LC, Silberman MS. Turkopticon: Interrupting Worker Invisibility in Amazon Mechanical Turk. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, April, 611–20.
  • 66. Benson A, Sojourner AJ, Umyarov A. The Value of Employer Reputation in the Absence of Contract Enforcement: A Randomized Experiment 2015.
  • 67. Silberman SM. Human-Centered Computing and the Future of Work: Lessons from Mechanical Turk and Turkopticon, 2008–2015. 2015. PhD Dissertation, Irvine: University of California.

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 04 August 2020

Moving back to the future of big data-driven research: reflecting on the social in genomics

  • Melanie Goisauf   ORCID: 1 , 2   na1 ,
  • Kaya Akyüz   ORCID: 1 , 2   na1 &
  • Gillian M. Martin   ORCID: 3   na1  

Humanities and Social Sciences Communications volume  7 , Article number:  55 ( 2020 ) Cite this article

3159 Accesses

8 Citations

9 Altmetric

Metrics details

  • Science, technology and society

With the advance of genomics, specific individual conditions have received increased attention in the generation of scientific knowledge. This spans the extremes of the aim of curing genetic diseases and identifying the biological basis of social behaviour. In this development, the ways knowledge is produced have gained significant relevance, as the data-intensive search for biology/sociality associations has repercussions on doing social research and on theory. This article argues that an in-depth discussion and critical reflection on the social configurations that are inscribed in, and reproduced by genomic data-intensive research is urgently needed. This is illustrated by debating a recent case: a large-scale genome-wide association study (GWAS) on sexual orientation that suggested partial genetic basis for same-sex sexual behaviour (Ganna et al. 2019b ). This case is analysed from three angles: (1) the demonstration of how, in the process of genomics research, societal relations, understandings and categorizations are used and inscribed into social phenomena and outcomes; (2) the exploration of the ways that the (big) data-driven research is constituted by increasingly moving away from theory and methodological generation of theoretical concepts that foster the understanding of societal contexts and relations (Kitchin 2014a ). Big Data Soc and (3) the demonstration of how the assumption of ‘free from theory’ in this case does not mean free of choices made, which are themselves restricted by data that are available. In questioning how key sociological categories are incorporated in a wider scientific debate on genetic conditions and knowledge production, the article shows how underlying classification and categorizations, which are inherently social in their production, can have wide ranging implications. The conclusion cautions against the marginalization of social science in the wake of developments in data-driven research that neglect social theory, established methodology and the contextual relevance of the social environment.

Similar content being viewed by others

big data in social science research

Using genetics for social science

big data in social science research

Genetic determinism, essentialism and reductionism: semantic clarity for contested science

big data in social science research

Participation bias in the UK Biobank distorts genetic associations and downstream analyses


With the advance of genomic research, specific individual conditions received increased attention in scientific knowledge generation. While understanding the genetic foundations of diseases has become an important driver for the advancement of personalized medicine, the focus of interest has also expanded from disease to social behaviour. These developments are embedded in a wider discourse in science and society about the opportunities and limits of genomic research and intervention. With the emergence of the genome as a key concept for ‘life itself’, understandings of health and disease, responsibility and risk, and the relation between present conditions and future health outcomes have shifted, impacting also the ways in which identities are conceptualized under new genetic conditions (Novas and Rose 2000 ). At the same time, the growing literature of postgenomics points to evolving understandings of what ‘gene’ and ‘environment’ are (Landecker and Panofsky 2013 ; Fox Keller 2014 ; Meloni 2016 ). The postgenomic genome is no longer understood as merely directional and static, but rather as a complex and dynamic system that responds to its environment (Fox Keller 2015 ), where the social as part of the environment becomes a signal for activation or silencing of genes (Landecker 2016 ). At the same time, genetic engineering, prominently known as the gene-editing technology CRISPR/Cas9, has received considerable attention, but also caused concerns regarding its ethical, legal and societal implications (ELSI) and governance (Howard et al. 2018 ; Jasanoff and Hurlbut 2018 ). Taking these developments together, the big question of nature vs. nurture has taken on a new significance.

Studies which aim to reveal how biology and culture are being put in relation to each other appear frequently and pursue a genomic re-thinking of social outcomes and phenomena, such as educational attainment (Lee et al. 2018 ) or social stratification (Abdellaoui et al. 2019 ). Yet, we also witness very controversial applications of biotechnology, such as the first known case of human germline editing by He Jiankui in China, which has impacted the scientific community both as an impetus of wide protests and insecurity about the future of gene-editing and its use, but also instigated calls towards public consensus to (re-)set boundaries to what is editable (Morrison and de Saille 2019 ).

Against this background, we are going to debate in this article a particular case that appeared within the same timeframe as these developments: a large-scale genome-wide association study (GWAS) on sexual orientation Footnote 1 , which suggested partial genetic basis for same-sex sexual behaviour (Ganna et al. 2019b ). Some scientists have been claiming sexual orientation to be partly heritable and trying to identify genetic basis for sexual orientation for years (Hamer et al. 1993 ); however, this was the first time that genetic variants were identified as statistically significant and replicated in an independent sample. We consider this GWAS not only by questioning the ways genes are associated with “the social” within this research, but also by exploring how the complexity of the social is reduced through specific data practices in research.

The sexual orientation study also constitutes an interesting case to reflect on how knowledge is produced at a time the data-intensive search for biology/sociality associations has repercussions on doing social research and on theory (Meloni 2014 ). Large amounts of genomic data are needed to identify genetic variations and for finding correlations with different biological and social factors. The rise of the genome corresponds to the rise of big data as the collection and sharing of genomic data gains power with the development of big data analytics (Parry and Greenhough 2017 ). Growing number of correlations, e.g. in genomics of educational attainment (Lee et al. 2018 ; Okbay et al. 2016 ), are being found that are linking the genome to the social, increasingly blurring the established biological/social divide. These could open up new ways of understanding life, and underpin the importance of culture, while, paradoxically, may also carry the risk of new genetic determinism and essentialism. The changing understanding of the now molecularised and datafied body also illustrates the changing significance of empirical research and sociology (Savage and Burrows 2007 ) in the era of postgenomics and ‘datafication’ (Ruckenstein and Schüll 2017 ). These developments are situated within methodological debates in which social sciences often appear through the perspective of ELSI.

As the field of genomics is progressing rapidly and the intervention in the human genome is no longer science fiction, we argue that it is important to discuss and reflect now on the social configurations that are inscribed in, and reproduced by genomic data-driven research. These may co-produce the conception of certain potentially editable conditions, i.e. create new, and reproduce existing classifications that are largely shaped by societal understandings of difference and order. Such definitions could have real consequences—as Thomas and Thomas ( 1929 ) remind us—for individuals and societies, and mark what has been described as an epistemic shift in biomedicine from the clinical gaze to the ‘molecular gaze’ where the processes of “medicalisation and biomedicalisation both legitimate and compel interventions that may produce transformations in individual, familial and other collective identities” (Clarke et al. 2013 , p. 23). While Science and Technology Studies (STS) has demonstrated how science and society are co-produced in research (Jasanoff 2004 ), we want to use the momentum of the current discourse to critically reflect on these developments from three angles: (1) we demonstrate how, in the process of genomics research, societal relations, understandings and categorizations are used and inscribed into social phenomena and outcomes; (2) we explore the ways that the (big) data-driven research is constituted by increasingly moving away from theory and methodological generation of theoretical concepts that foster the understanding of societal contexts and relations (Kitchin 2014a ) and (3) using the GWAS case in focus, we show how the assumption of ‘free from theory’ (Kichin 2014a ) in this case does not mean free of choices made, choices which are themselves restricted by data that are available. We highlight Griffiths’ ( 2016 ) contention that the material nature of genes, their impacts on biological makeup of individuals and their socially and culturally situated behaviour are not deterministic, and need to be understood within the dynamic, culturally and temporally situated context within which knowledge claims are made. We conclude by making the important point that ignoring the social may lead to a distorted, datafied, genomised body which ignores the key fact that “genes are not stable but essentially malleable” (Prainsack 2015 ) and that this ‘malleability’ is rooted in the complex interplay between biological and social environments.

From this perspective, the body is understood through the lens of embodiment, considering humans ‘live’ their genome within their own lifeworld contexts (Rehmann-Sutter and Mahr 2016 ). We also consider this paper as an intervention into the marginalization of social science in the wake of developments in data-driven research that neglect social theory, established methodology and the contextual relevance of the social environment.

In the following reflections, we proceed step by step: First, we introduce the case of the GWAS on same-sex sexual behaviour, as well as its limits, context and impact. Second, we recall key sociological theory on categorizations and their implications. Third, we discuss the emergence of a digital-datafication of scientific knowledge production. Finally, we conclude by cautioning against the marginalization of social science in the wake of developments in data-driven research that neglect social theory, established methodology and the contextual relevance of the social environment.

Studying sexual orientation: The case of same-sex sexual behaviour

Currently, a number of studies at the intersection of genetic and social conditions appear on the horizon. Just as in the examples we have already mentioned, such as those on educational attainment (Lee et al. 2018 ), or social stratification (Abdellaoui et al. 2019 ), it is important to note that the limit to such studies is only the availability of the data itself. In other words, once the data is available, there is always the potential that it would eventually be used. This said, an analysis of the entirety of the genomic research on social outcomes and behaviour is beyond the scope of this article. Therefore, we want to exemplify our argument with reference to the research on the genetics of same-sex sexual behaviour.

Based on a sample of half a million individuals of European ancestry, the first large-scale GWAS of its kind claims five genetic variants to be contributing to the assessed “same-sex sexual behaviour” (Ganna et al. 2019b ). Among these variants, two are useful only for male–male sexual behaviour, one for female–female sexual behaviour, and the remaining two for both. The data that has led to this analysis was sourced from biobanks/cohorts with different methods of data collection. The authors conclude that these genetic variations are not predictive of sexual orientation; not only because genetics is supposedly only part of the picture, but also because the variations are only a small part (<1% of the variance in same-sex sexual behaviour, p. 4) of the approximated genetic basis (8–25% of the variance in same-sex sexual behaviour) that may be identified with large sample sizes (p. 1). The study is an example of how the ‘gay gene’ discourse that has been around for years, gets transformed with the available data accumulating in the biobanks and the consequent genomic analysis, offering only one facet of a complex social phenomenon: same-sex sexual behaviour.

The way the GWAS has been conducted was not novel in terms of data collection. Genome-wide studies of similar scale, e.g. on insomnia (Jansen et al. 2019 ) or blood pressure (Evangelou et al. 2018 ), often rely on already collected data in biobanks rather than trying to collect hundreds of thousands of individuals’ DNA from scratch. Furthermore, in line with wider developments, the study was preregistered Footnote 2 with an analysis plan for the data to be used by the researchers. Unlike other GWASes, however, the researchers partnered with an LGBTQIA+ advocacy group (GLAAD) and a science communication charity (Sense About Science), where individuals beyond the research team interpreted the findings and discussed how to convey the results Footnote 3 . Following these engagements, the researchers have produced a website Footnote 4 with potential frequently asked questions as well as a video about the study, highlighting what it does and what it does not claim.

Despite efforts to control the drifting away of the study into genetic deterministic and discriminatory interpretations, the study has been criticized by many Footnote 5 . Indeed, the controversial “How gay are you?” Footnote 6 app on the GenePlaza website utilized the findings of the study, which in turn raised the alarm bells and, ultimately, was taken down after much debate. The application, however, showed how rapidly such findings can translate into individualized systems of categorization, and consequently feed into and be fed by the public imaginary. One of the study authors demands continuation of research by noting “[s]cientists have a responsibility to describe the human condition in a more nuanced and deeper way” (Maxmen, 2019 , p. 610). Critics, however, note that the context of data collected from the individuals may have influence on the findings; for instance, past developments (i.e. decriminalization of homosexuality, the HIV/AIDS epidemic, and legalization of same-sex marriage) are relevant to understand the UK Biobank’s donor profile and if the GWAS were to be redone according to the birth year of the individuals, different findings could have come out of the study (Richardson et al. 2019 , p. 1461).

It has been pointed out that such research should be assessed by a competent ethical review board according to its potential risks and benefits (Maxmen 2019 , p. 610), in addition to the review and approval by the UK Biobank Access Sub-Committee (Ganna et al. 2019a , p. 1461). Another ethical issue of concern raised by critics is that the informed consent form of UK Biobank does not specify that it could be used for such research since “homosexuality has long been removed from disease classifications” and that the broad consent forms allow only “health-related research” (Holm and Ploug 2019 , p. 1460). We do not want to make a statement here for or against broad consent. However, we argue that discussions about informed consent showcase the complexities related to secondary use of data in research. Similarly, the ‘gay gene’ app developed in the wake of the sexual orientation study, revealed the difficulty of controlling how the produced knowledge may be used, including in ways that are openly denounced by the study authors.

To the best of our knowledge, there have not been similar genome-wide studies published on sexual orientation and, while we acknowledge the limitations associated with focusing on a single case in our discussion, we see this case as relevant to opening up the following question: How are certain social categorizations incorporated into the knowledge production practices? We want to answer this by first revisiting some of the fundamental sociological perspectives into categorizations and the social implications these may have.

Categorizing sex, gender, bodies, disease and knowledge

Sociological perspectives on categorizations.

Categorizations and classifications take a central role in the sociology of knowledge, social stratifications and data-based knowledge production. Categories like gender, race, sexuality and class (and their intersection, see Crenshaw 1989 ) have become key classifications for the study of societies and in understanding the reproduction of social order. One of the most influential theories about the intertwining of categories like gender and class with power relations was formulated by Bourdieu ( 2010 , 2001 ). He claimed that belonging to a certain class or gender is an embodied practice that ensures the reproduction of social structure which is shaped by power relations. The position of subjects within this structure reflects the acquired cultural capital, such as education. Incorporated dispositions, schemes of perception, appreciation, classification that make up the individual’s habitus are shaped by social structure, which actors reproduce in practices. One key mechanism of social categorization is gender classification. The gender order appears to be in the ‘nature of things’ of biologically different bodies, whereas it is in fact an incorporated social construction that reflects and constitutes power relations. Bourdieu’s theory links the function of structuring classifications with embodied knowledge and demonstrates that categories of understanding are pervaded by societal power relations.

In a similar vein Foucault ( 2003 , 2005 ) describes the intertwining of ordering classifications, bodies and power in his study of the clinic. Understandings of and knowledge about the body follow a specific way of looking at it—the ‘medical gaze’ of separating the patient’s body from identity and distinguishing healthy from the diseased, which, too, is a process pervaded by power differentials. Such classifications evolved historically. Foucault reminds us that all periods in history are characterized by specific epistemological assumptions that shape discourses and manifest in modalities of order that made certain kinds of knowledge, for instance scientific knowledge, possible. The unnoticed “order of things”, as well as the social order, is implemented in classifications. Such categorizations also evolved historically for the discourse about sexuality, or, in particular as he pointed out writing in the late 1970s, distinguishing sexuality of married couples from other forms, such as homosexuality (Foucault 1998 ).

Bourdieu and Foucault offer two influential approaches within the wider field of sociology of knowledge that provide a theoretical framework on how categorizations and classifications structure the world in conjunction with social practice and power relations. Their work demonstrates that such structuration is never free from theory, i.e. they are not existing prediscursively, but are embedded within a certain temporal and spatial context that constitutes ‘situated knowledge’ (Haraway 1988 ). Consequently, classifications create (social) order that cannot be understood as ‘naturally’ given but as a result of relational social dynamics embedded in power differentials.

Feminist theory in the 1970s emphasized the inherently social dimension of male and female embodiment, which distinguished between biological sex and socially rooted gender. This distinction built the basis for a variety of approaches that examined gender as a social phenomenon, as something that is (re-)constructed in social interaction, impacted by collectively held beliefs and normative expectations. Consequently, the difference between men and women was no longer simply understood as a given biological fact, but as something that is, also, a result of socialization and relational exchanges within social contexts (see, e.g., Connell 2005 ; Lorber 1994 ). Belonging to a gender or sex is a complex practice of attribution, assignment, identification and, consequently, classification (Kessler and McKenna 1978 ). The influential concept of ‘doing gender’ emphasized that not only the gender, but also the assignment of sex is based on socially agreed-upon biological classification criteria, that form the basis of placing a person in a sex category , which needs to be practically sustained in everyday life. The analytical distinction between sex and gender became eventually implausible as it obscures the process in which the body itself is subject to social forces (West and Zimmerman 1991 ).

In a similar way, sexual behaviour and sexuality are also shaped by society, as societal expectations influence sexual attraction—in many societies within normative boundaries of gender binary and heteronormativity (Butler 1990 ). This also had consequences for a deviation from this norm, resulting for example in the medicalisation of homosexuality (Foucault 1998 ).

Reference to our illustrative case study on the recently published research into the genetic basis of sexuality brings the relevance of this theorization into focus. The study cautions against the ‘gay gene’ discourse, the use of the findings for prediction, and genetic determinism of sexual orientation, noting “the richness and diversity of human sexuality” and stressing that the results do not “make any conclusive statements about the degree to which ‘nature’ and ‘nurture’ influence sexual preference” (Ganna et al. 2019b , p. 6).

Coming back to categorizations, more recent approaches from STS are also based on the assumption that classifications are a “spatio-temporal segmentation of the world” (Bowker and Star 2000 , p. 10), and that classification systems are, similar to concepts of gender theory (e.g. Garfinkel 1967 ), consistent, mutually exclusive and complete. The “International Classification of Diseases (lCD)”, a classification scheme of diseases based on their statistical significance, is an example of such a historically grown knowledge system. How the ICD is utilized in practice points to the ethical and social dimensions involved (Bowker and Star 2000 ). Such approaches help to unravel current epistemological shifts in medical research and intervention, including removal of homosexuality from the disease classification half a century ago.

Re-classifying diseases in tandem with genetic conditions creates new forms of ‘genetic responsibilities (Novas and Rose 2000 ). For instance, this may result in a change of the ‘sick role’ (described early in Parsons 1951 ) in creating new obligations not only for diseased but also for actually healthy persons in relation to potential futures. Such genetic knowledge is increasingly produced using large-scale genomic databases and creates new categories based on genetic risk, and consequently, may result in new categories of individuals that are ‘genetically at risk’ (Novas and Rose 2000 ). The question now is how these new categories will alter, structure or replace evolved categories, in terms of constructing the social world and medical practice.

While advancement in genomics is changing understandings of bodies and diseases, the meanings of certain social categories for medical research remain rather stable. Developments of personalized medicine go along with “the ‘re-inscription’ of traditional epidemiological categories into people’s DNA” and adherence to “old population categories while working out new taxonomies of individual difference” (Prainsack 2015 , pp. 28–29). This, again, highlights the fact that knowledge production draws on and is shaped by categories that have a political and cultural meaning within a social world that is pervaded by power relations.

From categorization to social implication and intervention

While categorizations are inherently social in their production, their use in knowledge production has wide ranging implications. Such is the case of how geneticisation of sexual orientation has been an issue that troubled and comforted the LGBTQIA+ communities. Despite the inexistence of an identified gene, ‘gay gene’ has been part of societal discourse. Such circulation disseminates an unequal emphasis on the biologized interpretations of sexual orientation, which may be portrayed differently in media and appeal to groups of opposing views in contrasting ways (Conrad and Markens 2001 ). Geneticisation, especially through media, moves sexual orientation to an oppositional framework between individual choice and biological consequence (Fausto-Sterling 2007 ) and there have been mixed opinions within LGBTQIA+ communities, whether this would resolve the moralization of sexual orientation or be a move back into its medicalisation (Nelkin and Lindee 2004 ). Thus, while some activists support geneticisation, others resist it and work against the potential medicalisation of homosexuality (Shostak et al. 2008 ). The ease of communicating to the general public simple genetic basis for complex social outcomes which are genetically more complex than reported, contributes to the geneticisation process, while the scientific failures of replicating ‘genetic basis’ claims do not get reported (Conrad 1999 ). In other words, while finding a genetic basis becomes entrenched as an idea in the public imaginary, research showing the opposite does not get an equal share in the media and societal discourse, neither of course does the social sciences’ critique of knowledge production that has been discussed for decades.

A widely, and often quantitatively, studied aspect of geneticisation of sexual orientation is how this plays out in the broader understanding of sexual orientation in society. While there are claims that geneticisation of sexual orientation can result in depoliticization of the identities (O’Riordan 2012 ), it may at the same time lead to polarization of society. According to social psychologists, genetic attributions to conditions are likely to lead to perceptions of immutability, specificity in aetiology, homogeneity and discreteness as well naturalistic fallacy (Dar-Nimrod and Heine 2011 ). Despite the multitude of suggestive surveys that belief in genetic basis of homosexuality correlates with acceptance, some studies suggest learning about genetic attribution to homosexuality can be polarizing and confirmatory of the previously held negative or positive attitudes (Boysen and Vogel 2007 ; Mitchell and Dezarn 2014 ). Such conclusions can be taken as a precaution that just as scientific knowledge production is social, its consequences are, too.

Looking beyond the case

We want to exemplify this argument by taking a detour to another case where the intersection between scientific practice, knowledge production and the social environment is of particular interest. While we have discussed the social implications of geneticisation with a focus on sexual orientation, recent developments in biomedical sciences and biotechnology also have the potential to reframe the old debates in entirely different ways. For instance, while ‘designer babies’ were only an imaginary concept until recently, the facility and affordability of processes, such as in vitro selection of baby’s genotype and germline genome editing, have potentially important impacts in this regard. When CRISPR/Cas9 technique was developed for rapid and easy gene editing, both the hopes and worries associated with its use were high. Martin and others ( 2020 , pp. 237–238) claim gene editing is causing both disruption within the postgenomic regime, specifically to its norms and practices, and the convergence of various biotechnologies such as sequencing and editing. Against this background, He Jiankui’s announcement in November 2018 through YouTube Footnote 7 that twins were born with edited genomes was an unwelcome surprise for many. This unexpected move may have hijacked the discussions on ethical, legal, societal implications of human germline genome-editing, but also rang the alarm bells across the globe for similar “rogue” scientists planning experimentation with the human germline (Morrison and de Saille 2019 ). The facility to conduct germline editing is, logically, only one step away from ‘correcting’ and if there is a correction, then that would mean a return to a normative state. He’s construction of HIV infection as a genetic risk can be read as a placeholder for numerous questions to human germline editing: What are the variations that are “valuable” enough for a change in germline? For instance, there are plans by Denis Rebrikov in Russia to genome edit embryos to ‘fix’ a mutation that causes congenital deafness (Cyranoski 2019 ). If legalized, what would be the limits applied and who would be able to afford such techniques? At a time when genomics research into human sociality is booming, would the currently produced knowledge in this field and others translate into ‘corrective’ genome-editing? Who would decide?

The science, in itself is still unclear at this stage as, for many complex conditions, using gene editing to change one allele to another is often minuscule in effect, considering that numerous alleles altogether may affect phenotypes, while at the same time a single allele may affect multiple phenotypes. In another GWAS case, social genomicists claim there are thousands of variations that are found to be influential for a particular social outcome such as educational attainment (Lee et al. 2018 ), with each having minimal effect. It has also been shown in the last few years, as the same study is conducted with ever more larger samples, more genomic variants are associated with the social outcome, i.e. 74 single nucleotide polymorphisms (SNPs) associated with the outcome in a sample size of 293,723 (Okbay et al. 2016 ) and 1271 SNPs associated with the outcome in a sample size of 1.1 million individuals (Lee et al. 2018 ).

Applying this reasoning to the GWAS on same-sex sexual behaviour, it is highly probable that the findings will be superseded in the following years with similar studies of bigger data, increasing the number of associations.

A genomic re-thinking?

The examples outlined here have served to show how focusing the discussion on “genetic determinism” is fruitless considering the complexity of the knowledge production practices and how the produced knowledge could both mirror social dynamics and shape these further. Genomic rethinking of the social necessitates a new formulation of social equality, where genomes are also relevant. Within the work of social genomics researchers, there has been cautious optimism toward the contribution of findings from genomics research to understanding social outcomes of policy change (Conley and Fletcher 2018 ; Lehrer and Ding 2019 ). Two fundamental thoughts govern this thinking. First, genetic basis is not to be equalized with fate; in other words, ‘genetic predispositions’ make sense only within the broader social and physical environmental frame, which often allows room for intervention. Second, genetics often relates to heterogeneity of the individuals within a population, in ways that the same policy may be positive, neutral or negative for different individuals due to their genes. In this respect, knowledge gained via social genomics may be imagined as a basis for a more equal society in ‘uncovering’ invisible variables, while, paradoxically, it may also be a justification for exclusion of certain groups. For example, a case that has initially raised the possibility that policies affect individuals differently because of their genetic background was a genetic variant that was correlated to being unaffected by tax increases on tobacco (Fletcher 2012 ). The study suggested that raising the taxes may be an ineffective tool for lowering smoking rates below a certain level, since those who are continuing to smoke may be those who cannot easily stop due to their genetic predisposition to smoking. Similar ideas could also apply to a diverse array of knowledge produced in social genomics, where the policies may be under scrutiny according to how they are claimed to variably influence the members of a society due to their genetics.

Datafication of scientific knowledge production

From theory to data-driven science.

More than a decade has gone by since Savage and Burrows ( 2007 ) described a crisis in empirical research, where the well-developed methodologies for collecting data about the social world would become marginal as such data are being increasingly generated and collected as a by-product of daily virtual transactions. Today, sociological research faces a widely datafied world, where (big) data analytics are profoundly changing the paradigm of knowledge production, as Facebook, Twitter, Google and others produce large amounts of socially relevant data. A similar phenomenon is taking place through opportunities that public and private biobanks, such as UK Biobank or 23andMe, offer. Crossing the boundaries of social sciences and biological sciences is facilitated through mapping correlations between genomic data, and data on social behaviour or outcomes.

This shift from theory to data-driven science misleadingly implies a purely inductive knowledge production, neglecting the fact that data is not produced free of preceding theoretical framing, methodological decisions, technological conditions and the interpretation of correlations—i.e. an assemblage situated within a specific place, time, political regime and cultural context (Kitchin 2014a ). It glosses over the fact that data cannot simply be treated as raw materials, but rather as “inherently partial, selective and representative”, the collection of which has consequences (Kitchin 2014b , p. 3). How knowledge of the body is generated starts with how data is produced and how it is used and mobilized. Through sequencing, biological samples are translated into digital data that are circulated and merged and correlated with other data. With the translation from genes into data, their meaning also changes (Saukko 2017 ). The kind of knowledge that is produced is also not free of scientific and societal concepts.

Individually assigned categorical variables to genomes have become important for genomic research and are impacting the ways in which identities are conceptualized under (social) genomic conditions. These characteristics include those of social identity, such as gender, ethnicity, educational and socioeconomic status. They are often used for the study of human genetic variation and individual differences with the aim to advance personalized medicine and based on demographic and ascribed social characteristics.

The sexual orientation study that is central to this paper can be read as a case where such categories intersect with the mode of knowledge production. As the largest contributor of data to the study, UK Biobank’s data used in this research are revealing since they are based on the answer to the following question “Have you ever had sexual intercourse with someone of the same sex?” along with the statement “Sexual intercourse includes vaginal, oral or anal intercourse.” Footnote 8 .

Furthermore, the authors accept having made numerous reductive assumptions and that their study has methodological limitations. For instance, Ganna et al. ( 2019b ) acknowledge both within the article (p. 1) and an accompanying website Footnote 9 that the research is based on a binary ‘sex’ system with exclusions of non-complying groups as the authors report that they “dropped individuals from [the] study whose biological sex and self-identified sex/gender did not match” (p. 2). However, both categorizing sexual orientation mainly on practice rather than attraction or desire, and building it on normative assumptions about sexuality, i.e. gender binary and heteronormativity, are problematic, as sexual behaviour is diverse and does not necessarily correspond with such assumptions.

The variations found in the sexual orientation study, as is true for other genome-wide association studies, are often relevant for the populations studied and in this case, those mainly belong to certain age groups and European ancestry. While the study avoids critique in saying that their research is not genetics of sexual orientation, but rather of same-sex sexual behaviour, whether such a genomic study would be possible is also questionable. This example demonstrates that, despite the increasing influence of big data, a fundamental problem with the datafication of many social phenomena is whether or not they are amenable to measurement. In the case of sexual orientation, whether the answer to the sexual orientation questions corresponds to the “homosexuality” or “willingness to reveal homosexuality”/“stated sexual orientation” is debatable, considering the social pressure and stigma that may be an element in certain social contexts (Conley 2009 , p. 242).

While our aim is to bring a social scientific perspective, biologists have raised at least two different critical opinions on the knowledge production practice here in the case of the sexual orientation study, first on the implications of the produced knowledge Footnote 10 and second on the problems and flaws of the search for a genetic basis Footnote 11 . In STS, however, genetic differences that were hypothesized to be relevant for health, especially under the category of race in the US, have been a major point of discussion within the genomic ‘inclusion’ debates of 1990s (Reardon 2017 , p. 49; Bliss 2015 ). In other words, a point of criticism towards the knowledge production was the focus on certain “racial” or racialized groups, such as American of European ancestry, which supposedly biased the findings and downstream development of therapies for ‘other’ groups. However, measuring health and medical conditions against the background of groups that are constituted based on social or cultural categories (e.g. age, gender, ethnicity), may also result in a reinscription/reconstitution of social inequalities attached to these categories (Prainsack 2015 ) and at the same time result in health justice being a topic seen through a postgenomics lens, where postgenomics is “a frontline weapon against inequality” (Bliss 2015 p. 175). Social-economic factors may recede in the background, while data with its own often invisible politics are foregrounded.

Unlike what Savage and Burrows suggested in 2007, the coming crisis can not only be seen as a crisis of sociology, but of science in general. Just as the shift of focus in social sciences towards digital data is only one part of the picture, another part could be the developments in genomisation of the social. Considering that censuses and large-scale statistics are not new, the distinction of the current phenomenon is possibly the opportunity to individualize the data, while categories themselves are often unable to capture the complexity, despite producing knowledge more efficiently. In that sense, the above-mentioned survey questions do not do justice to the complexity of social behaviour. What is most important to flag within these transformations is the lack of reflexivity regarding how big data comes to represent the world and whether it adds and/or takes away from the ways of knowing before big data. These developments and directions of genetic-based research and big data go far beyond the struggle of a discipline, namely sociology, with a paradigm shift in empirical research. They could set the stage for real consequences for individuals and groups. Just as what is defined as an editable condition happens as a social process that relies on socio-political categories, the knowledge acquired from big data relies in similar way on the same kind of categories.

The data choices and restrictions: ‘Free from theory’ or freedom of choice

Data, broadly understood, have become a fundamental part of our lives, from accepting and granting different kinds of consent for our data to travel on the internet, to gaining the ‘right to be forgotten’ in certain countries, as well as being able to retrieve collected information about ourselves from states, websites, even supermarket chains. While becoming part of our lives, the data collected about individuals in the form of big data is transferred between academic and non-academic research, scientific and commercial enterprises. The associated changes in the knowledge production have important consequences for the ways in which we understand and live in the world (Jasanoff 2004 ). The co-productionist perspective in this sense does not relate to whether or how the social and the biological are co-produced, but rather it is pointing to how produced knowledge in science is both shaped by and shaping societies. Thus, the increasing impact and authority of big data in general, and within the sexual orientation study in focus here, opens up new avenues to claim as some suggest, that we have reached the end of theory.

The “end of theory” has actively been debated within and beyond science. Kitchin ( 2014a ) locates the recent origin of this debate in a piece in the Wired , where the author states “Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all” (Anderson 2008 ). Others call this a paradigm shift towards data-intensive research leaving behind the empirical and theoretical stages (Gray 2009 , p. xviii). While Google and others form the basis for this data-driven understanding in their predictive capacity or letting the data speak, the idea that knowledge production is ‘free from theory’ in this case seems to be, at best, an ignorance of any data infrastructure and how the categories are formed within it.

Taking a deeper look at the same-sex sexual behaviour study from this angle suggests that such research cannot be free from theory as it has to make an assumption regarding the role of genetics in the context of social dynamics. In other words, it has to move sexual orientation, at least partially in the form of same-sex sexual behaviour, out of the domain of the social towards the biological. In doing so, just as the study concludes the complexity of sexual orientation, the authors note in their informative video Footnote 12 on their website, that “they found that about a third of the differences between people in their sexual behaviour could be explained by inherited genetic factors. But the environment also plays a large role in shaping these differences.” While the study points to a minuscule component of the biological, it also frames biology as the basis on which the social, as part of the environment, acts upon.

Reconsidering how the biology and the social are represented in the study, three theoretical choices are made due to the limitation of the data. First of all, the biological is taken to be “the genome-wide data” in the biobanks that the study relies on. This means sexual orientation is assumed to be within the SNPs, or points on the genome that are common variations across a population, and not in other kinds of variations that are rare or not captured by the genotyped SNPs. These differences include, but are not limited to, large-scale to small-scale duplications and deletions of the genomic regions, rare variants or even common variants in the population that the SNP chips do not capture. Such ignored differences are very important for a number of conditions, from cancer to neurobiology. Similarly, the genomic focus leaves aside the epigenetic factors that could theoretically be the missing link between genomes and environments. In noting this, we do not suggest that the authors of the study are unaware or uninterested in epigenetics; however, regardless of their interest and/or knowledge, the availability of large-scale genome-wide data puts such data ahead of any other variation in the genome and epigenome. In other words, if the UK Biobank and 23andMe had similar amounts of epigenomic or whole genome data beyond the SNPs, the study would have most possibly relied on these other variations in the genome. The search for genetic basis within SNPs is a theoretical choice, and in this case this choice is pre-determined by the limitations of the data infrastructures.

The second choice that the authors make is to take three survey questions, i.e. in the case of UK Biobank data, as encompassing enough of the complexity of sexual orientation for their research. As partly discussed earlier, these questions are simply asking about sexual behaviour. Based on the UK Biobank’s definition of sexual intercourse as “vaginal, oral or anal intercourse” the answers to the following questions were relevant for the research: “Have you ever had sexual intercourse with someone of the same sex?” (Data-Field 2159), “How many sexual partners of the same sex have you had in your lifetime?” (Data-Field 3669), and, “About how many sexual partners have you had in your lifetime?” (Data-Field 2149). Answers to such questions do little justice to the complexity of the topic. Considering that they are not included in the biobank as data for the purpose of identifying a genetic basis to same-sex sexual behaviour, there is much to consider in what capacity they are useful for that. It is worth noting here that the UK Biobank is primarily focused on health-related research, and thus these three survey questions could not have been asked with a genomic exploration of ‘same-sex sexual behaviour’ or ‘sexual orientation’ in mind. The degree of success in the way they have been used to identify the genetic basis for complex social behaviours is questionable.

The authors of the study consider the UK Biobank sample to be comprised of relatively old individuals and this to be a shortcoming Footnote 13 . Similarly, the study authors claim that 23andMe samples may be biased because “[i]ndividuals who engage in same-sex sexual behaviour may be more likely to self-select the sexual orientation survey”, which then explains the high percentage of such individuals (18.9%) (Ganna et al. 2019b , p. 1). However, the authors do not problematize that there is at least three-fold difference between the youngest and oldest generation in the UK Biobank sample in their response to the same-sex sexual behaviour question (Ganna et al. 2019b , p. 2). The study, thus, highlights the problematic issue about who should be regarded as the representative sample to be asked about their “same-sex sexual behaviour”. Still, this is a data choice that the authors make in concluding a universal explanation out of a very specific and socially constrained collection of self-reported data that encompasses only part of what the researchers are interested in.

The third choice is a choice unmade. The study data mainly came from UK Biobank, following a proposal by Brendan Zietsch with the title “Direct test whether genetic factors predisposing to homosexuality increase mating success in heterosexuals” Footnote 14 . The original plan for research frames “homosexuality” as a condition that heterosexuals can be “predisposed” to and as this condition is not eliminated through evolution, scientists hypothesize that whatever genetic variation that predisposes an individual to homosexuality may also be functional in increasing the individual’s reproductive capacity. Despite using such an evolutionary explanation as the theoretical basis for obtaining the data from the UK Biobank, the authors use evolution/evolutionary only three times in the article, whereas the concept “mating success” is totally missing. Unlike the expectation in the research proposal, authors observe lower number of offspring for individuals reporting same-sex sexual behaviour, and they conclude briefly “This reproductive deficit raises questions about the evolutionary maintenance of the trait, but we do not address these here” (Ganna et al. 2019b , p. 2). In other words, the hypothesis that allowed scientists to acquire the UK Biobank data becomes irrelevant for the researchers, when they are reporting their findings.

In this section, we have performed an analysis of how data choices are made at different steps of the research and hinted at how these choices reflect certain understandings of how society functions. These are evident in the ways sexual behaviour is represented and categorized according to quantitative data, and, the considerations of whether certain samples are contemporary enough (UK Biobank) or too self-selecting (same-sex sexual behaviour being too high in 23andMe). The study, however, does not problematize how the percentage of individuals reporting same-sex sexual behaviour steadily increases according to year of birth, at least tripling for males and increasing more than five-fold for females from 1940 and 1970 (for UK Biobank). Such details are among the data that the authors display as descriptive statistics in Fig. 1 (Ganna et al. 2019b , p. 2); however, these do not attract a discussion that genomic data receives. The study itself starts from the idea that genetic markers that are associated with same-sex sexual behaviour could have an evolutionary advantage and ends in saying the behaviour is complex. Critics claim the “approach [of the study] implies that it is acceptable to issue claims of genetic drivers of behaviours and then lay the burden of proof on social scientists to perform post-hoc socio-cultural analysis” (Richardson et al. 2019 , p. 1461).

In this paper, we have ‘moved back to the future’—taking stock of the present-day accelerated impact of big data and of its potential and real consequences. Using the sexual orientation GWAS as point of reference, we have shown that claims to working under the premise of ‘pure science’ of genomics are untenable as the social is present by default—within the methodological choices made by the researchers, the impact on/of the social imaginary or epigenetic context.

By focusing on the contingency of the knowledge production on the social categories that are themselves reflections of the social in the data practices, we have highlighted the relational processes at the root of knowledge production. We are experiencing a period where the repertoire of what gets quantified continuously, and possibly exponentially, increases; however, this does not necessarily mean that our understanding of complexity increases at the same rate, rather, it may lead to unintended simplification where meaningful levels of understanding of causality are lost in the “triumph of correlations” in big data (Mayer-Schönberger and Cukier 2013 ; cited in Leonelli 2014 ). While sociology has much to offer through its qualitative roots, we think it should do more than critique, especially considering the culturally and temporally specific understandings of the social are also linked to the socio-material consequences.

We want to highlight that now is the time to think about the broader developments in science and society, not merely from an external perspective, but within a new framework. Clearly, our discussion of a single case here cannot sustain suggestions for a comprehensive and applicable framework for any study; however, we can flag the urgency of its requirement. We have shown that, in the context of the rapid developments within big data-driven, and socio-genomic research, it is necessary to renew the argument for bringing the social, and its interrelatedness to the biological, clearly back into focus. We strongly believe that reemphasizing this argument is essential to underline the analytical strength of the social science perspective, and in order to avoid the possibility of losing sight of the complexity of social phenomena, which risk being oversimplified in mainly statistical data-driven science.

We can also identify three interrelated dimensions of scientific practice that the framework would valorize: (1) Recognition of the contingency of choices made within the research process, and sensibility of their consequent impact within the social context. (2) Ethical responsibilities that move beyond procedural contractual requirements, to sustaining a process rooted in clear understanding of societal environments. (3) Interdisciplinarity in analytical practice that potentiates the impact of each perspectival lens.

Such a framework would facilitate moving out of the disciplinary or institutionalized silos of ELSI, STS, sociology, genetics, or even emerging social genomics. Rather than competing for authority on ‘the social’, the aim should be to critically complement each other and refract the produced knowledge with a multiplicity of lenses. Zooming ‘back to the future’ within the field of socio-biomedical science, we would flag the necessity of re-calibrating to a multi-perspectival endeavour—one that does justice to the complex interplay of social and biological processes within which knowledge is produced.

The GWAS primarily uses the term “same-sex sexual behaviour” as one of the facets of “sexual orientation” where the former becomes the component that is directly associable with the genes and the latter the broader phenomenon of interest. Thus, while the article is referring to “same-sex sexual behaviour” in its title, it is editorially presented in the same Science issue under Human Genetics heading with the subheading “The genetics of sexual orientation” (p. 880) (see Funk 2019 ). Furthermore, the request for data from UK Biobank by the corresponding author Brendan P. Zietsch (see footnote 14) refers only to sexual orientation and homosexuality and not to same-sex sexual behaviour. Therefore, we follow the same interchangeable use in this article.

Source: (04.03.2020).

Source: (04.03.2020).

Source: (04.03.2020).

In addition to footnotes 10 and 11, for a discussion please see: (04.03.2020).

Later “122 Shades of Grey”: (04.03.2020).

Source: (04.03.2020).

Source: (04.03.2020).

Source: (04.03.2020).

Source: (04.03.2020).

Source: (03.03.2020).

Source: (04.03.2020).

Source: (04.03.2020).

Source: (04.03.2020).

Abdellaoui A, Hugh-Jones D, Yengo L, Kemper KE, Nivard MG, Veul L, Holtz Y, Zietsch BP, Frayling TM, Wray NR (2019) Genetic correlates of social stratification in Great Britain. Nat Hum Behav 1–21.

Anderson C (2008) The end of theory: the data deluge makes the scientific method obsolete, Wired . Accessed 31 Mar 2020

Bliss C (2015) Defining health justice in the postgenomic era. In: Richardson SS, Stevens H (eds) Postgenomics: perspectives on biology after the genome. Duke University Press, Durham, Durham/London, pp. 174–191

Chapter   Google Scholar  

Bourdieu P (2001) Masculine domination. Stanford University Press, Stanford

Google Scholar  

Bourdieu P (2010) Distinction: a social critique of the judgement of taste. Routledge, London/New York

Bowker GC, Star SL (2000) Sorting things out: classification and its consequences. MIT Press, Cambridge/London

Book   Google Scholar  

Boysen GA, Vogel DL (2007) Biased assimilation and attitude polarization in response to learning about biological explanations of homosexuality. Sex Roles 57(9–10):755–762.

Article   Google Scholar  

Butler J (1990) Gender trouble. Feminism and the subversion of identity. Routledge, New York

Clarke AE, Shim JK, Shostak S, Nelson A (2013) Biomedicalising genetic health, diseases and identities. In: Atkinson P, Glasner P, Lock M (eds) Handbook of genetics and society: mapping the new genomc era. Routledge, Oxon, pp. 21–40

Conley D (2009) The promise and challenges of incorporating genetic data into longitudinal social science surveys and research. Biodemogr Soc Biol 55(2):238–251.

Conley D, Fletcher J (2018) The genome factor: what the social genomics revolution reveals about ourselves, our history, and the future. Princeton University Press, Princeton/Oxford

Connell RW (2005) Masculinities. Polity, Cambridge

Conrad P (1999) A mirage of genes. Sociol Health Illn 21(2):228–241.

Conrad P, Markens S (2001) Constructing the ‘gay gene’ in the news: optimism and skepticism in the US and British press. Health 5(3):373–400.

Crenshaw K (1989) Demarginalizing the intersection of race and sex: a black feminist critique of antidiscrimination doctrine, feminist theory and antiracist politics, vol 1989(8). University of Chicago Legal Forum. . Accessed 1 Apr 2020

Cyranoski D (2019) Russian ‘CRISPR-baby’ scientist has started editing genes in human eggs with goal of altering deaf gene. Nature 574(7779):465–466.

Article   ADS   CAS   PubMed   Google Scholar  

Dar-Nimrod I, Heine SJ (2011) Genetic essentialism: on the deceptive determinism of DNA. Psychol Bull 137(5):800–818.

Article   PubMed   PubMed Central   Google Scholar  

Evangelou E, Warren HR, Mosen-Ansorena D, Mifsud B, Pazoki R, Gao H, Ntritsos G, Dimou N, Cabrera CP, Karaman I (2018) Genetic analysis of over 1 million people identifies 535 new loci associated with blood pressure traits. Nat Genet 50(10):1412–1425.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Fausto-Sterling A (2007) Frameworks of desire. Daedalus 136(2):47–57.

Fletcher JM (2012) Why have tobacco control policies stalled? Using genetic moderation to examine policy impacts. PLoS ONE 7(12):e50576.

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Foucault M (1998) The history of sexuality 1: the will to knowledge. Penguin Books, London

Foucault M (2003) The birth of the clinic. Routledge, London/New York

Foucault M (2005) The order of things. Routledge, London/New York

Fox Keller E (2014) From gene action to reactive genomes. J Physiol 592(11):2423–2429.

Article   CAS   Google Scholar  

Fox Keller E (2015) The postgenomic genome. In: Richardson SS, Stevens H (eds) Postgenomics: perspectives on biology after the genome. Duke University Press, Durham/London, pp. 9–31

Funk M (2019) The genetics of sexual orientation. Science 365(6456):878–880.

Article   ADS   Google Scholar  

Ganna A, Verweij KJ, Nivard MG, Maier R, Wedow R, Busch AS, Abdellaoui A, Guo S, Sathirapongsasuti JF, Lichtenstein P (2019a) Genome studies must account for history—response. Science 366(6472):1461–1462.

Ganna A, Verweij KJ, Nivard MG, Maier R, Wedow R, Busch AS, Abdellaoui A, Guo S, Sathirapongsasuti JF, Lichtenstein P (2019b) Large-scale GWAS reveals insights into the genetic architecture of same-sex sexual behavior. Science 365(6456):eaat7693.

Garfinkel H (1967) Studies in ethnomethodology. Polity Press, Cambridge

Gray J (2009) Jim Gray on eScience: a transformed scientific method. In: Hey T, Tansley S, Tolle KM (eds) The fourth paradigm: data-intensive scientific discovery. Microsoft Research, Redmond, pp. xvii–xxxi

Griffiths DA (2016) Queer genes: realism, sexuality and science. J Crit Realism 15(5):511–529.

Hamer DH, Hu S, Magnuson VL, Hu N, Pattatucci AM (1993) A linkage between DNA markers on the X chromosome and male sexual orientation. Science 261(5119):321–327.

Haraway D (1988) Situated knowledges: the science question in feminism and the privilege of partial perspective. Fem Stud 14(3):575–599

Holm S, Ploug T (2019) Genome studies reveal flaws in broad consent. Science 366(6472):1460–1461.

Howard HC, van El CG, Forzano F, Radojkovic D, Rial-Sebbag E, de Wert G, Borry P, Cornel MC (2018) One small edit for humans, one giant edit for humankind? Points and questions to consider for a responsible way forward for gene editing in humans. Eur J Hum Genet 26(1):1.

Article   CAS   PubMed   Google Scholar  

Jansen PR, Watanabe K, Stringer S, Skene N, Bryois J, Hammerschlag AR, de Leeuw CA, Benjamins JS, Muñoz-Manchado AB, Nagel M, Savage JE, Tiemeier H, White T, Agee M, Alipanahi B, Auton A, Bell RK, Bryc K, Elson SL, Fontanillas P, Furlotte NA, Hinds DA, Huber KE, Kleinman A, Litterman NK, McCreight JC, McIntyre MH, Mountain JL, Noblin ES, Northover CAM, Pitts SJ, Sathirapongsasuti JF, Sazonova OV, Shelton JF, Shringarpure S, Tian C, Wilson CH, Tung JY, Hinds DA, Vacic V, Wang X, Sullivan PF, van der Sluis S, Polderman TJC, Smit AB, Hjerling-Leffler J, Van Someren EJW, Posthuma D, The 23andMe Research, T. (2019) Genome-wide analysis of insomnia in 1,331,010 individuals identifies new risk loci and functional pathways. Nat Genet 51(3):394–403.

Jasanoff S (2004) The idiom of co-production. In: Jasanoff S (ed.) States of knowledge: the co-production of science and social order. Routledge, London, p 1–12

Jasanoff S, Hurlbut JB (2018) A global observatory for gene editing. Nature 555:435–437.

Kessler SJ, McKenna W (1978) Gender: an ethnomethodological approach. John Wiley & Sons, New York

Kitchin, R. (2014a) Big Data, new epistemologies and paradigm shifts. Big Data Soc.

Kitchin R (2014b) The data revolution. Big data, open data, data infrastructures and their consequences. Sage, London

Landecker H (2016) The social as signal in the body of chromatin. Sociol Rev 64(1_suppl):79–99.

Landecker H, Panofsky A (2013) From social structure to gene regulation, and back: a critical introduction to environmental epigenetics for sociology. Annu Rev Sociol 39:333–357.

Lee JJ, Wedow R, Okbay A, Kong E, Maghzian O, Zacher M, Nguyen-Viet TA, Bowers P, Sidorenko J, Linnér RK (2018) Gene discovery and polygenic prediction from a 1.1-million-person GWAS of educational attainment. Nat Genet 50(8):1112.

Lehrer SF, Ding W (2019) Can social scientists use molecular genetic data to explain individual differences and inform public policy? In: Foster G (ed.) Biophysical measurement in experimental social science research. Academic Press, London/San Diego/Cambridge/Oxford, pp. 225–265

Leonelli, S. (2014) What difference does quantity make? On the epistemology of Big Data in biology. Big Data Soc.

Lorber J (1994) Paradoxes of gender. Yale University Press, New Haven

Martin P, Morrison M, Turkmendag I, Nerlich B, McMahon A, de Saille S, Bartlett A (2020) Genome editing: the dynamics of continuity, convergence, and change in the engineering of life. New Genet Soc 39(2):219–242.

Maxmen A (2019) Controversial ‘gay gene’ app provokes fears of a genetic Wild West. Nature 574(7780):609–610.

Mayer-Schönberger V, Cukier K (2013) Big data: a revolution that will transform how we live, work, and think. Houghton Mifflin Harcourt, Boston/New York

Meloni M (2014) Biology without biologism: social theory in a postgenomic age. Sociology 48(4):731–746.

Meloni M (2016) Political biology: Science and social values in human heredity from eugenics to epigenetics. Palgrave Macmillan, n.p.p

Mitchell RW, Dezarn L (2014) Does knowing why someone is gay influence tolerance? Genetic, environmental, choice, and “reparative” explanations. Sex Cult 18(4):994–1009.

Morrison M, de Saille S (2019) CRISPR in context: towards a socially responsible debate on embryo editing. Palgrave Commun 5(1):1–9.

Nelkin D, Lindee MS (2004) The DNA mystique: the gene as a cultural icon. University of Michigan Press, Ann Arbor

Novas C, Rose N (2000) Genetic risk and the birth of the somatic individual. Econ Soc 29(4):485–513.

O’Riordan K (2012) The life of the gay gene: from hypothetical genetic marker to social reality. J Sex Res 49(4):362–368.

Article   PubMed   Google Scholar  

Okbay A, Beauchamp JP, Fontana MA, Lee JJ, Pers TH, Rietveld CA, Turley P, Chen G-B, Emilsson V, Meddens SFW (2016) Genome-wide association study identifies 74 loci associated with educational attainment. Nature 533(7604):539–542.

Parry B, Greenhough B (2017) Bioinformation. Polity Press, Cambridge

Parsons T (1951) The social system. Free Press, New York

Prainsack B (2015) Is personalized medicine different? (Reinscription: the sequel) A response to Troy Duster. Br J Sociol 66(1):28–35.

Reardon J (2017) The postgenomic condition: ethics, justice, and knowledge after the genome. University of Chicago Press, Chicago/London

Rehmann-Sutter C, Mahr D (2016) The lived genome. In: Whitehead A, Woods A (eds) Edinburgh companion to the critical medical humanities. Edinburgh University Press, Edinburgh, pp. 87–103

Richardson SS, Borsa A, Boulicault M, Galka J, Ghosh N, Gompers A, Noll NE, Perret M, Reiches MW, Sandoval JCB (2019) Genome studies must account for history. Science 366(6472):1461.

Ruckenstein M, Schüll ND (2017) The datafication of health. Annu Rev Anthropol 46(261–278).

Saukko P (2017) Shifting metaphors in direct-to-consumer genetic testing: from genes as information to genes as big data. New Genet Soc 36(3):296–313.

Savage M, Burrows R (2007) The coming crisis of empirical sociology. Sociology 41(5):885–899.

Shostak S, Conrad P, Horwitz AV (2008) Sequencing and its consequences: path dependence and the relationships between genetics and medicalization. Am J Sociol 114(S1):S287–S316.

Thomas WJ, Thomas DS (1929) The child in America. Behavior problems and programs. Knopf, New York

West C, Zimmerman DH (1991) Doing gender. In: Lorber J, Farrell SA (eds) The social construction of gender. Sage, Newbury Park/London, pp. 13–37

Download references


Open access funding provided by University of Vienna. The authors thank Brígida Riso for contributing to a previous version of this article.

Author information

These authors contributed equally: Melanie Goisauf, Kaya Akyüz, Gillian M. Martin.

Authors and Affiliations

Department of Science and Technology Studies, University of Vienna, Vienna, Austria

Melanie Goisauf & Kaya Akyüz

BBMRI-ERIC, Graz, Austria

Department of Sociology, University of Malta, Msida, Malta

Gillian M. Martin

You can also search for this author in PubMed   Google Scholar

Corresponding authors

Correspondence to Melanie Goisauf or Kaya Akyüz .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit .

Reprints and permissions

About this article

Cite this article.

Goisauf, M., Akyüz, K. & Martin, G.M. Moving back to the future of big data-driven research: reflecting on the social in genomics. Humanit Soc Sci Commun 7 , 55 (2020).

Download citation

Received : 15 November 2019

Accepted : 09 July 2020

Published : 04 August 2020


Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Biobanking and risk assessment: a comprehensive typology of risks for an adaptive risk governance.

  • Gauthier Chassang
  • Michaela Th. Mayrhofer

Life Sciences, Society and Policy (2021)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

big data in social science research

SEP home page

  • Table of Contents
  • Random Entry
  • Chronological
  • Editorial Information
  • About the SEP
  • Editorial Board
  • How to Cite the SEP
  • Special Characters
  • Advanced Tools
  • Support the SEP
  • PDFs for SEP Friends
  • Make a Donation
  • SEPIA for Libraries
  • Entry Contents


Academic tools.

  • Friends PDF Preview
  • Author and Citation Info
  • Back to Top

Scientific Research and Big Data

Big Data promises to revolutionise the production of knowledge within and beyond science, by enabling novel, highly efficient ways to plan, conduct, disseminate and assess research. The last few decades have witnessed the creation of novel ways to produce, store, and analyse data, culminating in the emergence of the field of data science , which brings together computational, algorithmic, statistical and mathematical techniques towards extrapolating knowledge from big data. At the same time, the Open Data movement—emerging from policy trends such as the push for Open Government and Open Science—has encouraged the sharing and interlinking of heterogeneous research data via large digital infrastructures. The availability of vast amounts of data in machine-readable formats provides an incentive to create efficient procedures to collect, organise, visualise and model these data. These infrastructures, in turn, serve as platforms for the development of artificial intelligence, with an eye to increasing the reliability, speed and transparency of processes of knowledge creation. Researchers across all disciplines see the newfound ability to link and cross-reference data from diverse sources as improving the accuracy and predictive power of scientific findings and helping to identify future directions of inquiry, thus ultimately providing a novel starting point for empirical investigation. As exemplified by the rise of dedicated funding, training programmes and publication venues, big data are widely viewed as ushering in a new way of performing research and challenging existing understandings of what counts as scientific knowledge.

This entry explores these claims in relation to the use of big data within scientific research, and with an emphasis on the philosophical issues emerging from such use. To this aim, the entry discusses how the emergence of big data—and related technologies, institutions and norms—informs the analysis of the following themes:

  • how statistics, formal and computational models help to extrapolate patterns from data, and with which consequences;
  • the role of critical scrutiny (human intelligence) in machine learning, and its relation to the intelligibility of research processes;
  • the nature of data as research components;
  • the relation between data and evidence, and the role of data as source of empirical insight;
  • the view of knowledge as theory-centric;
  • understandings of the relation between prediction and causality;
  • the separation of fact and value; and
  • the risks and ethics of data science.

These are areas where attention to research practices revolving around big data can benefit philosophy, and particularly work in the epistemology and methodology of science. This entry doesn’t cover the vast scholarship in the history and social studies of science that has emerged in recent years on this topic, though references to some of that literature can be found when conceptually relevant. Complementing historical and social scientific work in data studies, the philosophical analysis of data practices can also elicit significant challenges to the hype surrounding data science and foster a critical understanding of the role of data-fuelled artificial intelligence in research.

1. What Are Big Data?

2. extrapolating data patterns: the role of statistics and software, 3. human and artificial intelligence, 4. the nature of (big) data, 5. big data and evidence, 6. big data, knowledge and inquiry, 7. big data between causation and prediction, 8. the fact/value distinction, 9. big data risks and the ethics of data science, 10. conclusion: big data and good science, other internet resources, related entries.

We are witnessing a progressive “datafication” of social life. Human activities and interactions with the environment are being monitored and recorded with increasing effectiveness, generating an enormous digital footprint. The resulting “big data” are a treasure trove for research, with ever more sophisticated computational tools being developed to extract knowledge from such data. One example is the use of various different types of data acquired from cancer patients, including genomic sequences, physiological measurements and individual responses to treatment, to improve diagnosis and treatment. Another example is the integration of data on traffic flow, environmental and geographical conditions, and human behaviour to produce safety measures for driverless vehicles, so that when confronted with unforeseen events (such as a child suddenly darting into the street on a very cold day), the data can be promptly analysed to identify and generate an appropriate response (the car swerving enough to avoid the child while also minimising the risk of skidding on ice and damaging to other vehicles). Yet another instance is the understanding of the nutritional status and needs of a particular population that can be extracted from combining data on food consumption generated by commercial services (e.g., supermarkets, social media and restaurants) with data coming from public health and social services, such as blood test results and hospital intakes linked to malnutrition. In each of these cases, the availability of data and related analytic tools is creating novel opportunities for research and for the development of new forms of inquiry, which are widely perceived as having a transformative effect on science as a whole.

A useful starting point in reflecting on the significance of such cases for a philosophical understanding of research is to consider what the term “big data” actually refers to within contemporary scientific discourse. There are multiple ways to define big data (Kitchin 2014, Kitchin & McArdle 2016). Perhaps the most straightforward characterisation is as large datasets that are produced in a digital form and can be analysed through computational tools. Hence the two features most commonly associated with Big Data are volume and velocity. Volume refers to the size of the files used to archive and spread data. Velocity refers to the pressing speed with which data is generated and processed. The body of digital data created by research is growing at breakneck pace and in ways that are arguably impossible for the human cognitive system to grasp and thus require some form of automated analysis.

Volume and velocity are also, however, the most disputed features of big data. What may be perceived as “large volume” or “high velocity” depends on rapidly evolving technologies to generate, store, disseminate and visualise the data. This is exemplified by the high-throughput production, storage and dissemination of genomic sequencing and gene expression data, where both data volume and velocity have dramatically increased within the last two decades. Similarly, current understandings of big data as “anything that cannot be easily captured in an Excel spreadsheet” are bound to shift rapidly as new analytic software becomes established, and the very idea of using spreadsheets to capture data becomes a thing of the past. Moreover, data size and speed do not take account of the diversity of data types used by researchers, which may include data that are not generated in digital formats or whose format is not computationally tractable, and which underscores the importance of data provenance (that is, the conditions under which data were generated and disseminated) to processes of inference and interpretation. And as discussed below, the emphasis on physical features of data obscures the continuing dependence of data interpretation on circumstances of data use, including specific queries, values, skills and research situations.

An alternative is to define big data not by reference to their physical attributes, but rather by virtue of what can and cannot be done with them. In this view, big data is a heterogeneous ensemble of data collected from a variety of different sources, typically (but not always) in digital formats suitable for algorithmic processing, in order to generate new knowledge. For example boyd and Crawford (2012: 663) identify big data with “the capacity to search, aggregate and cross-reference large datasets”, while O’Malley and Soyer (2012) focus on the ability to interrogate and interrelate diverse types of data, with the aim to be able to consult them as a single body of evidence. The examples of transformative “big data research” given above are all easily fitted into this view: it is not the mere fact that lots of data are available that makes a different in those cases, but rather the fact that lots of data can be mobilised from a wide variety of sources (medical records, environmental surveys, weather measurements, consumer behaviour). This account makes sense of other characteristic “v-words” that have been associated with big data, including:

  • Variety in the formats and purposes of data, which may include objects as different as samples of animal tissue, free-text observations, humidity measurements, GPS coordinates, and the results of blood tests;
  • Veracity , understood as the extent to which the quality and reliability of big data can be guaranteed. Data with high volume, velocity and variety are at significant risk of containing inaccuracies, errors and unaccounted-for bias. In the absence of appropriate validation and quality checks, this could result in a misleading or outright incorrect evidence base for knowledge claims (Floridi & Illari 2014; Cai & Zhu 2015; Leonelli 2017);
  • Validity , which indicates the selection of appropriate data with respect to the intended use. The choice of a specific dataset as evidence base requires adequate and explicit justification, including recourse to relevant background knowledge to ground the identification of what counts as data in that context (e.g., Loettgers 2009, Bogen 2010);
  • Volatility , i.e., the extent to which data can be relied upon to remain available, accessible and re-interpretable despite changes in archival technologies. This is significant given the tendency of formats and tools used to generate and analyse data to become obsolete, and the efforts required to update data infrastructures so as to guarantee data access in the long term (Bowker 2006; Edwards 2010; Lagoze 2014; Borgman 2015);
  • Value , i.e., the multifaceted forms of significance attributed to big data by different sections of society, which depend as much on the intended use of the data as on historical, social and geographical circumstances (Leonelli 2016, D’Ignazio and Klein 2020). Alongside scientific value, researchers may impute financial, ethical, reputational and even affective value to data, depending on their intended use as well as the historical, social and geographical circumstances of their use. The institutions involved in governing and funding research also have ways of valuing data, which may not always overlap with the priorities of researchers (Tempini 2017).

This list of features, though not exhaustive, highlights how big data is not simply “a lot of data”. The epistemic power of big data lies in their capacity to bridge between different research communities, methodological approaches and theoretical frameworks that are difficult to link due to conceptual fragmentation, social barriers and technical difficulties (Leonelli 2019a). And indeed, appeals to big data often emerge from situations of inquiry that are at once technically, conceptually and socially challenging, and where existing methods and resources have proved insufficient or inadequate (Sterner & Franz 2017; Sterner, Franz, & Witteveen 2020).

This understanding of big data is rooted in a long history of researchers grappling with large and complex datasets, as exemplified by fields like astronomy, meteorology, taxonomy and demography (see the collections assembled by Daston 2017; Anorova et al. 2017; Porter & Chaderavian 2018; as well as Anorova et al. 2010, Sepkoski 2013, Stevens 2016, Strasser 2019 among others). Similarly, biomedical research—and particularly subfields such as epidemiology, pharmacology and public health—has an extensive tradition of tackling data of high volume, velocity, variety and volatility, and whose validity, veracity and value are regularly negotiated and contested by patients, governments, funders, pharmaceutical companies, insurances and public institutions (Bauer 2008). Throughout the twentieth century, these efforts spurred the development of techniques, institutions and instruments to collect, order, visualise and analyse data, such as: standard classification systems and formats; guidelines, tools and legislation for the management and security of sensitive data; and infrastructures to integrate and sustain data collections over long periods of time (Daston 2017).

This work culminated in the application of computational technologies, modelling tools and statistical methods to big data (Porter 1995; Humphreys 2004; Edwards 2010), increasingly pushing the boundaries of data analytics thanks to supervised learning, model fitting, deep neural networks, search and optimisation methods, complex data visualisations and various other tools now associated with artificial intelligence. Many of these tools are based on algorithms whose functioning and results are tested against specific data samples (a process called “training”). These algorithms are programmed to “learn” from each interaction with novel data: in other words, they have the capacity to change themselves in response to new information being inputted into the system, thus becoming more attuned to the phenomena they are analysing and improving their ability to predict future behaviour. The scope and extent of such changes is shaped by the assumptions used to build the algorithms and the capability of related software and hardware to identify, access and process information of relevance to the learning in question. There is however a degree of unpredictability and opacity to these systems, which can evolve to the point of defying human understanding (more on this below).

New institutions, communication platforms and regulatory frameworks also emerged to assemble, prepare and maintain data for such uses (Kitchin 2014), such as various forms of digital data infrastructures, organisations aiming to coordinate and improve the global data landscape (e.g., the Research Data Alliance), and novel measures for data protection, like the General Data Protection Regulation launched in 2017 by the European Union. Together, these techniques and institutions afford the opportunity to assemble and interpret data at a much broader scale, while also promising to deliver finer levels of granularity in data analysis. [ 1 ] They increase the scope of any investigation by making it possible for researchers to link their own findings to those of countless others across the world, both within and beyond the academic sphere. By enhancing the mobility of data, they facilitate their repurposing for a variety of goals that may have been unforeseeable when the data were originally generated. And by transforming the role of data within research, they heighten their status as valuable research outputs in and of themselves. These technological and methodological developments have significant implications for philosophical conceptualisations of data, inferential processes and scientific knowledge, as well as for how research is conducted, organised, governed and assessed. It is to these philosophical concerns that I now turn.

Big data are often associated to the idea of data-driven research, where learning happens through the accumulation of data and the application of methods to extract meaningful patterns from those data. Within data-driven inquiry, researchers are expected to use data as their starting point for inductive inference, without relying on theoretical preconceptions—a situation described by advocates as “the end of theory”, in contrast to theory-driven approaches where research consists of testing a hypothesis (Anderson 2008, Hey et al. 2009). In principle at least, big data constitute the largest pool of data ever assembled and thus a strong starting point to search for correlations (Mayer-Schönberger & Cukier 2013). Crucial to the credibility of the data-driven approach is the efficacy of the methods used to extrapolate patterns from data and evaluate whether or not such patterns are meaningful, and what “meaning” may involve in the first place. Hence, some philosophers and data scholars have argued that

the most important and distinctive characteristic of Big Data [is] its use of statistical methods and computational means of analysis, (Symons & Alvarado 2016: 4)

such as for instance machine learning tools, deep neural networks and other “intelligent” practices of data handling.

The emphasis on statistics as key adjudicator of validity and reliability of patterns extracted from data is not novel. Exponents of logical empiricism looked for logically watertight methods to secure and justify inference from data, and their efforts to develop a theory of probability proceeded in parallel with the entrenchment of statistical reasoning in the sciences in the first half of the twentieth century (Romeijn 2017). In the early 1960s, Patrick Suppes offered a seminal link between statistical methods and the philosophy of science through his work on the production and interpretation of data models. As a philosopher deeply embedded in experimental practice, Suppes was interested in the means and motivations of key statistical procedures for data analysis such as data reduction and curve fitting. He argued that once data are adequately prepared for statistical modelling, all the concerns and choices that motivated data processing become irrelevant to their analysis and interpretation. This inspired him to differentiate between models of theory, models of experiment and models of data, noting that such different components of inquiry are governed by different logics and cannot be compared in a straightforward way. For instance,

the precise definition of models of the data for any given experiment requires that there be a theory of the data in the sense of the experimental procedure, as well as in the ordinary sense of the empirical theory of the phenomena being studied. (Suppes 1962: 253)

Suppes viewed data models as necessarily statistical: that is, as objects

designed to incorporate all the information about the experiment which can be used in statistical tests of the adequacy of the theory. (Suppes 1962: 258)

His formal definition of data models reflects this decision, with statistical requirements such as homogeneity, stationarity and order identified as the ultimate criteria to identify a data model Z and evaluate its adequacy:

Z is an N-fold model of the data for experiment Y if and only if there is a set Y and a probability measure P on subsets of Y such that \(Y = \langle Y, P\rangle\) is a model of the theory of the experiment, Z is an N-tuple of elements of Y , and Z satisfies the statistical tests of homogeneity, stationarity and order. (1962: 259)

This analysis of data models portrayed statistical methods as key conduits between data and theory, and hence as crucial components of inferential reasoning.

The focus on statistics as entry point to discussions of inference from data was widely promoted in subsequent philosophical work. Prominent examples include Deborah Mayo, who in her book Error and the Growth of Experimental Knowledge asked:

What should be included in data models? The overriding constraint is the need for data models that permit the statistical assessment of fit (between prediction and actual data); (Mayo 1996: 136)

and Bas van Fraassen, who also embraced the idea of data models as “summarizing relative frequencies found in data” (Van Fraassen 2008: 167). Closely related is the emphasis on statistics as means to detect error within datasets in relation to specific hypotheses, most prominently endorsed by the error-statistical approach to inference championed by Mayo and Aris Spanos (Mayo & Spanos 2009a). This approach aligns with the emphasis on computational methods for data analysis within big data research, and supports the idea that the better the inferential tools and methods, the better the chance to extract reliable knowledge from data.

When it comes to addressing methodological challenges arising from the computational analysis of big data, however, statistical expertise needs to be complemented by computational savvy in the training and application of algorithms associated to artificial intelligence, including machine learning but also other mathematical procedures for operating upon data (Bringsjord & Govindarajulu 2018). Consider for instance the problem of overfitting, i.e., the mistaken identification of patterns in a dataset, which can be greatly amplified by the training techniques employed by machine learning algorithms. There is no guarantee that an algorithm trained to successfully extrapolate patterns from a given dataset will be as successful when applied to other data. Common approaches to this problem involve the re-ordering and partitioning of both data and training methods, so that it is possible to compare the application of the same algorithms to different subsets of the data (“cross-validation”), combine predictions arising from differently trained algorithms (“ensembling”) or use hyperparameters (parameters whose value is set prior to data training) to prepare the data for analysis.

Handling these issues, in turn, requires

familiarity with the mathematical operations in question, their implementations in code, and the hardware architectures underlying such implementations. (Lowrie 2017: 3)

For instance, machine learning

aims to build programs that develop their own analytic or descriptive approaches to a body of data, rather than employing ready-made solutions such as rule-based deduction or the regressions of more traditional statistics. (Lowrie 2017: 4)

In other words, statistics and mathematics need to be complemented by expertise in programming and computer engineering. The ensemble of skills thus construed results in a specific epistemological approach to research, which is broadly characterised by an emphasis on the means of inquiry as the most significant driver of research goals and outputs. This approach, which Sabina Leonelli characterised as data-centric , involves “focusing more on the processes through which research is carried out than on its ultimate outcomes” (Leonelli 2016: 170). In this view, procedures, techniques, methods, software and hardware are the prime motors of inquiry and the chief influence on its outcomes. Focusing more specifically on computational systems, John Symons and Jack Horner argued that much of big data research consists of software-intensive science rather than data-driven research: that is, science that depends on software for its design, development, deployment and use, and thus encompasses procedures, types of reasoning and errors that are unique to software, such as for example the problems generated by attempts to map real-world quantities to discrete-state machines, or approximating numerical operations (Symons & Horner 2014: 473). Software-intensive science is arguably supported by an algorithmic rationality focused on the feasibility, practicality and efficiency of algorithms, which is typically assessed by reference to concrete situations of inquiry (Lowrie 2017).

Algorithms are enormously varied in their mathematical structures and underpinning conceptual commitments, and more philosophical work needs to be carried out on the specifics of computational tools and software used in data science and related applications—with emerging work in philosophy of computer science providing an excellent way forward (Turner & Angius 2019). Nevertheless, it is clear that whether or not a given algorithm successfully applies to the data at hand depends on factors that cannot be controlled through statistical or even computational methods: for instance, the size, structure and format of the data, the nature of the classifiers used to partition the data, the complexity of decision boundaries and the very goals of the investigation.

In a forceful critique informed by the philosophy of mathematics, Christian Calude and Giuseppe Longo argued that there is a fundamental problem with the assumption that more data will necessarily yield more information:

very large databases have to contain arbitrary correlations. These correlations appear only due to the size, not the nature, of data. (Calude & Longo 2017: 595)

They conclude that big data analysis is by definition unable to distinguish spurious from meaningful correlations and is therefore a threat to scientific research. A related worry, sometimes dubbed “the curse of dimensionality” by data scientists, concerns the extent to which the analysis of a given dataset can be scaled up in complexity and in the number of variables being considered. It is well known that the more dimensions one considers in classifying samples, for example, the larger the dataset on which such dimensions can be accurately generalised. This demonstrates the continuing, tight dependence between the volume and quality of data on the one hand, and the type and breadth of research questions for which data need to serve as evidence on the other hand.

Determining the fit between inferential methods and data requires high levels of expertise and contextual judgement (a situation known within machine learning as the “no free lunch theorem”). Indeed, overreliance on software for inference and data modelling can yield highly problematic results. Symons and Horner note that the use of complex software in big data analysis makes margins of error unknowable, because there is no clear way to test them statistically (Symons & Horner 2014: 473). The path complexity of programs with high conditionality imposes limits on standard error correction techniques. As a consequence, there is no effective method for characterising the error distribution in the software except by testing all paths in the code, which is unrealistic and intractable in the vast majority of cases due to the complexity of the code.

Rather than acting as a substitute, the effective and responsible use of artificial intelligence tools in big data analysis requires the strategic exercise of human intelligence—but for this to happen, AI systems applied to big data need to be accessible to scrutiny and modification. Whether or not this is the case, and who is best qualified to exercise such scrutiny, is under dispute. Thomas Nickles argued that the increasingly complex and distributed algorithms used for data analysis follow in the footsteps of long-standing scientific attempts to transcend the limits of human cognition. The resulting epistemic systems may no longer be intelligible to humans: an “alien intelligence” within which “human abilities are no longer the ultimate criteria of epistemic success” (Nickles forthcoming). Such unbound cognition holds the promise of enabling powerful inferential reasoning from previously unimaginable volumes of data. The difficulties in contextualising and scrutinising such reasoning, however, sheds doubt on the reliability of the results. It is not only machine learning algorithms that are becoming increasingly inaccessible to evaluation: beyond the complexities of programming code, computational data analysis requires a whole ecosystem of classifications, models, networks and inference tools which typically have different histories and purposes, and whose relation to each other—and effects when they are used together—are far from understood and may well be untraceable.

This raises the question of whether the knowledge produced by such data analytic systems is at all intelligible to humans, and if so, what forms of intelligibility it yields. It is certainly the case that deriving knowledge from big data may not involve an increase in human understanding, especially if understanding is understood as an epistemic skill (de Regt 2017). This may not be a problem to those who await the rise of a new species of intelligent machines, who may master new cognitive tools in a way that humans cannot. But as Nickles, Nicholas Rescher (1984), Werner Callebaut (2012) and others pointed out, even in that case “we would not have arrived at perspective-free science” (Nickles forthcoming). While the human histories and assumptions interwoven into these systems may be hard to disentangle, they still affect their outcomes; and whether or not these processes of inquiry are open to critical scrutiny, their telos, implications and significance for life on the planet arguably should be. As argued by Dan McQuillan (2018), the increasing automation of big data analytics may foster acceptance of a Neoplatonist machinic metaphysics , within which mathematical structures “uncovered” by AI would trump any appeal to human experience. Luciano Floridi echoes this intuition in his analysis of what he calls the infosphere :

The great opportunities offered by Information and Communication Technologies come with a huge intellectual responsibility to understand them and take advantage of them in the right way. (2014: vii)

These considerations parallel Paul Humphreys’s long-standing critique of computer simulations as epistemically opaque (Humphreys 2004, 2009)—and particularly his definition of what he calls essential epistemic opacity:

A process is essentially epistemically opaque to X if and only if it is impossible , given the nature of X , for X to know all of the epistemically relevant elements of the process. (Humphreys 2009: 618)

Different facets of the general problem of epistemic opacity are stressed within the vast philosophical scholarship on the role of modelling, computing and simulations in the sciences: the implications of lacking experimental access to the concrete parts of the world being modelled, for instance (Morgan 2005; Parker 2009; Radder 2009); the difficulties in testing the reliability of computational methods used within simulations (Winsberg 2010; Morrison 2015); the relation between opacity and justification (Durán & Formanek 2018); the forms of black-boxing associated to mechanistic reasoning implemented in computational analysis (Craver and Darden 2013; Bechtel 2016); and the debate over the intrinsic limits of computational approaches and related expertise (Collins 1990; Dreyfus 1992). Roman Frigg and Julian Reiss argued that such issues do not constitute fundamental challenges to the nature of inquiry and modelling, and in fact exist in a continuum with traditional methodological issues well-known within the sciences (Frigg & Reiss 2009). Whether or not one agrees with this position (Humphreys 2009; Beisbart 2012), big data analysis is clearly pushing computational and statistical methods to their limit, thus highlighting the boundaries to what even technologically augmented human beings are capable of knowing and understanding.

Research on big data analysis thus sheds light on elements of the research process that cannot be fully controlled, rationalised or even considered through recourse to formal tools.

One such element is the work required to present empirical data in a machine-readable format that is compatible with the software and analytic tools at hand. Data need to be selected, cleaned and prepared to be subjected to statistical and computational analysis. The processes involved in separating data from noise, clustering data so that it is tractable, and integrating data of different formats turn out to be highly sophisticated and theoretically structured, as demonstrated for instance by James McAllister’s (1997, 2007, 2011) and Uljana Feest’s (2011) work on data patterns, Marcel Boumans’s and Leonelli’s comparison of clustering principles across fields (forthcoming), and James Griesemer’s (forthcoming) and Mary Morgan’s (forthcoming) analyses of the peculiarities of datasets. Suppes was so concerned by what he called the “bewildering complexity” of data production and processing activities, that he worried that philosophers would not appreciate the ways in which statistics can and does help scientists to abstract data away from such complexity. He described the large group of research components and activities used to prepare data for modelling as “pragmatic aspects” encompassing “every intuitive consideration of experimental design that involved no formal statistics” (Suppes 1962: 258), and positioned them as the lowest step of his hierarchy of models—at the opposite end of its pinnacle, which are models of theory. Despite recent efforts to rehabilitate the methodology of inductive-statistical modelling and inference (Mayo & Spanos 2009b), this approach has been shared by many philosophers who regard processes of data production and processing as so chaotic as to defy systematic analysis. This explains why data have received so little consideration in philosophy of science when compared to models and theory.

The question of how data are defined and identified, however, is crucial for understanding the role of big data in scientific research. Let us now consider two philosophical views—the representational view and the relational view —that are both compatible with the emergence of big data, and yet place emphasis on different aspects of that phenomenon, with significant implications for understanding the role of data within inferential reasoning and, as we shall see in the next section, as evidence. The representational view construes data as reliable representations of reality which are produced via the interaction between humans and the world. The interactions that generate data can take place in any social setting regardless of research purposes. Examples range from a biologist measuring the circumference of a cell in the lab and noting the result in an Excel file, to a teacher counting the number of students in her class and transcribing it in the class register. What counts as data in these interactions are the objects created in the process of description and/or measurement of the world. These objects can be digital (the Excel file) or physical (the class register) and form a footprint of a specific interaction with the natural world. This footprint—“trace” or “mark”, in the words of Ian Hacking (1992) and Hans-Jörg Rheinberger (2011), respectively—constitutes a crucial reference point for analytic study and for the extraction of new insights. This is the reason why data forms a legitimate foundation to empirical knowledge: the production of data is equivalent to “capturing” features of the world that can be used for systematic study. According to the representative approach, data are objects with fixed and unchangeable content, whose meaning, in virtue of being representations of reality, needs to be investigated and revealed step-by-step through adequate inferential methods. The data documenting cell shape can be modelled to test the relevance of shape to the elasticity, permeability and resilience of cells, producing an evidence base to understand cell-to-cell signalling and development. The data produced counting students in class can be aggregated with similar data collected in other schools, producing an evidence base to evaluate the density of students in the area and their school attendance frequency.

This reflects the intuition that data, especially when they come in the form of numerical measurements or images such as photographs, somehow mirror the phenomena that they are created to document, thus providing a snapshot of those phenomena that is amenable to study under the controlled conditions of research. It also reflects the idea of data as “raw” products of research, which are as close as it gets to unmediated knowledge of reality. This makes sense of the truth-value sometimes assigned to data as irrefutable sources of evidence—the Popperian idea that if data are found to support a given claim, then that claim is corroborated as true at least as long as no other data are found to disprove it. Data in this view represent an objective foundation for the acquisition of knowledge and this very objectivity—the ability to derive knowledge from human experience while transcending it—is what makes knowledge empirical. This position is well-aligned with the idea that big data is valuable to science because it facilitates the (broadly understood) inductive accumulation of knowledge: gathering data collected via reliable methods produces a mountain of facts ready to be analysed and, the more facts are produced and connected with each other, the more knowledge can be extracted.

Philosophers have long acknowledged that data do not speak for themselves and different types of data require different tools for analysis and preparation to be interpreted (Bogen 2009 [2013]). According to the representative view, there are correct and incorrect ways of interpreting data, which those responsible for data analysis need to uncover. But what is a “correct” interpretation in the realm of big data, where data are consistently treated as mobile entities that can, at least in principle, be reused in countless different ways and towards different objectives? Perhaps more than at any other time in the history of science, the current mobilisation and re-use of big data highlights the degree to which data interpretation—and with it, whatever data is taken to represent—may differ depending on the conceptual, material and social conditions of inquiry. The analysis of how big data travels across contexts shows that the expectations and abilities of those involved determine not only the way data are interpreted, but also what is regarded as “data” in the first place (Leonelli & Tempini forthcoming). The representative view of data as objects with fixed and contextually independent meaning is at odds with these observations.

An alternative approach is to embrace these findings and abandon the idea of data as fixed representations of reality altogether. Within the relational view , data are objects that are treated as potential or actual evidence for scientific claims in ways that can, at least in principle, be scrutinised and accounted for (Leonelli 2016). The meaning assigned to data depends on their provenance, their physical features and what these features are taken to represent, and the motivations and instruments used to visualise them and to defend specific interpretations. The reliability of data thus depends on the credibility and strictness of the processes used to produce and analyse them. The presentation of data; the way they are identified, selected, and included (or excluded) in databases; and the information provided to users to re-contextualise them are fundamental to producing knowledge and significantly influence its content. For instance, changes in data format—as most obviously involved in digitisation, data compression or archival procedures— can have a significant impact on where, when, and who uses the data as source of knowledge.

This framework acknowledges that any object can be used as a datum, or stop being used as such, depending on the circumstances—a consideration familiar to big data analysts used to pick and mix data coming from a vast variety of sources. The relational view also explains how, depending on the research perspective interpreting it, the same dataset may be used to represent different aspects of the world (“phenomena” as famously characterised by James Bogen and James Woodward, 1988). When considering the full cycle of scientific inquiry from the viewpoint of data production and analysis, it is at the stage of data modelling that a specific representational value is attributed to data (Leonelli 2019b).

The relational view of data encourages attention to the history of data, highlighting their continual evolution and sometimes radical alteration, and the impact of this feature on the power of data to confirm or refute hypotheses. It explains the critical importance of documenting data management and transformation processes, especially with big data that transit far and wide over digital channels and are grouped and interpreted in different ways and formats. It also explains the increasing recognition of the expertise of those who produce, curate, and analyse data as indispensable to the effective interpretation of big data within and beyond the sciences; and the inextricable link between social and ethical concerns around the potential impact of data sharing and scientific concerns around the quality, validity, and security of data (boyd & Crawford 2012; Tempini & Leonelli, 2018).

Depending on which view on data one takes, expectations around what big data can do for science will vary dramatically. The representational view accommodates the idea of big data as providing the most comprehensive, reliable and generative knowledge base ever witnessed in the history of science, by virtue of its sheer size and heterogeneity. The relational view makes no such commitment, focusing instead on what inferences are being drawn from such data at any given point, how and why.

One thing that the representational and relational views agree on is the key epistemic role of data as empirical evidence for knowledge claims or interventions. While there is a large philosophical literature on the nature of evidence (e.g., Achinstein 2001; Reiss 2015; Kelly 2016), however, the relation between data and evidence has received less attention. This is arguably due to an implicit acceptance, by many philosophers, of the representational view of data. Within the representational view, the identification of what counts as data is prior to the study of what those data can be evidence for: in other words, data are “givens”, as the etymology of the word indicates, and inferential methods are responsible for determining whether and how the data available to investigators can be used as evidence, and for what. The focus of philosophical attention is thus on formal methods to single out errors and misleading interpretations, and the probabilistic and/or explanatory relation between what is unproblematically taken to be a body of evidence and a given hypothesis. Hence much of the expansive philosophical work on evidence avoids the term “data” altogether. Peter Achinstein’s seminal work is a case in point: it discusses observed facts and experimental results, and whether and under which conditions scientists would have reasons to believe such facts, but it makes no mention of data and related processing practices (Achinstein 2001).

By contrast, within the relational view an object can only be identified as datum when it is viewed as having value as evidence. Evidence becomes a category of data identification, rather than a category of data use as in the representational view (Canali 2019). Evidence is thus constitutive of the very notion of data and cannot be disentangled from it. This involves accepting that the conditions under which a given object can serve as evidence—and thus be viewed as datum - may change; and that should this evidential role stop altogether, the object would revert back into an ordinary, non-datum item. For example, the photograph of a plant taken by a tourist in a remote region may become relevant as evidence for an inquiry into the morphology of plants from that particular locality; yet most photographs of plants are never considered as evidence for an inquiry into the features and functioning of the world, and of those who are, many may subsequently be discarded as uninteresting or no longer pertinent to the questions being asked.

This view accounts for the mobility and repurposing that characterises big data use, and for the possibility that objects that were not originally generated in order to serve as evidence may be subsequently adopted as such. Consider Mayo and Spanos’s “minimal scientific principle for evidence”, which they define as follows:

Data x 0 provide poor evidence for H if they result from a method or procedure that has little or no ability of finding flaws in H , even if H is false. (Mayo & Spanos 2009b)

This principle is compatible with the relational view of data since it incorporates cases where the methods used to generate and process data may not have been geared towards the testing of a hypothesis H: all it asks is that such methods can be made relevant to the testing of H, at the point in which data are used as evidence for H (I shall come back to the role of hypotheses in the handling of evidence in the next section).

The relational view also highlights the relevance of practices of data formatting and manipulation to the treatment of data as evidence, thus taking attention away from the characteristics of the data objects alone and focusing instead on the agency attached to and enabled by those characteristics. Nora Boyd has provided a way to conceptualise data processing as an integral part of inferential processes, and thus of how we should understand evidence. To this aim she introduced the notion of “line of evidence”, which she defines as:

a sequence of empirical results including the records of data collection and all subsequent products of data processing generated on the way to some final empirical constraint. (Boyd 2018:406)

She thus proposes a conception of evidence that embraces both data and the way in which data are handled, and indeed emphasises the importance of auxiliary information used when assessing data for interpretation, which includes

the metadata regarding the provenance of the data records and the processing workflow that transforms them. (2018: 407)

As she concludes,

together, a line of evidence and its associated metadata compose what I am calling an “enriched line of evidence”. The evidential corpus is then to be made up of many such enriched lines of evidence. (2018: 407)

The relational view thus fosters a functional and contextualist approach to evidence as the manner through which one or more objects are used as warrant for particular knowledge items (which can be propositional claims, but also actions such as specific decisions or modes of conduct/ways of operating). This chimes with the contextual view of evidence defended by Reiss (2015), John Norton’s work on the multiple, tangled lines of inferential reasoning underpinning appeals to induction (2003), and Hasok Chang’s emphasis on the epistemic activities required to ground evidential claims (2012). Building on these ideas and on Stephen Toulmin’s seminal work on research schemas (1958), Alison Wylie has gone one step further in evaluating the inferential scaffolding that researchers (and particularly archaeologists, who so often are called to re-evaluate the same data as evidence for new claims; Wylie 2017) need to make sense of their data, interpret them in ways that are robust to potential challenges, and modify interpretations in the face of new findings. This analysis enabled Wylie to formulate a set of conditions for robust evidential reasoning, which include epistemic security in the chain of evidence, causal anchoring and causal independence of the data used as evidence, as well as the explicit articulation of the grounds for calibration of the instruments and methods involved (Chapman & Wylie 2016; Wylie forthcoming). A similar conclusion is reached by Jessey Wright’s evaluation of the diverse data analysis techniques that neuroscientists use to make sense of functional magnetic resonance imaging of the brain (fMRI scans):

different data analysis techniques reveal different patterns in the data. Through the use of multiple data analysis techniques, researchers can produce results that are locally robust. (Wright 2017: 1179)

Wylie’s and Wright’s analyses exemplify how a relational approach to data fosters a normative understanding of “good evidence” which is anchored in situated judgement—the arguably human prerogative to contextualise and assess the significance of evidential claims. The advantages of this view of evidence are eloquently expressed by Nancy Cartwright’s critique of both philosophical theories and policy approaches that do not recognise the local and contextual nature of evidential reasoning. As she notes,

we need a concept that can give guidance about what is relevant to consider in deciding on the probability of the hypothesis, not one that requires that we already know significant facts about the probability of the hypothesis on various pieces of evidence. (Cartwright 2013: 6)

Thus she argues for a notion of evidence that is not too restrictive, takes account of the difficulties in combining and selecting evidence, and allows for contextual judgement on what types of evidence are best suited to the inquiry at hand (Cartwright 2013, 2019). Reiss’s proposal of a pragmatic theory of evidence similarly aims to

takes scientific practice [..] seriously, both in terms of its greater use of knowledge about the conditions under which science is practised and in terms of its goal to develop insights that are relevant to practising scientists. (Reiss 2015: 361)

A better characterisation of the relation between data and evidence, predicated on the study of how data are processed and aggregated, may go a long way towards addressing these demands. As aptly argued by James Woodward, the evidential relationship between data and claims is not a “a purely formal, logical, or a priori matter” (Woodward 2000: S172–173). This again sits uneasily with the expectation that big data analysis may automate scientific discovery and make human judgement redundant.

Let us now return to the idea of data-driven inquiry, often suggested as a counterpoint to hypothesis-driven science (e.g., Hey et al. 2009). Kevin Elliot and colleagues have offered a brief history of hypothesis-driven inquiry (Elliott et al. 2016), emphasising how scientific institutions (including funding programmes and publication venues) have pushed researchers towards a Popperian conceptualisation of inquiry as the formulation and testing of a strong hypothesis. Big data analysis clearly points to a different and arguably Baconian understanding of the role of hypothesis in science. Theoretical expectations are no longer seen as driving the process of inquiry and empirical input is recognised as primary in determining the direction of research and the phenomena—and related hypotheses—considered by researchers.

The emphasis on data as a central component of research poses a significant challenge to one of the best-established philosophical views on scientific knowledge. According to this view, which I shall label the theory-centric view of science, scientific knowledge consists of justified true beliefs about the world. These beliefs are obtained through empirical methods aiming to test the validity and reliability of statements that describe or explain aspects of reality. Hence scientific knowledge is conceptualised as inherently propositional: what counts as an output are claims published in books and journals, which are also typically presented as solutions to hypothesis-driven inquiry. This view acknowledges the significance of methods, data, models, instruments and materials within scientific investigations, but ultimately regards them as means towards one end: the achievement of true claims about the world. Reichenbach’s seminal distinction between contexts of discovery and justification exemplifies this position (Reichenbach 1938). Theory-centrism recognises research components such as data and related practical skills as essential to discovery, and more specifically to the messy, irrational part of scientific work that involves value judgements, trial-and-error, intuition and exploration and within which the very phenomena to be investigated may not have been stabilised. The justification of claims, by contrast, involves the rational reconstruction of the research that has been performed, so that it conforms to established norms of inferential reasoning. Importantly, within the context of justification, only data that support the claims of interest are explicitly reported and discussed: everything else—including the vast majority of data produced in the course of inquiry—is lost to the chaotic context of discovery. [ 2 ]

Much recent philosophy of science, and particularly modelling and experimentation, has challenged theory-centrism by highlighting the role of models, methods and modes of intervention as research outputs rather than simple tools, and stressing the importance of expanding philosophical understandings of scientific knowledge to include these elements alongside propositional claims. The rise of big data offers another opportunity to reframe understandings of scientific knowledge as not necessarily centred on theories and to include non-propositional components—thus, in Cartwright’s paraphrase of Gilbert Ryle’s famous distinction, refocusing on knowing-how over knowing-that (Cartwright 2019). One way to construe data-centric methods is indeed to embrace a conception of knowledge as ability, such as promoted by early pragmatists like John Dewey and more recently reprised by Chang, who specifically highlighted it as the broader category within which the understanding of knowledge-as-information needs to be placed (Chang 2017).

Another way to interpret the rise of big data is as a vindication of inductivism in the face of the barrage of philosophical criticism levelled against theory-free reasoning over the centuries. For instance, Jon Williamson (2004: 88) has argued that advances in automation, combined with the emergence of big data, lend plausibility to inductivist philosophy of science. Wolfgang Pietsch agrees with this view and provided a sophisticated framework to understand just what kind of inductive reasoning is instigated by big data and related machine learning methods such as decision trees (Pietsch 2015). Following John Stuart Mill, he calls this approach variational induction and presents it as common to both big data approaches and exploratory experimentation, though the former can handle a much larger number of variables (Pietsch 2015: 913). Pietsch concludes that the problem of theory-ladenness in machine learning can be addressed by determining under which theoretical assumptions variational induction works (2015: 910ff).

Others are less inclined to see theory-ladenness as a problem that can be mitigated by data-intensive methods, and rather see it as a constitutive part of the process of empirical inquiry. Arching back to the extensive literature on perspectivism and experimentation (Gooding 1990; Giere 2006; Radder 2006; Massimi 2012), Werner Callebaut has forcefully argued that the most sophisticated and standardised measurements embody a specific theoretical perspective, and this is no less true of big data (Callebaut 2012). Elliott and colleagues emphasise that conceptualising big data analysis as atheoretical risks encouraging unsophisticated attitudes to empirical investigation as a

“fishing expedition”, having a high probability of leading to nonsense results or spurious correlations, being reliant on scientists who do not have adequate expertise in data analysis, and yielding data biased by the mode of collection. (Elliott et al. 2016: 880)

To address related worries in genetic analysis, Ken Waters has provided the useful characterisation of “theory-informed” inquiry (Waters 2007), which can be invoked to stress how theory informs the methods used to extract meaningful patterns from big data, and yet does not necessarily determine either the starting point or the outcomes of data-intensive science. This does not resolve the question of what role theory actually plays. Rob Kitchin (2014) has proposed to see big data as linked to a new mode of hypothesis generation within a hypothetical-deductive framework. Leonelli is more sceptical of attempts to match big data approaches, which are many and diverse, with a specific type of inferential logic. She rather focused on the extent to which the theoretical apparatus at work within big data analysis rests on conceptual decisions about how to order and classify data—and proposed that such decisions can give rise to a particular form of theorization, which she calls classificatory theory (Leonelli 2016).

These disagreements point to big data as eliciting diverse understandings of the nature of knowledge and inquiry, and the complex iterations through which different inferential methods build on each other. Again, in the words of Elliot and colleagues,

attempting to draw a sharp distinction between hypothesis-driven and data-intensive science is misleading; these modes of research are not in fact orthogonal and often intertwine in actual scientific practice. (Elliott et al. 2016: 881, see also O’Malley et al. 2009, Elliott 2012)

Another epistemological debate strongly linked to reflection on big data concerns the specific kinds of knowledge emerging from data-centric forms of inquiry, and particularly the relation between predictive and causal knowledge.

Big data science is widely seen as revolutionary in the scale and power of predictions that it can support. Unsurprisingly perhaps, a philosophically sophisticated defence of this position comes from the philosophy of mathematics, where Marco Panza, Domenico Napoletani and Daniele Struppa argued for big data science as occasioning a momentous shift in the predictive knowledge that mathematical analysis can yield, and thus its role within broader processes of knowledge production. The whole point of big data analysis, they posit, is its disregard for causal knowledge:

answers are found through a process of automatic fitting of the data to models that do not carry any structural understanding beyond the actual solution of the problem itself. (Napoletani, Panza, & Struppa 2014: 486)

This view differs from simplistic popular discourse on “the death of theory” (Anderson 2008) and the “power of correlations” (Mayer-Schoenberg and Cukier 2013) insofar as it does not side-step the constraints associated with knowledge and generalisations that can be extracted from big data analysis. Napoletani, Panza and Struppa recognise that there are inescapable tensions around the ability of mathematical reasoning to overdetermine empirical input, to the point of providing a justification for any and every possible interpretation of the data. In their words,

the problem arises of how we can gain meaningful understanding of historical phenomena, given the tremendous potential variability of their developmental processes. (Napoletani et al. 2014: 487)

Their solution is to clarify that understanding phenomena is not the goal of predictive reasoning, which is rather a form of agnostic science : “the possibility of forecasting and analysing without a structured and general understanding” (Napoletani et al. 2011: 12). The opacity of algorithmic rationality thus becomes its key virtue and the reason for the extraordinary epistemic success of forecasting grounded on big data. While “the phenomenon may forever re-main hidden to our understanding”(ibid.: 5), the application of mathematical models and algorithms to big data can still provide meaningful and reliable answers to well-specified problems—similarly to what has been argued in the case of false models (Wimsatt 2007). Examples include the use of “forcing” methods such as regularisation or diffusion geometry to facilitate the extraction of useful insights from messy datasets.

This view is at odds with accounts that posit scientific understanding as a key aim of science (de Regt 2017), and the intuition that what researchers are ultimately interested in is

whether the opaque data-model generated by machine-learning technologies count as explanations for the relationships found between input and output. (Boon 2020: 44)

Within the philosophy of biology, for example, it is well recognised that big data facilitates effective extraction of patterns and trends, and that being able to model and predict how an organism or ecosystem may behave in the future is of great importance, particularly within more applied fields such as biomedicine or conservation science. At the same time, researchers are interested in understanding the reasons for observed correlations, and typically use predictive patterns as heuristics to explore, develop and verify causal claims about the structure and functioning of entities and processes. Emanuele Ratti (2015) has argued that big data mining within genome-wide association studies often used in cancer genomics can actually underpin mechanistic reasoning, for instance by supporting eliminative inference to develop mechanistic hypotheses and by helping to explore and evaluate generalisations used to analyse the data. In a similar vein, Pietsch (2016) proposed to use variational induction as a method to establish what counts as causal relationships among big data patterns, by focusing on which analytic strategies allow for reliable prediction and effective manipulation of a phenomenon.

Through the study of data sourcing and processing in epidemiology, Stefano Canali has instead highlighted the difficulties of deriving mechanistic claims from big data analysis, particularly where data are varied and embodying incompatible perspectives and methodological approaches (Canali 2016, 2019). Relatedly, the semantic and logistical challenges of organising big data give reason to doubt the reliability of causal claims extracted from such data. In terms of logistics, having a lot of data is not the same as having all of them, and cultivating illusions of comprehensiveness is a risky and potentially misleading strategy, particularly given the challenges encountered in developing and applying curatorial standards for data other than the high-throughput results of “omics” approaches (see also the next section). The constant worry about the partiality and reliability of data is reflected in the care put by database curators in enabling database users to assess such properties; and in the importance given by researchers themselves, particularly in the biological and environmental sciences, to evaluating the quality of data found on the internet (Leonelli 2014, Fleming et al. 2017). In terms of semantics, we are back to the role of data classifications as theoretical scaffolding for big data analysis that we discussed in the previous section. Taxonomic efforts to order and visualise data inform causal reasoning extracted from such data (Sterner & Franz 2017), and can themselves constitute a bottom-up method—grounded in comparative reasoning—for assigning meaning to data models, particularly in situation where a full-blown theory or explanation for the phenomenon under investigation is not available (Sterner 2014).

It is no coincidence that much philosophical work on the relation between causal and predictive knowledge extracted from big data comes from the philosophy of the life sciences, where the absence of axiomatized theories has elicited sophisticated views on the diversity of forms and functions of theory within inferential reasoning. Moreover, biological data are heterogeneous both in their content and in their format; are curated and re-purposed to address the needs of highly disparate and fragmented epistemic communities; and present curators with specific challenges to do with tracking complex, diverse and evolving organismal structures and behaviours, whose relation to an ever-changing environment is hard to pinpoint with any stability (e.g., Shavit & Griesemer 2009). Hence in this domain, some of the core methods and epistemic concerns of experimental research—including exploratory experimentation, sampling and the search for causal mechanisms—remain crucial parts of data-centric inquiry.

At the start of this entry I listed “value” as a major characteristic of big data and pointed to the crucial role of valuing procedures in identifying, processing, modelling and interpreting data as evidence. Identifying and negotiating different forms of data value is an unavoidable part of big data analysis, since these valuation practices determine which data is made available to whom, under which conditions and for which purposes. What researchers choose to consider as reliable data (and data sources) is closely intertwined not only with their research goals and interpretive methods, but also with their approach to data production, packaging, storage and sharing. Thus, researchers need to consider what value their data may have for future research by themselves and others, and how to enhance that value—such as through decisions around which data to make public, how, when and in which format; or, whenever dealing with data already in the public domain (such as personal data on social media), decisions around whether the data should be shared and used at all, and how.

No matter how one conceptualises value practices, it is clear that their key role in data management and analysis prevents facile distinctions between values and “facts” (understood as propositional claims for which data provide evidential warrant). For example, consider a researcher who values both openness —and related practices of widespread data sharing—and scientific rigour —which requires a strict monitoring of the credibility and validity of conditions under which data are interpreted. The scale and manner of big data mobilisation and analysis create tensions between these two values. While the commitment to openness may prompt interest in data sharing, the commitment to rigour may hamper it, since once data are freely circulated online it becomes very difficult to retain control over how they are interpreted, by whom and with which knowledge, skills and tools. How a researcher responds to this conflict affects which data are made available for big data analysis, and under which conditions. Similarly, the extent to which diverse datasets may be triangulated and compared depends on the intellectual property regimes under which the data—and related analytic tools—have been produced. Privately owned data are often unavailable to publicly funded researchers; and many algorithms, cloud systems and computing facilities used in big data analytics are only accessible to those with enough resources to buy relevant access and training. Whatever claims result from big data analysis are, therefore, strongly dependent on social, financial and cultural constraints that condition the data pool and its analysis.

This prominent role of values in shaping data-related epistemic practices is not surprising given existing philosophical critiques of the fact/value distinction (e.g., Douglas 2009), and the existing literature on values in science—such as Helen Longino’s seminal distinction between constitutive and contextual values, as presented in her 1990 book Science as Social Knowledge —may well apply in this case too. Similarly, it is well-established that the technological and social conditions of research strongly condition its design and outcomes. What is particularly worrying in the case of big data is the temptation, prompted by hyped expectations around the power of data analytics, to hide or side-line the valuing choices that underpin the methods, infrastructures and algorithms used for big data extraction.

Consider the use of high-throughput data production tools, which enable researchers to easily generate a large volume of data in formats already geared to computational analysis. Just as in the case of other technologies, researchers have a strong incentive to adopt such tools for data generation; and may do so even in cases where such tools are not good or even appropriate means to pursue the investigation. Ulrich Krohs uses the term convenience experimentation to refer to experimental designs that are adopted not because they are the most appropriate ways of pursuing a given investigation, but because they are easily and widely available and usable, and thus “convenient” means for researchers to pursue their goals (Krohs 2012).

Appeals to convenience can extend to other aspects of data-intensive analysis. Not all data are equally easy to digitally collect, disseminate and link through existing algorithms, which makes some data types and formats more convenient than others for computational analysis. For example, research databases often display the outputs of well-resourced labs within research traditions which deal with “tractable” data formats (such as “omics”). And indeed, the existing distribution of resources, infrastructure and skills determines high levels of inequality in the production, dissemination and use of big data for research. Big players with large financial and technical resources are leading the development and uptake of data analytics tools, leaving much publicly funded research around the world at the receiving end of innovation in this area. Contrary to popular depictions of the data revolution as harbinger of transparency, democracy and social equality, the digital divide between those who can access and use data technologies, and those who cannot, continues to widen. A result of such divides is the scarcity of data relating to certain subgroups and geographical locations, which again limits the comprehensiveness of available data resources.

In the vast ecosystem of big data infrastructures, it is difficult to keep track of such distortions and assess their significance for data interpretation, especially in situations where heterogeneous data sources structured through appeal to different values are mashed together. Thus, the systematic aggregation of convenient datasets and analytic tools over others often results in a big data pool where the relevant sources and forms of bias are impossible to locate and account for (Pasquale 2015; O’Neill 2016; Zuboff 2017; Leonelli 2019a). In such a landscape, arguments for a separation between fact and value—and even a clear distinction between the role of epistemic and non-epistemic values in knowledge production—become very difficult to maintain without discrediting the whole edifice of big data science. Given the extent to which this approach has penetrated research in all domains, it is arguably impossible, however, to critique the value-laden structure of big data science without calling into question the legitimacy of science itself. A more constructive approach is to embrace the extent to which big data science is anchored in human choices, interests and values, and ascertain how this affects philosophical views on knowledge, truth and method.

In closing, it is important to consider at least some of the risks and related ethical questions raised by research with big data. As already mentioned in the previous section, reliance on big data collected by powerful institutions or corporations risks raises significant social concerns. Contrary to the view that sees big and open data as harbingers of democratic social participation in research, the way that scientific research is governed and financed is not challenged by big data. Rather, the increasing commodification and large value attributed to certain kinds of data (e.g., personal data) is associated to an increase in inequality of power and visibility between different nations, segments of the population and scientific communities (O’Neill 2016; Zuboff 2017; D’Ignazio and Klein 2020). The digital gap between those who not only can access data, but can also use it, is widening, leading from a state of digital divide to a condition of “data divide” (Bezuidenout et al. 2017).

Moreover, the privatisation of data has serious implications for the world of research and the knowledge it produces. Firstly, it affects which data are disseminated, and with which expectations. Corporations usually only release data that they regard as having lesser commercial value and that they need public sector assistance to interpret. This introduces another distortion on the sources and types of data that are accessible online while more expensive and complex data are kept secret. Even many of the ways in which citizens -researchers included - are encouraged to interact with databases and data interpretation sites tend to encourage participation that generates further commercial value. Sociologists have recently described this type of social participation as a form of exploitation (Prainsack & Buyx 2017; Srnicek 2017). In turn, these ways of exploiting data strengthen their economic value over their scientific value. When it comes to the commerce of personal data between companies working in analysis, the value of the data as commercial products -which includes the evaluation of the speed and efficiency with which access to certain data can help develop new products - often has priority over scientific issues such as for example, representativity and reliability of the data and the ways they were analysed. This can result in decisions that pose a problem scientifically or that simply are not interested in investigating the consequences of the assumptions made and the processes used. This lack of interest easily translates into ignorance of discrimination, inequality and potential errors in the data considered. This type of ignorance is highly strategic and economically productive since it enables the use of data without concerns over social and scientific implications. In this scenario the evaluation on the quality of data shrinks to an evaluation of their usefulness towards short-term analyses or forecasting required by the client. There are no incentives in this system to encourage evaluation of the long-term implications of data analysis. The risk here is that the commerce of data is accompanied by an increasing divergence between data and their context. The interest in the history of the transit of data, the plurality of their emotional or scientific value and the re-evaluation of their origins tend to disappear over time, to be substituted by the increasing hold of the financial value of data.

The multiplicity of data sources and tools for aggregation also creates risks. The complexity of the data landscape is making it harder to identify which parts of the infrastructure require updating or have been put in doubt by new scientific developments. The situation worsens when considering the number of databases that populate every area of scientific research, each containing assumptions that influence the circulation and interoperability of data and that often are not updated in a reliable and regular way. Just to provide an idea of the numbers involved, the prestigious scientific publication Nucleic Acids Research publishes a special issue on new databases that are relevant to molecular biology every year and included: 56 new infrastructures in 2015, 62 in 2016, 54 in 2017 and 82 in 2018. These are just a small proportion of the hundreds of databases that are developed each year in the life sciences sector alone. The fact that these databases rely on short term funding means that a growing percentage of resources remain available to consult online although they are long dead. This is a condition that is not always visible to users of the database who trust them without checking whether they are actively maintained or not. At what point do these infrastructures become obsolete? What are the risks involved in weaving an ever more extensive tapestry of infrastructures that depend on each other, given the disparity in the ways they are managed and the challenges in identifying and comparing their prerequisite conditions, the theories and scaffolding used to build them? One of these risks is rampant conservativism: the insistence on recycling old data whose features and management elements become increasingly murky as time goes by, instead of encouraging the production of new data with features that specifically respond to the requirements and the circumstances of their users. In disciplines such as biology and medicine that study living beings and therefore are by definition continually evolving and developing, such trust in old data is particularly alarming. It is not the case, for example, that data collected on fungi ten, twenty or even a hundred years ago is reliable to explain the behaviour of the same species of fungi now or in the future (Leonelli 2018).

Researchers of what Luciano Floridi calls the infosphere —the way in which the introduction of digital technologies is changing the world - are becoming aware of the destructive potential of big data and the urgent need to focus efforts for management and use of data in active and thoughtful ways towards the improvement of the human condition. In Floridi’s own words:

ICT yields great opportunity which, however, entails the enormous intellectual responsibility of understanding this technology to use it in the most appropriate way. (Floridi 2014: vii; see also British Academy & Royal Society 2017)

In light of these findings, it is essential that ethical and social issues are seen as a core part of the technical and scientific requirements associated with data management and analysis. The ethical management of data is not obtained exclusively by regulating the commerce of research and management of personal data nor with the introduction of monitoring of research financing, even though these are important strategies. To guarantee that big data are used in the most scientifically and socially forward-thinking way it is necessary to transcend the concept of ethics as something external and alien to research. An analysis of the ethical implications of data science should become a basic component of the background and activity of those who take care of data and the methods used to view and analyse it. Ethical evaluations and choices are hidden in every aspect of data management, including those choices that may seem purely technical.

This entry stressed how the emerging emphasis on big data signals the rise of a data-centric approach to research, in which efforts to mobilise, integrate, disseminate and visualise data are viewed as central contributions to discovery. The emergence of data-centrism highlights the challenges involved in gathering, classifying and interpreting data, and the concepts, technologies and institutions that surround these processes. Tools such as high-throughput measurement instruments and apps for smartphones are fast generating large volumes of data in digital formats. In principle, these data are immediately available for dissemination through internet platforms, which can make them accessible to anybody with a broadband connection in a matter of seconds. In practice, however, access to data is fraught with conceptual, technical, legal and ethical implications; and even when access can be granted, it does not guarantee that the data can be fruitfully used to spur further research. Furthermore, the mathematical and computational tools developed to analyse big data are often opaque in their functioning and assumptions, leading to results whose scientific meaning and credibility may be difficult to assess. This increases the worry that big data science may be grounded upon, and ultimately supporting, the process of making human ingenuity hostage to an alien, artificial and ultimately unintelligible intelligence.

Perhaps the most confronting aspect of big data science as discussed in this entry is the extent to which it deviates from understandings of rationality grounded on individual agency and cognitive abilities (on which much of contemporary philosophy of science is predicated). The power of any one dataset to yield knowledge lies in the extent to which it can be linked with others: this is what lends high epistemic value to digital objects such as GPS locations or sequencing data, and what makes extensive data aggregation from a variety of sources into a highly effective surveillance tool. Data production and dissemination channels such as social media, governmental databases and research repositories operate in a globalised, interlinked and distributed network, whose functioning requires a wide variety of skills and expertise. The distributed nature of decision-making involved in developing big data infrastructures and analytics makes it impossible for any one individual to retain oversight over the quality, scientific significance and potential social impact of the knowledge being produced.

Big data analysis may therefore constitute the ultimate instance of a distributed cognitive system. Where does this leave accountability questions? Many individuals, groups and institutions end up sharing responsibility for the conceptual interpretation and social outcomes of specific data uses. A key challenge for big data governance is to find mechanisms for allocating responsibilities across this complex network, so that erroneous and unwarranted decisions—as well as outright fraudulent, unethical, abusive, discriminatory or misguided actions—can be singled out, corrected and appropriately sanctioned. Thinking about the complex history, processing and use of data can encourage philosophers to avoid ahistorical, uncontextualized approaches to questions of evidence, and instead consider the methods, skills, technologies and practices involved in handling data—and particularly big data—as crucial to understanding empirical knowledge-making.

  • Achinstein, Peter, 2001, The Book of Evidence , Oxford: Oxford University Press. doi:10.1093/0195143892.001.0001
  • Anderson, Chris, 2008, “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete”, Wired Magazine , 23 June 2008.
  • Aronova, Elena, Karen S. Baker, and Naomi Oreskes, 2010, “Big science and big data in biology: From the International Geophysical Year through the International Biological Program to the Long Term Ecological Research (LTER) Network, 1957–present”, Historical Studies in the Natural Sciences , 40: 183–224.
  • Aronova, Elena, Christine von Oertzen, and David Sepkoski, 2017, “Introduction: Historicizing Big Data”, Osiris , 32(1): 1–17. doi:10.1086/693399
  • Bauer, Susanne, 2008, “Mining Data, Gathering Variables and Recombining Information: The Flexible Architecture of Epidemiological Studies”, Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences , 39(4): 415–428. doi:10.1016/j.shpsc.2008.09.008
  • Bechtel, William, 2016, “Using Computational Models to Discover and Understand Mechanisms”, Studies in History and Philosophy of Science Part A , 56: 113–121. doi:10.1016/j.shpsa.2015.10.004
  • Beisbart, Claus, 2012, “How Can Computer Simulations Produce New Knowledge?”, European Journal for Philosophy of Science , 2(3): 395–434. doi:10.1007/s13194-012-0049-7
  • Bezuidenhout, Louise, Leonelli, Sabina, Kelly, Ann and Rappert, Brian, 2017, “Beyond the Digital Divide: Towards a Situated Approach to Open Data”. Science and Public Policy , 44(4): 464–475. doi: 10.1093/scipol/scw036
  • Bogen, Jim, 2009 [2013], “Theory and Observation in Science”, in The Stanford Encyclopedia of Philosophy (Spring 2013 Edition), Edward N. Zalta (ed.), URL = < >.
  • –––, 2010, “Noise in the World”, Philosophy of Science , 77(5): 778–791. doi:10.1086/656006
  • Bogen, James and James Woodward, 1988, “Saving the Phenomena”, The Philosophical Review , 97(3): 303. doi:10.2307/2185445
  • Bokulich, Alisa, 2018, “Using Models to Correct Data: Paleodiversity and the Fossil Record”, in S.I.: Abstraction and Idealization in Scientific Modelling by Synthese , 29 May 2018. doi:10.1007/s11229-018-1820-x
  • Boon, Mieke, 2020, “How Scientists Are Brought Back into Science—The Error of Empiricism”, in A Critical Reflection on Automated Science , Marta Bertolaso and Fabio Sterpetti (eds.), (Human Perspectives in Health Sciences and Technology 1), Cham: Springer International Publishing, 43–65. doi:10.1007/978-3-030-25001-0_4
  • Borgman, Christine L., 2015, Big Data, Little Data, No Data , Cambridge, MA: MIT Press.
  • Boumans, M.J. and Sabina Leonelli, forthcoming, “From Dirty Data to Tidy Facts: Practices of Clustering in Plant Phenomics and Business Cycles”, in Leonelli and Tempini forthcoming.
  • Boyd, Danah and Kate Crawford, 2012, “Critical Questions for Big Data: Provocations for a Cultural, Technological, and Scholarly Phenomenon”, Information, Communication & Society , 15(5): 662–679. doi:10.1080/1369118X.2012.678878
  • Boyd, Nora Mills, 2018, “Evidence Enriched”, Philosophy of Science , 85(3): 403–421. doi:10.1086/697747
  • Bowker, Geoffrey C., 2006, Memory Practices in the Sciences , Cambridge, MA: The MIT Press.
  • Bringsjord, Selmer and Naveen Sundar Govindarajulu, 2018, “Artificial Intelligence”, in The Stanford Encyclopedia of Philosophy (Fall 2018 edition), Edward N. Zalta (ed.), URL = < >.
  • British Academy & Royal Society, 2017, Data Management and Use: Governance in the 21st Century. A Joint Report of the Royal Society and the British Academy , British Academy & Royal Society 2017 available online (see Report).
  • Cai, Li and Yangyong Zhu, 2015, “The Challenges of Data Quality and Data Quality Assessment in the Big Data Era”, Data Science Journal , 14: 2. doi:10.5334/dsj-2015-002
  • Callebaut, Werner, 2012, “Scientific Perspectivism: A Philosopher of Science’s Response to the Challenge of Big Data Biology”, Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences , 43(1): 69–80. doi:10.1016/j.shpsc.2011.10.007
  • Calude, Cristian S. and Giuseppe Longo, 2017, “The Deluge of Spurious Correlations in Big Data”, Foundations of Science , 22(3): 595–612. doi:10.1007/s10699-016-9489-4
  • Canali, Stefano, 2016, “Big Data, Epistemology and Causality: Knowledge in and Knowledge out in EXPOsOMICS”, Big Data & Society , 3(2): 205395171666953. doi:10.1177/2053951716669530
  • –––, 2019, “Evaluating Evidential Pluralism in Epidemiology: Mechanistic Evidence in Exposome Research”, History and Philosophy of the Life Sciences , 41(1): art. 4. doi:10.1007/s40656-019-0241-6
  • Cartwright, Nancy D., 2013, Evidence: For Policy and Wheresoever Rigor Is a Must , London School of Economics and Political Science (LSE), Order Project Discussion Paper Series [Cartwright 2013 available online ].
  • –––, 2019, Nature, the Artful Modeler: Lectures on Laws, Science, How Nature Arranges the World and How We Can Arrange It Better (The Paul Carus Lectures) , Chicago, IL: Open Court.
  • Chang, Hasok, 2012, Is Water H2O? Evidence, Realism and Pluralism , (Boston Studies in the Philosophy of Science 293), Dordrecht: Springer Netherlands. doi:10.1007/978-94-007-3932-1
  • –––, 2017, “VI—Operational Coherence as the Source of Truth”, Proceedings of the Aristotelian Society , 117(2): 103–122. doi:10.1093/arisoc/aox004
  • Chapman, Robert and Alison Wylie, 2016, Evidential Reasoning in Archaeology , London: Bloomsbury Publishing Plc.
  • Collins, Harry M., 1990, Artificial Experts: Social Knowledge and Intelligent Machines , Cambridge, MA: MIT Press.
  • Craver, Carl F. and Lindley Darden, 2013, In Search of Mechanisms: Discoveries Across the Life Sciences , Chicago: University of Chicago Press.
  • Daston, Lorraine, 2017, Science in the Archives: Pasts, Presents, Futures , Chicago: University of Chicago Press.
  • De Regt, Henk W., 2017, Understanding Scientific Understanding , Oxford: Oxford University Press. doi:10.1093/oso/9780190652913.001.0001
  • D’Ignazio, Catherine and Klein, Lauren F., 2020, Data Feminism , Cambridge, MA: The MIT Press.
  • Douglas, Heather E., 2009, Science, Policy and the Value-Free Ideal , Pittsburgh, PA: University of Pittsburgh Press.
  • Dreyfus, Hubert L., 1992, What Computers Still Can’t Do: A Critique of Artificial Reason , Cambridge, MA: MIT Press.
  • Durán, Juan M. and Nico Formanek, 2018, “Grounds for Trust: Essential Epistemic Opacity and Computational Reliabilism”, Minds and Machines , 28(4): 645–666. doi:10.1007/s11023-018-9481-6
  • Edwards, Paul N., 2010, A Vast Machine: Computer Models, Climate Data, and the Politics of Global Warming , Cambridge, MA: The MIT Press.
  • Elliott, Kevin C., 2012, “Epistemic and methodological iteration in scientific research”. Studies in History and Philosophy of Science , 43: 376–382.
  • Elliott, Kevin C., Kendra S. Cheruvelil, Georgina M. Montgomery, and Patricia A. Soranno, 2016, “Conceptions of Good Science in Our Data-Rich World”, BioScience , 66(10): 880–889. doi:10.1093/biosci/biw115
  • Feest, Uljana, 2011, “What Exactly Is Stabilized When Phenomena Are Stabilized?”, Synthese , 182(1): 57–71. doi:10.1007/s11229-009-9616-7
  • Fleming, Lora, Niccolò Tempini, Harriet Gordon-Brown, Gordon L. Nichols, Christophe Sarran, Paolo Vineis, Giovanni Leonardi, Brian Golding, Andy Haines, Anthony Kessel, Virginia Murray, Michael Depledge, and Sabina Leonelli, 2017, “Big Data in Environment and Human Health”, in Oxford Research Encyclopedia of Environmental Science , by Lora Fleming, Niccolò Tempini, Harriet Gordon-Brown, Gordon L. Nichols, Christophe Sarran, Paolo Vineis, Giovanni Leonardi, Brian Golding, Andy Haines, Anthony Kessel, Virginia Murray, Michael Depledge, and Sabina Leonelli, Oxford: Oxford University Press. doi:10.1093/acrefore/9780199389414.013.541
  • Floridi, Luciano, 2014, The Fourth Revolution: How the Infosphere is Reshaping Human Reality , Oxford: Oxford University Press.
  • Floridi, Luciano and Phyllis Illari (eds.), 2014, The Philosophy of Information Quality , (Synthese Library 358), Cham: Springer International Publishing. doi:10.1007/978-3-319-07121-3
  • Frigg, Roman and Julian Reiss, 2009, “The Philosophy of Simulation: Hot New Issues or Same Old Stew?”, Synthese , 169(3): 593–613. doi:10.1007/s11229-008-9438-z
  • Frigg, Roman and Stephan Hartmann, 2016, “Models in Science”, in The Stanford Encyclopedia of Philosophy (Winter 2016 edition), Edward N. Zalta (ed.), URL = < >.
  • Gooding, David C., 1990, Experiment and the Making of Meaning , Dordrecht & Boston: Kluwer.
  • Giere, Ronald, 2006, Scientific Perspectivism , Chicago: University of Chicago Press.
  • Griesemer, James R., forthcoming, “A Data Journey through Dataset-Centric Population Biology”, in Leonelli and Tempini forthcoming.
  • Hacking, Ian, 1992, “The Self-Vindication of the Laboratory Sciences”, In Science as Practice and Culture , Andrew Pickering (ed.), Chicago, IL: The University of Chicago Press, 29–64.
  • Harris, Todd, 2003, “Data Models and the Acquisition and Manipulation of Data”, Philosophy of Science , 70(5): 1508–1517. doi:10.1086/377426
  • Hey Tony, Stewart Tansley, and Kristin Tolle, 2009, The Fourth Paradigm. Data-Intensive Scientific Discovery , Redmond, WA: Microsoft Research.
  • Humphreys, Paul, 2004, Extending Ourselves: Computational Science, Empiricism, and Scientific Method , Oxford: Oxford University Press. doi:10.1093/0195158709.001.0001
  • –––, 2009, “The Philosophical Novelty of Computer Simulation Methods”, Synthese , 169(3): 615–626. doi:10.1007/s11229-008-9435-2
  • Karaca, Koray, 2018, “Lessons from the Large Hadron Collider for Model-Based Experimentation: The Concept of a Model of Data Acquisition and the Scope of the Hierarchy of Models”, Synthese , 195(12): 5431–5452. doi:10.1007/s11229-017-1453-5
  • Kelly, Thomas, 2016, “Evidence”, in The Stanford Encyclopedia of Philosophy (Winter 2016 edition), Edward N. Zalta (ed.), URL = < >.
  • Kitchin, Rob, 2013, The Data Revolution: Big Data, Open Data, Data Infrastructures & Their Consequences , Los Angeles: Sage.
  • –––, 2014, “Big Data, new epistemologies and paradigm shifts”, Big Data and Society , 1(1) April-June. doi: 10.1177/2053951714528481
  • Kitchin, Rob and Gavin McArdle, 2016, “What Makes Big Data, Big Data? Exploring the Ontological Characteristics of 26 Datasets”, Big Data & Society , 3(1): 205395171663113. doi:10.1177/2053951716631130
  • Krohs, Ulrich, 2012, “Convenience Experimentation”, Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences , 43(1): 52–57. doi:10.1016/j.shpsc.2011.10.005
  • Lagoze, Carl, 2014, “Big Data, data integrity, and the fracturing of the control zone,” Big Data and Society , 1(2) July-December. doi: 10.1177/2053951714558281
  • Leonelli, Sabina, 2014, “What Difference Does Quantity Make? On the Epistemology of Big Data in Biology”, Big Data & Society , 1(1): 205395171453439. doi:10.1177/2053951714534395
  • –––, 2016, Data-Centric Biology: A Philosophical Study , Chicago: University of Chicago Press.
  • –––, 2017, “Global Data Quality Assessment and the Situated Nature of ‘Best’ Research Practices in Biology”, Data Science Journal , 16: 32. doi:10.5334/dsj-2017-032
  • –––, 2018, “The Time of Data: Timescales of Data Use in the Life Sciences”, Philosophy of Science , 85(5): 741–754. doi:10.1086/699699
  • –––, 2019a, La Recherche Scientifique à l’Ère des Big Data: Cinq Façons Donc les Données Massive Nuisent à la Science, et Comment la Sauver , Milano: Éditions Mimésis.
  • –––, 2019b, “What Distinguishes Data from Models?”, European Journal for Philosophy of Science , 9(2): 22. doi:10.1007/s13194-018-0246-0
  • Leonelli, Sabina and Niccolò Tempini, 2018, “Where Health and Environment Meet: The Use of Invariant Parameters in Big Data Analysis”, Synthese , special issue on the Philosophy of Epidemiology , Sean Valles and Jonathan Kaplan (eds.). doi:10.1007/s11229-018-1844-2
  • –––, forthcoming, Data Journeys in the Sciences , Cham: Springer International Publishing.
  • Loettgers, Andrea, 2009, “Synthetic Biology and the Emergence of a Dual Meaning of Noise”, Biological Theory , 4(4): 340–356. doi:10.1162/BIOT_a_00009
  • Longino, Helen E., 1990, Science as Social Knowledge: Values and Objectivity in Scientific Inquiry , Princeton, NJ: Princeton University Press.
  • Lowrie, Ian, 2017, “Algorithmic Rationality: Epistemology and Efficiency in the Data Sciences”, Big Data & Society , 4(1): 1–13. doi:10.1177/2053951717700925
  • MacLeod, Miles and Nancy J. Nersessian, 2013, “Building Simulations from the Ground Up: Modeling and Theory in Systems Biology”, Philosophy of Science , 80(4): 533–556. doi:10.1086/673209
  • Massimi, Michela, 2011, “From Data to Phenomena: A Kantian Stance”, Synthese , 182(1): 101–116. doi:10.1007/s11229-009-9611-z
  • –––, 2012, “ Scientific perspectivism and its foes”, Philosophica , 84: 25–52.
  • –––, 2016, “Three Tales of Scientific Success”, Philosophy of Science , 83(5): 757–767. doi:10.1086/687861
  • Mayer-Schönberger, Victor and Kenneth Cukier, 2013, Big Data: A Revolution that Will Transform How We Live, Work, and Think , New York: Eamon Dolan/Houghton Mifflin Harcourt.
  • Mayo, Deborah G., 1996, Error and the Growth of Experimental Knowledge , Chicago: University of Chicago Press.
  • Mayo, Deborah G. and Aris Spanos (eds.), 2009a, Error and Inference , Cambridge: Cambridge University Press.
  • Mayo, Deborah G. and Aris Spanos, 2009b, “Introduction and Background”, in Mayo and Spanos (eds.) 2009a, pp. 1–27.
  • McAllister, James W., 1997, “Phenomena and Patterns in Data Sets”, Erkenntnis , 47(2): 217–228. doi:10.1023/A:1005387021520
  • –––, 2007, “Model Selection and the Multiplicity of Patterns in Empirical Data”, Philosophy of Science , 74(5): 884–894. doi:10.1086/525630
  • –––, 2011, “What Do Patterns in Empirical Data Tell Us about the Structure of the World?”, Synthese , 182(1): 73–87. doi:10.1007/s11229-009-9613-x
  • McQuillan, Dan, 2018, “Data Science as Machinic Neoplatonism”, Philosophy & Technology , 31(2): 253–272. doi:10.1007/s13347-017-0273-3
  • Mitchell, Sandra D., 2003, Biological Complexity and Integrative Pluralism , Cambridge: Cambridge University Press. doi:10.1017/CBO9780511802683
  • Morgan, Mary S., 2005, “Experiments versus Models: New Phenomena, Inference and Surprise”, Journal of Economic Methodology , 12(2): 317–329. doi:10.1080/13501780500086313
  • –––, forthcoming, “The Datum in Context”, in Leonelli and Tempini forthcoming.
  • Morrison, Margaret, 2015, Reconstructing Reality: Models, Mathematics, and Simulations , Oxford: Oxford University Press. doi:10.1093/acprof:oso/9780199380275.001.0001
  • Müller-Wille, Staffan and Isabelle Charmantier, 2012, “Natural History and Information Overload: The Case of Linnaeus”, Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences , 43(1): 4–15. doi:10.1016/j.shpsc.2011.10.021
  • Napoletani, Domenico, Marco Panza, and Daniele C. Struppa, 2011, “Agnostic Science. Towards a Philosophy of Data Analysis”, Foundations of Science , 16(1): 1–20. doi:10.1007/s10699-010-9186-7
  • –––, 2014, “Is Big Data Enough? A Reflection on the Changing Role of Mathematics in Applications”, Notices of the American Mathematical Society , 61(5): 485–490. doi:10.1090/noti1102
  • Nickles, Thomas, forthcoming, “Alien Reasoning: Is a Major Change in Scientific Research Underway?”, Topoi , first online: 20 March 2018. doi:10.1007/s11245-018-9557-1
  • Norton, John D., 2003, “A Material Theory of Induction”, Philosophy of Science , 70(4): 647–670. doi:10.1086/378858
  • O’Malley M, Maureen A., Kevin C. Elliott, Chris Haufe, and Richard Burian, 2009. “Philosophies of funding”. Cell , 138: 611–615. doi: 10.1016/j.cell.2009.08.008
  • O’Malley, Maureen A. and Orkun S. Soyer, 2012, “The Roles of Integration in Molecular Systems Biology”, Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences , 43(1): 58–68. doi:10.1016/j.shpsc.2011.10.006
  • O’Neill, Cathy, 2016, Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy , New York: Crown.
  • Parker, Wendy S., 2009, “Does Matter Really Matter? Computer Simulations, Experiments, and Materiality”, Synthese , 169(3): 483–496. doi:10.1007/s11229-008-9434-3
  • –––, 2017, “Computer Simulation, Measurement, and Data Assimilation”, The British Journal for the Philosophy of Science , 68(1): 273–304. doi:10.1093/bjps/axv037
  • Pasquale, Frank, 2015, The Black Box Society: The Secret Algorithms That Control Money and Information , Cambridge, MA: Harvard University Press.
  • Pietsch, Wolfgang, 2015, “Aspects of Theory-Ladenness in Data-Intensive Science”, Philosophy of Science , 82(5): 905–916. doi:10.1086/683328
  • –––, 2016, “The Causal Nature of Modeling with Big Data”, Philosophy & Technology , 29(2): 137–171. doi:10.1007/s13347-015-0202-2
  • –––, 2017, “Causation, probability and all that: Data science as a novel inductive paradigm”, in Frontiers in Data Science , Matthias Dehmer and Frank Emmert-Streib (eds.), Boca Raton, FL: CRC, 329–353.
  • Porter, Theodore M., 1995, Trust in Numbers: The Pursuit of Objectivity in Science and Public Life , Princeton, NJ: Princeton University Press.
  • Porter, Theodore M. and Soraya de Chadarevian, 2018, “Introduction: Scrutinizing the Data World”, Historical Studies in the Natural Sciences , 48(5): 549–556. doi:10.1525/hsns.2018.48.5.549
  • Prainsack, Barbara and Buyx, Alena, 2017, Solidarity in Biomedicine and Beyond , Cambridge, UK: Cambridge University Press.
  • Radder, Hans, 2009, “The Philosophy of Scientific Experimentation: A Review”, Automated Experimentation , 1(1): 2. doi:10.1186/1759-4499-1-2
  • Ratti, Emanuele, 2015, “Big Data Biology: Between Eliminative Inferences and Exploratory Experiments”, Philosophy of Science , 82(2): 198–218. doi:10.1086/680332
  • Reichenbach, Hans, 1938, Experience and Prediction: An Analysis of the Foundations and the Structure of Knowledge , Chicago, IL: The University of Chicago Press.
  • Reiss, Julian, 2015, “A Pragmatist Theory of Evidence”, Philosophy of Science , 82(3): 341–362. doi:10.1086/681643
  • Reiss, Julian, 2015, Causation, Evidence, and Inference , New York: Routledge.
  • Rescher, Nicholas, 1984, The Limits of Science , Berkely, CA: University of California Press.
  • Rheinberger, Hans-Jörg, 2011, “Infra-Experimentality: From Traces to Data, from Data to Patterning Facts”, History of Science , 49(3): 337–348. doi:10.1177/007327531104900306
  • Romeijn, Jan-Willem, 2017, “Philosophy of Statistics”, in The Stanford Encyclopedia of Philosophy (Spring 2017), Edward N. Zalta (ed.), URL: .
  • Sepkoski, David, 2013, “Toward ‘a natural history of data’: Evolving practices and epistemologies of data in paleontology, 1800–2000”, Journal of the History of Biology , 46: 401–444.
  • Shavit, Ayelet and James Griesemer, 2009, “There and Back Again, or the Problem of Locality in Biodiversity Surveys*”, Philosophy of Science , 76(3): 273–294. doi:10.1086/649805
  • Srnicek, Nick, 2017, Platform capitalism , Cambridge, UK and Malden, MA: Polity Press.
  • Sterner, Beckett, 2014, “The Practical Value of Biological Information for Research”, Philosophy of Science , 81(2): 175–194. doi:10.1086/675679
  • Sterner, Beckett and Nico M. Franz, 2017, “Taxonomy for Humans or Computers? Cognitive Pragmatics for Big Data”, Biological Theory , 12(2): 99–111. doi:10.1007/s13752-017-0259-5
  • Sterner, Beckett W., Nico M. Franz, and J. Witteveen, 2020, “Coordinating dissent as an alternative to consensus classification: insights from systematics for bio-ontologies”, History and Philosophy of the Life Sciences , 42(1): 8. doi: 10.1007/s40656-020-0300-z
  • Stevens, Hallam, 2016, “Hadooping the Genome: The Impact of Big Data Tools on Biology”, BioSocieties , 11: 352–371.
  • Strasser, Bruno, 2019, Collecting Experiments: Making Big Data Biology , Chicago: University of Chicago Press.
  • Suppes, Patrick, 1962, “Models of data”, in Logic, Methodology and Philosophy of Science , Ernest Nagel, Patrick Suppes, & Alfred Tarski (eds.), Stanford: Stanford University Press, 252–261.
  • Symons, John and Ramón Alvarado, 2016, “Can We Trust Big Data? Applying Philosophy of Science to Software”, Big Data & Society , 3(2): 1-17. doi:10.1177/2053951716664747
  • Symons, John and Jack Horner, 2014, “Software Intensive Science”, Philosophy & Technology , 27(3): 461–477. doi:10.1007/s13347-014-0163-x
  • Tempini, Niccolò, 2017, “Till Data Do Us Part: Understanding Data-Based Value Creation in Data-Intensive Infrastructures”, Information and Organization , 27(4): 191–210. doi:10.1016/j.infoandorg.2017.08.001
  • Tempini, Niccolò and Sabina Leonelli, 2018, “Concealment and Discovery: The Role of Information Security in Biomedical Data Re-Use”, Social Studies of Science , 48(5): 663–690. doi:10.1177/0306312718804875
  • Toulmin, Stephen, 1958, The Uses of Arguments , Cambridge: Cambridge University Press.
  • Turner, Raymond and Nicola Angius, 2019, “The Philosophy of Computer Science”, in The Stanford Encyclopedia of Philosophy (Spring 2019 edition), Edward N. Zalta (ed.), URL = < >.
  • Van Fraassen, Bas C., 2008, Scientific Representation: Paradoxes of Perspective , Oxford: Oxford University Press. doi:10.1093/acprof:oso/9780199278220.001.0001
  • Waters, C. Kenneth, 2007, “The Nature and Context of Exploratory Experimentation: An Introduction to Three Case Studies of Exploratory Research”, History and Philosophy of the Life Sciences , 29(3): 275–284.
  • Wilkinson, Mark D., Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino da Silva Santos, Philip E. Bourne, Jildau Bouwman, Anthony J. Brookes, Tim Clark, Mercè Crosas, Ingrid Dillo, Olivier Dumon, Scott Edmunds, Chris T. Evelo, Richard Finkers, Alejandra Gonzalez-Beltran, et al., 2016, “The FAIR Guiding Principles for Scientific Data Management and Stewardship”, Scientific Data , 3(1): 160018. doi:10.1038/sdata.2016.18
  • Williamson, Jon, 2004 “A dynamic interaction between machine learning and the philosophy of science”, Minds and Machines , 14(4): 539–54. doi:10.1093/bjps/axx012
  • Wimsatt, William C., 2007, Re-Engineering Philosophy for Limited Beings: Piecewise Approximations to Reality , Cambridge, MA: Harvard University Press.
  • Winsberg, Eric, 2010, Science in the Age of Computer Simulation , Chicago: University of Chicago Press.
  • Woodward, James, 2000, “Data, phenomena and reliability”, Philosophy of Science , 67(supplement): Proceedings of the 1998 Biennial Meetings of the Philosophy of Science Association. Part II: Symposia Papers (Sep., 2000), pp. S163–S179.
  • –––, 2010, “Data, Phenomena, Signal, and Noise”, Philosophy of Science , 77(5): 792–803. doi:10.1086/656554
  • Wright, Jessey, 2017, “The Analysis of Data and the Evidential Scope of Neuroimaging Results”, The British Journal for the Philosophy of Science , 69(4): 1179–1203. doi:10.1093/bjps/axx012
  • Wylie, Alison, 2017, “How Archaeological Evidence Bites Back: Strategies for Putting Old Data to Work in New Ways”, Science, Technology, & Human Values , 42(2): 203–225. doi:10.1177/0162243916671200
  • –––, forthcoming, “Radiocarbon Dating in Archaeology: Triangulation and Traceability”, in Leonelli and Tempini forthcoming.
  • Zuboff, Shoshana, 2017, The Age of Surveillance Capitalism: The Fight for the Future at the New Frontier of Power , New York: Public Affairs.
How to cite this entry . Preview the PDF version of this entry at the Friends of the SEP Society . Look up topics and thinkers related to this entry at the Internet Philosophy Ontology Project (InPhO). Enhanced bibliography for this entry at PhilPapers , with links to its database.

[Please contact the author with suggestions.]

artificial intelligence | Bacon, Francis | biology: experiment in | computer science, philosophy of | empiricism: logical | evidence | human genome project | models in science | Popper, Karl | science: theory and observation in | scientific explanation | scientific method | scientific theories: structure of | statistics, philosophy of


The research underpinning this entry was funded by the European Research Council (grant award 335925) and the Alan Turing Institute (EPSRC Grant EP/N510129/1).

Copyright © 2020 by Sabina Leonelli < s . leonelli @ exeter . ac . uk >

  • Accessibility

Support SEP

Mirror sites.

View this site from another server:

  • Info about mirror sites

The Stanford Encyclopedia of Philosophy is copyright © 2023 by The Metaphysics Research Lab , Department of Philosophy, Stanford University

Library of Congress Catalog Data: ISSN 1095-5054

Big Data Social Science

Welcome to Big Data Social Science

big data in social science research

Featured Projects

Working groups, big data and political science working group, big data and society working group, data science and public policy working group, building a research infrastructure for harnessing the data revolution and its social implications.

Big Data Social Science has three desired goals to better support big data and related research:

(1) Expand research support

(2) Help build an intellectual community around this work

(3) Help expand data science teaching

Research Support

Intellectual community, data science teaching, statistical support.

SSCERT will be providing Statistical and Research Design Support for divisional faculty and students. Contact information regarding this new service will be available soon at

Innovation Technology Studio

A new facility is currently under construction to demonstrate and assist in the use of interesting and new technologies relevant to research and teaching.  Contact Tom Phelan [email protected] for additional information.

California Census Research Data Center

The California Census Research Data Center (CCRDC) will soon be moving to into its new home at SSCERT.  The Data Center provides researchers access to micro level census data in a secured environment.  More information is available at

Data Visualization

SSCERT can provide assistance with certain data management and visualization tasks.  Contact Joy Guey [email protected]  for further information.


© Copyright 2020 UCLA

Northwestern Scholars Logo

  • Help & FAQ

Ethical Issues in Social Science Research Employing Big Data

  • Preventive Medicine

Research output : Contribution to journal › Article › peer-review

This paper analyzes the ethics of social science research (SSR) employing big data. We begin by highlighting the research gap found on the intersection between big data ethics, SSR and research ethics. We then discuss three aspects of big data SSR which make it warrant special attention from a research ethics angle: (1) the interpretative character of both SSR and big data, (2) complexities of anticipating and managing risks in publication and reuse of big data SSR, and (3) the paucity of regulatory oversight and ethical recommendations on protecting individual subjects as well as societies when conducting big data SSR. Against this backdrop, we propose using David Resnik’s research ethics framework to analyze some of the most pressing ethical issues of big data SSR. Focusing on the principles of honesty, carefulness, openness, efficiency, respect for subjects, and social responsibility, we discuss three clusters of ethical issues: those related to methodological biases and personal prejudices, those connected to risks arising from data availability and reuse, and those leading to individual and social harms. Finally, we advance considerations to observe in developing future ethical guidelines about big data SSR.

  • Computational Social Science
  • Open Science
  • Research Ethics
  • Research Integrity; Big Data
  • Social Science

ASJC Scopus subject areas

  • Health(social science)
  • Health Policy
  • Management of Technology and Innovation
  • Issues, ethics and legal aspects

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access to Document

  • 10.1007/s11948-022-00380-7

Other files and links

  • Link to publication in Scopus
  • Link to the citations in Scopus


  • Ethical Issue Psychology 100%
  • Research Psychology 100%
  • Big Data Psychology 100%
  • data INIS 100%
  • ethics INIS 50%
  • Ethics Psychology 38%
  • risks INIS 20%
  • efficiency INIS 10%

T1 - Ethical Issues in Social Science Research Employing Big Data

AU - Hosseini, Mohammad

AU - Wieczorek, Michał

AU - Gordijn, Bert

N1 - Publisher Copyright: © 2022, The Author(s).

PY - 2022/6

Y1 - 2022/6

N2 - This paper analyzes the ethics of social science research (SSR) employing big data. We begin by highlighting the research gap found on the intersection between big data ethics, SSR and research ethics. We then discuss three aspects of big data SSR which make it warrant special attention from a research ethics angle: (1) the interpretative character of both SSR and big data, (2) complexities of anticipating and managing risks in publication and reuse of big data SSR, and (3) the paucity of regulatory oversight and ethical recommendations on protecting individual subjects as well as societies when conducting big data SSR. Against this backdrop, we propose using David Resnik’s research ethics framework to analyze some of the most pressing ethical issues of big data SSR. Focusing on the principles of honesty, carefulness, openness, efficiency, respect for subjects, and social responsibility, we discuss three clusters of ethical issues: those related to methodological biases and personal prejudices, those connected to risks arising from data availability and reuse, and those leading to individual and social harms. Finally, we advance considerations to observe in developing future ethical guidelines about big data SSR.

AB - This paper analyzes the ethics of social science research (SSR) employing big data. We begin by highlighting the research gap found on the intersection between big data ethics, SSR and research ethics. We then discuss three aspects of big data SSR which make it warrant special attention from a research ethics angle: (1) the interpretative character of both SSR and big data, (2) complexities of anticipating and managing risks in publication and reuse of big data SSR, and (3) the paucity of regulatory oversight and ethical recommendations on protecting individual subjects as well as societies when conducting big data SSR. Against this backdrop, we propose using David Resnik’s research ethics framework to analyze some of the most pressing ethical issues of big data SSR. Focusing on the principles of honesty, carefulness, openness, efficiency, respect for subjects, and social responsibility, we discuss three clusters of ethical issues: those related to methodological biases and personal prejudices, those connected to risks arising from data availability and reuse, and those leading to individual and social harms. Finally, we advance considerations to observe in developing future ethical guidelines about big data SSR.

KW - Computational Social Science

KW - Open Science

KW - Research Ethics

KW - Research Integrity; Big Data

KW - Social Science

UR -

UR -

U2 - 10.1007/s11948-022-00380-7

DO - 10.1007/s11948-022-00380-7

M3 - Article

C2 - 35705883

AN - SCOPUS:85132204781

SN - 1353-3452

JO - Science and Engineering Ethics

JF - Science and Engineering Ethics

  • Survey paper
  • Open access
  • Published: 25 January 2017

Conceptualizing Big Social Data

  • Ekaterina Olshannikova   ORCID: 1 ,
  • Thomas Olsson 1 ,
  • Jukka Huhtamäki 2 &
  • Hannu Kärkkäinen 3  

Journal of Big Data volume  4 , Article number:  3 ( 2017 ) Cite this article

21k Accesses

66 Citations

33 Altmetric

Metrics details

The popularity of social media and computer-mediated communication has resulted in high-volume and highly semantic data about digital social interactions. This constantly accumulating data has been termed as Big Social Data or Social Big Data, and various visions about how to utilize that have been presented. However, as relatively new concepts, there are no solid and commonly agreed definitions of them. We argue that the emerging research field around these concepts would benefit from understanding about the very substance of the concept and the different viewpoints to it. With our review of earlier research, we highlight various perspectives to this multi-disciplinary field and point out conceptual gaps, the diversity of perspectives and lack of consensus in what Big Social Data means. Based on detailed analysis of related work and earlier conceptualizations, we propose a synthesized definition of the term, as well as outline the types of data that Big Social Data covers. With this, we aim to foster future research activities around this intriguing, yet untapped type of Big Data.


We live in an “always-on society” [ 1 – 3 ], meaning that people constantly interact with each other. Due to the rapid development of social computing and mushrooming of social media services, much of social interaction is nowadays mediated by information technology and takes place in the digital realm. An average Internet user consumes and shares large amounts of digital content every day through popular social online services, such as Facebook, Twitter, YouTube, Instagram and SnapChat.

From data perspective, this has led to emergence of extensive amounts of human-generated data [ 4 , 5 ] with diverse social uses and rich meanings (for example, communication text, videos for entertainment and self-representation, sharing of news and other 3rd party content in social media). Such unstructured/semi-structured, yet semantically rich data has been argued to constitute 95% of all Big Data [ 6 ]. This Social Data explosion has resulted in theorizations and studies about the emerging topic of Big Social Data (BSD).

Broadly speaking, BSD refers to large data volumes that relate to people or describe their behavior and technology-mediated social interactions in the digital realm. The sheer volume and semantic richness of such data opens enormous possibilities for utilizing and analyzing it for personal [ 7 , 8 ], commercial [ 9 , 10 ] as well as societal purposes [ 11 – 13 ]. For example, the scattered social media would benefit from meta-services that bring together all the content from a user. Commercial use could include even more targeted advertising, matchmaking services, or many unimaginable data-centered business models [ 14 , 15 ]. The search for beneficial applications and services in regard to BSD has only just begun.

Central concepts and goals of the research

In the research literature, the concept of Big Social Data has been defined and interpreted in many ways for various purposes; for example, the viewpoints from which it has been explored include social media, online social networks, social computing, and computational social science (CSS). The role of these fields in the scope of BSD is discussed in detail in the following sections.

As a rule, BSD is mainly utilized to extract insights from social media data and online social interactions of people for descriptive or predictive purposes to influence human decision-making in various application domains [ 16 – 18 ]. In general, researchers have focused on the analytics and utilization, having paid little attention to clarifying the very concept of BSD and understanding the related phenomena (for example, [ 19 – 21 ]).

In fact, there seems to be lack of consensus about the definition of BSD and the related terms, as we will analyze in the upcoming sections. Inconsideration of proper conceptualization may bring researchers methodological challenges in their studies, especially in such inherently broad and multi-disciplinary field as BSD.

Therefore, we argue for conceptual and theoretical work about the concept of BSD in order to inform future research activities as well as to foster the practical utilization of the data, which may signify social insight. There is a timely need to describe, review, and reflect on BSD literature in order to bring clarity to the concept and understanding about its beneficial opportunities for the practitioners of computational social science and other related research fields.

The potential value of this paper for the readers is presented as follows:

Firstly, by the literature review we aim to bring clarity on various existing BSD concepts and its definitions. We discuss relations between BSD and related fields of science in order to inform readers about the domains where this concept is currently applied. We consider these aspects will help researchers to properly identify scope and directions for their investigation on the topic;

Secondly, by providing a synthesized concept and definition of BSD we want to motivate researchers to develop better conceptualizations and clarifications of the BSD meaning in regard to their research. Currently, the majority of papers related to the topic are focused on analytical tasks and methods missing the explanation about what researchers consider as BSD and why. As an improvement step towards a holistic approach to this emerging field, BSD practitioners can utilize the definition presented in this work by revising it according to their research objectives;

By providing a comprehensive list of BSD types we aim to inform researchers about categories of data that is currently available for research and analysis. This serves as a starting point to identify research opportunities and practical means towards data-driven research. It is worth noting that there is no extensive taxonomy of BSD in related literature and we neither aim to design one; however, our classification of such data serves as an inducement to the research community for collaboratively creating this taxonomy;

Moreover, by describing the key characteristics of BSD we differentiate it from the concept of Big Data. By doing so, we anticipate the emphasis on its unique qualities to open new opportunities for multi-disciplinary research ventures.

In general, we assume this work will attract researchers’ attention to explore the holistic view on BSD concept and help them to identify relevant sources of data to utilize in BSD studies.

Related concepts and literature

Due to rapid development of online social services and tremendous growth of data therein, various concepts have emerged in different research fields to help understanding digital environments and their social effects. This section reviews related concepts relevant to BSD and their correlations, as well as outlines existing literature on the topic (see Fig.  1 ).

Conceptual map of various BSD/SBD interpretations in the related literature. This illustration depicts four main domains, which were studied by different researchers from various perspectives and intersections of science field/data types

There are many interpretations and terms to refer to the “social” aspect in Big Data. The most widespread terms so far are Social Big Data (SBD) and Big Social Data (BSD). Various definitions and approaches are presented and compared in the following, in order to outline the existing research directions.

Big Social Data as science: Ishikawa’s and Pentland’s concepts

Hiroshi Ishikawa is a central adherent of Social Big Data concept, which he described and defined in his book as science of analyzing interconnections between physical world data and social data for the good of public:

“Analyzing both physical real world data (heterogeneous data with implicit semantics such as science data, event data, and transportation data) and social data (social media data with explicit semantics) by relating them to each other, is called Social Big Data science or Social Big Data for short”  [ 22 ].

It is worth noting that Ishikawa is one among few who provide a proper conceptualization of his ideas and views on the social phenomenon in Big Data. Accordingly, he clarified and supported by arguments relevant related terms, data sources and analytical approaches.

Thus, he defines social data as social media data , which, in his opinion, is one kind of Big Data with four V’s characteristics— volume , variety , velocity and vague . While the first three and veracity characteristics are already discussed in multiple studies on Big Data [ 23 – 26 ], the vagueness first appears in this book as essential characteristic of social data. It should not be mixed with vagueness proposed by Venkat Krishnamurthy on Big Data Innovation Summit in Silicon Valley in 2014, which refers to the confusion over the meaning of Big Data [ 27 – 29 ]. According to Ishikawa, vagueness characteristic is a result of a combination of various types of data to be analyzed, which lead to inconsistency and deficiency. It also relates to the issues of privacy and data management as social data involves individuals’ personal information.

Additionally, Ishikawa classifies the sources of social media data accordingly: blogging , micro blogging , social network services , sharing and video communication services , social news and gaming , social search and crowd sourcing services , and collaboration services . All data in such services would therefore be regarded as Big Social Data.

Ishikawa is interested in relationships between physical and cyber worlds. He considers SBD should follow the bidirectional analysis that includes influences from the physical real world on social media, and vice versa, in order to develop a complete model (theory). Such theory may explain interactions between both realms and enable potential prediction, recommendation and problem solving. In other words, he suggests tracking social media data and physical world data in order to reveal mutual interdependencies that in turn would result in actual insight. Ishikawa provides an example of traffic authorities predicting public transportation issues in context of massive social events that are actively discussed in social networks, blogs, news, etc. Thus, the data from social media could be analyzed to prevent traffic jams or to increase the amount of public transportation next to the event location.

Ishikawa’s thinking is in line with Pentland’s concept of social physics [ 30 ]. According to Pentland, social physics is the “quantitative social science that describes reliable, mathematical connections between information and idea flow on the one hand and people’s behavior on the other” . While Ishikawa aims to bring clarity about analytical techniques for SBD (for example, modeling, data mining, multivariate analysis), Pentland envisions a data-driven society. Even though Pentland does not utilize SBD or BSD terms directly in the conceptualization, he defines Big Data as the engine of social physics. The author refers to the data about human behavior, which consists of both human-generated content (from social media platforms) and data from the physical world (for instance, transactions, locations, call records), which is similar to Ishikawa’s vision about social data sources. The main goal of Petland’s research is to show how this data together with social science theories could be applied in practical settings.

Data-driven approaches to Big Social Data

Guellil and Boukhalfa consider SBD as a part of social computing [ 31 ]. To differentiate their view on SBD from general Big Data, authors provide certain characteristics referring to the research of Tang et al. [ 32 ]: “the set of links (due to relationships between users), a nonstructural nature (due to the length of messages required by some microblogging, the presence of spelling mistakes or other) and the lack of completeness (due to certain user requirements for data privacy)” . Authors provide a classification of the research works on SBD and discuss various analytical approaches and related challenges.

Guellil and Boukhalfa compile their vision of SBD based on the works of Barbier [ 33 ], Mukkamala [ 34 ] and Nguyen [ 35 ]. Notably, Mukkamala and Nguyen utilize SBD and BSD terms interchangeably and mention only social media data as a major data source. Even though Guellil and Boukhalfa point out the inconsistent use of terms in related literature, they do not provide clear conceptualization of the SBD in their own research. In fact, SBD term from the perspective of Guellil and Boukhalfa might be interpreted as a synonym of social media data with qualities such as large volume , noisiness and dynamism that were already revealed earlier in Barbier’s work.

From another perspective, Mark Coté makes the attempt to distinguish BSD concept from the broader category of Big Data [ 36 ]. In his viewpoint, Big Data is any data produced as the result of the quantification of the world that may include data from sensors, multiple industrial and domestic networks as well as financial markets, whereas BSD “comes from the mediated communicative practices of our everyday lives, whenever we go online, use our smartphone, use an app or make a purchase.” Moreover, Cote provides reasoning for the importance of BSD. According to him, the concept is not novel, but may significantly affect the media theory. Among those reasons are: the enormous size of data generated by humans that enables endless future analysis; the symbolic nature of social data that is challenging to process even though it is produced in the structured platform spaces; the infrastructure of BSD is very distributed that require scalable computer architecture and network capacity; challenges related to processing, storing, costs and data regulations.

Purpose-driven approaches: Big Social Data for society

Jean Burgess and Axel Bruns discuss Big Data in terms of social media and use the BSD term to refer to this research area [ 37 ]. Their vision is based on Manovich’s ideology [ 38 ], which is focused on bringing the potential of social or cultural data into humanities and social sciences . Thus, Jean Burgess and Axel Bruns present the BSD concept by mentioning the shift of Big Data towards media, communication, cultural and computational social science, which has led to the wave of research on digital humanities [ 39 – 41 ]. According to Burgess and Bruns, such changes “...provoked in large part by the dramatic quantitative growth and apparently increased cultural importance of social media—hence, “big social data”. Their research is aimed to clarify the role of social media in context of the contemporary media ecology with focus on communication, societal events and the nature of human’s engagement by applying computational methods towards Twitter archives. Inspired by the Manovich’s concept of BSD they trialled the feasibility of research on the phenomenon in order to reveal potential technical, political and epistemological issues. They identified ethical concerns as well as data accessibility, authenticity and reliability challenges. Based on the results, they stated that research on BSD requires the elaboration of mature conceptual models and methodological priorities.

Housley et al. [ 42 ] also take a society-oriented view to discuss Big Data. The authors have been conducting observatory research on the opportunities and challenges of open source social media data in the context of social sciences. They seek for the governance and organization improvements through the sense of civil society by means of ‘big and broad’ social data. According to authors, the term “big and broad” social data refers to three V’s ( volume , variety , velocity )—already well-known dimensions of related data, which also might be real-time and dynamic. Accordingly, social media could be used to empower people engagement in civil society through a methodological approach to generate sociological insight as proposed in the paper. William Housley et al. characterize digital innovations with qualities such as interaction, participation and “social” that affect complicated relationships between data and analytical capacity, thus enabling participatory infrastructure for public sociology. Consequently, in this regard, the authors point to “citizen social science”, which is aimed to assist social scientists by decreasing the challenges of social media data with the help of volunteers among citizens [ 43 ]. Such members of public may contribute with research by recording their knowledge, opinions and beliefs, thus connecting the social science academy and society [ 44 , 45 ].

Big Social Data as method

Bello-Orgaz et al. [ 46 ] consider SBD is a combination of Big Data and social media. According to the authors, SBD is needed for analysis of large amount of data from diverse social media sources. They theorize the concept as follows: “Those processes and methods that are designed to provide sensitive and relevant knowledge to any user or company from social media data sources when data sources can be characterized by their different formats and contents, their very large size, and the online or streamed generation of information”.

Thus, the conceptual map of SBD from Gema Bello-Orgaz et al. incorporates Big Data as processing paradigm, social media as the main source of data, and Data Analysis as method gaining and analyzing knowledge. Authors revise analytical methodologies for social media as well as new related applications and frameworks.

Summary of the related literature

Even though not all in the above-mentioned papers explicitly use BSD as a term, we consider these works are relevant to the topic. Researchers try to clarify the phenomenon of rapidly growing amount of human-related social data and seek for ways to apply it for the good of the society, data analytics and various fields of science. The key content of the approaches under discussion and theorizations about BSD is summarized in Table 1 .

One central commonality among existing research directions is the presence of social media as major data source and orientation towards analytics. The conceptualizations in these scientific articles vary from fundamentally broad (e.g. Ishikawa [ 22 ] and Pentland [ 30 ]) to vaguely described (e.g., Guelil and Boukhalfa [ 31 ]). Additionally, there are only a few attempts to distinguish the concepts from mere Big Data. What is also important, there is lack of clarity regarding the relations between researchers’ concepts and related fields: it is hard to outline how other sciences affect the scope of BSD/SBD and directions of studies. Moreover, it is often confusing what data types are considered relevant and valuable for research, and it is hard to understand which data was utilized in the reported research.

We conclude that there are research gaps that researchers of BSD should bridge in order to achieve holistic understanding about the concept of BSD and its characteristics. For example, it is essential to identify the data types that can be explored and studied in this domain. Sophisticated conceptualization and definition of BSD would help researchers build proper methods to process and analyze it. This is essential also because the growth in human-generated data engenders new challenges to solve, requiring novel tools, frameworks and methodological approaches as well as multidisciplinary expertise.

Theoretical foundations of Big Social Data

Based on the literature overview we perceive the concept of BSD as a combination of four fields of science: social computing (including social media and social networks), Big Data science and data analytics as fields that enable and contribute to the existence of the data, and CSS as a field that primarily utilizes the data to gain insight and conduct research (see Fig.  2 ).

Different fields of science contributing to and utilizing Big Social Data as a field. There are four main science fields that contribute to and utilize Big Social Data as a research field—social computing, Big Data science, data analytics and CSS

We emphasize that the concept should be understood in an interdisciplinary way in order to open new research avenues. The current and possible roles of each field of science in the context of BSD are discussed in the following.

  • Social computing

Social computing is a research and application field that integrates social and computational sciences [ 47 ]. According to Wang, the theoretical foundations of social computing incorporate Social Psychology, Sociology, Social Network Analysis, Anthropology as well as theories of organization, communication, human–computer interaction and computing theory. In his work, Kling [ 48 ] addresses the idea of a mutual interference between communication technologies and society. Therefore, social computing favorably affects both society and technology development: on the one hand enabling smooth socialization and social interactions through various computational systems, and on the other hand, introducing social practices and theories in the development of computational systems and applications. In terms of BSD, social computing enables services for technology-mediated self-representation  [ 49 ] and communication and supports the building and maintaining of digital relationships through multiple technological infrastructures (for example, Web, database, multimedia and wireless technologies) . In summary, social computing approaches the topic from the perspectives of applications, communication and business.

Big data science

Big data science refers to a field that processes and manages high-volume , high-velocity and high-variety data in order to extract reliable and valuable insights [ 50 ]. Big Data is aimed to serve large-scale digital applications and computational systems. Therefore, from BSD perspective, Big Data science provides solutions to process and manage data originated from technology-mediated social interactions in the context of numerous social services and applications in the digital environment . There are both optimistic and realistic approaches in regard to recent interest to Big Data technology. One group of researchers (as a rule business-oriented) discusses potential benefits of utilizing Big Data [ 51 , 52 ] to study massive data about people, things and interactions, while other researchers appeal to critical questions, assumptions and issues that may occur when accessing such data [ 53 – 55 ]. It is crucial to consider a critical view on BSD concept, because data that is primarily related to digital human interactions would definitely cause controversial challenges (for example, data availability, regulations on accessing data, ethics issues, and privacy). In summary, originating from computer science and information systems Big Data is a broader category than BSD, and has mostly data and infrastructure-centric perspective, for instance, with focus on Hadoop, Spark, clusters, and related infrastructural work.

Data analytics

Data analytics allows the extraction of insight or conclusions from existing massive data sets. Generally, it includes descriptive (describes data), exploratory (discovering unknown correlations in data), predictive (predict events and trends) and prescriptive (suggest actions) methods to gain meaningful insight for different domains [ 56 , 57 ]. Social Network Analysis (SNA) is one of the most established fields of data analytics [ 58 , 59 ], providing tools, methods and theories for the research of social networks in the digital realm. Other central areas that can be relevant for BSD include Business Analytics  [ 60 , 61 ] and Sentiment Analytics  [ 62 , 63 ]. Regardless of the intention and application area of the analysis, data analytics can be said to approach BSD from the perspective of utilization of data (for example, service development, gaining insight, decision making).

  • Computational social science

Definition of the concept is only one step towards proper understanding of BSD. Duncan Watts claimed the potential of Big Data in social domain—“we finally have our telescope” [ 64 ]. However, Macy challenges this statement [ 65 ] by referring to Gintis and Helbing [ 66 ] who point out that just having a telescope is not enough. “We also need to know where to point it, and for that we need the core analytical toolkit... Big data needs big theory” [ 65 ]. In terms of BSD such a pointer or a guide toward the theory and meaningful applications is CSS  [ 67 ]. This multidisciplinary field seeks for theory-grounded models of the social phenomena within the intersection of social and computational sciences [ 68 ]. CSS determines a joint collaboration between social, behavioral, cognitive and computer scientists with agent theorists, mathematicians and physicists [ 69 ]. According to Conte, CSS is going beyond the traditional social science tools to unravel social complexity from new perspectives more deeply [ 70 ]. Author highlights that CSS is not only about variables and equations; the major elements of this science are “people, ideas, human-made artifacts, and their relations within ecosystems”. The theorization and modeling of society by means of computational approaches is aimed to bring comprehension of social complexity and the way social systems operate [ 71 ]. Thus, we argue that CSS utilizes BSD in order to “serve the public good and examine the public agenda” [ 72 ]. In other words, CSS can reveal the meaningful and relevant areas in utilization of BSD, thus pointing directions for the analysis, making sense of the findings and enabling predictions as well as sensible explanations.

In summary, the aforementioned areas are the central conceptual and theoretical foundations of BSD that contribute to this inter-disciplinary concept. Social computing enables and serves technology-mediated social services and applications that in turn generate vast amount of complex social data; such data are managed and processed through Big Data tools; then insights and prescriptions are derived from data analytics methods and algorithms. CSS is one of the key fields to define targets and reasons for the analysis and explanations for the analysis results.

Our synthesis and definition of Big Social Data

Drawing from our overview of the related literature and observation of contributing science fields we provide a meta-level definition of the synthesized BSD concept as follows:

Big Social Data is any high-volume, high-velocity, high-variety and/or highly semantic data that is generated from technology-mediated social interactions and actions in digital realm, and which can be collected and analyzed to model social interactions and behavior.

This definition approaches the concept from the synthesized perspective including the description of social data characteristics, its sources and origins as well as purpose of use:

Characteristics Shortly speaking, in this context, volume refers to the exponential growth of social data. Variety relates to various types and forms of social data sources: it might be structured, semi-structured or unstructured. Variety can also mean the difference of formats (for instance, text, image, video). Velocity refers to the fact that social data is generated and distributed with tremendous speed. One can simply count his/her activity in online services per hour to imagine the frequency, with which billions of people right at this moment create or share something online. These characteristics define the size of social data available for the analysis as well as real-time and dynamic nature of BSD. The volume, velocity and variety are traditional characteristics in any Big Data, while semantic is a more unique characteristic of BSD. It refers to the fact that all content manually created is highly symbolic with various often-subjective meanings, which require intelligent solutions to be analyzed. There have been studies on mining and analyzing such multimedia data [ 73 – 76 ], however we are still far from the degree of the intelligence, which may turn immense pools of user-generated content into meaningful insights.

Data sources and origins In context of BSD, we consider technology-mediated social interactions as origins of social data types. It refers to digital self-representation , technology-mediated communication and digital relationships data that may appear not only in social networks services but in variety of discussion forums, blogs, web and mobile chat applications, multi-player games as well as different web sites that are not for social purposes per se.

Purpose Analyzing and modeling social interactions and behavior means that researchers may use the data to describe, understand, and build models of digital interactions taking place between people and how people act (online) around these interactions (for example, profile building, self-expression and other activities that are not directly seen as interaction but, rather, necessary prerequisites for it). The knowledge, which is gained from analysis, may then be utilized in variety of applications, meaning that BSD practitioners are free to choose which domain or research question to address. For instance, researchers may aim to solve fundamental societal issues or just explore tweets for the sake of testing new semantic algorithms.

The definition is further explicated in the following subsection with the classification of data types that relate to technology-mediated social interactions.

Types of Big Social Data

We emphasize that a central element of the BSD concept is “digital human”, who uses Information and Communications Technology (ICT) for digital social interactions. The rapid evolution of ICT has shifted the role of a user from a consumer to the active producer and mediator of information, thus allowing people to control, personalize and apply the digital realm according to their values, social needs and preferences [ 70 ]. We incorporate the term of “digital human” to underline the shift towards new sociality that lives in hybrid reality [ 77 ], where the dynamism and constant availability of technology-mediated communication blurs the boundaries between reality and virtuality. Thus, people do not distinct their activity in online and physical environments, because of “always-on” social networking. Similarly, Wooglar suggests the term of “virtual community” and states that it is just the matter of choosing words: “In this usage, ‘virtual’, like ‘interactive’, ‘information’, ‘global’, ‘remote’, ‘distance’, ‘digital’, ‘electronic’ (or ‘e-’), ‘cyber-’, ‘network’, ‘tele-’, and so on, appears as an epithet applied to various existing activities and social institutions”.  [ 78 ].

Around digital human interactions, there are both machine-generated and human-generated data that potentially might turn into the social insight. However, in this paper we argue that exactly human-generated data makes BSD concept unique and distinguishes it from general field of Big Data. While machine-generated data could be analyzed through mere Big Data tools and applications, human-generated content requires more intelligent solutions to decode the semantics of people’s beliefs, opinions and behavior. Undoubtedly, Big Data may show what and how is changing in social interactions, however it does not answer the question of why those changes and processes are happening. Therefore, we consider BSD is the solution to properly investigate the semantics of human-generated content. From our perspective, it may provide to practitioners of many research fields both facts and reasoning.

Overview of BSD types and sources. There are three major data types of Big Social Data—technology-mediated communication data, digital self-representation data and digital relationships data

While discussing human-generated data we mean content that is produced through social technology-mediated interactions of people in social media platforms. This category may contain digital-self representation data, technology-mediated communication data and digital relationships data (see Fig.  3 ). These three categories define the types of data that could be interpreted and utilized as social data in the current digital environment (see Table 2 ). In other words, Table 2 serves as a simplified taxonomy of BSD; however, it is not meant as an extensive index of what data is BSD but, rather, as a list of currently existing BSD examples that could be available for research and analysis.

Digital self-representation

The first category to be discussed is digital self-representation. This is the initial step for “digital humans” to socialize and communicate themselves in the digital realm. These data types relate to numerous virtual profiles that have functions of identity depiction and communicative body [ 49 ]. In other words, the data is meant to reveal some information (a “face”) for other users in the particular digital service. Albrechstlund proposes a concept of “sharing yourself”, which is related to the way constructed identity is participating in social networks creating relations with others [ 79 ]. In digital environment people are limited in verbal and non-verbal impressions compensating it by means of text, pictures, videos and music that could be placed in the following data categories:

Profile data It includes login data (usually a name/nickname/e-mail address with which other people identify the user); identity data (depends on the digital environment, i.e. for some services one should provide real first name and last name, mobile phone number, country, education, birthday); and personality data (e.g., profile pictures, tags of interest, slogan, personal signature in discussion forums) In many social media services, it is the personality data that the other users particularly focus on and analyze to assess the interestingness of the user.

Self-published content It incorporates publicly disclosed or socially restricted data (to trusted users or specific communities), such as most status updates in social media, pictures, videos, and other content that people add to services to represent themselves.

Data published by the community Self-representation could be complemented through person-related content shared by other users. This refers to collaboratively created pictures, narrations, videos, etc.

Technology-mediated communication data

Technology-mediated communication data refers to the data generated in two-way communication, collaborative knowledge creation and information distribution in the context of digital environment—the content and subjects of the communication. Technology mediates the constructed digital self-representation to contribute information, edit existing contributions, comment on entries and discuss related matters. From the fundamental perspective digital environments allow people to contribute to knowledge creation and distribution through various digital devices [ 80 ]. Digital environment facilitates physical communication channels resulting in private communication (i.e., one-to-one), public communication (one-to-many) and collaborative communication (many-to-many) data. Depending on the context, public and collaborative communication could also be private within the group of participants, i.e. in case it is a private channel of the organization.

Digital relationships data

Digital Relationships data describes the explicit connections and ties between users in the services. Analysis of this data can reveal social relationship patterns, social network structures and various other sociological and network level phenomena in the digital realm. Digital Representation category firstly contains explicit data, which refers to digital friendships and followership that a user has intentionally and explicitly defined. Technology-mediated social services provide the possibility to build virtual communities based on both physical and online activities (to create networks based on existing connections in physical world and/or create new networks with people from digital realm). There are two roles for users of such services—to be followee and follower. One could have followers or friends on various social platforms (Facebook, LinkedIn, Twitter, Instagram, and many others), and in turn could follow someone to maintain friendships, business relationships or track important content of another relevant user. An interesting factor to be researched is the motivation of people adding someone to the friend’s lists. Obviously such lists incorporate friends and colleagues, but also there could be public figures, interesting strangers or people with weak ties [ 81 – 83 ]. There is also implicit data, which could be revealed through analysis of technology-mediated communication data. For instance, tweets can be analyzed to infer individual connections between people. And from these individual connections, we can build network representations of communities in system level. As another example, two users having multiple common contacts (e.g., friend-of-a-friend) can be predicted to become explicit contacts in the future. When a user has, for example, liked or otherwise interacted with a non-contact user’s content or profile, there can be seen to be an implicit tie between the users [ 82 ]. However, such implicit data normally requires network analysis to be created, and there are few tools or methods to provide such data automatically.

To summarize, we consider this list of BSD types could be valuable for researchers to outline the scope of their interests and will guide them to achieve successful outcomes. Nevertheless, research community has to remember that the accessibility of such data is a crucial challenge of BSD. Lack of access to the data often held by various service providers hinders the utilization of and research opportunities related to this emerging concept. Thus, researches should search for ways of collaboration with social media platforms.

Future work

The holistic overview of related concepts, research fields as well as research communities provide ideas regarding methodological steps that should be taken to enable further research and utilization activities around BSD. This is a combination of three activities that should be primarily focused on in order to open new avenues for the utilization.

Collecting data The initial step for all researchers who work with BSD is to collect needed datasets for analysis. This step brings up the ethical issues and challenges of data accessibility. Indeed, there are challenges in terms of accessing the data as it is often held by various service providers, which hinders the utilization of the data. Manovich notes this by stating ‘only social media companies have access to really large social data’ [ 38 ]. Fortunately, recently we have seen various movements and joint efforts for bringing together data that, in theory, is public but very challenging to collect in high volume enough for research purposes (for example, the OSoMe Footnote 1 project to help analyzing Twitter data). One of the most troubling issues is related to ethics: majority of people are not aware about their data being collected and analyzed by different organizations (including government and social media companies). Moreover, the regulations on accessing and usage of such data are not clear and not completely unified. There are also challenges that may cause privacy violation: collecting more private data than allowed; accessing data without permissions; utilizing data for purposes, which are different from the initial purpose of collecting the data; misinterpreting the data; and changing the content. To make collecting phase feasible we need to fulfill the next step of our framework.

Collaboration BSD is multidisciplinary area that will require practitioners to build a proper team for work. Our suggestion is to build collaboration with social media platforms or companies that have access to actually large data sets. For instance, the research outcomes from thousands of twits would be questionable in comparison with research under billions of human-generated content from multiple channels. Collaboration with people or companies with various expertize and advantages in terms of social data availability will potentially reduce challenges with collecting data for one’s own study, extend the scale and scope of the work in a positive way as well as provide access to multidisciplinary expertise.

Manipulating data We argue that for gaining meaningful insights from BSD, researchers should design virtual environments where they would be able to access multiple data types, to compare and control them. It may bring new opportunities for authentic and reliable research outcomes. In this regard we agree with Watts [ 68 ] that we need ’social supercollider’ , which will obtain diverse social data streams thus opening access to knowledge about people’s behavior on the massive scale. BSD artificial environments also could give opportunity to run virtual experiments and validate results with members of related research community.

This paper was aimed to bring clarity on BSD topic in general for any application area. As for our intended future work, we aim to utilize BSD to foster serendipity and, thus, innovativeness in knowledge work organizations. Our objective is to obtain empirical evidence that analysis of BSD can help identify relevant new people to collaborate with.

The multidisciplinary and multi-dimensional nature of Big Social Data brings challenges to the development of a useful conceptualization and definition of the concept. Our literature overview shows that majority of related work on BSD is focused on the analysis of social data, giving less attention to describing what BSD actually is. This can lead to lack of consensus, inconsistency, and vague understanding of what such data could be used for. To bring clarity and sophisticated understanding of BSD we propose a synthesized conceptualization and definition of the concept and this growing field. We reviewed existing literature that demonstrates a variety of applications and approaches to study the phenomena around social data. Based on this we outlined the fields of science that determine the scope of BSD (social computing, Big Data science, data analytics and CSS). We assume the knowledge about the involvement of each field would provide researches with the understanding of the expertise that is demanded for conducting research in this field. Additionally, we proposed the classification of BSD types that, from our perspective, well cover the spectrum of data that BSD consists of. In summary, with this paper, we aim to make researchers more informed about what is BSD, on what data to focus as well as motivate them to elaborate better conceptualization, in order to reach clear desirable research outcomes.

Observatory on social media (OSoMe) project to study diffusion of information online and discriminate among mechanisms that drive the spread of memes on social media— .

Belsey B. Cyberbullying: an emerging threat to the “always on” generation. Recuperado el. 2005; 5. Retrieved from . Accessed 15 Oct 2016.

Katz JE. Handbook of mobile communication studies. London: The MIT Press; 2008.

Book   Google Scholar  

Mandiberg M. The social media reader. New York: NYU Press, New York University; 2012.

Google Scholar  

Monash C. Three broad categories of data. 2010. . Accessed 15 Oct 2016.

Chen W. How to tame big bad data. 2010. . Accessed 15 Oct 2016.

Gandomi A, Haider M. Beyond the hype: big data concepts, methods, and analytics. Int J Inf Manag. 2015;35(2):137–44.

Article   Google Scholar  

Marwick AE. Status update: celebrity, publicity, and branding in the social media age. New Haven, USA: Yale University Press; 2015.

Freire FC. Online digital social tools for professional self-promotion. A state of the art review. Revista Latina de Comunicación Social. 2015;70:288–99.

Shih C. The facebook era: tapping online social networks to build better products, reach new audiences, and sell more stuff. Upper Saddle River: Prentice Hall; 2009.

Stephen AT, Toubia O. Deriving value from social commerce networks. J Mark Res. 2010;47(2):215–28.

Musacchio M, Panizzon R, Zhang X, Zorzi V. A linguistically-driven methodology for detecting impending disasters and un-folding emergencies from social media messages. In: proceedings of LREC 2016 workshop. EMOT: emotions, metaphors, ontology and terminology during disasters; 2016. p. 26–33.

Aradau C, Blanke T. Politics of prediction: security and the time/space of governmentality in the age of big data. European Journal of Social Theory. 2016:1–19. Retrieved from . Accessed 15 Oct 2016.

Saldana-Perez AMM, Moreno-Ibarra M. Traffic analysis based on short texts from social media. Int J Knowl Soc Res. 2016;7(1):63–79.

Qualman E. Socialnomics: how social media transforms the way we live and do business. Hoboken: Wiley; 2010.

Kennedy H. Commercial mediations of social media data. London: Springer; 2016. p. 99–127.

Golbeck J, Robles C, Turner K. Predicting personality with social media. In: CHI’11 Extended abstracts on human factors in computing systems. Vancouver: ACM; 2011. p. 253–62.

Power DJ, Phillips-Wren G. Impact of social media and Web 2.0 on decision-making. J Decis Syst. 2011;20(3):249–61.

Golbeck J. Big social data predicting the future of you. Executive Tallent Mag. 2014;5:12–4.

Cambria E, Rajagopal D, Olsher D, Das D. Big social data analysis. In: Akerkar R, editor. Big Data Computing. Boca Raton, Florida: Chapman and Hall/CRC; 2013. p. 401–14.

Chapter   Google Scholar  

Bravo-Marquez F, Mendoza M, Poblete B. Meta-level sentiment models for big social data analysis. Knowl Based Syst. 2014;69:86–99.

Pandarachalil R, Sendhilkumar S, Mahalakshmi G. Twitter sentiment analysis for large-scale data: an unsupervised approach. Cogn Comput. 2015;7(2):254–62.

Ishikawa H. Social big data mining. Boca Raton: Taylor & Francis Group, CRC Press; 2015.

Sicular S. Gartner’s big data definition consists of three parts, not to be confused with three “V’s”, vol. 27. Stanford: Gartner, Inc; 2013.

Kaisler S, Armour F, Espinosa JA, Money W. Big data: issues and challenges moving forward. In: 2013 46th Hawaii international conference on system sciences (HICSS). New York: IEEE; 2013. p. 995–1004.

Tole AA, et al. Big data challenges. Database Syst J. 2013;4(3):31–40.

MathSciNet   Google Scholar  

Chen M, Mao S, Zhang Y, Leung VC. Big data: related technologies, challenges and future prospects. In: Springerbriefs in computer science. Cham: Springer; 2014.

Borne K. Top 10 big data challenges—a serious Look at 10 big data V’s. 2014. . Accessed 15 Oct 2016.

Kehoe M. What does it take to qualify as ’big data’? 2014. . Accessed 15 Oct 2016.

Moorthy J, Lahiri R, Biswas N, Sanyal D, Ranjan J, Nanath K, Ghosh P. Big data: prospects and challenges. J Decis Makers. 2015;40:74–96.

Pentland A. Social physics: how good ideas spread-the lessons from a new science. New York: The Penguin Press, Penguin Group; 2014.

Guellil I, Boukhalfa K. Social big data mining: a survey focused on opinion mining and sentiments analysis. In: 2015 12th international symposium on programming and systems (ISPS). New York: IEEE; 2015. p. 1–10.

Tang J, Chang Y, Liu H. Mining social media with social theories: a survey. ACM SIGKDD Explor Newsl. 2014;15(2):20–9.

Barbier G, Liu H. Data mining in social media. Berlin: Springer; 2011. p. 327–52.

Mukkamala RR, Hussain A, Vatrapu R. Fuzzy-set based sentiment analysis of big social data. In: Enterprise distributed object computing conference (EDOC), 2014 IEEE 18th international. New York: IEEE; 2014. p. 71–80.

Nguyen DT, Hwang D, Jung JJ. Time–frequency social data analytics for understanding social big data. Cham: Springer; 2015.

Coté M. Data motility: the materiality of big social data. Cult Stud Rev. 2014;20(1):121.

Burgess J, Bruns A. Twitter archives and the challenges of “big social data” for media and communication research. M/C J. 2012;15(5):1–7.

Manovich L. Trending: the promises and the challenges of big social data. Debates Digit Humanit. 2011;2:460–75.

Berry D. Understanding digital humanities. London: Palgrave Macmillan, Springer Nature; 2012.

Kaplan F. A map for big data research in digital humanities. Front Digit Humanit. 2015;2:1.

Svensson P. Big digital humanities: imagining a meeting place for the humanities and the digital. Ann Arbor: University of Michigan Press; 2016.

Housley W, Procter R, Edwards A, Burnap P, Williams M, Sloan L, Rana O, Morgan J, Voss A, Greenhill A. Big and broad social data and the sociological imagination: a collaborative response. Big Data Soc. 2014;1(2):2053951714545135.

Procter R, Housley W, Williams M, Edwards A, Burnap P, Morgan J, Rana O, Klein E, Taylor M, Voss A, Choi C, Mavros P, Hudson Smith A, Thelwall M, Ferne T, greenhill A. Enabling social media research through citizen social science. In: Korn M, Colomnbino T, Lewkowicz M (eds) ECSCW 2013 Adjunct Proceedings, 13th european conference on computer supported cooperative work, 21–25 September 2013, Paphos, Cyprus

Mossberger K, Tolbert CJ, McNeal RS. Digital citizenship: the internet, society, and participation. London: MIt Press; 2007.

Kullenberg C, Kasperowski D. What is citizen science? A scientometric meta-analysis. PLoS One. 2016;11(1):0147152.

Bello-Orgaz G, Jung JJ, Camacho D. Social big data: recent achievements and new challenges. Inf Fusion. 2016;28:45–59.

Wang F-Y, Carley KM, Zeng D, Mao W. Social computing: from social informatics to social intelligence. IEEE Intell Syst. 2007;22(2):79–83.

Kling R. What is social informatics and why does it matter? Inf Soc. 2007;23(4):205–20.

Boyd D, Heer J. Profiles as conversation: networked identity performance on friendster. In: Proceedings of the 39th annual Hawaii international conference on system sciences (HICSS’06), vol. 3. New York: IEEE; 2006. p. 59.

Demchenko Y, De Laat C, Membrey P. Defining architecture components of the big data ecosystem. In: 2014 international conference on collaboration technologies and systems (CTS). New York: IEEE. 2014. p. 104–12.

Beyer MA, Laney D. The importance of ’big data’: a definition. Stamford: Gartner; 2012. p. 2014–8.

James M, Michael C, Brad B, Jacques B, Richard D, Charles R, Angela H. Big data: the next frontier for innovation, competition, and productivity. New York: The McKinsey Global Institute; 2011.

Boyd D, Crawford K. Critical questions for big data: provocations for a cultural, technological, and scholarly phenomenon. Inf Commun Soc. 2012;15(5):662–79.

Akerkar R. Big data computing. Boca Raton: CRC Press, Taylor & Francis Group; 2013.

Vis F. A critical reflection on big data: considering APIs, researchers and tools as data makers. First Monday. 2013;18(10). Retrieved from Accessed 15 Oct 2016.

Davenport T. Analytics 3.0: In the new era, big data will power consumers products and services. Brighton, MA: Harvard Business Review. Retrieved from 2013. Accessed 15 Oct 2016.

Bendoly E. Fit, bias, and enacted sensemaking in data visualization: frameworks for continuous development in operations and supply chain management analytics. J Bus Logist. 2016;37(1):6–17.

Wasserman S, Faust K. Social network analysis: methods and applications, vol. 8. Cambridge: Cambridge University Press; 1994.

Book   MATH   Google Scholar  

Easley D, Kleinberg J. Networks, crowds, and markets: reasoning about a highly connected world. Cambridge: Cambridge University Press, University of Cambridge; 2010.

Phillips-Wren G, Iyer LS, Kulkarni U, Ariyachandra T. Business analytics in the context of big data. Commun Assoc Inf Syst. 2015;37:448–72.

Duan L, Xiong Y. Big data analytics and business analytics. J Manag Anal. 2015;2(1):1–21.

Chen C, Chen F, Cao D, Ji R. A cross-media sentiment analytics platform for microblog. In: Proceedings of the 23rd ACM international conference on multimedia. New York City: ACM; 2015. p. 767–9.

Boumaiza AD. A survey on sentiment analysis and visualization. In: Qatar foundation annual research conference proceedings, vol. 2016. Doha: HBKU Press Qatar; 2016. p. 1203.

Watts DJ. Everything is obvious: how common sense fails us. New York: Crown Business, Crown Publishing group; 2011.

Macy MW, et al. Big theory: a trojan horse for economics? Rev Behav Econ. 2015;2(1–2):161–6.

Gintis H, Helbing D, Durkheim E, King ML, Smith A. Homo socialis: an analytical core for sociological theory. Rev Behav Econ. 2015;2(1–2):1–59.

Lazer D, Friedman A. The network structure of exploration and exploitation. Adm Sci Q. 2007;52(4):667–94.

Watts DJ. Computational social science: exciting progress and future directions. Bridge Front Eng. 2013;43(4):5–10.

Wallach H. Computational social science: Toward a collaborative future. In: Alvarez RM, editor. Computational social science: Discovery and prediction. USA: Cambridge Universisty Press; 2016. p. 307–16.

Conte R, Gilbert N, Bonelli G, Cioffi-Revilla C, Deffuant G, Kertesz J, Loreto V, Moat S, Nadal J-P, Sanchez A, et al. Manifesto of computational social science. Eur Phys J Spec Top. 2012;214(1):325–46.

Cioffi-Revilla C. Introduction to computational social science: principles and applications. London: Springer; 2013.

MATH   Google Scholar  

Shah DV, Cappella JN, Neuman WR. Big data, digital media, and computational social science possibilities and perils. Ann Am Acad Political Soc Sci. 2015;659(1):6–13.

Zhu X, Wu X, Elmagarmid AK, Feng Z, Wu L. Video data mining: semantic indexing and event detection from the association perspective. IEEE Trans Knowl Data Eng. 2005;17(5):665–77.

Wu P, Hoi SCH, Zhao P, He Y. Mining social images with distance metric learning for automated image tagging. In: Proceedings of the fourth ACM international conference on web search and data mining. New York City: ACM; 2011. p. 197–206.

Hu X, Liu H. Text analytics in social media. New York: Springer; 2012. p. 385–414.

Naaman M. Social multimedia: highlighting opportunities for search and mining of multimedia data in social media applications. Multimed Tools Appl. 2012;56(1):9–34.

e Silva ADS. From cyber to hybrid mobile technologies as interfaces of hybrid spaces. Space Cult. 2006;9(3):261–78.

Woolgar S. Virtual society? Technology, cyberbole reality. New York: Oxford University Press; 2002.

Albrechtslund A. Online social networking as participatory surveillance. First Monday. 2008;13(3). Retrieved from . Accessed 15 Oct 2016.

Ruppert E, Law J, Savage M. Reassembling social science methods: the challenge of digital devices. Theory Cult Soc. 2013;30(4):22–46.

Granovetter MS. The strength of weak ties. Am J Sociology. 1973;78(6):1360–80.

Gilbert E, Karahalios K. Predicting tie strength with social media. In: Proceedings of the SIGCHI conference on human factors in computing systems. New York City: ACM; 2009. p. 211–20.

Haythornthwaite C. Strong, weak, and latent ties and the impact of new media. Inf Soc. 2002;18(5):385–401.

Download references

Authors' contributions

EO performed the primary literature review and analysis for this work as well as designed graphics. Manuscript was drafted by EO, TO and JH. EO introduced this topic to other authors and coordinate the work process to complete the manuscript. EO, TO, JH and HK worked together to develop the article’s framework and focus. All authors read and approved the final manuscript.


We thank all members of the COBWEB project.

Competing interests

The authors declare that they have no competing interests.

This work was supported by the Academy of Finland project 295893, 295894, 295895— “Enhancing Knowledge Work and Co-creation with Analysis of Weak Ties in Online Services (COBWEB)”.

Author information

Authors and affiliations.

Department of Pervasive Computing, Tampere University of Technology, Korkeakoulunkatu 10, 33720, Tampere, Finland

Ekaterina Olshannikova & Thomas Olsson

Department of Mathematics, Tampere University of Technology, Korkeakoulunkatu 10, 33720, Tampere, Finland

Jukka Huhtamäki

NOVI research group, Department of Information Management and Logistics, Tampere University of Technology, Korkeakoulunkatu 10, 33720, Tampere, Finland

Hannu Kärkkäinen

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Ekaterina Olshannikova .

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( ), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article.

Olshannikova, E., Olsson, T., Huhtamäki, J. et al. Conceptualizing Big Social Data. J Big Data 4 , 3 (2017).

Download citation

Received : 01 November 2016

Accepted : 12 January 2017

Published : 25 January 2017


Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Big Social Data
  • Social Big Data
  • Digital human
  • Conceptualization
  • Social Data
  • Social media
  • Classification
  • Big Social Data analysis

big data in social science research

A woman standing in a server room holding a laptop connected to a series of tall, black servers cabinets.

Published: 5 April 2024 Contributors: Tim Mucci, Cole Stryker

Big data analytics refers to the systematic processing and analysis of large amounts of data and complex data sets, known as big data, to extract valuable insights. Big data analytics allows for the uncovering of trends, patterns and correlations in large amounts of raw data to help analysts make data-informed decisions. This process allows organizations to leverage the exponentially growing data generated from diverse sources, including internet-of-things (IoT) sensors, social media, financial transactions and smart devices to derive actionable intelligence through advanced analytic techniques.

In the early 2000s, advances in software and hardware capabilities made it possible for organizations to collect and handle large amounts of unstructured data. With this explosion of useful data, open-source communities developed big data frameworks to store and process this data. These frameworks are used for distributed storage and processing of large data sets across a network of computers. Along with additional tools and libraries, big data frameworks can be used for:

  • Predictive modeling by incorporating artificial intelligence (AI) and statistical algorithms
  • Statistical analysis for in-depth data exploration and to uncover hidden patterns
  • What-if analysis to simulate different scenarios and explore potential outcomes
  • Processing diverse data sets, including structured, semi-structured and unstructured data from various sources.

Four main data analysis methods  – descriptive, diagnostic, predictive and prescriptive  – are used to uncover insights and patterns within an organization's data. These methods facilitate a deeper understanding of market trends, customer preferences and other important business metrics.

IBM named a Leader in the 2024 Gartner® Magic Quadrant™ for Augmented Data Quality Solutions.

Structured vs unstructured data

What is data management?

The main difference between big data analytics and traditional data analytics is the type of data handled and the tools used to analyze it. Traditional analytics deals with structured data, typically stored in relational databases . This type of database helps ensure that data is well-organized and easy for a computer to understand. Traditional data analytics relies on statistical methods and tools like structured query language (SQL) for querying databases.

Big data analytics involves massive amounts of data in various formats, including structured, semi-structured and unstructured data. The complexity of this data requires more sophisticated analysis techniques. Big data analytics employs advanced techniques like machine learning and data mining to extract information from complex data sets. It often requires distributed processing systems like Hadoop to manage the sheer volume of data.

These are the four methods of data analysis at work within big data:

The "what happened" stage of data analysis. Here, the focus is on summarizing and describing past data to understand its basic characteristics.

The “why it happened” stage. By delving deep into the data, diagnostic analysis identifies the root patterns and trends observed in descriptive analytics.

The “what will happen” stage. It uses historical data, statistical modeling and machine learning to forecast trends.

Describes the “what to do” stage, which goes beyond prediction to provide recommendations for optimizing future actions based on insights derived from all previous.

The following dimensions highlight the core challenges and opportunities inherent in big data analytics.

The sheer volume of data generated today, from social media feeds, IoT devices, transaction records and more, presents a significant challenge. Traditional data storage and processing solutions are often inadequate to handle this scale efficiently. Big data technologies and cloud-based storage solutions enable organizations to store and manage these vast data sets cost-effectively, protecting valuable data from being discarded due to storage limitations.

Data is being produced at unprecedented speeds, from real-time social media updates to high-frequency stock trading records. The velocity at which data flows into organizations requires robust processing capabilities to capture, process and deliver accurate analysis in near real-time. Stream processing frameworks and in-memory data processing are designed to handle these rapid data streams and balance supply with demand.

Today's data comes in many formats, from structured to numeric data in traditional databases to unstructured text, video and images from diverse sources like social media and video surveillance. This variety demans flexible data management systems to handle and integrate disparate data types for comprehensive analysis. NoSQL databases , data lakes and schema -on-read technologies provide the necessary flexibility to accommodate the diverse nature of big data.

Data reliability and accuracy are critical, as decisions based on inaccurate or incomplete data can lead to negative outcomes. Veracity refers to the data's trustworthiness, encompassing data quality, noise and anomaly detection issues. Techniques and tools for data cleaning, validation and verification are integral to ensuring the integrity of big data, enabling organizations to make better decisions based on reliable information.

Big data analytics aims to extract actionable insights that offer tangible value. This involves turning vast data sets into meaningful information that can inform strategic decisions, uncover new opportunities and drive innovation. Advanced analytics, machine learning and AI are key to unlocking the value contained within big data, transforming raw data into strategic assets.

Data professionals, analysts, scientists and statisticians prepare and process data in a data lakehouse, which combines the performance of a data lakehouse with the flexibility of a data lake to clean data and ensure its quality. The process of turning raw data into valuable insights encompasses several key stages:

  • Collect data: The first step involves gathering data, which can be a mix of structured and unstructured forms from myriad sources like cloud, mobile applications and IoT sensors. This step is where organizations adapt their data collection strategies and integrate data from varied sources into central repositories like a data lake, which can automatically assign metadata for better manageability and accessibility.
  • Process data: After being collected, data must be systematically organized, extracted, transformed and then loaded into a storage system to ensure accurate analytical outcomes. Processing involves converting raw data into a format that is usable for analysis, which might involve aggregating data from different sources, converting data types or organizing data into structure formats. Given the exponential growth of available data, this stage can be challenging. Processing strategies may vary between batch processing, which handles large data volumes over extended periods and stream processing, which deals with smaller real-time data batches.
  • Clean data: Regardless of size, data must be cleaned to ensure quality and relevance. Cleaning data involves formatting it correctly, removing duplicates and eliminating irrelevant entries. Clean data prevents the corruption of output and safeguard’s reliability and accuracy.
  • Analyze data: Advanced analytics, such as data mining, predictive analytics, machine learning and deep learning, are employed to sift through the processed and cleaned data. These methods allow users to discover patterns, relationships and trends within the data, providing a solid foundation for informed decision-making.

Under the Analyze umbrella, there are potentially many technologies at work, including data mining, which is used to identify patterns and relationships within large data sets; predictive analytics, which forecasts future trends and opportunities; and deep learning , which mimics human learning patterns to uncover more abstract ideas.

Deep learning uses an artificial neural network with multiple layers to model complex patterns in data. Unlike traditional machine learning algorithms, deep learning learns from images, sound and text without manual help. For big data analytics, this powerful capability means the volume and complexity of data is not an issue.

Natural language processing (NLP) models allow machines to understand, interpret and generate human language. Within big data analytics, NLP extracts insights from massive unstructured text data generated across an organization and beyond.

Structured Data

Structured data refers to highly organized information that is easily searchable and typically stored in relational databases or spreadsheets. It adheres to a rigid schema, meaning each data element is clearly defined and accessible in a fixed field within a record or file. Examples of structured data include:

  • Customer names and addresses in a customer relationship management (CRM) system
  • Transactional data in financial records, such as sales figures and account balances
  • Employee data in human resources databases, including job titles and salaries

Structured data's main advantage is its simplicity for entry, search and analysis, often using straightforward database queries like SQL. However, the rapidly expanding universe of big data means that structured data represents a relatively small portion of the total data available to organizations.

Unstructured Data

Unstructured data lacks a pre-defined data model, making it more difficult to collect, process and analyze. It comprises the majority of data generated today, and includes formats such as:

  • Textual content from documents, emails and social media posts
  • Multimedia content, including images, audio files and videos
  • Data from IoT devices, which can include a mix of sensor data, log files and time-series data

The primary challenge with unstructured data is its complexity and lack of uniformity, requiring more sophisticated methods for indexing, searching and analyzing. NLP, machine learning and advanced analytics platforms are often employed to extract meaningful insights from unstructured data.

Semi-structured data

Semi-structured data occupies the middle ground between structured and unstructured data. While it does not reside in a relational database, it contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. Examples include:

  • JSON (JavaScript Object Notation) and XML (eXtensible Markup Language) files, which are commonly used for web data interchange
  • Email, where the data has a standardized format (e.g., headers, subject, body) but the content within each section is unstructured
  • NoSQL databases, can store and manage semi-structured data more efficiently than traditional relational databases

Semi-structured data is more flexible than structured data but easier to analyze than unstructured data, providing a balance that is particularly useful in web applications and data integration tasks.

Ensuring data quality and integrity, integrating disparate data sources, protecting data privacy and security and finding the right talent to analyze and interpret data can present challenges to organizations looking to leverage their extensive data volumes. What follows are the benefits organizations can realize once they see success with big data analytics:

Real-time intelligence

One of the standout advantages of big data analytics is the capacity to provide real-time intelligence. Organizations can analyze vast amounts of data as it is generated from myriad sources and in various formats. Real-time insight allows businesses to make quick decisions, respond to market changes instantaneously and identify and act on opportunities as they arise.

Better-informed decisions

With big data analytics, organizations can uncover previously hidden trends, patterns and correlations. A deeper understanding equips leaders and decision-makers with the information needed to strategize effectively, enhancing business decision-making in supply chain management, e-commerce, operations and overall strategic direction.  

Cost savings

Big data analytics drives cost savings by identifying business process efficiencies and optimizations. Organizations can pinpoint wasteful expenditures by analyzing large datasets, streamlining operations and enhancing productivity. Moreover, predictive analytics can forecast future trends, allowing companies to allocate resources more efficiently and avoid costly missteps.

Better customer engagement

Understanding customer needs, behaviors and sentiments is crucial for successful engagement and big data analytics provides the tools to achieve this understanding. Companies gain insights into consumer preferences and tailor their marketing strategies by analyzing customer data.

Optimized risk management strategies

Big data analytics enhances an organization's ability to manage risk by providing the tools to identify, assess and address threats in real time. Predictive analytics can foresee potential dangers before they materialize, allowing companies to devise preemptive strategies.

As organizations across industries seek to leverage data to drive decision-making, improve operational efficiencies and enhance customer experiences, the demand for skilled professionals in big data analytics has surged. Here are some prominent career paths that utilize big data analytics:

Data scientist

Data scientists analyze complex digital data to assist businesses in making decisions. Using their data science training and advanced analytics technologies, including machine learning and predictive modeling, they uncover hidden insights in data.

Data analyst

Data analysts turn data into information and information into insights. They use statistical techniques to analyze and extract meaningful trends from data sets, often to inform business strategy and decisions.

Data engineer

Data engineers prepare, process and manage big data infrastructure and tools. They also develop, maintain, test and evaluate data solutions within organizations, often working with massive datasets to assist in analytics projects.

Machine learning engineer

Machine learning engineers focus on designing and implementing machine learning applications. They develop sophisticated algorithms that learn from and make predictions on data.

Business intelligence analyst

Business intelligence (BI) analysts help businesses make data-driven decisions by analyzing data to produce actionable insights. They often use BI tools to convert data into easy-to-understand reports and visualizations for business stakeholders.

Data visualization specialist

These specialists focus on the visual representation of data. They create data visualizations that help end users understand the significance of data by placing it in a visual context.

Data architect

Data architects design, create, deploy and manage an organization's data architecture. They define how data is stored, consumed, integrated and managed by different data entities and IT systems.

IBM and Cloudera have partnered to create an industry-leading, enterprise-grade big data framework distribution plus a variety of cloud services and products — all designed to achieve faster analytics at scale.

IBM Db2 Database on IBM Cloud Pak for Data combines a proven, AI-infused, enterprise-ready data management system with an integrated data and AI platform built on the security-rich, scalable Red Hat OpenShift foundation.

IBM Big Replicate is an enterprise-class data replication software platform that keeps data consistent in a distributed environment, on-premises and in the hybrid cloud, including SQL and NoSQL databases.

A data warehouse is a system that aggregates data from different sources into a single, central, consistent data store to support data analysis, data mining, artificial intelligence and machine learning.

Business intelligence gives organizations the ability to get answers they can understand. Instead of using best guesses, they can base decisions on what their business data is telling them — whether it relates to production, supply chain, customers or market trends.

Cloud computing is the on-demand access of physical or virtual servers, data storage, networking capabilities, application development tools, software, AI analytic tools and more—over the internet with pay-per-use pricing. The cloud computing model offers customers flexibility and scalability compared to traditional infrastructure.

Purpose-built data-driven architecture helps support business intelligence across the organization. IBM analytics solutions allow organizations to simplify raw data access, provide end-to-end data management and empower business users with AI-driven self-service analytics to predict outcomes.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • JMIR Publications - PMC COVID-19 Collection
  • PMC10071404

Logo of pheihealthco

Big Data and Infectious Disease Epidemiology: Bibliometric Analysis and Research Agenda

Lateef babatunde amusa.

1 Centre for Applied Data Science, University of Johannesburg, Johannesburg, South Africa

2 Department of Statistics, University of Ilorin, Ilorin, Nigeria

Hossana Twinomurinzi

Edith phalane.

3 Pan African Centre for Epidemics Research (PACER) Extramural Unit, South African Medical Research Council/University of Johannesburg, Johannesburg, South Africa

4 Department of Environmental Health, Faculty of Health Sciences, University of Johannesburg, Johannesburg, South Africa

Refilwe Nancy Phaswana-Mafuya

Infectious diseases represent a major challenge for health systems worldwide. With the recent global pandemic of COVID-19, the need to research strategies to treat these health problems has become even more pressing. Although the literature on big data and data science in health has grown rapidly, few studies have synthesized these individual studies, and none has identified the utility of big data in infectious disease surveillance and modeling.

The aim of this study was to synthesize research and identify hotspots of big data in infectious disease epidemiology.

Bibliometric data from 3054 documents that satisfied the inclusion criteria retrieved from the Web of Science database over 22 years (2000-2022) were analyzed and reviewed. The search retrieval occurred on October 17, 2022. Bibliometric analysis was performed to illustrate the relationships between research constituents, topics, and key terms in the retrieved documents.

The bibliometric analysis revealed internet searches and social media as the most utilized big data sources for infectious disease surveillance or modeling. The analysis also placed US and Chinese institutions as leaders in this research area. Disease monitoring and surveillance, utility of electronic health (or medical) records, methodology framework for infodemiology tools, and machine/deep learning were identified as the core research themes.


Proposals for future studies are made based on these findings. This study will provide health care informatics scholars with a comprehensive understanding of big data research in infectious disease epidemiology.


Globally, the infectious disease burden continues to be substantial in countries with low and lower-middle income, while morbidity and mortality related to neglected tropical diseases and HIV infection, tuberculosis, and malaria remain high. Tuberculosis and malaria are endemic to many areas, imposing substantial but steady burdens. At the same time, other infections such as influenza fluctuate in pervasiveness and intensity, disrupting the developing and developed settings alike when an outbreak and epidemic occurs. Additionally, deaths have persisted over the 21st century due to emerging and reemerging infectious diseases compared with seasonal and endemic infections. This portrays a new era of infectious disease, defined by outbreaks of emerging, reemerging, and endemic pathogens that spread quickly with the help of global mobility and climate change [ 1 ].

Moreover, the risk from infectious diseases is globally shared. While infectious diseases thrive in underresourced settings, inequalities and inequities in accessing health and health care create a favorable environment for infectious diseases to spread [ 2 , 3 ]. Addressing inequalities and inequities in accessing health care, and improving surveillance and monitoring of infectious diseases should be prioritized to minimize the emergence and spread of infections.

Recent years have witnessed the rapid emergence of big data and data science research, propelled by the increasing availability of digital traces [ 4 ]. The growing availability of electronic records and passive data generated by social media, the internet, and other digital sources can be mined for pattern discoveries and knowledge extraction. Like most buzz words, big data has no straightforward meaning and its definition is evolving. Broadly, big data refers to a large volume of structured or unstructured data, with largeness itself associated with three major terms known as the “3 Vs”: volume (large quantity), velocity (coming in at unprecedented real-time speeds), and variety (increasing collection from different data sources). Additional characteristics of big data include veracity, validity, volatility, and value [ 5 ]. For epidemiology and infectious diseases research, this means that in the last decade, there has been a significant spike in the number of studies with considerable interest in using digital epidemiology and big data tools to enhance health systems in terms of disease surveillance, modeling, and evidence-based responses [ 4 , 6 - 8 ]. Digital epidemiology uses digital data or online sources to gain insight into disease dynamics and health equity, and to inform public health programs and policies [ 9 , 10 ].

The success of infectious disease control relies heavily on surveillance systems tracking diseases, pathogens, and clinical outcomes [ 11 ]. However, conventional surveillance systems are known to frequently have severe time lags and limited spatial resolution; therefore, surveillance systems that are robust, local, and timely are critically needed. It is crucial to monitor and forecast emerging and reemerging infections [ 12 ] such as severe acute respiratory syndrome, pandemic influenza, Ebola, Zika, and drug-resistant pathogens, especially in resource-limited settings such as low-middle–income countries. Using big data to strengthen surveillance systems is critical for future pandemic preparedness. This approach provides big data streams that can be triangulated with spatial and temporal data. These big data streams include digital data sources such as mobile health apps, electronic health (or medical) records, social media, internet searches, mobile phone network data, and GPS mobile data. Many studies have demonstrated the usefulness of real-time data in health assessments [ 13 - 18 ]. Some of these studies have been used explicitly for the monitoring and forecasting of epidemics such as COVID-19 [ 19 ], Zika [ 13 ], Ebola [ 16 ], and influenza [ 14 ].

The body of extant literature at the nexus of big data, epidemiology, and infectious diseases is rapidly growing. However, despite its growth and dispersion, there has been a limited synthesis of the applications. A previous study [ 20 ] performed a bibliometric analysis focusing on only HIV. A bibliometric analysis is a statistical or quantitative analysis of large-scale bibliographic metadata (or metrics of published studies) on a given topic. These quantitative analyses detect patterns, networks, and trends among the bibliographic metadata [ 21 , 22 ]. Thus, the aim of this study was to address the evolution of big data in epidemiology and infectious diseases to identify gaps and opportunities for further research. The study findings reveal interesting patterns and can inform trending research focus and future directions in big data–driven infectious diseases research.

Study Design

A bibliometric analysis was performed to understand and explore research on big data in infectious disease modeling and surveillance. The adopted bibliometric methodology involved three main phases: data collection, data analysis, and data visualization and reporting [ 23 ].

Search Strategy

Regarding data collection, which entails querying and exporting data from selected databases, we queried the Web of Science (WoS) core databases for publications using specific inclusion and exclusion criteria. Compared to other databases, the WoS has been shown to have better quality bibliometric information [ 23 , 24 ] and more excellent coverage of high-impact journals [ 25 ]. With the aid of domain knowledge experts from the fields of both big data and epidemiology, we iteratively developed a search strategy and selected the following search terms. The following search string queried all documents’ titles, abstracts, and keywords, and generated 3235 publications in the WoS collection:

(Epidemic* OR “infectious disease*” OR “Disease surveillance” OR “disease transmission” OR “disease outbreak*” OR (“communicable disease*” NOT “non-communicable disease”) OR syndemic* OR HIV OR AIDS OR “human immunodeficiency virus” OR coronavirus* OR SARS-CoV-2 OR COVID-19 OR Influenza OR flu OR Zika OR Ebola OR MERS OR “Middle East respiratory syndrome” OR Tuberculosis OR “Monkey Pox” OR “Dengue virus” OR Hepatitis*)
(“BIG DATA” OR “web mining” OR “opinion mining” OR “Google Trend*” OR “Google search*” OR “Google quer*” OR “Internet search*” OR “Internet quer*” OR “search engine quer*” OR “Digital traces” OR “electronic health records” OR “Digital epidemiology”)

Screening Strategy

Documents not written in English and not peer-reviewed, including editorial materials, letters, meeting abstracts, news items, book reviews, and retracted publications, were removed from the data set given the focus on bibliometric analysis, leaving 3054 documents for the analytic sample ( Figure 1 ).

An external file that holds a picture, illustration, etc.
Object name is ijmr_v12i1e42292_fig1.jpg

Flow chart of the literature selection process.

The 3054 bibliographic data were exported into the R package bibliometrix [ 23 ] for analysis. This package was specifically used to conduct performance analysis and science mapping of big data in infectious disease epidemiology. Performance mapping evaluates the production and impact of research constituents, including authors, institutions, countries, and journals. Science mapping examines the relationships between the research constituents by analyzing the topic’s conceptual, intellectual, and social structure.

There are several metrics available for bibliometric analysis. In this study, the primary metrics used for evaluating productivity and influence were the H-index and M-index. The H-index represents the number of published papers h , such that the citation number is at least h [ 26 ]. The H-index can be computed for different bibliometric units of analysis: authors, journals, institutions, and countries. The M-index simply adjusts the H-index for the academic age (ie, the number of years since the researcher’s first publication). Other utilized performance analysis metrics were obtained from yearly research output and citation counts. These metrics also contribute to identifying the main themes and the key actors in the research area.

In terms of science mapping, network maps were constructed for some selected bibliographic units of analysis [ 27 ]. These networks exhibit frequency distributions of the involved bibliographic data over time. For instance, international collaborations can be explored by assessing same-country publications. A cocitation network analysis was also used to analyze publication references. In addition, using the Louvain clustering algorithm and a greedy optimization technique [ 28 ], a co-occurrence analysis was used to understand the conceptual structure of the research area. The basic purpose of co-occurrence analysis is to investigate the link between keywords based on the number of times they appear together in a publication. Notable research topics and over-time trends were detected by generating clusters for author-provided keywords [ 29 ]. VOSviewer [ 30 ] was used to construct the network visualizations. Each network node represents a research constituent (eg, author, country, institution, article, document source, keyword). The node’s size is proportional to the occurrence frequency of the relevant parameters. The degree of association is represented by the thickness of the link between nodes, and the various colors reflect distinct clusters.

Descriptive Summary

The bibliographic data set comprises 3054 documents from 1600 sources, 14,351 authors, and 121,726 references. From the 3054 documents, 2666 (87.30%) were original research articles and the remaining 388 (12.70%) were review papers. The research output before 2009 was relatively low. The annual publication output during the 27 years (1995-2022) grew steadily, with a yearly growth rate of 26.5%. The publication growth increased steeply between 2013 and 2020 ( Figure 2 ). Table 1 presents the summary statistics of the primary characteristics of these 3054 publications, including the time span and information about documents and authors.

An external file that holds a picture, illustration, etc.
Object name is ijmr_v12i1e42292_fig2.jpg

Annual growth of publications related to big data in infectious diseases research.

Main descriptive summary of the extracted bibliographic records from 1995 to 2022.

As shown in Table 2 , the most productive and influential sources publishing on topics related to big data and infectious diseases epidemiology were Journal of Medical Internet Research and PLoS One (H-index=18), followed by IEEE Access (H-index=13). In terms of productivity, Journal of Medical Internet Research produced a slightly higher number of publications (n=61) than the next best journal PLoS One (n=56). PLoS One had the highest number of total citations at 1893.

Top 10 productive and influential publication sources ranked by H-index.

As shown in Table 3 , the most productive and influential author was Zhang Y (H-index=17), followed by Li X (H-index=13) and Wang J (H-index=12). Wang L had the highest total citations (n=1072), which was substantially higher than the next most impactful author Wang J (total citations=861).

Top 10 productive and influential authors ranked by H-index and total citations.

a Not available.

The aim and scope of the top 10 most influential journals, as listed in Table 2 , is to publish medical research, medical informatics, or multidisciplinary studies. It can thus be inferred that major future breakthroughs regarding big data in infectious diseases epidemiology will likely appear in these journals.

Figure 3 displays the top 20 most productive institutions. Institutional contributions were assessed by affiliations with at least one author in the publication. Except for the University of California, the top three institutions, which account for 21.3% of the number of publications in the top 20, were medical schools: Harvard Medical School (7.9%) and Icahn School of Medicine at Mount Sinai (6.4%). The other institutions, each accounting for more than 6% of the total, included Columbia University and Oxford University in the top 5, whereas others in the top 20 are research universities: London School of Hygiene and Tropical Medicine focuses on global and public health, Taipei Medical University is medical-based, and Huazhong University of Science and Technology is focused on science and technology. The United States produced the majority of the top 10 most productive institutions, which were in the top 5.

An external file that holds a picture, illustration, etc.
Object name is ijmr_v12i1e42292_fig3.jpg

Top 20 institutions by number of publications. CALIF: California; HARVARD MED SCH: Harvard Medical School; ICAHN SCH MED MT SINAI: Icahn School of Medicine at Mount Sinai; LONDON SCH HYG AND TROP MED: London School of Hygiene & Tropical Medicine; PENN: Pennsylvania; UNIV: University.

The 20 most productive countries ( Figure 4 ) are led by the United States and China, accounting for more than half (57.3%) of the total publication output. The United States alone accounted for 41.1% of the productivity in this field. The other countries in the top five were the United Kingdom (9.4%), India (4.4%), and Canada (3.3%).

An external file that holds a picture, illustration, etc.
Object name is ijmr_v12i1e42292_fig4.jpg

Top 20 productive countries by number of publications.

Computer science was the most productive research domain in the bibliographic collection ( Figure 5 ), accounting for 17.6% of the top 10 subject areas. In order of productivity, the other research subjects in the top 5 were public environmental and occupational health (11.4%), health care services (9.6%), medical informatics (9.0%), and engineering (8.8%).

An external file that holds a picture, illustration, etc.
Object name is ijmr_v12i1e42292_fig5.jpg

Top 10 key subject areas by number of publications.

Two major clusters of countries represent the collaboration patterns of the most productive countries ( Figure 6 ). The network was set to include only countries with at least 10 documents, resulting in 50 productive countries. The clustering results demonstrated a demarcation of European countries from the others. For instance, cluster 1 (red) represented most countries from Europe, with England, Germany, and Spain being the core countries. Non-European countries constituted the second cluster (green). The United States and China were the core countries of this group.

An external file that holds a picture, illustration, etc.
Object name is ijmr_v12i1e42292_fig6.jpg

Network of country collaborations (≥10 documents, 50 countries, 2 clusters).

Regarding collaboration strength, the United States, with a total link strength of 570, featured the highest number of partners (48), accounting for almost all 50 countries in the network (96%). China, which distantly followed the United States, featured 38 partners and a total link strength of 304. This implies that collaboration is mainly regional.

Figure 7 shows a network map of cocited references in this research area, wherein the node’s size represents the citation strength of the individual studies. The network was set to include only studies with at least 25 citations, resulting in 37 studies. Ginsberg et al [ 31 ] published the most highly cited article (185 citations). This 13-year-old study presented a method that used Google search queries to track flu-like illnesses in a population. The second most cited study by Eysenbach [ 9 ] introduced the concept of infodemiology, the science of using the internet (eg, social media, search engines, blogs, and websites) to inform public health and public policy. Table 4 further summarizes the top 15 most cited references, including the title, year of publication, number of citations, type of disease, and data source.

An external file that holds a picture, illustration, etc.
Object name is ijmr_v12i1e42292_fig7.jpg

Network of cocited references.

Summary of the top 15 most cited references.

a NA: not applicable (eg, a review paper, no particular disease or data source for a case study).

b Online platform of real-time COVID-19 cases in China.

c Internet searches include Google Trends and Baidu Index.

d Weibo is a China-based social media platform.

The 37 studies in the network map of cocited references produced four thematic clusters ( Figure 7 ); disease monitoring and surveillance (cluster 1), utility of electronic health (or medical) records (cluster 2), methodology framework for infodemiology tools (cluster 3), and machine learning and deep learning methods (cluster 4) were the main topics discussed.

Keyword co-occurrence analysis serves as a supplement to enrich the understanding of the thematic clusters derived from the reference cocitation analysis and helps identify the core topics and contents [ 29 ]. As shown in Figure 8 , the co-occurrence network displayed 100 relevant keywords after assigning a selection threshold of 10 for the number of keyword occurrences. The top 5 most frequently used keywords were COVID-19, big data, machine learning, coronavirus, and electronic health records.

An external file that holds a picture, illustration, etc.
Object name is ijmr_v12i1e42292_fig8.jpg

Co-occurrence networks of author keywords.

The 100 author-derived keywords produced four clusters from the coword analysis ( Figure 8 ). Cluster 1 (yellow-green) is related to public health and infectious diseases, with top keywords such as COVID-19, SARS-CoV-2, epidemiology, and epidemics . Cluster 2 (green) is related to electronic storage and delivery of health care, with top keywords including electronic health records, clinical decision support, primary care, epidemiology, and telemedicine . Cluster 3 (blue) involves infodemiology tools, with top keywords including coronavirus, google trends, social media, infodemiology , and surveillance . Cluster 4 (red) is more coherent and broadly related to big data and artificial intelligence, including top keywords big data, machine learning, artificial intelligence, deep learning, and big data analytics.

Systematic Review of the Top 20 Papers

Further filtering of the top 20 papers was performed to determine if they met the following criteria: (1) addressed at least one infectious disease and (2) utilized a big data source. A review of these 20 papers (summarized in Table 5 ) was then performed. These selected studies were mainly characterized by papers that utilized novel data sources, including internet search engine data (Google Trends: n=11; Baidu or Weibo index: n=2; Yahoo: n=1) and social media data (Twitter: n=5). Other data sources included electronic health or medical records (n=3) and Tencent migration data (n=1). The most frequently studied diseases were COVID-19 (n=10) [ 35 , 36 , 39 , 42 , 45 - 50 ], followed by influenza (n=8) [ 37 , 40 , 43 , 44 , 51 - 54 ]. Only one study considered the Zika virus [ 55 ], and another considered the trio of meningitis, legionella pneumonia, and Ebola [ 56 ].

Summary of top 20 studies that addressed an infectious disease and utilized a big data source.

Principal Findings

Novel big data streams have created interesting opportunities for infectious disease monitoring and control. The review of the top 20 papers suggests the domination of high-volume electronic health records and digital traces such as internet searches and social media. Of note is the relatively increased use of Google Trends. Most studies used Google Trends data by correlating them with official data on disease occurrence, spread, and outbreaks. Some of these studies further adopted nowcasting for disease surveillance. However, using Google Trends for forecasts and predictions in infectious diseases epidemiology fills a gap in the extant literature. Few studies have gone as far as predicting incidents and occurrences, even though data on reported cases of various health concerns and the associated Google Trends data have been correlated in many studies. Predicting the future is hard; hence, more reliable and efficient methodologies are needed for forecasting infectious disease outbreaks.

There are a few drawbacks to digital trace data that should be considered. Many of these data streams miss demographic information such as age and gender, which is essential in almost any epidemiological study. Besides, they represent a growing but still limited population segment, with infants unfeatured and fewer older adults than younger people. Geographic heterogeneity in coverage exists, with underrepresentation in developing countries, although these biases tend to fade and are arguably less pronounced than those found in traditional global surveillance systems. Further, the retrieved data are subject to spatial and temporal uncertainty. Accordingly, hybrid systems that supplement rather than replace conventional surveillance systems as well as improve prospects for accurate infectious disease models and forecasts should be developed.

Most studies, except for those in the United States and China, were conducted in the European context. Thus, more studies need to test the utility of these big data streams for infectious disease epidemiology in the context of more countries, especially in Africa. Future research questions should ask if any cross-cultural differences between countries affect the adoption and use of big data in infectious disease epidemiology.

The vast majority of infectious diseases have a global distribution. Apart from the coronavirus, influenza, Zika, and Ebola virus outbreaks that are featured in our review, the utility of these big data sources for more infectious diseases should be studied.


A few limitations were inherent in our study. First, like any bibliometric study, we are limited by the search terms and database used. This study utilized English publications from the WoS core collection; therefore, relevant publications may have been missed. However, our choice of WoS was informed by its greater coverage of high-impact journals. Second, some studies may have been published after we concluded document extraction. Accordingly, this study does not claim to be exhaustive but rather extensive.

Future Research Agenda and Conclusions

The bibliometric study identified the United States and China as research leaders in this field, with most affiliations from the Harvard Medical School and the University of California. Top authors were Zhang Yi and Li Xingwang. Journal of Medical Internet Research and PLoS One are the most productive and influential journals in this field. Internet searches and social media data are the most utilized data sources. COVID-19 and influenza were the most studied infectious diseases. The main research themes in this area of research were disease monitoring and surveillance, utility of electronic health (or medical) records, methodology framework for infodemiology tools, and machine/deep learning. Most research papers on big data in infectious diseases epidemiology were published in outlets related to computer science, public health, and health care services.

Opportunities for future research are revealed directly from the results of this study. Integrating multiple surveillance platforms, including big data tools, are critical to better understanding pathogen spread. It is also paramount for the research needs to align with a global view of disease risk. The risk of infectious disease is globally shared in an increasingly connected world. The COVID-19 pandemic, including the rapid global circulation of evolved strains, has emphasized the need for an interdisciplinary, collaborative, global framework for infectious disease research and control. There is a need to empower epidemiologists and public health scientists to leverage insights from big data for infectious disease prevention and control.


Conflicts of Interest: None declared.

Help | Advanced Search

Computer Science > Computation and Language

Title: subtle biases need subtler measures: dual metrics for evaluating representative and affinity bias in large language models.

Abstract: Research on Large Language Models (LLMs) has often neglected subtle biases that, although less apparent, can significantly influence the models' outputs toward particular social narratives. This study addresses two such biases within LLMs: \textit{representative bias}, which denotes a tendency of LLMs to generate outputs that mirror the experiences of certain identity groups, and \textit{affinity bias}, reflecting the models' evaluative preferences for specific narratives or viewpoints. We introduce two novel metrics to measure these biases: the Representative Bias Score (RBS) and the Affinity Bias Score (ABS), and present the Creativity-Oriented Generation Suite (CoGS), a collection of open-ended tasks such as short story writing and poetry composition, designed with customized rubrics to detect these subtle biases. Our analysis uncovers marked representative biases in prominent LLMs, with a preference for identities associated with being white, straight, and men. Furthermore, our investigation of affinity bias reveals distinctive evaluative patterns within each model, akin to `bias fingerprints'. This trend is also seen in human evaluators, highlighting a complex interplay between human and machine bias perceptions.

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

A big data analysis of the adoption of quoting encouragement policy on Twitter during the 2020 U.S. presidential election

  • Research Article
  • Open access
  • Published: 19 May 2024

Cite this article

You have full access to this open access article

big data in social science research

  • Amirhosein Bodaghi   ORCID: 1 &
  • Jonathan J. H. Zhu 2  

97 Accesses

Explore all metrics

This research holds significance for the fields of social media and communication studies through its comprehensive evaluation of Twitter’s quoting encouragement policy enacted during the 2020 U.S. presidential election. In addressing a notable gap in the literature, this study introduces a framework that assesses both the quantitative and qualitative effects of specific platform-wide policy interventions, an aspect lacking in existing research. Employing a big data approach, the analysis includes 304 million tweets from a randomly sampled cohort of 86,334 users, using a systematic framework to examine pre-, within-, and post-intervals aligned with the policy timeline. Methodologically, SARIMAX models and linear regression are applied to the time series data on tweet types within each interval, offering an examination of temporal trends. Additionally, the study characterizes short-term and long-term adopters of the policy using text and sentiment analyses on quote tweets. Results show a significant retweeting decrease and modest quoting increase during the policy, followed by a swift retweeting resurgence and quoting decline post-policy. Users with fewer connections or higher activity levels adopt quoting more. Emerging quoters prefer shorter, positive quote texts. These findings hold implications for social media policymaking, providing evidence for refining existing policies and shaping effective interventions.

Similar content being viewed by others

big data in social science research

Quantifying Media Influence and Partisan Attention on Twitter During the UK EU Referendum

big data in social science research

Small Is the New Big – At Least on Twitter: A Diachronic Study of Twitter Use during Two Regional Norwegian Elections

big data in social science research

Indian TV Anchors on Twitter: Technological Practice and Textual Form

Avoid common mistakes on your manuscript.


The introduction of the quote tweet feature by Twitter in April 2015 marked a significant development in the platform’s functionality. While a conventional retweet merely reproduces the original tweet, serving as a symbol of agreement and endorsement between users involved [ 1 ], the quote tweet feature allows users to include their own commentary when sharing a tweet. Consequently, this feature has given rise to various novel applications, including the expression of opinions, public replies, and content forwarding [ 2 ]. Notably, owing to the perennial significance of the US presidential elections [ 3 , 4 ], Twitter instituted a novel policy on October 9, 2020, advising users to abstain from mere retweeting and advocating instead for the utilization of quote tweets supplemented by individual perspectives. This policy remained in effect until December 16, 2020. Indeed, before the policy change, retweeting on Twitter was simple. With a single click, users could share a post with their followers. However, during the time policy was held, clicking the retweet button no longer automatically shared the post. Instead, Twitter prompted users to add their own thoughts or comments before sharing. This essentially created a “Quote Tweet.” This extra step was intended to encourage users to share more thoughtfully. Importantly, adding text to the quote tweet was optional. Users could still leave the comment section blank and share the post without any additional commentary. This option essentially replicated the old retweet functionality.

Significance of the research

This research holds significance in the realm of social media and communication studies, particularly in understanding the impact of policy interventions on user behavior. The significance can be delineated through various dimensions. First, the study provides a comprehensive evaluation of the effectiveness of Twitter’s quoting encouragement policy implemented during the 2020 U.S. presidential election. By employing a robust big data approach and sophisticated analytical methods, the research goes beyond anecdotal observations, offering a nuanced understanding of how such policies influence user engagement. This contribution is valuable for social media platforms seeking evidence-based insights into the outcomes of policy interventions, aiding in the refinement of existing policies and the formulation of new ones. Second, the findings offer actionable insights for social media policymakers and practitioners involved in the delicate task of shaping user behavior. Understanding the quantitative and qualitative effects of the policy shift allows for the optimization of future interventions, fostering more effective communication strategies on platforms like Twitter. Policymakers can leverage the identified user characteristics and behavioral patterns to tailor interventions that resonate with the diverse user base, thereby enhancing the impact of social media policies. Finally, the research enriches the theoretical landscape by applying the Motivation Crowding Theory, Theory of Planned Behavior (TPB), and Theory of Diffusion of Innovation (DOI) to the context of social media policy adoption. This interdisciplinary approach contributes to theoretical advancements, offering a framework that can be applied beyond the scope of this study. As theories from economics and psychology are employed to understand user behavior in the digital age, the research paves the way for cross-disciplinary collaborations and a more holistic comprehension of online interactions.

Research gap

Despite the existing body of literature on quoting behavior on Twitter, there is a conspicuous gap in addressing the unique policy implemented by Twitter from October 9 to December 16, 2020, encouraging users to quote instead of retweeting. Previous studies have explored the use of quotes in various contexts, such as political discourse and the spread of misinformation, but none have specifically examined the impact of a platform-wide policy shift promoting quoting behavior. In addition, while some studies have investigated user behaviors associated with quoting, retweeting, and other tweet types, there is a lack of a comprehensive framework that assesses the quantitative and qualitative effects of a specific policy intervention. The current study introduces a detailed evaluation framework, incorporating time series analysis, text analysis, and sentiment analysis, providing a nuanced understanding of the Twitter quoting encouragement policy’s impact on user engagement. Moreover, previous research has explored user characteristics in the context of social media engagement but has not specifically addressed how users' attributes may influence their response to a platform-wide policy change. The current study bridges this gap by investigating how factors like social network size and activity levels correlate with users’ adoption of the quoting encouragement policy. Finally, while some studies have assessed the immediate effects of policy interventions, there is a lack of research investigating the longitudinal impact after the withdrawal of such policies. The current study extends the temporal dimension by examining user behavior during the pre-, within-, and post-intervals, offering insights into the sustained effects and user adaptation following the cessation of the quoting encouragement policy. By addressing these research gaps, the current study seeks to provide a holistic examination of the quoting encouragement policy on Twitter, contributing valuable insights to the fields of social media studies, policy evaluation, and user behavior analysis.

Research objectives

This study aims to assess the effectiveness of the Twitter policy implemented from October 9 to December 16, 2020, which encouraged users to utilize the quote tweet feature instead of simple retweeting. Specifically, the research objectives are twofold: (1) to determine the adoption rate of this policy and evaluate its success, and (2) to identify user characteristics based on their reactions to this policy. The outcomes of this research contribute to both the evaluation of the Twitter policy and the development of more effective approaches to policymaking. Stier et al. [ 5 ] proposed a comprehensive framework comprising four phases for the policymaking process: agenda setting, policy formulation, policy implementation, and evaluation. According to this framework, the evaluation phase involves assessing the outcomes of the policy, considering the perspectives of stakeholders involved in the previous phases. In this context, the present research examines the Twitter quoting encouragement policy, which represents an intervention in the daily usage patterns of Twitter users, through both quantitative and qualitative analyses. The quantitative effects analysis, particularly the achievements observed, provide valuable insights for evaluating the efficacy of the quoting encouragement policy by Twitter. Additionally, the results obtained from the qualitative analyses facilitate policy implementation, which refers to the process of translating policies into practical action under the guidance of an authoritative body.

Quantitative effects

In this section, we present the hypotheses formulated to assess the quantitative effects of the Twitter quoting encouragement policy. The hypotheses are as follows:

H1: The intervention is expected to have a negative impact on users’ retweeting behavior. We hypothesize that the policy promoting the use of quote tweets instead of simple retweets will lead to a reduction in the frequency of retweeting among users.

H2: The intervention is unlikely to significantly affect other types of user behavior, such as posting original or reply tweets, as well as quotes. We anticipate that any observed changes in the rates of these tweet types would be of minimal magnitude and primarily influenced by factors unrelated to the intervention.

H3: The termination of the intervention is anticipated to have a positive effect on users' retweeting behavior. We hypothesize that the discontinuation of the policy encouraging quote tweets will result in an increase in users' retweeting activity.

H4: Similar to H2, the conclusion of the intervention is not expected to impact other tweet types (excluding quotes) in terms of posting behavior. This suggests the presence of a prevailing opinion inertia, where users tend to maintain their existing patterns and tendencies when posting original, reply, and non-quote tweets.

These hypotheses serve as a foundation for analyzing the quantitative effects of the Twitter quoting encouragement policy and investigating its influence on users’ tweet behaviors. Through rigorous analysis, we aim to shed light on the impact of the intervention and its implications for user engagement on the platform.

Qualitative effects

The qualitative effects can be examined from two distinct perspectives: User Characteristics and Text Characteristics. Moreover, the analysis encompasses three intervals, namely the Pre-Interval (prior to the policy implementation), Within Interval (during the policy implementation), and Post-Interval (after the policy withdrawal). The hypotheses for each perspective are as follows:

User characteristics

H5: Users with a larger social network (i.e., more friends) are expected to exhibit a lesser increase in their quoting rate during the Within Interval.

H6: Users who demonstrate a regular pattern of activity, characterized by a lower frequency of overall Twitter engagement (such as publishing at least one tweet type on more days), are more inclined to experience an elevation in their quoting rate during the Within Interval.

H7: Users who engage in a higher volume of retweeting activities during the Pre-Interval are more likely to observe an increase in their quoting rate during the Within Interval.

H8: The swifter users experience an increase in their quoting rate during the Within Interval, the sooner they are likely to discontinue quoting tweets upon entering the Post-Interval.

Text characteristics

H9: Short-term quoters tend to exhibit a comparatively smaller change in the length of their quote texts compared to long-term quoters. This is primarily due to the involuntary nature of the former, whereas the latter are more intentionally created.

H10: The sentiment of quote texts from short-term quoters is generally more likely to elicit a greater range of emotions compared to those from long-term quoters. This difference is attributable, at least in part, to the intervention's influence on short-term quoters.

H11: The quote texts of short-term quoters are generally more prone to receiving a higher number of retweets compared to those of long-term quoters. This can be attributed to factors such as longer text length, less deliberative content, and the presence of heightened emotional elements in the latter.

These hypotheses form the basis for analyzing the qualitative effects of the Twitter quoting encouragement policy, enabling a comprehensive understanding of user and text characteristics during different intervals. By examining these effects, we aim to shed light on the nuanced dynamics that underlie users’ quoting behavior and its implications on social interaction and engagement within the Twitter platform.

Theoretical framework

In alignment with the two main parts of this research, which examine the quantitative and qualitative effects of the recent Twitter policy, the theoretical framework is also divided into two contexts: one for quantitative analysis and the other for investigating qualitative effects. For the quantitative analyses, the motivation crowding theory has played a central role in shaping the corresponding hypotheses. This theory suggests that providing extrinsic incentives for specific behavior can sometimes undermine intrinsic motivation to engage in that behavior [ 6 ]. Although the motivation crowding theory originated in the realm of economics [ 7 ], this study aims to apply it to the adoption of policies within the context of Twitter. By treating the quoting encouragement policy as an incentive, this research seeks to quantify the impact of this incentive during its implementation and withdrawal. Hypotheses 1–4 have been formulated to guide these quantitative analyses and explore the potential influence of the undermining effect on the adoption rate after the policy withdrawal.

Regarding the qualitative analyses, the TPB and the DOI serve as foundational frameworks for developing hypotheses related to user and text characteristics. The TPB explains behavior based on individuals' beliefs through three key components: attitude, subjective norms, and perceived behavioral control, which collectively shape behavioral intentions. Drawing on the TPB, hypotheses 5–8 aim to characterize different users based on their behaviors and attitudes toward the new policy. The DOI provides a platform for distinguishing users based on the time of adoption. In line with this theory, hypotheses 9–11 have been formulated to address characteristics that facilitate early adoption based on the content of quote texts. Figure  1 illustrates the theoretical framework of this study, highlighting its key components.

figure 1

Uniqueness and generalizability

To the best of our knowledge, this research represents the first comprehensive study to investigate the impact of the quoting encouragement policy implemented by Twitter. In comparison to the limited existing studies that have examined Twitter policies in the past, this research distinguishes itself through both the scale of the dataset utilized and the breadth of the analyses conducted. These unique aspects contribute to the applicability of this study in two key areas: methodology and findings. In terms of methodology, the presented approach incorporates an interrupted time series analysis framework, coupled with text and sentiment analyses, to examine policy interventions on a large scale. This framework enables researchers to develop various approaches for analyzing interventions within the realm of social media and big data. With regards to the findings, the extraction of qualitative and quantitative patterns from such a vast dataset yields novel insights. Particularly noteworthy is the ability to juxtapose these macro and micro results, leading to a deeper understanding of the policy’s effects. The findings of this study hold potential value for practitioners and policymakers not only on Twitter but also on other global platforms like Instagram and YouTube. However, it is important to consider certain modifications, such as adapting the follower-to-following ratio, when applying these findings to undirected networks like Facebook, where mutual agreement is necessary for link creation. Moreover, the analysis of this policy, which was implemented during the presidential election, provides valuable insights into its potential impact on public attention. Public attention has recently been identified as a critical factor in the success of presidential candidates [ 8 ]. Therefore, understanding the effects of the quoting encouragement policy can contribute to a better understanding of the dynamics surrounding public attention during such critical periods. Indeed, the uniqueness of this research lies in its pioneering examination of the Twitter quoting encouragement policy, extensive dataset, and comprehensive analyses. These distinct features enhance the applicability of the research in terms of methodology and findings, with potential implications for other global platforms and the study of public attention in political contexts.

Literature review

Given the nature of this research, which focuses on a novel Twitter policy that promotes quoting instead of retweeting, the literature review examines three perspectives: (1) Quote, (2) Engagement, and (3) Hashtag Adoption. These perspectives encompass relevant aspects that align with the scope of this study.

Garimella et al. [ 2 ] conducted a study on the utilization of the newly introduced “quote RT” feature on Twitter, specifically examining its role in political discourse and the sharing of political opinions within the broader social network. Their findings indicated that users who were more socially connected and had a longer history on Twitter were more likely to employ quote RTs. Furthermore, they observed that quotes facilitated the dissemination of political discourse beyond its original source. In a different context, Jang et al. [ 9 ] employed the rate of quotes as a measure to identify and detect fake news on Twitter. Their research focused on leveraging quotes as a means of analyzing the spread of misinformation and distinguishing it from authentic news. Li et al. [ 10 ] tried to identify users with high dissemination capability under different topics. Additionally, Bodaghi et al. [ 11 ] investigated the characteristics of users involved in the propagation of fake news, considering quotes and their combined usage with other tweet types such as retweets and replies. Their analysis aimed to gain insights into the user behaviors associated with the dissemination of false information. South et al. [ 12 ] utilized the quoter model, which mimics the information generation process of social media accounts, to evaluate the reliability and resilience of information flow metrics within a news–network ecosystem. This study focused on assessing the validity of these metrics in capturing the dynamics between news outlets engaged in a similar information dissemination process. By reviewing these studies, we can identify their relevance to the understanding of quoting behavior and its implications within different contexts, such as political discourse and the spread of misinformation. However, it is important to note that these previous works primarily focused on the usage of quotes and their effects without specifically addressing the Twitter policy under investigation in this study.

The concept of engagement on social media platforms, particularly in relation to political communication and online interactions, has been extensively explored in previous studies. Boulianne et al. [ 13 ] conducted research on the engagement rate with candidates’ posts on social media and observed that attack posts tend to receive higher levels of engagement, while tagging is associated with a general trend of lower engagement. Lazarus et al. [ 14 ] focused on President Trump’s tweets and found that engagement levels vary depending on the substantive content of the tweet, with negatively toned tweets and tweets involving foreign policy receiving higher engagement compared to other types of tweets. Yue et al. [ 15 ] delved into how nonprofit executives in the U.S. engage with online audiences through various communication strategies and tactics. Ahmed et al. [ 16 ] examined Twitter political campaigns during the 2014 Indian general election. Bodaghi et al. [ 17 ] conducted a longitudinal analysis on Olympic gold medalists on Instagram, investigating their characteristics as well as the rate of engagement they receive from their followers. Hou et al. [ 18 ] studied the engagement differences between scholars and non-scholars on Twitter. Hoang et al. [ 19 ] aimed at predicting whether a post is going to be forwarded or not. Munoz et al. [ 20 ] proposed an index as a tool to measure engagement based on the tweet and follower approach.

The decision of an online social network user to join a discussion group is not solely influenced by the number of friends who are already members of the group. Backstrom et al. [ 21 ] discovered that factors such as the relationships between friends within the group and the level of activity in the group also play a significant role in the user’s decision. Hu et al. [ 22 ] performed an empirical study on Sina Weibo to understand the selectivity of retweeting behaviors. Moreover, Balestrucci et al. [ 23 ] studied how credulous users engage with social media content. Bodaghi et al. [ 24 ] explored the impact of dissenting opinions on the engagement rate during the process of information spreading on Twitter. Wells et al. [ 25 ] examined the interactions between candidate communications, social media, partisan media, and news media during the 2015–2016 American presidential primary elections. They found that social media activity, particularly in the form of retweets of candidate posts, significantly influenced news media coverage of specific candidates. Yang et al. [ 26 ] investigated the tweet features that trigger customer engagement and found a positive correlation between the rate of quoting and the number of positive quotes. Bodaghi et al. [ 27 ] studied the role of users’ position in Twitter graphs in their engagement with viral tweets. They demonstrated how different patterns of engagement can arise from various forms of graph structures, leading to the development of open-source software for characterizing spreaders [ 28 , 29 ].

Hashtag adoption

The adoption and usage of hashtags on Twitter have been investigated in several studies, shedding light on the factors influencing individual behavior and the role of social networks. Zhang et al. [ 30 ] explored the behavior of Twitter users in adopting hashtags and specifically focused on the effect of “structure diversity” on individual behavior. Their findings suggest that users' behavior in online social networks is not solely influenced by their friends but is also significantly affected by the number of groups to which these friends belong. Tian et al. [ 31 ] investigated the impact of preferred behaviors among a heterogeneous population on social propagation within multiplex-weighted networks. Their research shed light on the diverse adoption behaviors exhibited by individuals with varying personalities in real-world scenarios. Examining hashtag use on Twitter, Monster et al. [ 32 ] examined how social network size influences people's likelihood of adopting novel labels. They found that individuals who follow fewer users tend to use a larger number of unique hashtags to refer to events, indicating greater malleability and variability in hashtag use. Rathnayake [ 33 ] sought to conceptualize networking events from a platform-oriented view of media events, emphasizing the role of hashtags in bottom-up construction. Hashtags played a key role in this taxonomy, reflecting their significance in organizing and categorizing discussions around specific events. Furthermore, Bodaghi et al. [ 34 ] demonstrated that the size of a user's friend network also impacts broader aspects, such as their decision to participate in an information-spreading process. The characteristics and dynamics of an individual’s social network play a role in shaping their behavior and engagement with hashtags. These studies collectively contribute to our understanding of hashtag adoption and its relationship to social networks, providing insights into the factors that influence individuals’ decisions to adopt and use hashtags in online platforms like Twitter.

Method and analysis

Data collection.

For this study, a random sample of 86,334 users from the United States was selected. The data collection process involved crawling their tweets, specifically the last 3200 tweets if available, until October 2020. The crawling process continued for these users at seven additional time intervals until February 2021. This resulted in a total of eight waves of data, encompassing all the tweets from these 86,334 users starting from the 3200th tweet prior to the first crawling time in October 2020, up until their last tweet on February 2, 2021. The eight waves of crawled data were then merged into a final dataset, and any overlapping tweets were removed. The final dataset consists of a data frame containing 304,602,173 unique tweets from the 86,334 users. Each tweet in the dataset is associated with 23 features, resulting in a dataset size exceeding 31 GB. Additionally, another dataset was created by crawling the user characteristics of these 86,334 users, such as the number of followers, friends, and statuses. The dataset includes four types of tweets: Retweet, Quote, Reply, and Original. Each tweet in the dataset belongs to only one of these types (pure mode) or a combination of types (hybrid mode). The hybrid modes are represented in two forms: (1) a retweet of a quote and (2) a reply that contains a quote. To maintain consistency and focus on pure modes in the dataset, the former was considered solely as a retweet, and the latter was treated as a quote only. As a result, the approximate counts of the four tweet types (Retweet, Quote, Reply, and Original) in the dataset are 143 M, 23 M, 77 M, and 61 M, respectively. To ensure a more recent focus on activities, the analysis specifically considered data from October 9, 2019, onwards. This date, October 9, 2019, was chosen as it is one year prior to Twitter’s issuance of the quoting encouragement policy. By using this cutoff date, the analysis concentrates on the data relevant to the policy's implementation and subsequent effects.

Data exploration

This section explores three aspects of the data: (1) the average number of tweets per user in each tweet type, (2) the number of active users in each tweet type, and (3) the usage of hashtags. The analysis includes all 86,334 users in the dataset. The exploration is conducted across three intervals: (1) pre-interval (from October 9, 2019, to October 8, 2020), (2) within-interval (from October 9, 2020, to December 15, 2020), and (3) post-interval (from December 16, 2020, to February 2, 2021). The code used for these explorations is publicly available. Footnote 1 Figure  2 presents the results for the first two aspects. The plots on the left-hand side illustrate the average number of tweets published in each tweet type, namely Original, Quote, Reply, and Retweet. The plots on the right-hand side display the number of active users in each tweet type. Active users in a specific type on a given day are defined as users who have published at least one tweet in that type on that day.

figure 2

Daily rates of user activities during pre-, within-, and post-intervals

To analyze the usage of hashtags, the first step is to identify political hashtags. This involves extracting all the hashtags used in the dataset from September 1, 2020, to February 1, 2021, excluding the last day of the dataset (February 2, 2021) due to incomplete data collection. The following intervals are defined based on this period:

Pre-Interval: September 1, 2020, to October 8, 2020.

Within-Interval: October 9, 2020, to December 15, 2020.

Post-Interval: December 16, 2020, to February 1, 2021.

The extraction process yields a total of 1,126,587 hashtags. From this set, the 100 most frequently used hashtags are selected for further analysis. These selected hashtags are then reviewed and annotated by two referees, considering their political context. Through consensus between the referees, 32 hashtags out of the initial 100 are identified as political. The results of the usage analysis on these selected political hashtags are presented in Fig.  3 .

figure 3

Usage of political hashtags. The left plot presents a word cloud depicting the 32 most frequently repeated political hashtags. The right plot displays the distribution of these political hashtags. The upper plot labels the significant dates associated with spikes in usage

Table 1 displays the key dates corresponding to the significant spikes observed in the plots depicted in Fig.  3 . These events directly influenced the patterns observed in the dataset.

Measurements of quantitative effects

To perform quantitative analysis, the data frame of each user was extracted by segregating all tweets associated with the same user ID. This process resulted in the creation of 86,334 individual data frames, each corresponding to a unique user. Subsequently, each user's data frame was divided into three distinct time intervals as follows:

Pre Interval [2019-10-09 to 2020-10-08]: This interval encompasses the year prior to the implementation of the new Twitter policy on 2020-10-09. Hence, the end of this interval is set as 2020-10-08.

Within Interval [2020-10-09 to 2020-12-15]: This interval spans from the policy’s inception on the first day, i.e., 2020-10-09, until its termination by Twitter on the last day, i.e., 2020-12-15.

Post Interval [2020-12-16 to 2021-02-02]: This interval commences on the day immediately following the removal of the policy, i.e., 2020-12-16, and continues until the last day on which a user published a tweet within the dataset. The dataset's coverage concludes on 2021-02-02, which represents the latest possible end date for this interval if a user had any tweet published on that date.

Impact analysis of the Twitter policy

The objective of this analysis is to assess the individual impact of the new Twitter policy, which promotes quoting instead of retweeting, on each user. Specifically, we aim to examine how the rate and quantity of published tweets per day have been altered following the implementation or removal of the new policy. Figure  4 illustrates the slopes and levels of a selected tweet type (quote) within each interval for a given user. Given the presence of four tweet types and three intervals, it is necessary to fit a total of 12 models for each user, corresponding to each tweet type within each interval.

figure 4

Slope and levels of number of quotes published by users during three intervals. This figure displays the slope and levels of the three intervals (pre-, within-, and post-intervals) for the number of quotes published by each user. The green lines depict the linear regression of the time series for each interval. The slope of pre-interval, within-interval, and post-interval corresponds to the slope of AB, CD, and EF lines, respectively. The start/end levels of pre-interval, within-interval, and post-interval are represented by A/B, C/D, and E/F, respectively

To analyze the impact of the new policy for each tweet type within a specific interval, we applied linear regression using the Ordinary Least Squares method (Eq.  1 ) in Python for users who had at least 7 data points with non-zero values.

where y is the number of tweets per day, x is the number of days, \(\alpha\) is the coefficient representing the slope, \(\varepsilon\) is the error, and \(\beta\) is the level. We then checked for the presence of autocorrelation in the residuals using the Durbin–Watson test (Eq.  2 ). If no autocorrelation was detected, we used linear regression to calculate the slopes and levels.

where d is the Durbin–Watson statistic, \({{\text{e}}}_{{\text{i}}}\) is the residual at observation i, n is the number of observations. The Durbin–Watson statistic ranges from 0 to 4. A value around 2 indicates no autocorrelation, while values significantly less than 2 suggest positive autocorrelation, and values significantly greater than 2 suggest negative autocorrelation. However, if autocorrelation was present, we employed linear regression with autoregressive errors (Eq.  3 ).

where \(\delta = {\widehat{{\varnothing }}}_{1}{\delta }_{i-1}+ {\widehat{{\varnothing }}}_{2}{\delta }_{i-2}+\dots + {\widehat{{\varnothing }}}_{p}{\delta }_{i-p}-{\widehat{\theta }}_{1}{e}_{i-1}- {\widehat{\theta }}_{2}{e}_{i-2}-\dots -{\widehat{\theta }}_{q}{e}_{i-q}+{\varepsilon }_{i}\)

In this equation, the errors are modelled using an ARIMA (p, d, q), where p and q represent the lags in the autoregressive (AR) and moving-average (MA) models, respectively, and d is the differencing value. We utilized the SARIMAX (p, d, q) model in Python to implement this regression, where the exogenous variable X (in Eq.  1 ) represents the number of days.

To determine the best values for the model's parameters, we conducted a grid search to generate a set of potential parameter combinations. We then evaluated the results for each combination based on the following criteria: (1) All achieved coefficients must be significant, (2) Akaike Information Criterion (AIC) based on Eq. ( 4 ) should be less than 5000, and (3) The Ljung–Box coefficient based on Eq. ( 5 ) should be significant (> 0.05).

where \(\mathcal{L}\) is the maximum log-likelihood of the model, k is the number of estimated parameters. A lower AIC value indicates a better estimation of the model orders.

where n is the sample size, \({\rho }_{k}\) is the sample autocorrelation at lag k. The test statistic follows a chi-squared distribution with degrees of freedom equal to the number of lags considered. The null hypothesis is that there is no autocorrelation up to the specified lag. A p-value greater than 0.05 suggests that there is no significant autocorrelation in the residuals, indicating an adequate fit. Finally, among the selected results, the model with the lowest sigma2 shown by Eq. ( 6 ), indicating less variance in the residuals, was chosen as the best-fit model.

where \({e}_{i}\) is the residual at observation i, \(\overline{e }\) is the mean of the residuals, and n is the number of observations. In the case of time series analysis, the residuals are the differences between the observed values and the values predicted by the ARIMA or SARIMAX model. The parameter values corresponding to this model were considered the optimal fit. The entire process for obtaining the slope and level findings is depicted in Fig.  5 , and the results are presented in Table  2 .

figure 5

Flowchart illustrating the overall analysis procedure for slope and level assessments

Table 2 illustrates variations in the slope and level of tweeting between intervals. For instance, a level change of − 1.025 indicates a daily decrease of approximately 1.025 quotes from pre-interval to within-interval. Similarly, a slope change of 0.003 reflects an increase of around 0.003 quotes per day in the slope of quoting during the same transition. The table provides additional insights into slope and level changes for other tweet types across different intervals.

Analysis of qualitative effects

In this section, we aim to investigate the changes in user behavior towards the Twitter policy based on user characteristics such as the number of followers, number of friends, and number of statuses. To achieve this, we consider users whose obtained models are significant in both paired intervals (pre-within or within-post). We calculate the correlations between the values of these characteristics and the rate of change in the slope of each tweet type between the intervals. The results of this analysis are presented in Table  3 .

For instance based on Table  3 , investigating the relationships, a notable negative correlation of − 0.042 is observed between the number of friends a user has and the rate of slope change for quote publishing, specifically from the pre-interval to the within-interval. Additionally, a significant negative correlation of − 0.079 is evident between the number of quotes published in the post-interval and the number of retweets previously published in the within-interval. Further detailed explanations and implications are presented in the “ Results ” section.

The analysis of text characteristics focuses on examining the impact of the new policy on the length and sentiment of quote texts. Specifically, we are interested in understanding how the quote texts of two different user groups, namely “short-term quoters” and “long-term quoters,” have changed in terms of length and sentiment from the pre-interval to the within-interval. We define the two groups as follows:

Short-term Quoter: A user who did not engage in quoting during the pre-interval but started quoting in the within interval.

Long-term Quoter: A user who engaged in quoting during the pre-interval and continued to do so in the within interval. A quoter is defined as a user whose average number of quotes in all tweets exceeds a certain threshold.

For the analysis, we extract three characteristics from the quote text: (1) the number of characters (excluding spaces), (2) the sentiment score, and (3) the number of times the quote has been retweeted. We preprocess the text by performing tasks such as removing non-ASCII characters, emojis, mentions, and hashtags. To calculate the sentiment score, we utilize the sentiment analyzer from the Python NLTK package, which is based on VADER, Footnote 2 a lexicon, and rule-based sentiment analysis tool specifically designed for sentiments expressed in social media. The sentiment score calculated by VADER is a compound score that represents the overall sentiment of a text. The score is computed based on the valence (positivity or negativity) of individual words in the text (Eq.  7 ).

where \({S}_{compound}\) is the compound sentiment score, V is the valence score of word and is normalized to be between − 1 (most negative) and 1 (most positive), and I is the intensity of word. The weights are determined by the intensity of each word's sentiment. The rate of change in the average value of these characteristics from the pre to within intervals is then calculated for each user. Finally, we compute the average rates of change separately for short-term and long-term quoters, as presented in Table  4 .

As illustrated in Table  4 , 34,317 users in the dataset exhibited a quote-publishing rate exceeding 0.05 during the pre and within intervals, indicating more than 5 quotes per every 100 published tweets. These users observed a marginal increase (0.006) in the average sentiment of their tweets from pre-intervals to within-intervals. Conversely, 5900 users in the dataset, who had no quotes in the pre-interval but exceeded 0.01 of all their tweets as quotes during the pre and within intervals, experienced a decrease of 0.242 per day in their rate of retweeting from pre-intervals to within-intervals. Further detailed explanations and implications are presented in the “ Results ” section.

Impact of the Twitter policy

The findings of the impact analysis are presented in Table  2 , illustrating the changes in slopes for different tweet types. It is observed that the slope of each tweet type, except for quotes, decreased upon entering the within interval, while quotes experienced a slight increase (0.0031). Notably, prior to the implementation of the new policy, there was a substantial increase in the number of daily tweets across all types. Therefore, the decline in levels during the within intervals relative to the pre-intervals can be attributed to this initial surge in activity. Another significant result is the considerable decrease in the number of daily published quotes during the post-interval compared to the within-interval. Additionally, a significant decrease (− 2.785) and increase (11.587) are observed in the slope of retweets per day during the within and post intervals, respectively. These notable changes in both retweet and quote rates highlight the impact of Twitter's new policy. When examining these results from a broader perspective, two trends emerge: (1) during the transition from the pre-interval to the within interval, the slope of all tweet types, except for quotes, decreased, and (2) from the within-interval to the post-interval, the slope of all tweet types increased, except for quotes. These trends underscore the pronounced impact of the new policy implemented by Twitter. In conclusion, it can be inferred that the policy has achieved some progress. However, determining the true success of the policy requires considering Twitter's overarching goals within a broader context, encompassing both short-term and long-term consequences.

The correlations between user characteristics and slope changes in each tweet type during different intervals are presented in Table  3 . The results, particularly the correlations between slope changes in quoting and retweeting and other user characteristics, can be examined from three perspectives: the pre-within transition, within-post transition, and a comparison of pre-within to post-within.

Pre-within transitions

Regarding the pre-within transition, several noteworthy relationships can be observed. Firstly, there is an inverse relationship between the number of friends a user has and the slope change for the quote type. This suggests that users with a larger number of friends exhibit less improvement in their quoting rate during the within-interval (following the implementation of the Twitter policy). Similarly, the number of statuses published by users also demonstrates a negative correlation with the slope change for quotes. In other words, users who tend to publish a higher number of statuses show less inclination to increase their quoting rate during the within interval. Additionally, significant relationships emerge between the slope change in quoting during the pre-within interval and both retweet counts and the number of data points. This indicates that users who have engaged in more retweets are more likely to exhibit a propensity for quoting during the within-interval. Similar relationships can be observed between the slope change in quoting during the pre-within interval and other tweet types, suggesting that more active users are more influenced by the changes in the quoting behavior.

Within-post transitions

Analyzing the within-post transitions, several significant relationships can be observed. Firstly, the slope change in retweeting during the within-post interval exhibits a significant relationship with the number of quotes and original tweets during the within interval. This implies that users who have a higher number of quotes and original tweets in their activity would experience a greater increase in the retweeting rate after the policy cancellation (post-interval). However, the slope change in retweeting during the within-post interval does not show a significant relationship with the slope change in any other tweet type, except for an inverse relationship with original tweets. In other words, users who engage more in original tweets during the within-interval are likely to exhibit a lower increase in the rate of retweeting during the post-interval. Regarding the slope change in quoting during the within-post interval, a significant negative relationship is observed with the number of retweets during the within interval. This indicates that users who have a higher number of retweets during the within-interval are likely to experience a lower increase in the quoting rate during the post-interval. This relationship holds true for users who have quoted more during the within interval as well.

Pre-within to within-post comparison

Comparing the slope change in quoting and retweeting between the pre-within and within-post transitions, it can be observed that users who experienced an increase in their quoting or retweeting rate during the pre-within transition tend to exhibit a higher inclination to decrease it during the within-post transition. Additionally, a significant inverse relationship is evident between the slope change in quoting during the pre-within interval and the slope change in retweeting during the within-post interval. This implies that users who witnessed a greater increase in their quoting rate during the pre-within transition are likely to experience a larger decrease in their retweeting rate during the within-post transition.

The results of the text analysis, specifically length, sentiment, and the number of retweets, are presented in Table  4 . Examining the results reveals several key findings. Firstly, the quote texts of long-term quoters have undergone a reduction in length during the within interval compared to the pre-interval, across all threshold levels. However, for short-term quoters, this reduction in quote length only occurs at threshold levels equal to or above 0.05. Furthermore, among those whose quote texts have been shortened (at threshold levels of 0.05 and 0.075), short-term quoters experience a greater reduction in length compared to long-term quoters. Regarding sentiment analysis, the results indicate an overall increase in the sentiment score of quote texts from the pre-interval to the within interval. However, this increase is more pronounced for short-term quoters compared to long-term quoters.

Additionally, for both categories and across all threshold levels, the number of retweets received by quotes has decreased from the pre-interval to the within-interval. This decrease is particularly significant for long-term quoters, except at threshold level 0 for short-term quoters. This observation aligns with expectations since short-term quoters did not have any quotes during the pre-interval, resulting in their sentiment score being subtracted from the sentiment scores of quotes during the within interval. Notably, the decrease in the number of retweets is more substantial for long-term quoters, except at a threshold level of 0.075, where it is slightly higher for short-term quoters. In summary, by considering a threshold of 0.075 as an indicator, we can conclude that the Twitter policy has influenced quote texts in the following ways: (1) There is a greater reduction in the number of characters for short-term quoters compared to long-term quoters, and (2) The increase in sentiment score is more significant for short-term quoters relative to long-term quoters.

The findings pertaining to the hypotheses are outlined in Table  5 .

Quantitative findings

The quantitative analysis, based on hypotheses H1–4, reveals that the intervention has a negative impact on users’ retweeting behavior, while other tweet types remain relatively unaffected. However, the cessation of the intervention leads to an increase in the retweeting rate and a decrease in the quoting rate. When considering only the period when the policy was in effect, namely the within-interval, it can be concluded that the policy was partially successful. Despite a minor increase in the quoting rate, the significant decline in retweeting indicates a positive outcome. However, when examining the long-term effects after discontinuation of the policy, i.e., the post-interval, the policy can be regarded as a failure, as the retweeting rate experienced a dramatic increase while the quoting rate decreased substantially. Although Twitter did not enforce users to quote instead of retweeting nor provide any explicit promotion or reward for quoting, the quoting encouragement policy may have influenced users' perceptions and served as a virtual external incentive for initiating quoting behavior. This phenomenon can be explained by the adaptive nature of the brain in perceiving rewards based on recent levels and ranges of rewards, fictive outcomes, social comparisons, and other relevant factors [ 35 , 36 ]. The motivation crowding theory offers a framework for discussing this observation. When an extrinsic reward is removed, the level of intrinsic motivation diminishes compared to a scenario where no additional reward was initially provided [ 37 ]. In the case of Twitter's policy, users may have perceived the extrinsic incentive of adding a few extra characters to a retweet as rewarding and complied accordingly. However, once this external incentive was eliminated, the residual intrinsic motivation decreased below its initial level. This explains the subsequent decline in the quoting rate during the post-interval, accompanied by a surge in retweeting activity.

Qualitative findings

The qualitative analysis, focusing on hypotheses H5–11, reveals several noteworthy patterns. Users with a smaller number of friends and higher levels of overall tweet activity are more inclined to align with the policy and increase their quoting rate during the within interval. Furthermore, users who experienced an increase in their quoting rate during the within interval are more likely to decrease their quoting rate following the policy withdrawal in the post interval. Additionally, users who adopted quoting behavior as a result of the policy during the within interval demonstrated a tendency to publish quotes with shorter text length and more positive emotions. Two observed patterns can be explained respectively by the TPB and the DOI. The TPB posits that an individual’s behavioral intentions are influenced by three components, with subjective norms playing a significant role [ 38 ]. The impact of subjective norms is contingent upon the connections an individual has with others. Users with a smaller number of friends have fewer channels through which subjective norms can exert pressure. Consequently, these users are less influenced by societal norms that have not yet accommodated the new policy. Hence, users with fewer friends are more likely to be early adopters of the policy. Moreover, recent research [ 39 ] suggests that TPB, along with the theory of the Spiral of Silence, can potentially explain the avoidance of adoption, particularly when adoption involves expressing individual beliefs. Furthermore, the DOI provides insights into the adoption process, suggesting that adopters can be categorized into distinct groups based on the timing of their adoption [ 40 ]. Through this categorization, shared characteristics in terms of personality, socioeconomic status, and communication behaviors emerge. Early adopters, characterized by a greater tolerance for uncertainty and change, often exhibit higher levels of upward mobility within their social and socioeconomic contexts, as well as enhanced self-efficacy [ 41 ]. These characteristics are reflected in the more positive emotions expressed in their quote posts.


This study carries implications from both practical and theoretical perspectives. From a practical standpoint, the findings provide valuable guidance for practitioners in developing a multistage model that captures users’ behavior towards a new social media policy at an aggregate level. Such a model is crucial for designing efficient strategies aimed at expediting the adoption process among the majority of users. Leveraging the quantitative analysis method employed in this study, practitioners can first evaluate the impact of the policy, and then, using the qualitative analysis method, identify users who are more inclined to adopt or reject the policy based on their characteristics and text behavior. Gaining insights into user tendencies towards policy adoption or rejection in advance can inform a series of initiatives, including targeted user categorization to introduce or withhold the policy during its initial stages. An illustrative study by Xu et al. [ 42 ] explored public opinion on Twitter during Hurricane Irma across different stages, analyzing over 3.5 million tweets related to the disaster to discern distinct thematic patterns that emerged during each stage. Their findings assist practitioners in utilizing Twitter data to devise more effective strategies for crisis management. From a theoretical perspective, the findings contribute to the advancement of theories such as the TPB and the DOI in the realm of cyberspace. According to TPB, subjective norms play a significant role in shaping human behavior. This study revealed that users with a smaller number of friends are more inclined to accept the new policy. This suggests that users who have fewer connections are more likely to deviate from the prevailing norm in which the adoption of the new policy has not yet gained traction. Furthermore, the higher rates of positivity observed in the quote texts of short-term quoters, relative to their long-term counterparts, contribute to the extension of the Innovation Diffusion Theory regarding policy adoption and expand our understanding of the possible manifestations of early adopters' characteristics in the context of social media.

For a more nuanced understanding, it is noteworthy to explore the impact of events on user behavior. While events like debates can undeniably influence user activity levels, this impact is likely experienced across all types of users, such as quoters and retweeters. Our analysis, examining individual users across multiple time intervals that encompass these events, allows us to observe user-specific behavioral evolution. The extracted patterns thus represent dominant shifts in spreading behavior observed in the majority, irrespective of their original preference (retweeting or quoting). This observed consistency suggests that the policy's influence may extend beyond just event-driven fluctuations. The consistent shift in information-sharing behavior throughout the study period points towards the possible contribution of additional factors beyond isolated events.

Conclusion and future works

This research employed a big data approach to analyze the Twitter quoting encouragement policy, examining both its quantitative and qualitative effects. The research timeline was divided into three distinct intervals: pre, within, and post-intervals. Time series analysis was then utilized to identify changes in the rates of different tweet types across these intervals. Additionally, text and sentiment analysis, along with correlation methods, were applied to explore the relationships between user characteristics and their responses to the policy. The results revealed a short-term success followed by a long-term failure of the policy. Moreover, a set of user characteristics was identified, shedding light on their adherence to the policy and their quoting tendencies during the policy’s implementation. These findings have significant implications for the development and evaluation of new policies in the realm of social media, offering valuable insights for the design of more effective strategies.

The study of policy adoption on social media is still in its early stages, particularly in the realm of data analytics and behavioral research [ 43 ]. Future studies can build upon this research and explore additional factors and techniques to deepen our understanding. For example, the impact of aggregations, such as crowd emotional contagion, convergence behavior, and adaptive acceptance, can be modelled as exogenous factors in the analysis [ 44 , 45 ]. Additionally, incorporating new techniques for sentiment analysis, as highlighted in studies by Zhao et al. [ 46 ], and Erkantarci et al. [ 47 ], as well as semantic techniques [ 48 ], can further enhance computational analyses. Moreover, future research can consider factors related to the continuance of use [ 49 ] to examine the reasons behind policy rejection by users who initially adopted it. The inclusion of census data, search logs of users [ 50 ], user demographics [ 51 ], and the analysis of interconnections within a graph [ 52 ] would be valuable additions to the analysis. These additional data sources can provide a more comprehensive understanding of user behaviors and interactions. Furthermore, it is important to consider bot filtering techniques to ensure the accuracy and reliability of the findings. This step is particularly crucial for extending the research beyond Twitter and examining policy adoption in non-cyber spaces. By exploring these avenues of research, future studies can advance our knowledge of policy adoption on social media, providing valuable insights into user behaviors, motivations, and the effectiveness of policy interventions. Finally, this study’s data collection and storage methods share similarities with those employed in prior efforts [ 53 ]. However, there remains significant potential for innovation in this area.

Data availability

The datasets analyzed during the current study are available from the corresponding author on reasonable request.

Code availability

Codes are publicly available by this link: . .

Valence Aware Dictionary and sEntiment Reasoner.

Weber, I., Garimella, V. R. K., & Batayneh, A. (2013). Secular vs. Islamist polarization in Egypt on twitter . ASONAM.

Book   Google Scholar  

Garimella, K., Weber, I., & Choudhury, M.D. (2016). Quote RTs on Twitter: Usage of the new feature for political discourse. WebSci’ 16 Germany.

Gallego, M., & Schofield, N. (2017). Modeling the effect of campaign advertising on US presidential elections when differences across states matter. Mathematical Social Sciences, 90 , 160–181.

Article   Google Scholar  

Jones, M. A., McCune, D., & Wilson, J. M. (2020). New quota-based apportionment methods: The allocation of delegates in the Republican Presidential Primary. Mathematical Social Sciences., 108 , 122–137.

Stier, S., Schünemann, W. J., & Steiger, S. (2018). Of activists and gatekeepers: Temporal and structural properties of policy networks on Twitter. New Media and Society, 20 (5), 1910–1930.

Frey, B. S., & Jegen, R. (2001). Motivation crowding theory. Journal of Economic Survey, 15 , 589–611.

Kreps, D. (1997). Intrinsic motivation and extrinsic incentives. American Economic Review, 87 , 359–364.

Google Scholar  

Stiles, E. A., Swearingen, C. D., & Seiter, L. M. (2022). Life of the party: Social networks, public attention, and the importance of shocks in the presidential nomination process. Social Science Computer Review .

Jang, Y., Park, C. H., & Seo, Y. S. (2019). Fake news analysis modeling using quote retweet. Electronics, 8 (12), 1377.

Li, K., Zhu, H., Zhang, Y., & Wei, J. (2022). Dynamic evaluation method on dissemination capability of microblog users based on topic segmentation. Physica A: Statistical Mechanics and its Applications, 608 , 128264.

Bodaghi, A., & Oliveira, J. (2020). The characteristics of rumor spreaders on Twitter: A quantitative analysis on real data. Computer Communications, 160 , 674–687.

South, T., Smart, B., Roughan, M., & Mitchell, L. (2022). Information flow estimation: A study of news on Twitter. Online Social Networks and Media, 31 , 100231.

Boulianne, S., & Larsson, A. O. (2021). Engagement with candidate posts on Twitter, Instagram, and Facebook during the 2019 election. New Media and Society, 1–22.

Lazarus, J., & Thornton, J. R. (2021). Bully pulpit? Twitter users’ engagement with president trump’s tweets. Social Science Computer Review., 39 (5), 961–980.

Yue, C. A., Qin, Y. S., Vielledent, M., Men, L. R., & Zhou, A. (2021). Leadership going social: How U.S. nonprofit executives engage publics on Twitter. Telematics and Informatics, 65 , 101710.

Ahmed, S., Jaidka, K., & Cho, J. (2021). The 2014 Indian elections on Twitter: A comparison of campaign strategies of political parties. Telematics and Informatics, 33 (4), 1071–1087.

Bodaghi, A., & Oliveira, J. (2022). A longitudinal analysis on Instagram characteristics of Olympic champions. Social Network Analysis and Mining, 12 , 3.

Hou, J., Wang, Y., Zhang, Y., & Wang, D. (2022). How do scholars and non-scholars participate in dataset dissemination on Twitter. Journal of Informetrics., 16 (1), 101223.

Hoang, T. B. N., & Mothe, J. (2018). Predicting information diffusion on Twitter—Analysis of predictive features. Journal of Computational Science, 28 , 257–264.

Munoz, M. M., Rojas-de-Gracia, M.-M., & Navas-Sarasola, C. (2022). Measuring engagement on Twitter using a composite index: An application to social media influencers. Journal of Informetrics, 16 (4), 101323.

Backstrom, L., Huttenlocher, D., Kleinberg, J., & Lan, X. (2006). Group formation in large social networks: membership, growth, and evolution. Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD ’06) (pp. 44–54). Association for Computing Machinery.

Chapter   Google Scholar  

Hu, J., Luo, Y., & Yu, J. (2018). An empirical study on selectiviey of retweeting behaviors under multiple exposures in social networks. Journal of Computational Science, 28 , 228–235.

Balestrucci, A., De Nicola, R., Petrocchi, M., & Trubiani, C. (2021). A behavioural analysis of credulous Twitter users. Online Social Networks and Media., 23 , 100133.

Bodaghi, A., & Goliaei, S. (2018). A novel model for rumor spreading on social networks with considering the influence of dissenting opinions. Advances in Complex Systems, 21 , 1850011.

Wells, C., Shah, D., Lukito, J., Pelled, A., Pevehouse, J. C., & Yang, J. (2020). Trump, Twitter, and news media responsiveness: A media systems approach. New Media and Society, 22 (4), 659–682.

Yang, D., & Fujimura, S. (2019). What Will Influence customer's engagement the strategies and goals of tweet. IEEE international conference on industrial engineering and engineering management ( IEEM ), pp. 364–368.

Bodaghi, A., & Oliveira, J. (2022). The theater of fake news spreading, who plays which role? A study on real graphs of spreading on Twitter. Expert Systems with Applications, 189 , 116110.

Bodaghi, A., Oliveira, J., & Zhu, J. J. H. (2021). The fake news graph analyzer: An open-source software for characterizing spreaders in large diffusion graphs. Software Impacts. 100182.

Bodaghi, A., Oliveira, J., & Zhu, J. J. H. (2022). The Rumor Categorizer: An open-source software for analyzing rumor posts on Twitter. Software Impacts. 100232.

Zhang, A., Zheng, M., & Pang, B. (2018). Structural diversity effect on hashtag adoption in Twitter. Physica A: Statistical Mechanics and its Applications., 493 , 267–275.

Tian, Y., Tian, H., Cui, Y., Zhu, X., & Cui, Q. (2023). Influence of behavioral adoption preference based on heterogeneous population on multiple weighted networks. Applied Mathematics and Computation, 446 , 127880.

Monster, I., & Lev-Ari, S. (2018). The effect of social network size on hashtag adoption on Twitter. Cognitive Science, 42 (8), 3149–3158.

Rathnayake, C. (2021). Uptake, polymorphism, and the construction of networked events on Twitter. Telematics and Informatics, 57 , 101518.

Bodaghi, A., Goliaei, S., & Salehi, M. (2019). The number of followings as an influential factor in rumor spreading. Applied Mathematics and Computation, 357 , 167–184.

Seymour, B., & McClure, S. M. (2008). Anchors, scales and the relative coding of value in the brain. Current Opinion in Neurobiology, 18 , 173–178.

Murayama, K., Matsumoto, M., Izuma, K., & Matsumoto, K. (2010). Neural basis of the undermining effect of monetary reward on intrinsic motivation. Proceedings of the National Academy of Sciences of USA, 107 , 20911–20916.

Camerer, C. (2010). Removing financial incentives demotivates the brain. Proceedings of the National Academy of Sciences, 107 (49), 20849–20850.

Ajzen, I. (1991). The theory of planned behavior. Organizational Behavior and Human Decision Processes, 50 (2), 179–211.

Wu, T. Y., Xu, X., & Atkin, D. (2020). The alternatives to being silent: Exploring opinion expression avoidance strategies for discussing politics on Facebook. Internet Research, 30 (6), 1709–1729.

Everett, R. (2003). Diffusion of innovations (5th ed.). Simon and Schuster. ISBN 978-0-7432-5823-4.

Straub, E. T. (2009). Understanding technology adoption: Theory and future directions for informal learning. Review of Educational Research, 79 (2), 625–649.

Xu, Z., Lachlan, K., Ellis, L., & Rainear, A. M. (2020). Understanding public opinion in different disaster stages: A case study of Hurricane Irma. Internet Research, 30 (2), 695–709.

Motiwalla, L., Deokar, A. V., Sarnikar, S., & Dimoka, A. (2019). Leveraging data analytics for behavioral research. Information Systems Frontiers, 21 , 735–742.

Mirbabaie, M., Bunker, D., Stieglitz, S., & Deubel, A. (2020). Who sets the tone? Determining the impact of convergence behaviour archetypes in social media crisis communication. Information System Frontiers, 22 , 339–351.

Iannacci, F., Fearon, C., & Pole, K. (2021). From acceptance to adaptive acceptance of social media policy change: A set-theoretic analysis of B2B SMEs. Information Systems Frontiers, 23 , 663–680.

Zhao, X., & Wong, C. W. (2023). Automated measures of sentiment via transformer- and lexicon-based sentiment analysis (TLSA). Journal of Computational Social Science .

Erkantarci, B., & Bakal, G. (2023). An empirical study of sentiment analysis utilizing machine learning and deep learning algorithms. Journal of Computational Social Science .

Bodaghi, A., & Oliveira, J. (2024). A financial anomaly prediction approach using semantic space of news flow on twitter. Decision Analytics Journal, 10 , 100422.

Franque, F. B., Oliveira, T., Tam, C., & Santini, F. O. (2020). A meta-analysis of the quantitative studies in continuance intention to use an information system. Internet Research, 31 (1), 123–158.

Feng, Y., & Shah, C. (2022). Unifying telescope and microscope: A multi-lens framework with open data for modeling emerging events. Information Processing and Management, 59 (2), 102811.

Brandt, J., Buckingham, K., Buntain, C., Anderson, W., Ray, S., Pool, J. R., & Ferrari, N. (2020). Identifying social media user demographics and topic diversity with computational social science: A case study of a major international policy forum. Journal of Computational Social Science, 3 , 167–188.

Antonakaki, D., Fragopoulou, P., & Ioannidis, S. (2021). A survey of Twitter research: Data model, graph structure, sentiment analysis and attacks. Expert Systems with Applications, 164 , 114006.

Bodaghi, A. (2019). Newly emerged rumors in Twitter. Zenodo.

Download references


The study was funded by City University of Hong Kong Centre for Communication Research (No. 9360120) and Hong Kong Institute of Data Science (No. 9360163). We would also like to express our sincere appreciation to Pastor David Senaratne and his team at Haggai Tourist Bungalow in Colombo, Sri Lanka, for their generous hospitality. Their support provided a conducive environment for the corresponding author to complete parts of this manuscript.

Author information

Authors and affiliations.

School of Computing, Ulster University, Belfast, Northern Ireland, UK

Amirhosein Bodaghi

Department of Media and Communication, City University of Hong Kong, Kowloon, Hong Kong

Jonathan J. H. Zhu

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Amirhosein Bodaghi .

Ethics declarations

Conflict of interest.

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit .

Reprints and permissions

About this article

Bodaghi, A., Zhu, J.J.H. A big data analysis of the adoption of quoting encouragement policy on Twitter during the 2020 U.S. presidential election. J Comput Soc Sc (2024).

Download citation

Received : 06 January 2024

Accepted : 07 May 2024

Published : 19 May 2024


Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Quote retweets
  • Social media
  • Time series analysis
  • Text analysis
  • Policy intervention
  • Find a journal
  • Publish with us
  • Track your research

big data in social science research

by Michael Friedrich

As part of a new series profiling participants in SSRC’s Criminal Justice Innovation Fellowship program, Romaine Campbell talks about his research on police and prison policies. This is a cross-posting with  Arnold Ventures .

Recently, the Social Science Research Council (SSRC), with support from Arnold Ventures (AV), launched the Criminal Justice Innovation (CJI) Fellowship program , which supports early-career researchers who are exploring what works to make communities safer and the criminal justice system fairer and more effective. 

“These CJI fellows will spend the next three years investing in their own policy-relevant research, as well as conducting policy analyses for AV that will directly inform our work,” Jennifer Doleac , executive vice president of criminal justice at AV, says. “We are eager to know if particular policies and programs are working, and this group of researchers will figure that out. I’m thrilled to get to work with these brilliant, talented scholars.”

According to Anna Harvey , president of the SSRC, this new fellowship program will uniquely foster innovative and rigorous causal research on criminal justice policies. “By supporting ‘people, not projects,’ the CJI fellowships will give these exceptional young researchers the time and freedom to pursue novel and creative approaches to evaluating criminal justice policies and practices. We can’t wait to see what they produce,” she says. 

In part one of a new series profiling the CJI fellows, AV spoke with Romaine Campbell, a Ph.D candidate in economics at Harvard University whose work addresses racial disparities in the criminal justice system.

Romaine Campbell: Police Behavior and Community Safety

A labor economist by training, Campbell will produce research as a fellow through the CJI fellowship program over the next two years before joining the faculty at Cornell University’s Brooks School of Public Policy. His research will focus on how federal scrutiny impacts police behavior and community safety, as well as the effects of higher education in prison on the outcomes of people who are incarcerated, among other topics. 

big data in social science research

Campbell, who is originally from the Caribbean, says that he has seen how rigorous empirical research can help to explain the things that are important for his community. “A lot of my work looks at how we can improve law enforcement in the United States,” he says. “Policing serves an important role in ensuring the public safety of communities, but increasingly we’re aware of the social costs that can sometimes come with policing. My work examines policies that can help balance the important work that officers do with trying to mitigate the harms that come out of the excesses of policing.”

In 2023, Campbell published a working paper on the results of federal oversight of policing in Seattle. Using administrative data from the Seattle Police Department, the paper found that federal oversight resulted in a 26% reduction in police stops in the city — mostly by reducing stop-and-frisk style stops. Importantly, that reduction had no impact on the rates of serious crime or other community safety measures. 

As part of the new fellowship, Campbell expects to expand his work on the impacts of police oversight. By working with other police departments across the country, he will explore how officers respond to federal investigations, how it affects their behavior, and what types of policing are actually effective for crime reduction. Some policymakers, Campbell notes, have expressed concerns that adding oversight to police departments causes them to pull back from policing, which can damage community safety. As such, policies are needed that reduce the harms of policing while also allowing officers to address serious crime and build trust with the communities they serve. “As our society considers the best ways to improve policing,” he says, “it’s going to be important to document the types of policies that can achieve this without having deleterious effects for communities.” 

Additionally, working in partnership with the Philadelphia District Attorney’s Office, Campbell and colleagues intend to explore the impact of Brady Lists — public-facing records of information about police misconduct, decertification, use-of-force reports, and other metrics — to understand how prosecutors use such information in charging decisions in their cases. 

Separately, Campbell and colleagues plan to launch a project to understand how the provision of higher education in prison affects short- and long-term outcomes of people who are incarcerated, especially their social and economic mobility. He will focus on Iowa, where agreements with the state’s department of corrections, department of education, and workforce development agency will provide him with the necessary data. 

Campbell says that rigorous research is important for decision-making about public policy in the criminal justice system. “When you operate in public policy spaces, you really want to build out evidence-based policy,” he explains. “We can all have our feelings and intuitions about what will happen when a policy goes into effect, but the gold standard should be to implement policies that are supported by data.”

Privacy Overview


  1. Big Data and Social Science

    Since the first edition of this book came out we have been fortunate to train over 450 participants in the Applied Data Analytics classes, resulting in increased data analytics capacity, both in terms of human and technical resources. What we learned in delivering these classes greatly influenced the 2nd edition. We also added an entire new ...

  2. Big Data in Social Research

    These data often have major differences in their origins, structure, and attributes compared to the data typically used social science research. Big Data Management: The big data management phase of the BDaaP framework involves both processes and supporting technologies for acquiring, storing, preparing, and retrieving the information for ...

  3. Enhancing big data in the social sciences with crowdsourcing: Data

    Introduction. Big data and computational approaches present a potential paradigm shift in the social sciences, particularly since they allow for measuring human behaviors that cannot be observed with survey research [1, 2, 3].In fact, the transformative potential of big data for the social sciences has been compared to how "the invention of the telescope revolutionized the study of the ...

  4. Methods for big data in social sciences

    In this special issue "Methods for big data in social sciences" Luis Martinez-Uribe shows that collections prepared by libraries can be used as big data. He uses network coincidence analysis, a method for combining co-incidence and social network analyses, on more than three million records, which represent 800,000 person names and 300,000 ...

  5. Causation, Correlation, and Big Data in Social Science Research

    The emergence of big data offers not only a potential boon for social scientific inquiry, but also raises distinct epistemological issues for this new area of research. Drawing on interviews conducted with researchers at the forefront of big data research, we offer insight into questions of causal versus correlational research, the use of ...

  6. Causation, Correlation, and Big Data in Social Science Research

    The emergence of big data offers not only a potential boon for social scientific inquiry, but also raises distinct epistemological issues for this new area of research. Drawing on interviews conducted with researchers at the forefront of big data research, we offer insight into questions of causal versus correlational research, the use of ...

  7. Big Data and Social Science: Data Science Methods and Tools for

    The book belongs to the CRC Series of Statistics in the Social and Behavioral Sciences, and presents the 2nd edition of its original version published in 2016. It is devoted to the modern developments of the analytics tools for operating with big data in application to social science problems.

  8. The data revolution in social science needs qualitative research

    We see at least seven reasons why qualitative research will be essential to 'big data' social science (Fig. 1). Fig. 1: Qualitative research and big data. Seven roles for qualitative research ...

  9. Big Data, new epistemologies and paradigm shifts

    For positivistic scholars in the social sciences, Big Data offers a significant opportunity to develop more sophisticated, wider-scale, finer-grained models of human life. ... With respect to the sciences, access to Big Data and new research praxes has led some to proclaim the emergence of a new fourth paradigm, one rooted in data-intensive ...

  10. Moving back to the future of big data-driven research: reflecting on

    The conclusion cautions against the marginalization of social science in the wake of developments in data-driven research that neglect social theory, established methodology and the contextual ...

  11. We Have Big Data, But Do We Need Big Theory? Review-Based Remarks on an

    From a philosophy-of-social-science perspective on big data, some researchers have discussed a paradigmatic shift toward "new empiricism" (based on a stronger focus on data evidence; Arbia 2021) or "digital positivism" (related to computer-generated evidence about the world; Fuchs 2017).More specifically, Chin-Yee and Upshur (2019) have identified three major philosophical problems ...

  12. Big data in social and psychological science: theoretical and

    Big data presents unprecedented opportunities to understand human behavior on a large scale. It has been increasingly used in social and psychological research to reveal individual differences and group dynamics. There are a few theoretical and methodological challenges in big data research that require attention. In this paper, we highlight four issues, namely data-driven versus theory-driven ...

  13. Big Data and Social Science Data Science Methods and Tools for Research

    Description. Big Data and Social Science: Data Science Methods and Tools for Research and Practice, Second Edition shows how to apply data science to real-world problems, covering all stages of a data-intensive social science or policy project. Prominent leaders in the social sciences, statistics, and computer science as well as the field of ...

  14. Big Data: Methodological Challenges and Approaches for Sociological

    The emergence of Big Data is both promising and challenging for social research. This article suggests that realising this promise has been restricted by the methods applied in social science research, which undermine our potential to apprehend the qualities that make Big Data so appealing, not least in relation to the sociology of networks and flows.

  15. Big Data in Computational Social Science and Humanities

    This edited volume focuses on big data implications for computational social science and humanities from management to usage. The first part of the book covers geographic data, text corpus data, and social media data, and exemplifies their concrete applications in a wide range of fields including anthropology, economics, finance, geography, history, linguistics, political science, psychology ...

  16. Opportunities and challenges of big data for the social sciences: The

    These data in conjunction with genome-wide genotype data and social science measures can reveal new insights to important research questions, for example, which genes of interest are subject to social regulation, how the social environment provokes the dynamics, and what social, psychological and biological mechanisms mediate the effects.

  17. The role of administrative data in the big data revolution in social

    1. Introduction. Big data is heralded as a powerful new resource for social science research. The excitement around big data emerges from the recognition of the opportunities it may offer to advance our understanding of human behaviour and social phenomenon in a way that has never been possible before (see for example Burrows and Savage, 2014, Kitchin, 2014a, Kitchin, 2014b, Manovich, 2011 ...

  18. Ethical Issues in Social Science Research Employing Big Data

    This paper analyzes the ethics of social science research (SSR) employing big data. We begin by highlighting the research gap found on the intersection between big data ethics, SSR and research ethics. We then discuss three aspects of big data SSR which make it warrant special attention from a research ethics angle: (1) the interpretative character of both SSR and big data, (2) complexities of ...

  19. Scientific Research and Big Data

    Scientific Research and Big Data. First published Fri May 29, 2020. Big Data promises to revolutionise the production of knowledge within and beyond science, by enabling novel, highly efficient ways to plan, conduct, disseminate and assess research. The last few decades have witnessed the creation of novel ways to produce, store, and analyse ...

  20. Big Data Social Science

    Building a Research Infrastructure for Harnessing the Data Revolution and Its Social Implications. Big Data Social Science has three desired goals to better support big data and related research: (1) Expand research support. (2) Help build an intellectual community around this work. (3) Help expand data science teaching.

  21. Big Data in social science research. What matters about the size?

    The term Big Data is often used in social science research. Borrowing Big Data 3Vs — Volume, Variety and Velocity from Dough Laney — this article tries to explain big data within the research ...

  22. Ethical Issues in Social Science Research Employing Big Data

    This paper analyzes the ethics of social science research (SSR) employing big data. We begin by highlighting the research gap found on the intersection between big data ethics, SSR and research ethics. We then discuss three aspects of big data SSR which make it warrant special attention from a research ethics angle: (1) the interpretative ...

  23. Conceptualizing Big Social Data

    Big data science. Big data science refers to a field that processes and manages high-volume, high-velocity and high-variety data in order to extract reliable and valuable insights [ 50 ]. Big Data is aimed to serve large-scale digital applications and computational systems. Therefore, from BSD perspective, Big Data science provides solutions to ...

  24. What is Big Data Analytics?

    What is big data analytics? Big data analytics refers to the systematic processing and analysis of large amounts of data and complex data sets, known as big data, to extract valuable insights. Big data analytics allows for the uncovering of trends, patterns and correlations in large amounts of raw data to help analysts make data-informed decisions.

  25. Big Data and Infectious Disease Epidemiology: Bibliometric Analysis and

    Recent years have witnessed the rapid emergence of big data and data science research, propelled by the increasing availability of digital traces . The growing availability of electronic records and passive data generated by social media, the internet, and other digital sources can be mined for pattern discoveries and knowledge extraction ...

  26. [2405.14555] Subtle Biases Need Subtler Measures: Dual Metrics for

    Research on Large Language Models (LLMs) has often neglected subtle biases that, although less apparent, can significantly influence the models' outputs toward particular social narratives. This study addresses two such biases within LLMs: \\textit{representative bias}, which denotes a tendency of LLMs to generate outputs that mirror the experiences of certain identity groups, and \\textit ...

  27. X data for academic research

    Learn the fundamentals of using X data for academic research with tailored get-started guides. Or, take your current use of the API further with tutorials, code samples, and tools. Curated datasets. Free, no-code datasets are intended to make it easier for academics to study topics that are of frequent interest to the research community.

  28. A big data analysis of the adoption of quoting encouragement ...

    This research holds significance for the fields of social media and communication studies through its comprehensive evaluation of Twitter's quoting encouragement policy enacted during the 2020 U.S. presidential election. In addressing a notable gap in the literature, this study introduces a framework that assesses both the quantitative and qualitative effects of specific platform-wide policy ...

  29. "The gold standard should be to implement policies that are supported

    The Social Science Research Council fosters innovative research, nurtures new generations of social scientists, deepens how inquiry is practiced within and across disciplines, and mobilizes necessary knowledge on important public issues. ... Data Fluencies - Research grants and convenings to identify data-centric practices that advance well ...

  30. A Book Outlines the Social Study of Science

    By. Eve Glasberg. May 20, 2024. Until the middle of the 20th century, few thought of science as a social system, instead seeing scientific discovery as the work of individual geniuses. Columbia's Department of Sociology played a pivotal role in advancing the social study of science. Researchers from the Columbia program analyzed how science ...