Big Data and Social Science

Data science methods and tools for research and practice.

Ian Foster, Rayid Ghani, Ron S. Jarmin, Frauke Kreuter and Julia Lane

Preface to the 2nd edition

The class on which this book is based was created in response to a very real challenge: how to introduce new ideas and methodologies about economic and social measurement into a workplace focused on producing high-quality statistics. Since the first edition of this book came out we have been fortunate to train over 450 participants in the Applied Data Analytics classes, resulting in increased data analytics capacity, both in terms of human and technical resources. What we learned in delivering these classes greatly influenced the 2nd edition. We also added an entire new chapter on Bias and Fairness in Machine Learning, and re-organized the book chapters somewhat.

As with any book, there are many people to be thanked. The Coleridge Initiative team at New York University, the University of Maryland and the University of Chicago were critical in shaping the format and structure - we are particularly grateful to Clayton Hunter, Jody Derezinski Williams, Graham Henke, Jonathan Morgan, Drew Gordon, Avishek Kumar, Brian Kim, Christoph Kern, and all the book chapter authors for their contributions to the second edition.

We also thank the critical reviewers solicited from CRC Press and everyone from whom we got revision suggestions online, in particular Stas Kolenikov, who carefully examined the first edition and suggested updates. We owe a great debt of gratitude to the project editor, Vaishali Singh, and the publisher, Rob Calver, for their hard work and dedication.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 04 August 2020

Moving back to the future of big data-driven research: reflecting on the social in genomics

  • Melanie Goisauf   ORCID: orcid.org/0000-0002-3909-8071 1 , 2   na1 ,
  • Kaya Akyüz   ORCID: orcid.org/0000-0002-2444-2095 1 , 2   na1 &
  • Gillian M. Martin   ORCID: orcid.org/0000-0002-5281-8117 3   na1  

Humanities and Social Sciences Communications volume  7 , Article number:  55 ( 2020 ) Cite this article

3242 Accesses

8 Citations

9 Altmetric

Metrics details

  • Science, technology and society

With the advance of genomics, specific individual conditions have received increased attention in the generation of scientific knowledge. This spans the extremes of the aim of curing genetic diseases and identifying the biological basis of social behaviour. In this development, the ways knowledge is produced have gained significant relevance, as the data-intensive search for biology/sociality associations has repercussions on doing social research and on theory. This article argues that an in-depth discussion and critical reflection on the social configurations that are inscribed in, and reproduced by genomic data-intensive research is urgently needed. This is illustrated by debating a recent case: a large-scale genome-wide association study (GWAS) on sexual orientation that suggested partial genetic basis for same-sex sexual behaviour (Ganna et al. 2019b ). This case is analysed from three angles: (1) the demonstration of how, in the process of genomics research, societal relations, understandings and categorizations are used and inscribed into social phenomena and outcomes; (2) the exploration of the ways that the (big) data-driven research is constituted by increasingly moving away from theory and methodological generation of theoretical concepts that foster the understanding of societal contexts and relations (Kitchin 2014a ). Big Data Soc and (3) the demonstration of how the assumption of ‘free from theory’ in this case does not mean free of choices made, which are themselves restricted by data that are available. In questioning how key sociological categories are incorporated in a wider scientific debate on genetic conditions and knowledge production, the article shows how underlying classification and categorizations, which are inherently social in their production, can have wide ranging implications. The conclusion cautions against the marginalization of social science in the wake of developments in data-driven research that neglect social theory, established methodology and the contextual relevance of the social environment.

Similar content being viewed by others

big data in social science research

Using genetics for social science

big data in social science research

Genetic determinism, essentialism and reductionism: semantic clarity for contested science

big data in social science research

Participation bias in the UK Biobank distorts genetic associations and downstream analyses

Introduction.

With the advance of genomic research, specific individual conditions received increased attention in scientific knowledge generation. While understanding the genetic foundations of diseases has become an important driver for the advancement of personalized medicine, the focus of interest has also expanded from disease to social behaviour. These developments are embedded in a wider discourse in science and society about the opportunities and limits of genomic research and intervention. With the emergence of the genome as a key concept for ‘life itself’, understandings of health and disease, responsibility and risk, and the relation between present conditions and future health outcomes have shifted, impacting also the ways in which identities are conceptualized under new genetic conditions (Novas and Rose 2000 ). At the same time, the growing literature of postgenomics points to evolving understandings of what ‘gene’ and ‘environment’ are (Landecker and Panofsky 2013 ; Fox Keller 2014 ; Meloni 2016 ). The postgenomic genome is no longer understood as merely directional and static, but rather as a complex and dynamic system that responds to its environment (Fox Keller 2015 ), where the social as part of the environment becomes a signal for activation or silencing of genes (Landecker 2016 ). At the same time, genetic engineering, prominently known as the gene-editing technology CRISPR/Cas9, has received considerable attention, but also caused concerns regarding its ethical, legal and societal implications (ELSI) and governance (Howard et al. 2018 ; Jasanoff and Hurlbut 2018 ). Taking these developments together, the big question of nature vs. nurture has taken on a new significance.

Studies which aim to reveal how biology and culture are being put in relation to each other appear frequently and pursue a genomic re-thinking of social outcomes and phenomena, such as educational attainment (Lee et al. 2018 ) or social stratification (Abdellaoui et al. 2019 ). Yet, we also witness very controversial applications of biotechnology, such as the first known case of human germline editing by He Jiankui in China, which has impacted the scientific community both as an impetus of wide protests and insecurity about the future of gene-editing and its use, but also instigated calls towards public consensus to (re-)set boundaries to what is editable (Morrison and de Saille 2019 ).

Against this background, we are going to debate in this article a particular case that appeared within the same timeframe as these developments: a large-scale genome-wide association study (GWAS) on sexual orientation Footnote 1 , which suggested partial genetic basis for same-sex sexual behaviour (Ganna et al. 2019b ). Some scientists have been claiming sexual orientation to be partly heritable and trying to identify genetic basis for sexual orientation for years (Hamer et al. 1993 ); however, this was the first time that genetic variants were identified as statistically significant and replicated in an independent sample. We consider this GWAS not only by questioning the ways genes are associated with “the social” within this research, but also by exploring how the complexity of the social is reduced through specific data practices in research.

The sexual orientation study also constitutes an interesting case to reflect on how knowledge is produced at a time the data-intensive search for biology/sociality associations has repercussions on doing social research and on theory (Meloni 2014 ). Large amounts of genomic data are needed to identify genetic variations and for finding correlations with different biological and social factors. The rise of the genome corresponds to the rise of big data as the collection and sharing of genomic data gains power with the development of big data analytics (Parry and Greenhough 2017 ). Growing number of correlations, e.g. in genomics of educational attainment (Lee et al. 2018 ; Okbay et al. 2016 ), are being found that are linking the genome to the social, increasingly blurring the established biological/social divide. These could open up new ways of understanding life, and underpin the importance of culture, while, paradoxically, may also carry the risk of new genetic determinism and essentialism. The changing understanding of the now molecularised and datafied body also illustrates the changing significance of empirical research and sociology (Savage and Burrows 2007 ) in the era of postgenomics and ‘datafication’ (Ruckenstein and Schüll 2017 ). These developments are situated within methodological debates in which social sciences often appear through the perspective of ELSI.

As the field of genomics is progressing rapidly and the intervention in the human genome is no longer science fiction, we argue that it is important to discuss and reflect now on the social configurations that are inscribed in, and reproduced by genomic data-driven research. These may co-produce the conception of certain potentially editable conditions, i.e. create new, and reproduce existing classifications that are largely shaped by societal understandings of difference and order. Such definitions could have real consequences—as Thomas and Thomas ( 1929 ) remind us—for individuals and societies, and mark what has been described as an epistemic shift in biomedicine from the clinical gaze to the ‘molecular gaze’ where the processes of “medicalisation and biomedicalisation both legitimate and compel interventions that may produce transformations in individual, familial and other collective identities” (Clarke et al. 2013 , p. 23). While Science and Technology Studies (STS) has demonstrated how science and society are co-produced in research (Jasanoff 2004 ), we want to use the momentum of the current discourse to critically reflect on these developments from three angles: (1) we demonstrate how, in the process of genomics research, societal relations, understandings and categorizations are used and inscribed into social phenomena and outcomes; (2) we explore the ways that the (big) data-driven research is constituted by increasingly moving away from theory and methodological generation of theoretical concepts that foster the understanding of societal contexts and relations (Kitchin 2014a ) and (3) using the GWAS case in focus, we show how the assumption of ‘free from theory’ (Kichin 2014a ) in this case does not mean free of choices made, choices which are themselves restricted by data that are available. We highlight Griffiths’ ( 2016 ) contention that the material nature of genes, their impacts on biological makeup of individuals and their socially and culturally situated behaviour are not deterministic, and need to be understood within the dynamic, culturally and temporally situated context within which knowledge claims are made. We conclude by making the important point that ignoring the social may lead to a distorted, datafied, genomised body which ignores the key fact that “genes are not stable but essentially malleable” (Prainsack 2015 ) and that this ‘malleability’ is rooted in the complex interplay between biological and social environments.

From this perspective, the body is understood through the lens of embodiment, considering humans ‘live’ their genome within their own lifeworld contexts (Rehmann-Sutter and Mahr 2016 ). We also consider this paper as an intervention into the marginalization of social science in the wake of developments in data-driven research that neglect social theory, established methodology and the contextual relevance of the social environment.

In the following reflections, we proceed step by step: First, we introduce the case of the GWAS on same-sex sexual behaviour, as well as its limits, context and impact. Second, we recall key sociological theory on categorizations and their implications. Third, we discuss the emergence of a digital-datafication of scientific knowledge production. Finally, we conclude by cautioning against the marginalization of social science in the wake of developments in data-driven research that neglect social theory, established methodology and the contextual relevance of the social environment.

Studying sexual orientation: The case of same-sex sexual behaviour

Currently, a number of studies at the intersection of genetic and social conditions appear on the horizon. Just as in the examples we have already mentioned, such as those on educational attainment (Lee et al. 2018 ), or social stratification (Abdellaoui et al. 2019 ), it is important to note that the limit to such studies is only the availability of the data itself. In other words, once the data is available, there is always the potential that it would eventually be used. This said, an analysis of the entirety of the genomic research on social outcomes and behaviour is beyond the scope of this article. Therefore, we want to exemplify our argument with reference to the research on the genetics of same-sex sexual behaviour.

Based on a sample of half a million individuals of European ancestry, the first large-scale GWAS of its kind claims five genetic variants to be contributing to the assessed “same-sex sexual behaviour” (Ganna et al. 2019b ). Among these variants, two are useful only for male–male sexual behaviour, one for female–female sexual behaviour, and the remaining two for both. The data that has led to this analysis was sourced from biobanks/cohorts with different methods of data collection. The authors conclude that these genetic variations are not predictive of sexual orientation; not only because genetics is supposedly only part of the picture, but also because the variations are only a small part (<1% of the variance in same-sex sexual behaviour, p. 4) of the approximated genetic basis (8–25% of the variance in same-sex sexual behaviour) that may be identified with large sample sizes (p. 1). The study is an example of how the ‘gay gene’ discourse that has been around for years, gets transformed with the available data accumulating in the biobanks and the consequent genomic analysis, offering only one facet of a complex social phenomenon: same-sex sexual behaviour.

The way the GWAS has been conducted was not novel in terms of data collection. Genome-wide studies of similar scale, e.g. on insomnia (Jansen et al. 2019 ) or blood pressure (Evangelou et al. 2018 ), often rely on already collected data in biobanks rather than trying to collect hundreds of thousands of individuals’ DNA from scratch. Furthermore, in line with wider developments, the study was preregistered Footnote 2 with an analysis plan for the data to be used by the researchers. Unlike other GWASes, however, the researchers partnered with an LGBTQIA+ advocacy group (GLAAD) and a science communication charity (Sense About Science), where individuals beyond the research team interpreted the findings and discussed how to convey the results Footnote 3 . Following these engagements, the researchers have produced a website Footnote 4 with potential frequently asked questions as well as a video about the study, highlighting what it does and what it does not claim.

Despite efforts to control the drifting away of the study into genetic deterministic and discriminatory interpretations, the study has been criticized by many Footnote 5 . Indeed, the controversial “How gay are you?” Footnote 6 app on the GenePlaza website utilized the findings of the study, which in turn raised the alarm bells and, ultimately, was taken down after much debate. The application, however, showed how rapidly such findings can translate into individualized systems of categorization, and consequently feed into and be fed by the public imaginary. One of the study authors demands continuation of research by noting “[s]cientists have a responsibility to describe the human condition in a more nuanced and deeper way” (Maxmen, 2019 , p. 610). Critics, however, note that the context of data collected from the individuals may have influence on the findings; for instance, past developments (i.e. decriminalization of homosexuality, the HIV/AIDS epidemic, and legalization of same-sex marriage) are relevant to understand the UK Biobank’s donor profile and if the GWAS were to be redone according to the birth year of the individuals, different findings could have come out of the study (Richardson et al. 2019 , p. 1461).

It has been pointed out that such research should be assessed by a competent ethical review board according to its potential risks and benefits (Maxmen 2019 , p. 610), in addition to the review and approval by the UK Biobank Access Sub-Committee (Ganna et al. 2019a , p. 1461). Another ethical issue of concern raised by critics is that the informed consent form of UK Biobank does not specify that it could be used for such research since “homosexuality has long been removed from disease classifications” and that the broad consent forms allow only “health-related research” (Holm and Ploug 2019 , p. 1460). We do not want to make a statement here for or against broad consent. However, we argue that discussions about informed consent showcase the complexities related to secondary use of data in research. Similarly, the ‘gay gene’ app developed in the wake of the sexual orientation study, revealed the difficulty of controlling how the produced knowledge may be used, including in ways that are openly denounced by the study authors.

To the best of our knowledge, there have not been similar genome-wide studies published on sexual orientation and, while we acknowledge the limitations associated with focusing on a single case in our discussion, we see this case as relevant to opening up the following question: How are certain social categorizations incorporated into the knowledge production practices? We want to answer this by first revisiting some of the fundamental sociological perspectives into categorizations and the social implications these may have.

Categorizing sex, gender, bodies, disease and knowledge

Sociological perspectives on categorizations.

Categorizations and classifications take a central role in the sociology of knowledge, social stratifications and data-based knowledge production. Categories like gender, race, sexuality and class (and their intersection, see Crenshaw 1989 ) have become key classifications for the study of societies and in understanding the reproduction of social order. One of the most influential theories about the intertwining of categories like gender and class with power relations was formulated by Bourdieu ( 2010 , 2001 ). He claimed that belonging to a certain class or gender is an embodied practice that ensures the reproduction of social structure which is shaped by power relations. The position of subjects within this structure reflects the acquired cultural capital, such as education. Incorporated dispositions, schemes of perception, appreciation, classification that make up the individual’s habitus are shaped by social structure, which actors reproduce in practices. One key mechanism of social categorization is gender classification. The gender order appears to be in the ‘nature of things’ of biologically different bodies, whereas it is in fact an incorporated social construction that reflects and constitutes power relations. Bourdieu’s theory links the function of structuring classifications with embodied knowledge and demonstrates that categories of understanding are pervaded by societal power relations.

In a similar vein Foucault ( 2003 , 2005 ) describes the intertwining of ordering classifications, bodies and power in his study of the clinic. Understandings of and knowledge about the body follow a specific way of looking at it—the ‘medical gaze’ of separating the patient’s body from identity and distinguishing healthy from the diseased, which, too, is a process pervaded by power differentials. Such classifications evolved historically. Foucault reminds us that all periods in history are characterized by specific epistemological assumptions that shape discourses and manifest in modalities of order that made certain kinds of knowledge, for instance scientific knowledge, possible. The unnoticed “order of things”, as well as the social order, is implemented in classifications. Such categorizations also evolved historically for the discourse about sexuality, or, in particular as he pointed out writing in the late 1970s, distinguishing sexuality of married couples from other forms, such as homosexuality (Foucault 1998 ).

Bourdieu and Foucault offer two influential approaches within the wider field of sociology of knowledge that provide a theoretical framework on how categorizations and classifications structure the world in conjunction with social practice and power relations. Their work demonstrates that such structuration is never free from theory, i.e. they are not existing prediscursively, but are embedded within a certain temporal and spatial context that constitutes ‘situated knowledge’ (Haraway 1988 ). Consequently, classifications create (social) order that cannot be understood as ‘naturally’ given but as a result of relational social dynamics embedded in power differentials.

Feminist theory in the 1970s emphasized the inherently social dimension of male and female embodiment, which distinguished between biological sex and socially rooted gender. This distinction built the basis for a variety of approaches that examined gender as a social phenomenon, as something that is (re-)constructed in social interaction, impacted by collectively held beliefs and normative expectations. Consequently, the difference between men and women was no longer simply understood as a given biological fact, but as something that is, also, a result of socialization and relational exchanges within social contexts (see, e.g., Connell 2005 ; Lorber 1994 ). Belonging to a gender or sex is a complex practice of attribution, assignment, identification and, consequently, classification (Kessler and McKenna 1978 ). The influential concept of ‘doing gender’ emphasized that not only the gender, but also the assignment of sex is based on socially agreed-upon biological classification criteria, that form the basis of placing a person in a sex category , which needs to be practically sustained in everyday life. The analytical distinction between sex and gender became eventually implausible as it obscures the process in which the body itself is subject to social forces (West and Zimmerman 1991 ).

In a similar way, sexual behaviour and sexuality are also shaped by society, as societal expectations influence sexual attraction—in many societies within normative boundaries of gender binary and heteronormativity (Butler 1990 ). This also had consequences for a deviation from this norm, resulting for example in the medicalisation of homosexuality (Foucault 1998 ).

Reference to our illustrative case study on the recently published research into the genetic basis of sexuality brings the relevance of this theorization into focus. The study cautions against the ‘gay gene’ discourse, the use of the findings for prediction, and genetic determinism of sexual orientation, noting “the richness and diversity of human sexuality” and stressing that the results do not “make any conclusive statements about the degree to which ‘nature’ and ‘nurture’ influence sexual preference” (Ganna et al. 2019b , p. 6).

Coming back to categorizations, more recent approaches from STS are also based on the assumption that classifications are a “spatio-temporal segmentation of the world” (Bowker and Star 2000 , p. 10), and that classification systems are, similar to concepts of gender theory (e.g. Garfinkel 1967 ), consistent, mutually exclusive and complete. The “International Classification of Diseases (lCD)”, a classification scheme of diseases based on their statistical significance, is an example of such a historically grown knowledge system. How the ICD is utilized in practice points to the ethical and social dimensions involved (Bowker and Star 2000 ). Such approaches help to unravel current epistemological shifts in medical research and intervention, including removal of homosexuality from the disease classification half a century ago.

Re-classifying diseases in tandem with genetic conditions creates new forms of ‘genetic responsibilities (Novas and Rose 2000 ). For instance, this may result in a change of the ‘sick role’ (described early in Parsons 1951 ) in creating new obligations not only for diseased but also for actually healthy persons in relation to potential futures. Such genetic knowledge is increasingly produced using large-scale genomic databases and creates new categories based on genetic risk, and consequently, may result in new categories of individuals that are ‘genetically at risk’ (Novas and Rose 2000 ). The question now is how these new categories will alter, structure or replace evolved categories, in terms of constructing the social world and medical practice.

While advancement in genomics is changing understandings of bodies and diseases, the meanings of certain social categories for medical research remain rather stable. Developments of personalized medicine go along with “the ‘re-inscription’ of traditional epidemiological categories into people’s DNA” and adherence to “old population categories while working out new taxonomies of individual difference” (Prainsack 2015 , pp. 28–29). This, again, highlights the fact that knowledge production draws on and is shaped by categories that have a political and cultural meaning within a social world that is pervaded by power relations.

From categorization to social implication and intervention

While categorizations are inherently social in their production, their use in knowledge production has wide ranging implications. Such is the case of how geneticisation of sexual orientation has been an issue that troubled and comforted the LGBTQIA+ communities. Despite the inexistence of an identified gene, ‘gay gene’ has been part of societal discourse. Such circulation disseminates an unequal emphasis on the biologized interpretations of sexual orientation, which may be portrayed differently in media and appeal to groups of opposing views in contrasting ways (Conrad and Markens 2001 ). Geneticisation, especially through media, moves sexual orientation to an oppositional framework between individual choice and biological consequence (Fausto-Sterling 2007 ) and there have been mixed opinions within LGBTQIA+ communities, whether this would resolve the moralization of sexual orientation or be a move back into its medicalisation (Nelkin and Lindee 2004 ). Thus, while some activists support geneticisation, others resist it and work against the potential medicalisation of homosexuality (Shostak et al. 2008 ). The ease of communicating to the general public simple genetic basis for complex social outcomes which are genetically more complex than reported, contributes to the geneticisation process, while the scientific failures of replicating ‘genetic basis’ claims do not get reported (Conrad 1999 ). In other words, while finding a genetic basis becomes entrenched as an idea in the public imaginary, research showing the opposite does not get an equal share in the media and societal discourse, neither of course does the social sciences’ critique of knowledge production that has been discussed for decades.

A widely, and often quantitatively, studied aspect of geneticisation of sexual orientation is how this plays out in the broader understanding of sexual orientation in society. While there are claims that geneticisation of sexual orientation can result in depoliticization of the identities (O’Riordan 2012 ), it may at the same time lead to polarization of society. According to social psychologists, genetic attributions to conditions are likely to lead to perceptions of immutability, specificity in aetiology, homogeneity and discreteness as well naturalistic fallacy (Dar-Nimrod and Heine 2011 ). Despite the multitude of suggestive surveys that belief in genetic basis of homosexuality correlates with acceptance, some studies suggest learning about genetic attribution to homosexuality can be polarizing and confirmatory of the previously held negative or positive attitudes (Boysen and Vogel 2007 ; Mitchell and Dezarn 2014 ). Such conclusions can be taken as a precaution that just as scientific knowledge production is social, its consequences are, too.

Looking beyond the case

We want to exemplify this argument by taking a detour to another case where the intersection between scientific practice, knowledge production and the social environment is of particular interest. While we have discussed the social implications of geneticisation with a focus on sexual orientation, recent developments in biomedical sciences and biotechnology also have the potential to reframe the old debates in entirely different ways. For instance, while ‘designer babies’ were only an imaginary concept until recently, the facility and affordability of processes, such as in vitro selection of baby’s genotype and germline genome editing, have potentially important impacts in this regard. When CRISPR/Cas9 technique was developed for rapid and easy gene editing, both the hopes and worries associated with its use were high. Martin and others ( 2020 , pp. 237–238) claim gene editing is causing both disruption within the postgenomic regime, specifically to its norms and practices, and the convergence of various biotechnologies such as sequencing and editing. Against this background, He Jiankui’s announcement in November 2018 through YouTube Footnote 7 that twins were born with edited genomes was an unwelcome surprise for many. This unexpected move may have hijacked the discussions on ethical, legal, societal implications of human germline genome-editing, but also rang the alarm bells across the globe for similar “rogue” scientists planning experimentation with the human germline (Morrison and de Saille 2019 ). The facility to conduct germline editing is, logically, only one step away from ‘correcting’ and if there is a correction, then that would mean a return to a normative state. He’s construction of HIV infection as a genetic risk can be read as a placeholder for numerous questions to human germline editing: What are the variations that are “valuable” enough for a change in germline? For instance, there are plans by Denis Rebrikov in Russia to genome edit embryos to ‘fix’ a mutation that causes congenital deafness (Cyranoski 2019 ). If legalized, what would be the limits applied and who would be able to afford such techniques? At a time when genomics research into human sociality is booming, would the currently produced knowledge in this field and others translate into ‘corrective’ genome-editing? Who would decide?

The science, in itself is still unclear at this stage as, for many complex conditions, using gene editing to change one allele to another is often minuscule in effect, considering that numerous alleles altogether may affect phenotypes, while at the same time a single allele may affect multiple phenotypes. In another GWAS case, social genomicists claim there are thousands of variations that are found to be influential for a particular social outcome such as educational attainment (Lee et al. 2018 ), with each having minimal effect. It has also been shown in the last few years, as the same study is conducted with ever more larger samples, more genomic variants are associated with the social outcome, i.e. 74 single nucleotide polymorphisms (SNPs) associated with the outcome in a sample size of 293,723 (Okbay et al. 2016 ) and 1271 SNPs associated with the outcome in a sample size of 1.1 million individuals (Lee et al. 2018 ).

Applying this reasoning to the GWAS on same-sex sexual behaviour, it is highly probable that the findings will be superseded in the following years with similar studies of bigger data, increasing the number of associations.

A genomic re-thinking?

The examples outlined here have served to show how focusing the discussion on “genetic determinism” is fruitless considering the complexity of the knowledge production practices and how the produced knowledge could both mirror social dynamics and shape these further. Genomic rethinking of the social necessitates a new formulation of social equality, where genomes are also relevant. Within the work of social genomics researchers, there has been cautious optimism toward the contribution of findings from genomics research to understanding social outcomes of policy change (Conley and Fletcher 2018 ; Lehrer and Ding 2019 ). Two fundamental thoughts govern this thinking. First, genetic basis is not to be equalized with fate; in other words, ‘genetic predispositions’ make sense only within the broader social and physical environmental frame, which often allows room for intervention. Second, genetics often relates to heterogeneity of the individuals within a population, in ways that the same policy may be positive, neutral or negative for different individuals due to their genes. In this respect, knowledge gained via social genomics may be imagined as a basis for a more equal society in ‘uncovering’ invisible variables, while, paradoxically, it may also be a justification for exclusion of certain groups. For example, a case that has initially raised the possibility that policies affect individuals differently because of their genetic background was a genetic variant that was correlated to being unaffected by tax increases on tobacco (Fletcher 2012 ). The study suggested that raising the taxes may be an ineffective tool for lowering smoking rates below a certain level, since those who are continuing to smoke may be those who cannot easily stop due to their genetic predisposition to smoking. Similar ideas could also apply to a diverse array of knowledge produced in social genomics, where the policies may be under scrutiny according to how they are claimed to variably influence the members of a society due to their genetics.

Datafication of scientific knowledge production

From theory to data-driven science.

More than a decade has gone by since Savage and Burrows ( 2007 ) described a crisis in empirical research, where the well-developed methodologies for collecting data about the social world would become marginal as such data are being increasingly generated and collected as a by-product of daily virtual transactions. Today, sociological research faces a widely datafied world, where (big) data analytics are profoundly changing the paradigm of knowledge production, as Facebook, Twitter, Google and others produce large amounts of socially relevant data. A similar phenomenon is taking place through opportunities that public and private biobanks, such as UK Biobank or 23andMe, offer. Crossing the boundaries of social sciences and biological sciences is facilitated through mapping correlations between genomic data, and data on social behaviour or outcomes.

This shift from theory to data-driven science misleadingly implies a purely inductive knowledge production, neglecting the fact that data is not produced free of preceding theoretical framing, methodological decisions, technological conditions and the interpretation of correlations—i.e. an assemblage situated within a specific place, time, political regime and cultural context (Kitchin 2014a ). It glosses over the fact that data cannot simply be treated as raw materials, but rather as “inherently partial, selective and representative”, the collection of which has consequences (Kitchin 2014b , p. 3). How knowledge of the body is generated starts with how data is produced and how it is used and mobilized. Through sequencing, biological samples are translated into digital data that are circulated and merged and correlated with other data. With the translation from genes into data, their meaning also changes (Saukko 2017 ). The kind of knowledge that is produced is also not free of scientific and societal concepts.

Individually assigned categorical variables to genomes have become important for genomic research and are impacting the ways in which identities are conceptualized under (social) genomic conditions. These characteristics include those of social identity, such as gender, ethnicity, educational and socioeconomic status. They are often used for the study of human genetic variation and individual differences with the aim to advance personalized medicine and based on demographic and ascribed social characteristics.

The sexual orientation study that is central to this paper can be read as a case where such categories intersect with the mode of knowledge production. As the largest contributor of data to the study, UK Biobank’s data used in this research are revealing since they are based on the answer to the following question “Have you ever had sexual intercourse with someone of the same sex?” along with the statement “Sexual intercourse includes vaginal, oral or anal intercourse.” Footnote 8 .

Furthermore, the authors accept having made numerous reductive assumptions and that their study has methodological limitations. For instance, Ganna et al. ( 2019b ) acknowledge both within the article (p. 1) and an accompanying website Footnote 9 that the research is based on a binary ‘sex’ system with exclusions of non-complying groups as the authors report that they “dropped individuals from [the] study whose biological sex and self-identified sex/gender did not match” (p. 2). However, both categorizing sexual orientation mainly on practice rather than attraction or desire, and building it on normative assumptions about sexuality, i.e. gender binary and heteronormativity, are problematic, as sexual behaviour is diverse and does not necessarily correspond with such assumptions.

The variations found in the sexual orientation study, as is true for other genome-wide association studies, are often relevant for the populations studied and in this case, those mainly belong to certain age groups and European ancestry. While the study avoids critique in saying that their research is not genetics of sexual orientation, but rather of same-sex sexual behaviour, whether such a genomic study would be possible is also questionable. This example demonstrates that, despite the increasing influence of big data, a fundamental problem with the datafication of many social phenomena is whether or not they are amenable to measurement. In the case of sexual orientation, whether the answer to the sexual orientation questions corresponds to the “homosexuality” or “willingness to reveal homosexuality”/“stated sexual orientation” is debatable, considering the social pressure and stigma that may be an element in certain social contexts (Conley 2009 , p. 242).

While our aim is to bring a social scientific perspective, biologists have raised at least two different critical opinions on the knowledge production practice here in the case of the sexual orientation study, first on the implications of the produced knowledge Footnote 10 and second on the problems and flaws of the search for a genetic basis Footnote 11 . In STS, however, genetic differences that were hypothesized to be relevant for health, especially under the category of race in the US, have been a major point of discussion within the genomic ‘inclusion’ debates of 1990s (Reardon 2017 , p. 49; Bliss 2015 ). In other words, a point of criticism towards the knowledge production was the focus on certain “racial” or racialized groups, such as American of European ancestry, which supposedly biased the findings and downstream development of therapies for ‘other’ groups. However, measuring health and medical conditions against the background of groups that are constituted based on social or cultural categories (e.g. age, gender, ethnicity), may also result in a reinscription/reconstitution of social inequalities attached to these categories (Prainsack 2015 ) and at the same time result in health justice being a topic seen through a postgenomics lens, where postgenomics is “a frontline weapon against inequality” (Bliss 2015 p. 175). Social-economic factors may recede in the background, while data with its own often invisible politics are foregrounded.

Unlike what Savage and Burrows suggested in 2007, the coming crisis can not only be seen as a crisis of sociology, but of science in general. Just as the shift of focus in social sciences towards digital data is only one part of the picture, another part could be the developments in genomisation of the social. Considering that censuses and large-scale statistics are not new, the distinction of the current phenomenon is possibly the opportunity to individualize the data, while categories themselves are often unable to capture the complexity, despite producing knowledge more efficiently. In that sense, the above-mentioned survey questions do not do justice to the complexity of social behaviour. What is most important to flag within these transformations is the lack of reflexivity regarding how big data comes to represent the world and whether it adds and/or takes away from the ways of knowing before big data. These developments and directions of genetic-based research and big data go far beyond the struggle of a discipline, namely sociology, with a paradigm shift in empirical research. They could set the stage for real consequences for individuals and groups. Just as what is defined as an editable condition happens as a social process that relies on socio-political categories, the knowledge acquired from big data relies in similar way on the same kind of categories.

The data choices and restrictions: ‘Free from theory’ or freedom of choice

Data, broadly understood, have become a fundamental part of our lives, from accepting and granting different kinds of consent for our data to travel on the internet, to gaining the ‘right to be forgotten’ in certain countries, as well as being able to retrieve collected information about ourselves from states, websites, even supermarket chains. While becoming part of our lives, the data collected about individuals in the form of big data is transferred between academic and non-academic research, scientific and commercial enterprises. The associated changes in the knowledge production have important consequences for the ways in which we understand and live in the world (Jasanoff 2004 ). The co-productionist perspective in this sense does not relate to whether or how the social and the biological are co-produced, but rather it is pointing to how produced knowledge in science is both shaped by and shaping societies. Thus, the increasing impact and authority of big data in general, and within the sexual orientation study in focus here, opens up new avenues to claim as some suggest, that we have reached the end of theory.

The “end of theory” has actively been debated within and beyond science. Kitchin ( 2014a ) locates the recent origin of this debate in a piece in the Wired , where the author states “Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all” (Anderson 2008 ). Others call this a paradigm shift towards data-intensive research leaving behind the empirical and theoretical stages (Gray 2009 , p. xviii). While Google and others form the basis for this data-driven understanding in their predictive capacity or letting the data speak, the idea that knowledge production is ‘free from theory’ in this case seems to be, at best, an ignorance of any data infrastructure and how the categories are formed within it.

Taking a deeper look at the same-sex sexual behaviour study from this angle suggests that such research cannot be free from theory as it has to make an assumption regarding the role of genetics in the context of social dynamics. In other words, it has to move sexual orientation, at least partially in the form of same-sex sexual behaviour, out of the domain of the social towards the biological. In doing so, just as the study concludes the complexity of sexual orientation, the authors note in their informative video Footnote 12 on their website, that “they found that about a third of the differences between people in their sexual behaviour could be explained by inherited genetic factors. But the environment also plays a large role in shaping these differences.” While the study points to a minuscule component of the biological, it also frames biology as the basis on which the social, as part of the environment, acts upon.

Reconsidering how the biology and the social are represented in the study, three theoretical choices are made due to the limitation of the data. First of all, the biological is taken to be “the genome-wide data” in the biobanks that the study relies on. This means sexual orientation is assumed to be within the SNPs, or points on the genome that are common variations across a population, and not in other kinds of variations that are rare or not captured by the genotyped SNPs. These differences include, but are not limited to, large-scale to small-scale duplications and deletions of the genomic regions, rare variants or even common variants in the population that the SNP chips do not capture. Such ignored differences are very important for a number of conditions, from cancer to neurobiology. Similarly, the genomic focus leaves aside the epigenetic factors that could theoretically be the missing link between genomes and environments. In noting this, we do not suggest that the authors of the study are unaware or uninterested in epigenetics; however, regardless of their interest and/or knowledge, the availability of large-scale genome-wide data puts such data ahead of any other variation in the genome and epigenome. In other words, if the UK Biobank and 23andMe had similar amounts of epigenomic or whole genome data beyond the SNPs, the study would have most possibly relied on these other variations in the genome. The search for genetic basis within SNPs is a theoretical choice, and in this case this choice is pre-determined by the limitations of the data infrastructures.

The second choice that the authors make is to take three survey questions, i.e. in the case of UK Biobank data, as encompassing enough of the complexity of sexual orientation for their research. As partly discussed earlier, these questions are simply asking about sexual behaviour. Based on the UK Biobank’s definition of sexual intercourse as “vaginal, oral or anal intercourse” the answers to the following questions were relevant for the research: “Have you ever had sexual intercourse with someone of the same sex?” (Data-Field 2159), “How many sexual partners of the same sex have you had in your lifetime?” (Data-Field 3669), and, “About how many sexual partners have you had in your lifetime?” (Data-Field 2149). Answers to such questions do little justice to the complexity of the topic. Considering that they are not included in the biobank as data for the purpose of identifying a genetic basis to same-sex sexual behaviour, there is much to consider in what capacity they are useful for that. It is worth noting here that the UK Biobank is primarily focused on health-related research, and thus these three survey questions could not have been asked with a genomic exploration of ‘same-sex sexual behaviour’ or ‘sexual orientation’ in mind. The degree of success in the way they have been used to identify the genetic basis for complex social behaviours is questionable.

The authors of the study consider the UK Biobank sample to be comprised of relatively old individuals and this to be a shortcoming Footnote 13 . Similarly, the study authors claim that 23andMe samples may be biased because “[i]ndividuals who engage in same-sex sexual behaviour may be more likely to self-select the sexual orientation survey”, which then explains the high percentage of such individuals (18.9%) (Ganna et al. 2019b , p. 1). However, the authors do not problematize that there is at least three-fold difference between the youngest and oldest generation in the UK Biobank sample in their response to the same-sex sexual behaviour question (Ganna et al. 2019b , p. 2). The study, thus, highlights the problematic issue about who should be regarded as the representative sample to be asked about their “same-sex sexual behaviour”. Still, this is a data choice that the authors make in concluding a universal explanation out of a very specific and socially constrained collection of self-reported data that encompasses only part of what the researchers are interested in.

The third choice is a choice unmade. The study data mainly came from UK Biobank, following a proposal by Brendan Zietsch with the title “Direct test whether genetic factors predisposing to homosexuality increase mating success in heterosexuals” Footnote 14 . The original plan for research frames “homosexuality” as a condition that heterosexuals can be “predisposed” to and as this condition is not eliminated through evolution, scientists hypothesize that whatever genetic variation that predisposes an individual to homosexuality may also be functional in increasing the individual’s reproductive capacity. Despite using such an evolutionary explanation as the theoretical basis for obtaining the data from the UK Biobank, the authors use evolution/evolutionary only three times in the article, whereas the concept “mating success” is totally missing. Unlike the expectation in the research proposal, authors observe lower number of offspring for individuals reporting same-sex sexual behaviour, and they conclude briefly “This reproductive deficit raises questions about the evolutionary maintenance of the trait, but we do not address these here” (Ganna et al. 2019b , p. 2). In other words, the hypothesis that allowed scientists to acquire the UK Biobank data becomes irrelevant for the researchers, when they are reporting their findings.

In this section, we have performed an analysis of how data choices are made at different steps of the research and hinted at how these choices reflect certain understandings of how society functions. These are evident in the ways sexual behaviour is represented and categorized according to quantitative data, and, the considerations of whether certain samples are contemporary enough (UK Biobank) or too self-selecting (same-sex sexual behaviour being too high in 23andMe). The study, however, does not problematize how the percentage of individuals reporting same-sex sexual behaviour steadily increases according to year of birth, at least tripling for males and increasing more than five-fold for females from 1940 and 1970 (for UK Biobank). Such details are among the data that the authors display as descriptive statistics in Fig. 1 (Ganna et al. 2019b , p. 2); however, these do not attract a discussion that genomic data receives. The study itself starts from the idea that genetic markers that are associated with same-sex sexual behaviour could have an evolutionary advantage and ends in saying the behaviour is complex. Critics claim the “approach [of the study] implies that it is acceptable to issue claims of genetic drivers of behaviours and then lay the burden of proof on social scientists to perform post-hoc socio-cultural analysis” (Richardson et al. 2019 , p. 1461).

In this paper, we have ‘moved back to the future’—taking stock of the present-day accelerated impact of big data and of its potential and real consequences. Using the sexual orientation GWAS as point of reference, we have shown that claims to working under the premise of ‘pure science’ of genomics are untenable as the social is present by default—within the methodological choices made by the researchers, the impact on/of the social imaginary or epigenetic context.

By focusing on the contingency of the knowledge production on the social categories that are themselves reflections of the social in the data practices, we have highlighted the relational processes at the root of knowledge production. We are experiencing a period where the repertoire of what gets quantified continuously, and possibly exponentially, increases; however, this does not necessarily mean that our understanding of complexity increases at the same rate, rather, it may lead to unintended simplification where meaningful levels of understanding of causality are lost in the “triumph of correlations” in big data (Mayer-Schönberger and Cukier 2013 ; cited in Leonelli 2014 ). While sociology has much to offer through its qualitative roots, we think it should do more than critique, especially considering the culturally and temporally specific understandings of the social are also linked to the socio-material consequences.

We want to highlight that now is the time to think about the broader developments in science and society, not merely from an external perspective, but within a new framework. Clearly, our discussion of a single case here cannot sustain suggestions for a comprehensive and applicable framework for any study; however, we can flag the urgency of its requirement. We have shown that, in the context of the rapid developments within big data-driven, and socio-genomic research, it is necessary to renew the argument for bringing the social, and its interrelatedness to the biological, clearly back into focus. We strongly believe that reemphasizing this argument is essential to underline the analytical strength of the social science perspective, and in order to avoid the possibility of losing sight of the complexity of social phenomena, which risk being oversimplified in mainly statistical data-driven science.

We can also identify three interrelated dimensions of scientific practice that the framework would valorize: (1) Recognition of the contingency of choices made within the research process, and sensibility of their consequent impact within the social context. (2) Ethical responsibilities that move beyond procedural contractual requirements, to sustaining a process rooted in clear understanding of societal environments. (3) Interdisciplinarity in analytical practice that potentiates the impact of each perspectival lens.

Such a framework would facilitate moving out of the disciplinary or institutionalized silos of ELSI, STS, sociology, genetics, or even emerging social genomics. Rather than competing for authority on ‘the social’, the aim should be to critically complement each other and refract the produced knowledge with a multiplicity of lenses. Zooming ‘back to the future’ within the field of socio-biomedical science, we would flag the necessity of re-calibrating to a multi-perspectival endeavour—one that does justice to the complex interplay of social and biological processes within which knowledge is produced.

The GWAS primarily uses the term “same-sex sexual behaviour” as one of the facets of “sexual orientation” where the former becomes the component that is directly associable with the genes and the latter the broader phenomenon of interest. Thus, while the article is referring to “same-sex sexual behaviour” in its title, it is editorially presented in the same Science issue under Human Genetics heading with the subheading “The genetics of sexual orientation” (p. 880) (see Funk 2019 ). Furthermore, the request for data from UK Biobank by the corresponding author Brendan P. Zietsch (see footnote 14) refers only to sexual orientation and homosexuality and not to same-sex sexual behaviour. Therefore, we follow the same interchangeable use in this article.

Source: https://osf.io/xwfe8 (04.03.2020).

Source: https://www.wsj.com/articles/research-finds-genetic-links-to-same-sex-behavior-11567101661 (04.03.2020).

Source: https://geneticsexbehavior.info (04.03.2020).

In addition to footnotes 10 and 11, for a discussion please see: https://www.nytimes.com/2019/08/29/science/gay-gene-sex.html (04.03.2020).

Later “122 Shades of Grey”: https://www.geneplaza.com/app-store/72/preview (04.03.2020).

Source: https://www.youtube.com/watch?v=th0vnOmFltc (04.03.2020).

Source: http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=2159 (04.03.2020).

Source: https://geneticsexbehavior.info/ (04.03.2020).

Source: https://www.broadinstitute.org/blog/opinion-big-data-scientists-must-be-ethicists-too (04.03.2020).

Source: https://medium.com/@cecilejanssens/study-finds-no-gay-gene-was-there-one-to-find-ce5321c87005 (03.03.2020).

Source: https://videos.files.wordpress.com/2AVNyj7B/gosb_subt-4_dvd.mp4 (04.03.2020).

Source: https://geneticsexbehavior.info/what-we-found/ (04.03.2020).

Source: https://www.ukbiobank.ac.uk/2017/04/direct-test-whether-genetic-factors-predisposing-to-homosexuality-increase-mating-success-in-heterosexuals/ (04.03.2020).

Abdellaoui A, Hugh-Jones D, Yengo L, Kemper KE, Nivard MG, Veul L, Holtz Y, Zietsch BP, Frayling TM, Wray NR (2019) Genetic correlates of social stratification in Great Britain. Nat Hum Behav 1–21. https://doi.org/10.1038/s41562-019-0757-5

Anderson C (2008) The end of theory: the data deluge makes the scientific method obsolete, Wired https://www.wired.com/2008/06/pb-theory/ . Accessed 31 Mar 2020

Bliss C (2015) Defining health justice in the postgenomic era. In: Richardson SS, Stevens H (eds) Postgenomics: perspectives on biology after the genome. Duke University Press, Durham, Durham/London, pp. 174–191

Chapter   Google Scholar  

Bourdieu P (2001) Masculine domination. Stanford University Press, Stanford

Google Scholar  

Bourdieu P (2010) Distinction: a social critique of the judgement of taste. Routledge, London/New York

Bowker GC, Star SL (2000) Sorting things out: classification and its consequences. MIT Press, Cambridge/London

Book   Google Scholar  

Boysen GA, Vogel DL (2007) Biased assimilation and attitude polarization in response to learning about biological explanations of homosexuality. Sex Roles 57(9–10):755–762. https://doi.org/10.1007/s11199-007-9256-7

Article   Google Scholar  

Butler J (1990) Gender trouble. Feminism and the subversion of identity. Routledge, New York

Clarke AE, Shim JK, Shostak S, Nelson A (2013) Biomedicalising genetic health, diseases and identities. In: Atkinson P, Glasner P, Lock M (eds) Handbook of genetics and society: mapping the new genomc era. Routledge, Oxon, pp. 21–40

Conley D (2009) The promise and challenges of incorporating genetic data into longitudinal social science surveys and research. Biodemogr Soc Biol 55(2):238–251. https://doi.org/10.1080/19485560903415807

Conley D, Fletcher J (2018) The genome factor: what the social genomics revolution reveals about ourselves, our history, and the future. Princeton University Press, Princeton/Oxford

Connell RW (2005) Masculinities. Polity, Cambridge

Conrad P (1999) A mirage of genes. Sociol Health Illn 21(2):228–241. https://doi.org/10.1111/1467-9566.00151

Conrad P, Markens S (2001) Constructing the ‘gay gene’ in the news: optimism and skepticism in the US and British press. Health 5(3):373–400. https://doi.org/10.1177/136345930100500306

Crenshaw K (1989) Demarginalizing the intersection of race and sex: a black feminist critique of antidiscrimination doctrine, feminist theory and antiracist politics, vol 1989(8). University of Chicago Legal Forum. http://chicagounbound.uchicago.edu/uclf/vol1989/iss1/8 . Accessed 1 Apr 2020

Cyranoski D (2019) Russian ‘CRISPR-baby’ scientist has started editing genes in human eggs with goal of altering deaf gene. Nature 574(7779):465–466. https://doi.org/10.1038/d41586-019-03018-0

Article   ADS   CAS   PubMed   Google Scholar  

Dar-Nimrod I, Heine SJ (2011) Genetic essentialism: on the deceptive determinism of DNA. Psychol Bull 137(5):800–818. https://doi.org/10.1037/a0021860

Article   PubMed   PubMed Central   Google Scholar  

Evangelou E, Warren HR, Mosen-Ansorena D, Mifsud B, Pazoki R, Gao H, Ntritsos G, Dimou N, Cabrera CP, Karaman I (2018) Genetic analysis of over 1 million people identifies 535 new loci associated with blood pressure traits. Nat Genet 50(10):1412–1425. https://doi.org/10.1038/s41588-018-0205-x

Article   CAS   PubMed   PubMed Central   Google Scholar  

Fausto-Sterling A (2007) Frameworks of desire. Daedalus 136(2):47–57. https://doi.org/10.1162/daed.2007.136.2.47

Fletcher JM (2012) Why have tobacco control policies stalled? Using genetic moderation to examine policy impacts. PLoS ONE 7(12):e50576. https://doi.org/10.1371/journal.pone.0050576

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Foucault M (1998) The history of sexuality 1: the will to knowledge. Penguin Books, London

Foucault M (2003) The birth of the clinic. Routledge, London/New York

Foucault M (2005) The order of things. Routledge, London/New York

Fox Keller E (2014) From gene action to reactive genomes. J Physiol 592(11):2423–2429. https://doi.org/10.1113/jphysiol.2014.270991

Article   CAS   Google Scholar  

Fox Keller E (2015) The postgenomic genome. In: Richardson SS, Stevens H (eds) Postgenomics: perspectives on biology after the genome. Duke University Press, Durham/London, pp. 9–31

Funk M (2019) The genetics of sexual orientation. Science 365(6456):878–880. https://doi.org/10.1126/science.365.6456.878-k

Article   ADS   Google Scholar  

Ganna A, Verweij KJ, Nivard MG, Maier R, Wedow R, Busch AS, Abdellaoui A, Guo S, Sathirapongsasuti JF, Lichtenstein P (2019a) Genome studies must account for history—response. Science 366(6472):1461–1462. https://doi.org/10.1126/science.aaz8941

Ganna A, Verweij KJ, Nivard MG, Maier R, Wedow R, Busch AS, Abdellaoui A, Guo S, Sathirapongsasuti JF, Lichtenstein P (2019b) Large-scale GWAS reveals insights into the genetic architecture of same-sex sexual behavior. Science 365(6456):eaat7693. https://doi.org/10.1126/science.aat7693

Garfinkel H (1967) Studies in ethnomethodology. Polity Press, Cambridge

Gray J (2009) Jim Gray on eScience: a transformed scientific method. In: Hey T, Tansley S, Tolle KM (eds) The fourth paradigm: data-intensive scientific discovery. Microsoft Research, Redmond, pp. xvii–xxxi

Griffiths DA (2016) Queer genes: realism, sexuality and science. J Crit Realism 15(5):511–529. https://doi.org/10.1080/14767430.2016.1210872

Hamer DH, Hu S, Magnuson VL, Hu N, Pattatucci AM (1993) A linkage between DNA markers on the X chromosome and male sexual orientation. Science 261(5119):321–327. https://doi.org/10.1126/science.8332896

Haraway D (1988) Situated knowledges: the science question in feminism and the privilege of partial perspective. Fem Stud 14(3):575–599

Holm S, Ploug T (2019) Genome studies reveal flaws in broad consent. Science 366(6472):1460–1461. https://doi.org/10.1126/science.aaz3797

Howard HC, van El CG, Forzano F, Radojkovic D, Rial-Sebbag E, de Wert G, Borry P, Cornel MC (2018) One small edit for humans, one giant edit for humankind? Points and questions to consider for a responsible way forward for gene editing in humans. Eur J Hum Genet 26(1):1. https://doi.org/10.1038/s41431-017-0024-z

Article   CAS   PubMed   Google Scholar  

Jansen PR, Watanabe K, Stringer S, Skene N, Bryois J, Hammerschlag AR, de Leeuw CA, Benjamins JS, Muñoz-Manchado AB, Nagel M, Savage JE, Tiemeier H, White T, Agee M, Alipanahi B, Auton A, Bell RK, Bryc K, Elson SL, Fontanillas P, Furlotte NA, Hinds DA, Huber KE, Kleinman A, Litterman NK, McCreight JC, McIntyre MH, Mountain JL, Noblin ES, Northover CAM, Pitts SJ, Sathirapongsasuti JF, Sazonova OV, Shelton JF, Shringarpure S, Tian C, Wilson CH, Tung JY, Hinds DA, Vacic V, Wang X, Sullivan PF, van der Sluis S, Polderman TJC, Smit AB, Hjerling-Leffler J, Van Someren EJW, Posthuma D, The 23andMe Research, T. (2019) Genome-wide analysis of insomnia in 1,331,010 individuals identifies new risk loci and functional pathways. Nat Genet 51(3):394–403. https://doi.org/10.1038/s41588-018-0333-3

Jasanoff S (2004) The idiom of co-production. In: Jasanoff S (ed.) States of knowledge: the co-production of science and social order. Routledge, London, p 1–12

Jasanoff S, Hurlbut JB (2018) A global observatory for gene editing. Nature 555:435–437. https://doi.org/10.1038/d41586-018-03270-w

Kessler SJ, McKenna W (1978) Gender: an ethnomethodological approach. John Wiley & Sons, New York

Kitchin, R. (2014a) Big Data, new epistemologies and paradigm shifts. Big Data Soc. https://doi.org/10.1177/2053951714528481

Kitchin R (2014b) The data revolution. Big data, open data, data infrastructures and their consequences. Sage, London

Landecker H (2016) The social as signal in the body of chromatin. Sociol Rev 64(1_suppl):79–99. https://doi.org/10.1111/2059-7932.12014

Landecker H, Panofsky A (2013) From social structure to gene regulation, and back: a critical introduction to environmental epigenetics for sociology. Annu Rev Sociol 39:333–357. https://doi.org/10.1146/annurev-soc-071312-145707

Lee JJ, Wedow R, Okbay A, Kong E, Maghzian O, Zacher M, Nguyen-Viet TA, Bowers P, Sidorenko J, Linnér RK (2018) Gene discovery and polygenic prediction from a 1.1-million-person GWAS of educational attainment. Nat Genet 50(8):1112. https://doi.org/10.1038/s41588-018-0147-3

Lehrer SF, Ding W (2019) Can social scientists use molecular genetic data to explain individual differences and inform public policy? In: Foster G (ed.) Biophysical measurement in experimental social science research. Academic Press, London/San Diego/Cambridge/Oxford, pp. 225–265

Leonelli, S. (2014) What difference does quantity make? On the epistemology of Big Data in biology. Big Data Soc. https://doi.org/10.1177/2053951714534395

Lorber J (1994) Paradoxes of gender. Yale University Press, New Haven

Martin P, Morrison M, Turkmendag I, Nerlich B, McMahon A, de Saille S, Bartlett A (2020) Genome editing: the dynamics of continuity, convergence, and change in the engineering of life. New Genet Soc 39(2):219–242. https://doi.org/10.1080/14636778.2020.1730166

Maxmen A (2019) Controversial ‘gay gene’ app provokes fears of a genetic Wild West. Nature 574(7780):609–610. https://doi.org/10.1038/d41586-019-03282-0

Mayer-Schönberger V, Cukier K (2013) Big data: a revolution that will transform how we live, work, and think. Houghton Mifflin Harcourt, Boston/New York

Meloni M (2014) Biology without biologism: social theory in a postgenomic age. Sociology 48(4):731–746. https://doi.org/10.1177/0038038513501944

Meloni M (2016) Political biology: Science and social values in human heredity from eugenics to epigenetics. Palgrave Macmillan, n.p.p

Mitchell RW, Dezarn L (2014) Does knowing why someone is gay influence tolerance? Genetic, environmental, choice, and “reparative” explanations. Sex Cult 18(4):994–1009. https://doi.org/10.1007/s12119-014-9233-6

Morrison M, de Saille S (2019) CRISPR in context: towards a socially responsible debate on embryo editing. Palgrave Commun 5(1):1–9. https://doi.org/10.1057/s41599-019-0319-5

Nelkin D, Lindee MS (2004) The DNA mystique: the gene as a cultural icon. University of Michigan Press, Ann Arbor

Novas C, Rose N (2000) Genetic risk and the birth of the somatic individual. Econ Soc 29(4):485–513. https://doi.org/10.1080/03085140050174750

O’Riordan K (2012) The life of the gay gene: from hypothetical genetic marker to social reality. J Sex Res 49(4):362–368. https://doi.org/10.1080/00224499.2012.663420

Article   PubMed   Google Scholar  

Okbay A, Beauchamp JP, Fontana MA, Lee JJ, Pers TH, Rietveld CA, Turley P, Chen G-B, Emilsson V, Meddens SFW (2016) Genome-wide association study identifies 74 loci associated with educational attainment. Nature 533(7604):539–542. https://doi.org/10.1038/nature17671

Parry B, Greenhough B (2017) Bioinformation. Polity Press, Cambridge

Parsons T (1951) The social system. Free Press, New York

Prainsack B (2015) Is personalized medicine different? (Reinscription: the sequel) A response to Troy Duster. Br J Sociol 66(1):28–35. https://doi.org/10.1111/1468-4446.12117

Reardon J (2017) The postgenomic condition: ethics, justice, and knowledge after the genome. University of Chicago Press, Chicago/London

Rehmann-Sutter C, Mahr D (2016) The lived genome. In: Whitehead A, Woods A (eds) Edinburgh companion to the critical medical humanities. Edinburgh University Press, Edinburgh, pp. 87–103

Richardson SS, Borsa A, Boulicault M, Galka J, Ghosh N, Gompers A, Noll NE, Perret M, Reiches MW, Sandoval JCB (2019) Genome studies must account for history. Science 366(6472):1461. https://doi.org/10.1126/science.aaz6594

Ruckenstein M, Schüll ND (2017) The datafication of health. Annu Rev Anthropol 46(261–278). https://doi.org/10.1146/annurev-anthro-102116-041244

Saukko P (2017) Shifting metaphors in direct-to-consumer genetic testing: from genes as information to genes as big data. New Genet Soc 36(3):296–313. https://doi.org/10.1080/14636778.2017.1354691

Savage M, Burrows R (2007) The coming crisis of empirical sociology. Sociology 41(5):885–899. https://doi.org/10.1177/0038038507080443

Shostak S, Conrad P, Horwitz AV (2008) Sequencing and its consequences: path dependence and the relationships between genetics and medicalization. Am J Sociol 114(S1):S287–S316. https://doi.org/10.1086/595570

Thomas WJ, Thomas DS (1929) The child in America. Behavior problems and programs. Knopf, New York

West C, Zimmerman DH (1991) Doing gender. In: Lorber J, Farrell SA (eds) The social construction of gender. Sage, Newbury Park/London, pp. 13–37

Download references

Acknowledgements

Open access funding provided by University of Vienna. The authors thank Brígida Riso for contributing to a previous version of this article.

Author information

These authors contributed equally: Melanie Goisauf, Kaya Akyüz, Gillian M. Martin.

Authors and Affiliations

Department of Science and Technology Studies, University of Vienna, Vienna, Austria

Melanie Goisauf & Kaya Akyüz

BBMRI-ERIC, Graz, Austria

Department of Sociology, University of Malta, Msida, Malta

Gillian M. Martin

You can also search for this author in PubMed   Google Scholar

Corresponding authors

Correspondence to Melanie Goisauf or Kaya Akyüz .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Goisauf, M., Akyüz, K. & Martin, G.M. Moving back to the future of big data-driven research: reflecting on the social in genomics. Humanit Soc Sci Commun 7 , 55 (2020). https://doi.org/10.1057/s41599-020-00544-5

Download citation

Received : 15 November 2019

Accepted : 09 July 2020

Published : 04 August 2020

DOI : https://doi.org/10.1057/s41599-020-00544-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Biobanking and risk assessment: a comprehensive typology of risks for an adaptive risk governance.

  • Gauthier Chassang
  • Michaela Th. Mayrhofer

Life Sciences, Society and Policy (2021)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

big data in social science research

Breadcrumbs Section. Click here to navigate to respective pages.

Big Data and Social Science

Big Data and Social Science

DOI link for Big Data and Social Science

Get Citation

Big Data and Social Science: Data Science Methods and Tools for Research and Practice, Second Edition shows how to apply data science to real-world problems, covering all stages of a data-intensive social science or policy project. Prominent leaders in the social sciences, statistics, and computer science as well as the field of data science provide a unique perspective on how to apply modern social science research principles and current analytical and computational tools. The text teaches you how to identify and collect appropriate data, apply data science methods and tools to the data, and recognize and respond to data errors, biases, and limitations.

  • Takes an accessible, hands-on approach to handling new types of data in the social sciences
  • Presents the key data science tools in a non-intimidating way to both social and data scientists while keeping the focus on research questions and purposes
  • Illustrates social science and data science principles through real-world problems
  • Links computer science concepts to practical social science research
  • Promotes good scientific practice
  • Provides freely available workbooks with data, code, and practical programming exercises, through Binder and GitHub

New to the Second Edition:

  • Increased use of examples from different areas of social sciences
  • New chapter on dealing with Bias and Fairness in Machine Learning models
  • Expanded chapters focusing on Machine Learning and Text Analysis
  • Revamped hands-on Jupyter notebooks to reinforce concepts covered in each chapter

This classroom-tested book fills a major gap in graduate- and professional-level data science and social science education. It can be used to train a new generation of social data scientists to tackle real-world problems and improve the skills and competencies of applied social scientists and public policy practitioners. It empowers you to use the massive and rapidly growing amounts of available data to interpret economic and social activities in a scientific and rigorous manner.

TABLE OF CONTENTS

Chapter chapter 1 | 22  pages, introduction, part i | 95  pages, capture and curation, chapter chapter 2 | 18  pages, working with web data and apis, chapter chapter 3 | 23  pages, record linkage, chapter chapter 4 | 33  pages, chapter chapter 5 | 17  pages, scaling up through parallel and distributed computing, part ii | 129  pages, modeling and analysis, chapter chapter 6 | 22  pages, information visualization, chapter chapter 7 | 50  pages, machine learning, chapter chapter 8 | 27  pages, text analysis, chapter chapter 9 | 27  pages, networks: the basics, part iii | 91  pages, inference and ethics, chapter chapter 10 | 30  pages, data quality and inference errors, chapter chapter 11 | 32  pages, bias and fairness, chapter chapter 12 | 19  pages, privacy and confidentiality, chapter chapter 13 | 7  pages.

  • Privacy Policy
  • Terms & Conditions
  • Cookie Policy
  • Taylor & Francis Online
  • Taylor & Francis Group
  • Students/Researchers
  • Librarians/Institutions

Connect with us

Registered in England & Wales No. 3099067 5 Howick Place | London | SW1P 1WG © 2024 Informa UK Limited

big data in social science research

2nd Edition

Big Data and Social Science Data Science Methods and Tools for Research and Practice

VitalSource Logo

  • Taylor & Francis eBooks (Institutional Purchase) Opens in new tab or window

Description

Big Data and Social Science: Data Science Methods and Tools for Research and Practice, Second Edition shows how to apply data science to real-world problems, covering all stages of a data-intensive social science or policy project. Prominent leaders in the social sciences, statistics, and computer science as well as the field of data science provide a unique perspective on how to apply modern social science research principles and current analytical and computational tools. The text teaches you how to identify and collect appropriate data, apply data science methods and tools to the data, and recognize and respond to data errors, biases, and limitations.

  • Takes an accessible, hands-on approach to handling new types of data in the social sciences
  • Presents the key data science tools in a non-intimidating way to both social and data scientists while keeping the focus on research questions and purposes
  • Illustrates social science and data science principles through real-world problems
  • Links computer science concepts to practical social science research
  • Promotes good scientific practice
  • Provides freely available workbooks with data, code, and practical programming exercises, through Binder and GitHub

New to the Second Edition:

  • Increased use of examples from different areas of social sciences
  • New chapter on dealing with Bias and Fairness in Machine Learning models
  • Expanded chapters focusing on Machine Learning and Text Analysis
  • Revamped hands-on Jupyter notebooks to reinforce concepts covered in each chapter

This classroom-tested book fills a major gap in graduate- and professional-level data science and social science education. It can be used to train a new generation of social data scientists to tackle real-world problems and improve the skills and competencies of applied social scientists and public policy practitioners. It empowers you to use the massive and rapidly growing amounts of available data to interpret economic and social activities in a scientific and rigorous manner.

Table of Contents

1. Introduction 2. Working with Web Data and APIs - Cameron Neylon 3. Record Linkage - Joshua Tokle and Stefan Bender 4. Databases - Ian Foster and Pascal Heus 5. Scaling up through Parallel and Distributed Computing - Huy Vo and Claudio Silva 6. Information Visualization - M. Adil Yalcin and Catherine Plaisant 7. Machine Learning - Rayid Ghani and Malte Schierholz 8. Text Analysis - Evgeny Klochikhin and Jordan Boyd-Graber 9. Networks: The Basics - Jason Owen-Smith 10. Data Quality and Inference Errors - Paul P. Biemer 11. Bias and Fairness - Kit T. Rodolfa, Pedro Saleiro, and Rayid Ghani 12. Privacy and Confidentiality - Stefan Bender, Ron Jarmin, Frauke Kreuter, and Julia Lane 13. Workbooks - Brian Kim, Christoph Kern, Jonathan Scott Morgan, Clayton Hunter, and Avishek Kumar

Ian Foster, PhD,  is a professor of computer science at the University of Chicago as well as a senior scientist and distinguished fellow at Argonne National Laboratory. His research addresses innovative applications of distributed, parallel, and data-intensive computing technologies to scientific problems in such domains as climate change and biomedicine. Methods and software developed under his leadership underpin many large national and international cyberinfrastructures. He is a fellow of the American Association for the Advancement of Science, the Association for Computing Machinery, and the British Computer Society. He earned a PhD in computer science from Imperial College London. Rayid Ghani is a professor in the Machine Learning Department (in the School of Computer Science) and the Heinz College of Information Systems and Public Policy at Carnegie Mellon University. His research focuses on developing and using Machine Learning, AI, and Data Science methods for solving high impact social good and public policy problems in a fair and equitable way across criminal justice, education, healthcare, energy, transportation, economic development, workforce development and public safety. He is also the founder and director of the “Data Science for Social Good” summer program for aspiring data scientists to work on data mining, machine learning, big data, and data science projects with social impact. Previously Rayid Ghani was a faculty member at University of Chicago, and prior to that, served as the Chief Scientist for Obama for America (Obama 2012 Campaign). Ron Jarmin, PhD,  is the Deputy Director at the U.S. Census Bureau. He earned a PhD in economics from the University of Oregon and has published in the areas of industrial organization, business dynamics, entrepreneurship, technology and firm performance, urban economics, Big Data, data access and statistical disclosure avoidance. He oversees the Census Bureau’s large portfolio of data collection, research and dissemination activities for critical economic and social statistics including the 2020 Decennial Census of Population and Housing. Frauke Kreuter, PhD,  is Professor at the University of Maryland in the Joint Program in Survey Methodology, Professor of Statistics and Methodology at the University of Mannheim and head of the Statistical Methods group at the Institute for Employment Research in Nuremberg, Germany. She is founder of the International Program in Survey and Data Science, co-founder of the Coleridge Initiative, fellow of the American Statistical Association (ASA), and recipient of the WSS Cox and the ASA Links Lecture Awards. Her research focuses on data quality, privacy, and the effects of bias in data collection on statistical estimates and algorithmic fairness. Julia Lane, PhD,  is a professor at the NYU Wagner Graduate School of Public Service. She is also an NYU Provostial Fellow for Innovation Analytics. She co-founded the Coleridge Initiative as well as UMETRICS and STAR METRICS programs at the National Science Foundation, established a data enclave at NORC/University of Chicago, and co-founded the Longitudinal Employer-Household Dynamics Program at the U.S. Census Bureau and the Linked Employer Employee Database at Statistics New Zealand. She is the author/editor of 10 books and the author of more than 70 articles in leading journals, including Nature and Science . She is an elected fellow of the American Association for the Advancement of Science and a fellow of the American Statistical Association.

Critics' Reviews

"Like the first edition, the new edition will continue to play an important role for the intended audience and a wider professional community. The much-needed second edition is timely and showcases a wide range of examples and application examples from different areas of the social sciences to demonstrate how the methods are implemented using several real datasets. As expected with this kind of book, the topics of this text are diverse in nature, but interesting none the less. As it is well known, machine learning techniques are subject to inherited bias in model selection and consequently negatively impacts post estimation and prediction. This new edition includes a new chapter on dealing with bias and fairness in machine learning models, a much-needed fair and welcome edition! Further, the authors have done an excellent job in expanding the material on machine learning and text analysis. Like the first edition, the main strength of the book is that it offers a wide variety of applications that are based on real datasets emerging from social science perspectives and useful for both academic and professional communes. As Jupyter has become more popular as the data scientists’ computational notebook of choice, the book has new and improved hands-on Jupyter notebooks to complement each chapter’s material. In conclusion, this new edition has an impressive collection of material on useful and interesting topics on big data. The book will be equally useful to graduate students and researchers interested in gaining perspectives and knowledge on this important topic. The new volume comprises of a wealth of information, a kind of one-stop shop, and can be served as a textbook and research reference book." - S. Ejaz Ahmed , Brock University, Canada Praise For First Edition "This book builds a nice bridge connecting social science and big data methodology. Big data such as social media and electronic health records, empowered by the advances in information technology, are an emerging phenomenon in recent years and present unprecedented opportunities for social science research. This book was written by pioneering scientists in applying big data methods to address social science problems. As shown by numerous examples in the book, social science could benefit significantly by embracing the new mode of big data and taking advantage of the technical progress in analysing such data. If you work in social science and would like to explore the power of big data, this book is clearly for you. Indeed, if you do not have previous experience in dealing with big data, you should read this book first, before implementing a big-data project. As indicated by the title, this book acts as a practical guide and targets readers with minimum big data experience, hence it is very hands-on. … It covers all necessary steps to finish a big data project: collecting raw data, cleaning and preprocessing data, applying various modelling tools to analyze the data, evaluating results, protecting privacy, and addressing ethical problems. … All the important topics concerning big data are covered, making this book a good reference that you should always keep on your desk." — Guoqiang Yu , Virginia Tech, in Journal of the American Statistical Association, July 2017 "…In summary, although there is a growing number of books related to social science and big data, this volume contains several non-trivial aspects which make it worth to have in the library, possibly along with other similar textbooks as a good complement to them." — Stefano M. Iacus , University of Milan, in Journal of Statistical Software, June 2017 "This is a well-written book and showcases a good number of examples and applications to demonstrate how the methods are actually used in real life situation using real datasets. Further, topics at hand are motivated by social science data. … The chapters are nicely structured, well presented and motivated by data examples. The main strength of the book is that it still offers a good number of applications that are based on real datasets emerging from social science perspectives. The book will be useful to students, practitioners, and data analyst in the respective fields. The editors did a very good job introducing the book, it aims and goals, intendent audience, clarifying underneath concepts and phrases, a must read before moving to other chapters." — S. Ejaz Ahmed , in Technometrics, April 2017 "Economists and Social Scientist have a lot to learn from Machine Learning, and Engineers have a lot to learn from Econometricians and Statisticians. This two way sharing is long overdue and it is time to start the conversation. This book is a tour-de-force for anyone interested in participating in such a discussion." — Roberto Rigobon , Society of Sloan Fellows Professor of Applied Economics, MIT "This ambitious sweep through data science techniques provides an invaluable introduction to the toolbox of big data methodologies, as applied to social science data. It provides tremendous value not only to beginners in the field, but also to experienced data scientists wishing round out their knowledge of this broad and dynamic field." — Kenneth Benoit , Department of Methodology, London School of Economics and Political Science "Most social scientists would agree that ‘big data’ – the term we use to encapsulate the huge amount of electronic information we generate in our everyday lives – provide the potential for path-breaking research not just into our economic, social, and political lives but also the physical environment we create and inhabit. However, few have the knowledge, or critically, the tools that equip them to realize this potential. This book provides a bridge between computer science, statistics, and the social sciences, demonstrating this new field of ‘data science’ via practical applications. The book is remarkable in many ways. It originates from classes taught by leading practitioners in this area to federal agency research staff, drawing in particular upon the example of a hugely successful project that linked federal research spending to outcomes in terms of patents, job creation, and the subsequent career development of researchers. By making these workbooks accessible, the book takes the novice on a step-by-step journey through complex areas such as database dynamics, data linkage, text analysis, networks and data visualization. The book is a treasure trove of information. It leads the field in the important task of bringing together computer science, statistics, and social science. I strongly recommend that all social scientists with an interest in ‘big data’ immerse themselves in this book." — Professor Peter Elias CBE, University of Warwick "The explosive growth in big data and in new technologies to analyze these data is transforming the practice of research in a variety of fields. Foster, et al. provides a well-timed, valuable guide to the new methods and tools associated with big data that can be used to address critical research questions in the social science field. The breadth of the material is impressive, providing a comprehensive summary of the methods and tools as well as practical guidance for their use. A key feature of the guide is the use of a case study to illustrate how big data techniques can be used to address a research question from beginning to end of the project, including providing examples of computer code targeted to specific steps in the project. Any researcher will find this unique guide to be useful, and it is essential reading for any social science practitioner that wants to use the best available data to conduct influential research in the near future." — Paul Decker , President and CEO, Mathematica Policy Research "The typical statistics pedagogy has changed. In the past, textbooks assumed that data was hard to obtain, but neatly organized in a single file. Today, data is very easy to obtain from a number of data sources, often very messy, and analysts are now responsible for organizing it in addition to deriving useful insights. Foster, Ghani, Jarmin, Kreuter, and Lane have assembled a book that gives a pointed overview of tools to facilitate the entire digital lifespan of data in this era of analytics. Big Data and Social Science gives an evenhanded look at the myriad of ways to obtain data--whether scraping the web, web APIs, or databases--to conducting statistical analysis to doing analysis when your data cannot fit on a single computer. Meanwhile, they provide sound, diligent advice on pitfalls that still, and will always, exist. A book like this is useful for social scientists, experienced statisticians, econometricians, and computer programmers who want to see the tools available to them. It will also be a helpful text for a budding data scientist who wants a fairly technical preview of the landscape." — Tom Schenk Jr. , Chief Data Officer, City of Chicago "In Big Data and Social Science, the authors have deftly crafted one of the very best "how-to" books on big data that researchers, enterprise analysts, and government practitioners will find equally valuable. From Nodes, to Edges, to Arcs, the book takes the reader along a near-perfect path to understanding the fundamental elements of constructing a practical and realistic model for Big Data Analysis that any organization can execute by simply following the path outlined in this book. Elegant in its simplicity, Big Data and Social Science is one of those books that every research group and data-analysis team will want to have on their reference shelf." — Tom Herzog , Former Deputy Commissioner, NY State Department of Corrections and Community Supervision

About VitalSource eBooks

VitalSource is a leading provider of eBooks.

  • Access your materials anywhere, at anytime.
  • Customer preferences like text size, font type, page color and more.
  • Take annotations in line as you read.

Multiple eBook Copies

This eBook is already in your shopping cart. If you would like to replace it with a different purchasing option please remove the current eBook option from your cart.

Book Preview

big data in social science research

The country you have selected will result in the following:

  • Product pricing will be adjusted to match the corresponding currency.
  • The title Perception will be removed from your cart because it is not available in this region.

Scholarly Borderlands Big Data and Historical Social Science

While “big data” often connotes new opportunities for understanding the present, largely through the analysis of social media and search engine data, other newly available kinds of rich data sources create huge possibilities for reimagining the past. In recent years, millions of previously difficult-to-access documents and massive archival data structures have become widely available to scholars of human history and the general public.  The project on Big Data and Historical Social Science brings together researchers across a range of disciplines, methods, and research strategies to explore the intersection of classical historical and social science problems with big data. How can access to new kinds of historical data, and new capacities to manipulate and analyze them, allow scholars to address historical questions in new ways?  The first demonstration project of this group is entitled  “Reclaiming Lost Data on American Racial Inequality: 1865-1940.”  Participants in this project include economists, historians, political scientists, and sociologists from across the country. After several planning meetings, the scholars working on this project were awarded a grant through the Russell Sage Foundation initiative on Computational Social Science to create accessible, linked datasets that will help social scientists of all disciplines gain access to more accurate information about African American populations in United States history. 

Another outcome of this working group is the SSRC’s project on “ American Slavery’s Legacy across Space and Time .” This research initiative combines qualitative, community-grounded social science with innovative big-data methodologies to bring to light how entire communities were transformed by the United States’ legacy of slavery, while also establishing ethical norms for this type of emerging research.

Current participants  

Marcella Alsan  (Stanford University),  Audrey Augenbraum  (Columbia University)  Peter Bearman  (Columbia University),  Leah Boustan  (Princeton University),  Karida Brown  (University of North Carolina),  James Feigenbaum  (Boston University),  Megan Ming Francis  (University of Washington),  Trevon Logan  (The Ohio State University)  Mara Loveman  (University of California, Berkeley),  Christopher Muller  (University of California, Berkeley),  Suresh Naidu  (Columbia University),  Evan Roberts  (University of Minnesota),  Eric Schickler  (University of California, Berkeley),  Benjamin Schmidt  (Northeastern University), and  Vesla Weaver  (Johns Hopkins University).

Privacy Overview

CookieDurationDescription
cookielawinfo-checkbox-analytics11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional11 monthsThe cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy11 monthsThe cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Big Data Social Science

Welcome to Big Data Social Science

big data in social science research

Featured Projects

Working groups, big data and political science working group, big data and society working group, data science and public policy working group, building a research infrastructure for harnessing the data revolution and its social implications.

Big Data Social Science has three desired goals to better support big data and related research:

(1) Expand research support

(2) Help build an intellectual community around this work

(3) Help expand data science teaching

Research Support

Intellectual community, data science teaching, statistical support.

SSCERT will be providing Statistical and Research Design Support for divisional faculty and students. Contact information regarding this new service will be available soon at sscert.ucla.edu

Innovation Technology Studio

A new facility is currently under construction to demonstrate and assist in the use of interesting and new technologies relevant to research and teaching.  Contact Tom Phelan [email protected] for additional information.

California Census Research Data Center

The California Census Research Data Center (CCRDC) will soon be moving to into its new home at SSCERT.  The Data Center provides researchers access to micro level census data in a secured environment.  More information is available at http://ccrdc.ucla.edu/

Data Visualization

SSCERT can provide assistance with certain data management and visualization tasks.  Contact Joy Guey [email protected]  for further information.

405 HILGARD AVENUE, LOS ANGELES, CA 90095

© Copyright 2020 UCLA

Ethical Issues in Social Science Research Employing Big Data

  • Original Research/Scholarship
  • Open access
  • Published: 15 June 2022
  • Volume 28 , article number  29 , ( 2022 )

Cite this article

You have full access to this open access article

big data in social science research

  • Mohammad Hosseini   ORCID: orcid.org/0000-0002-2385-985X 1 ,
  • Michał Wieczorek   ORCID: orcid.org/0000-0003-3688-9684 2 &
  • Bert Gordijn   ORCID: orcid.org/0000-0002-3686-8659 2  

14k Accesses

14 Citations

257 Altmetric

30 Mentions

Explore all metrics

This paper analyzes the ethics of social science research (SSR) employing big data. We begin by highlighting the research gap found on the intersection between big data ethics, SSR and research ethics. We then discuss three aspects of big data SSR which make it warrant special attention from a research ethics angle: (1) the interpretative character of both SSR and big data, (2) complexities of anticipating and managing risks in publication and reuse of big data SSR, and (3) the paucity of regulatory oversight and ethical recommendations on protecting individual subjects as well as societies when conducting big data SSR. Against this backdrop, we propose using David Resnik’s research ethics framework to analyze some of the most pressing ethical issues of big data SSR. Focusing on the principles of honesty, carefulness, openness, efficiency, respect for subjects, and social responsibility, we discuss three clusters of ethical issues: those related to methodological biases and personal prejudices, those connected to risks arising from data availability and reuse, and those leading to individual and social harms. Finally, we advance considerations to observe in developing future ethical guidelines about big data SSR.

Similar content being viewed by others

big data in social science research

Embedding Privacy and Ethical Values in Big Data Technology

big data in social science research

How ethics combine with big data: a bibliometric analysis

big data in social science research

A Theoretical Framework for Ethical Reflection in Big Data Research

Avoid common mistakes on your manuscript.

Introduction

This paper explores ethical issues of employing big data Footnote 1 in social science research (SSR) with a specific focus on how these practices challenge the integrity and ethics of research. In recent years, the research community has witnessed the introduction of new technologies that collect and process big data. Social scientists have particularly benefited from these developments as their research increasingly generates big data sets or reuses existing ones such as those collected by public institutions and federal agencies (Foster et al., 2016 , pp. 1–9), those generated and collected by social media platforms (Townsend & Wallace, 2016 ), e.g., Facebook analytics, and those generated by developers of digital devices and services (Lazer et al., 2009 ), e.g., Google Trends.

With the increasing use and reuse of big data sets in SSR, new ethical concerns emerge that need to be recognized, communicated to the research community, and mentioned in research ethics guidelines and protocols. Exploring these issues becomes more relevant when we consider the surge of studies that source their data from countries with dissimilar standards or employ publicly available data (e.g., harvested from social media platforms) without addressing ethical issues (OECD, 2016 ). As shown in a recent paper, 64% of studies (n = 132) that used big data “did not discuss ethical issues, mostly claiming the data were publicly available” (Stommel & de Rijke, 2021 , p. 1).

Despite the significance of the topic from a research ethics and integrity perspective, an exploratory scoping search conducted for this study showed that the published literature has paid little attention to the challenges posed by big data SSR for upholding the norms of research ethics and integrity (for this purpose, the Web of Science core collection was searched on 18/06/2021 with the following string “social science*” AND “big data” AND “ethics”. Using this string yielded 22 items, only one of which exclusively discussed ethics of big data SSR). In fact, a recent review of the literature (n = 892) concludes that big data ethics are mainly discussed in relation to health and technology (Kuc-Czarnecka & Olczyk, 2020 ). This could be due to the historical roots of the discipline of ethics and its closer ties with biomedical sciences (Resnik, 2015 ), or big data’s closer ties to discussions about technology as the “term refers to the vast amounts of digital data which are being produced in technologically and algorithmically mediated practices” (Richterich, 2018 , p. 4).

In contexts where big data SSR is discussed, authors have raised concerns about consent, privacy, potential harm to research subjects and data ownership (Lipworth et al., 2017 ; Lunshof et al., 2008 ; Mittelstadt & Floridi, 2016 ; Metcalf & Crawford, 2016 ; Rothstein, 2015 ; Starkbaum & Felt, 2019 ; Zimmer, 2018 ). Sometimes the methodological problems associated with the move to a data-driven/computational SSR paradigm have received more attention than ethical aspects (with some notable exceptions such as Weinhardt’s study ( 2020 ) and Salganik’s book ( 2017 ), but even within these contributions, the ethical issues are either not analyzed systematically or the impact of the interpretative nature of SSR on ethical issues is neglected Footnote 2 ). Some existing studies develop tools for analyzing big data in SSR or note difficulties that arise when big data analysis methods developed for biomedical/engineering purposes are employed in SSR. Authors of these studies mostly mention, but not elaborate on, challenges related to privacy and consent (Chang et al., 2014 ; Connelly et al., 2016 ; González-Bailón, 2013 ; Liu, 2016 ) or legal and liability issues (Bender et al., 2016 ).

Furthermore, although papers in two special issues of the American Behavioral Scientist (Volume 63, Issue 5 and 6, 2019) and a special issue of Social Science Computer Review (Volume 38, Issue 1, 2020) provide useful perspectives on the ethical issues of SSR Footnote 3 , only one of these contributions uses a normative framework to provide a systematic analysis of ethical issues. These papers discuss big data’s impact on social interpretations and context (Camfield, 2019 ; Feldman & Shaw, 2019 ; Frey et al., 2020 ; Hesse et al., 2019 ), data representativeness (Hargittai, 2020 ), data accuracy and inclusiveness (Popham et al., 2020 ), data sharing and replicability (Mannheimer et al., 2019 ; Sterett, 2019 ), press and personal freedom (Shahin & Zheng, 2020 ) as well as issues related to the prioritization of big data as a source and the impact of big data tools on research questions and results (Hesse et al., 2019 ; Mauthner 2019 ). Hossain & Scott-Villiers ( 2019 ) explicitly base their analysis on an ethical framework, but since their adopted approach only captures qualitative SSR (similar to other papers in the American Behavioral Scientist special issues), they problematize relationships between researchers and subjects based on the quality of relationships without discussing biases/prejudices. Thus, we believe that applying a research ethics framework and paying specific attention to the interpretive nature of SSR in this paper, expands the scope of the current debate about big data SSR.

In what follows we first distinguish three reasons why ethics of big data SSR matters. Then we employ David Resnik’s research ethics framework to systematically analyze the ethics of big data SSR. Consequently, we advance suggestions for researchers, data repositories and research institutions to minimize the likelihood of ethical issues in big data SSR.

Three Reasons for Ethical Concerns About Big Data SSR

Without claiming to be exhaustive, we highlight three factors that motivated our concerns about the ethics of big data SSR: (1) the interpretative aspects of SSR provide fertile grounds for different forms of bias, (2) anticipating and managing risks in publication and reuse of big data SSR is complicated, and (3) the paucity of regulatory oversight and ethical recommendations on protecting subjects and societies when conducting big data SSR.

1) While some approaches to social science define it as a discipline concerned with studying facts about society to formulate theories and predictions about it (Popper, 1961 ), we endorse the view that social sciences interpret societies’ norms and practices through the lens of values and beliefs held by researchers (Richardson & Fowers, 1998 ; Taylor, 1971 ). Especially in cases where SSR focuses on subjective concepts and phenomena such as culture, behavior, social relations, shared imagination and beliefs, results are markedly interpretative and reflect the cultural context, the historical circumstances in which they are produced, as well as the worldviews of involved researchers (Feldman & Shaw, 2019 ; Taylor, 1971 ). Although interpretative practices allow us to make sense of the social world, they can expose research and its outcomes to external factors such as researchers’ moral beliefs, prejudices, stereotypes, values or even the used language. Using big data in SSR further complicates this problem because big data technologies can potentially affix problematic interpretations into research when third-party technology and services are employed in data collection or analysis (Barocas & Selbst, 2016 ). Of course, sometimes this problem is exacerbated by using big data processing techniques designed for STEM disciplines (arguably a misfit for studying people, beliefs and behavior). Footnote 4 Moreover, a positivist view of data (i.e., data as an objective entity), can be in conflict with the interpretative aspects of SSR (Hesse et al., 2019 ).

2) There is no such thing as raw data or big data sets that simply represent facts (Gitelman, 2013 ; Barrowman, 2018 ). Arguably, big data is always already interpreted by those who generated data sets or, in the case of automatically created data sets, by employed algorithms and their designers. Researchers engaged with pre-processed data or data reuse could further divorce it from rawness by attributing meaning to it over the course of subsequent analyses. These future uses and analyses are not always in line with data generators’ objectives. Therefore, dissemination of big data SSR results may involve risks that are hard to identify/manage even for researchers strongly determined to uphold research ethics and integrity norms. Furthermore, algorithmic tools that analyze and interpret big data SSR might influence results by operating under assumptions that are not endorsed by researchers or their subjects (e.g., what should be considered normal in each population, cf. Neff & Nafus, 2016 , 48–49). Indeed, big data sets could reveal unforeseen connections, patterns and information, making it difficult for investigators to anticipate the outcomes and consequences of future analyses (Mittelstadt & Floridi, 2016 ). These challenges not only threaten methodological soundness, but also have ethical implications when big data SSR generates unpredictable results that could justify discrimination, symbolic violence Footnote 5 and other harmful practices that are difficult to anticipate when research is being designed, conducted or published. In particular, since data literacy is a specialized skill unequally possessed by researchers, policymakers, and the public (Wolff et al., 2016 ), results produced by big data SSR might confuse various stakeholders (Pangrazio & Sefton-Green, 2020 ) about their intended purpose or their actual meaning (boyd & Metcalf, 2014 ).

3) Methods and devices employed to collect health-related information are subjected to strict regulatory oversight and their reliability is demonstrated in elaborate trials (Kramer et al., 2020 ). Such stringent requirements are not applied to SSR, and if applied, they are considered a misfit (National Research Council, 2003 ). Using a biomedical understanding of ethical principles and issues “such as avoiding harm and doing good, informed consent, confidentiality, etc.” for SSR, could result in misjudging the impact of SSR on research subjects and societies (Gurzawska & Benčin, 2015 , p. 5). Accordingly, big data SSR could serve as a justification for discriminatory policy decisions against research subjects or create and reinforce harmful stereotypes about social groups. Especially since many researchers engaged in big data SSR are not social scientists by training, they might be insufficiently trained/prepared to anticipate likely harms arising from SSR (Hesse et al., 2019 ). Experts have argued that one reason why these issues are not adequately addressed during the design, data collection, analysis and publication of big data SSR is that available ethical frameworks are not well-equipped to address them (Boyd, 2017 ). Footnote 6 In addition, regulatory bodies, Institutional Review Boards (IRBs) and Research Ethics Committees, are inadequately equipped to evaluate ethical issues of big data SSR (Favaretto et al., 2020 ; Vitak et al., 2017 ). It is challenging to capture ethical issues of big data SSR as they evolve alongside big data technologies. The necessity to continuously revise guidelines, even those that are developed for a specific data collection method e.g., Internet Research: Ethical Guidelines (franzke et al., 2020 ) demonstrates the dynamic landscape of this domain and calls for the improvement of current guidelines (Hollingshead et al., 2021 ).

Big Data SSR Through the Lens of Resnik’s Principles

To explore the ethical issues of big data SSR in a systematic manner, we employ the normative framework developed in David Resnik’s Ethics of science ( 2005 ). This framework consists of twelve principles: honesty, carefulness, openness, freedom, credit, education, social responsibility, legality, opportunity, mutual respect, efficiency, and respect for subjects. Although all twelve principles are relevant to big data SSR, in our analysis we focus on the six principles of honesty, carefulness, openness, efficiency, responsibility and respect for subjects. Employing the six mentioned principles in three pairs enables us to systematically explore what we deem to be the three most pressing reasons for ethical concern in the context of big data SSR. In what follows we discuss three clusters, each addressing two principles. These include ethical issues about bias (the principles of honesty and carefulness), risks relating to publication and reuse of big data (the principles of openness and efficiency) and ethical concerns about individuals and societies (the principles of social responsibility and respect for subjects).

First, Resnik’s framework allows us to make a distinction between two types of bias. One type (discouraged by the principle of carefulness) pertains to biases that might be embedded in methodologies and techniques used in research processes (what we call methodological biases, which as explained in the previous section are pronounced in using big data). The second type (discouraged by the principle of honesty) is related to researchers’ personal values, worldviews, preferences, used language, etc., that may affect their observations, inferences or conclusions (what we call prejudice). Given the aforementioned weaknesses (e.g., misfit) of big data analysis methods for SSR, and the hermeneutic nature of SSR, making a distinction between these two types of bias helps articulating ethical issues more specifically. These two forms of bias are discouraged by the principles of honesty and carefulness and are explored in detail in Sect.  3.1 :

Honesty: “scientists should not fabricate, falsify, or misrepresent data or results. They should be objective, unbiased , and truthful in all aspects of the research process” [emphasis added] (Resnik, 2005 , p. 48). Carefulness: “Scientists should avoid errors in research, especially in presenting results. They should minimize experimental, methodological, and human errors and avoid self-deception, bias , and conflicts of interest” [emphasis added] (Resnik, 2005 , p. 51).

Second, Resnik’s principles of openness and efficiency are also particularly useful in exploring ethical issues related to the publication/reuse of big data and the associated risks.

Openness: “Scientists should share data, results, methods, ideas, techniques, and tools. They should allow other scientists to review their work and be open to criticism and new ideas” (Resnik, 2005 , p. 52). Efficiency: “Scientists should use resources efficiently” (Resnik 2005 , p. 60).

When it comes to using big data, the principles of openness and efficiency are not only connected but also inseparable, making both relevant to exploring the risks of big data publication and reuse. While openness of data enables efficient use of resources (e.g., data reuse), efficient use of resources requires openness of data. However, as Sect.  3.2 shall demonstrate, attempts to uphold both in the context of big data SSR contributes to specific risks.

Third, Resnik’s framework is developed with the recognition of social impacts of SSR (e.g., influence of results on social and political agendas) in addition to personal harms (Resnik, 2005 , p. 133). Accordingly, it allows us to identify and explore two forms of ethical concerns, one related to research subjects (e.g., dignity) and one to societies (e.g., harms to society), both formulated as normative principles:

Respect for subjects: “scientists should not violate rights or dignity when using human subjects in experiments” (Resnik, 2005 , p. 61). Social responsibility: “scientists should avoid causing harms to society and they should attempt to produce social benefits. Scientists should be responsible for the consequences of their research and they should inform the public about those consequences” (Resnik, 2005 , p. 57).

As will be shown in Sect.  3.3 , in the context of big data SSR, respect for subjects might not necessarily prevent harms to societies and attempts to uphold both of these principles might not always succeed.

Prejudices and Biases

Recent developments in big-data-generating technologies have opened new possibilities for social scientists, some of which might infuse new forms of prejudice and bias into research outcomes. Prejudices and biases discussed in this section not only hinder researchers’ adherence to the principles of honesty and carefulness but might be so subtle that even the most diligent researchers might be unable to neutralize them.

While researchers have more control over methods used to generate original data sets (compared with reusing existing data sets), they cannot always identify biases introduced by technologies they employ. Although this difficulty is present in all kinds of research to a degree, we argue that the sheer variety, velocity and volume of information in big data sets make researchers’ dependence on technology greater while reducing their control over technologies’ impact, thus, exacerbating ethical issues. Accordingly, by employing data sets that were generated with the help of technology/services/software delivered by third parties (whether generating their own datasets or reusing available datasets), social scientists might face specific ethical challenges regarding bias. Depending on the stage(s) wherein third-party technology is used, their inherent biases might corrupt data collection, study designs and analysis with, for example, lack of considerations for relevant characteristics of respondents (e.g., membership of vulnerable groups or endorsement of certain political views). These challenges might hamper social scientists’ ability to identify, let alone avoid methodological biases as demanded by the principle of carefulness. To articulate some of these biases more clearly, we will use self-tracking Footnote 7 and crowdsourcing platforms employed in SSR as examples that complicate researchers’ adherence to principles of honesty and carefulness.

I) In some SSR contexts (e.g., psychology, anthropology, sport and health sociology), researchers employ automated data collection devices (e.g., self-tracking devices) worn/used by research subjects to explore movement, health and/or productivity (Neff & Nafus, 2016 ; Lupton, 2016 ). These data collections are not always accurate; hence, resulted conclusions might not be as objective and unbiased as they appear. Research shows that self-tracking devices cannot always reliably detect particular kinds of movement, which leads them to inflate/underestimate activity metrics, while still framing them as accurate and objective (Hoy, 2016 ; Piwek et al., 2016 ; Moore & Piwek, 2017 ). Moreover, even if self-tracking devices could (accurately) capture all possible movements, their designers might categorize and understand these in ways different than researchers. For example, since the definition of an intense workout and the recommended activity levels for each individual remain rather ambiguous, different technologies use dissimilar parameters to define specific variables. Consequently, devices from two different manufacturers might provide altogether different results for the same subject, even in measurements as seemingly uncomplicated as step-counting (Crawford et al., 2015 ). According to Crawford and colleagues, this issue becomes even more pronounced when complex parameters, such as the differences between light and deep sleep are considered. These parameters might be important information for social scientists investigating for example, the relationship between physical and mental health and the quality of the neighborhood wherein research subjects live (Hale et al., 2013 ). Although the objectivity and accuracy of such results cannot always be fully trusted, upon publication (and partly due to varied levels of data literacy of different stakeholders, as mentioned in Sect.  2 ), results can be interpreted (and reproduced in popular media) with blind faith because they are expressed numerically, and therefore, resemble objective measurement (Mills, 2018 ).

Furthermore, it is possible that collected data lacks contextual information because researchers might be unable/unmotivated to examine and disclose contextually relevant information that impacted data sets. For example, even though self-tracking data about geolocation and physical activity might be highly beneficial for a study that investigates people’s mobility and public health risks, such data might not necessarily provide all the contextual information required to make accurate conclusions about the studied cohort. A one-size-fits-all approach of data-collection devices does not account for variables such as childcare responsibilities or injury history of research subjects, which can influence the extent and intensity of daily movements (Neff & Nafus, 2016 ; Selke, 2016 ). Footnote 8 Consequently, while some researchers might be inclined to make seemingly objective and science-based conclusions when employing big data in SSR, a careful evaluation of what information is missing from the used data sets and the implications of missing such information for the overall conclusions could reveal undisclosed limitations and biases (cf. Camfield, 2019 ).

II) Algorithmic bias and limitations of third-party technologies remain mostly undisclosed; hence, researchers cannot always employ measures to offset biases. Data-generating devices process collected information using algorithms that operate in line with instructions and assumptions of their developers. As designers of algorithmic tools might be unaware of their own presuppositions and prejudices or they might not actively take steps to avoid biases in designing algorithms, many contemporary technologies have been demonstrated to exhibit various forms of algorithmic bias (Friedman & Nissenbaum, 1996 ; Sharon, 2017 ). Self-tracking devices are reported to be only accurate in gathering data related to particular types of activity or to particular users, while producing unreliable or even plainly wrong results for others. For example, women using wearable fitness trackers or step-counting functionalities embedded in most contemporary smartphones commonly report that some of their daily movements (e.g., pushing prams) remain unregistered or that their smartphones register different statistics when kept in handbags instead of pockets (Criado-Perez, 2020, pp.159–160; Lupton & Maslen 2018 ).

Technologies that collect/process data do not always account for the racial, gender and age diversity of the general population. For example, they might be more likely to produce reliable results for white, young, male users (if they were overrepresented in the development process) than for other groups (Obermeyer et al., 2019 ). Moreover, the functioning of algorithms and the rationale for the design of hardware employed in data-collecting devices is rarely disclosed by developers (Crawford et al., 2015 ). This has implications for those arguing that the genealogy of data needs to be untangled by researchers (Mauthner, 2019 ). However, such views seems to overlook the fact that untangling genealogy might not be always possible, especially when companies with commercial interests hide the exact technical specifications of their devices and algorithms, or even attempt to mislead users (and researchers) about the actual operations of their technologies by hiding relevant information in purposefully unclear terms of service and privacy policy documents (Kreitmair & Cho, 2017 ; Danaher et al., 2018 ). Therefore, it is reasonable to argue that biases inherent in devices and algorithms used for collecting and processing data make it likely for the generated big data sets to be biased as well. However, since data is framed as accurate and objective, and potential biases or limitations are not always diligently disclosed, it is difficult for researchers to identify potential biases of generated data sets.

III) Users’ and third-parties’ financial/non-financial conflicts of interests exacerbate biases. Crowdsourcing platforms such as CrowdFlower, Clickworker, Toluna, and Amazon’s Mechanical Turk are regularly used by social scientists to generate big data sets. When crowdsourcing platforms are used, financial incentives offered to participants (a payment per completed survey) and the lower cost of data collection for researchers (who incur lesser costs than when collecting data manually) might not only contribute to, but also encourage unethical practices (Quinton & Reynolds, 2017 ). Research subjects might decide to increase their profits by completing surveys hastily to maximize completed surveys per day or researchers might exploit subjects by not fully informing them about the required time for completing a survey, hence (inadvertently) encourage sloppy behavior and increase the likelihood of generating biased data sets (Semuels, 2018 ; Starkbaum & Felt 2019 ). Furthermore, low financial rewards offered by most crowdsourcing platforms, increases the chances of obtaining biased data sets. Crowdsourced surveys might entail non-inclusive samples as the low financial rewards do not incentivize individuals from high income countries, whereas for individuals based in low-income countries, working full time on crowdsourcing platforms could yield sufficient incomes. Footnote 9

Moreover, when big data sets are generated using social networking sites, it might be impossible to isolate data sourced from fake and bot accounts, some of which might have been created with specific financial and political agendas. Consequently, the information contained within such data sets might have been subject to manipulation by third-parties engaged in disinformation campaigns, or otherwise tainted by trolls and malicious actors.

Risks Arising from Reuse of Data

Social scientists commonly reuse data sets generated for other studies (Curty, 2016 ). In fact, Resnik’s principles of openness and efficiency demand that data sets should be made openly available and reused. However, reusing big data sets in SSR to uphold these two principles might contribute to, and even facilitate violations of other principles, as we demonstrate in this section. Although some of these issues might be connected to individual and social harms, as well as prejudices and biases discussed in the neighboring subsections, we believe that highlighting involved risks when openly available data is reused by third-parties (e.g., other researchers or non-academic parties) is essential.

Administrative data generated by public institutions is particularly useful for SSR, especially when they are in the public domain and contain demographic and financial information (Connelly et al., 2016 ). For instance, the European Union Open Data portal ( https://data.europa.eu/euodp/en/data/ ) contains 1,306,410 data sets (as per February 2022) ranging from national opinion trends to medicine, mobility, demographic and gender issues. Footnote 10 The American equivalent, the Data.gov catalog ( https://catalog.data.gov/dataset ), contains 341,876 data sets (as per February 2022) pertaining to various topics from property sales per county to health status for groups of Medicare beneficiaries. Besides gaining access to data that might be impossible to collect without public/governmental resources, using advanced big data analytic techniques, social scientists can extract useful information from these data sets without having to engage in time-consuming or costly data collection efforts. Footnote 11 From an ethical perspective, this extent of availability of data sets creates three dilemmas.

Although reusing data sets is efficient, it has a significant (epistemic) downside: researchers have not been involved in the data collection processes, so they have no influence on, and potentially limited insight into how data was collected. Accordingly, researchers are unable to anticipate and account for undisclosed biases embedded in data sets. Especially in cases where data sets are not linked with a published manuscript or lack supplementary information about the used methodology, researchers are unaware (and unable to become aware) of biases and limitations (Mittelstadt & Floridi, 2016 ; Lazer et al., 2014 ). Hence, researchers cannot determine whether the data was collected diligently and responsibly (Wallis & Borgman, 2011 ), which poses a threat to the integrity of research.

While public availability of data enables the critical scrutiny and assessment of results and facilitates efficiency, it also makes data vulnerable to unethical practices or, worse, accessible to abusive actors. Besides benefiting academic scholars, the regulatory push for making research data FAIR (Findable, Accessible, Interoperable, Reproducible) has also allowed various non-academic parties to benefit from free research data (Wilkinson et al., 2016 ). When reusing data, non-academic users might not necessarily adhere to norms and values that academic researchers are expected to uphold. Researchers are (usually) required and mandated by institutions to attend research ethics and integrity trainings and have their proposals and methodology vetted by IRB or ethics committees. However, since mechanisms for regulating non-academic research are generally less rigorous (Polonetsky et al., 2015 ), data availability might contribute to unforeseen ethical challenges. While the number of data sets stored on repositories such as The European Union Open Data portal and the American data catalogue shows researchers’ and public institutions’ willingness to share data sets, citizens should be concerned about who will reuse these data sets and for what purposes. Furthermore, data sets are vulnerable to cyber-attacks and so-called data leaks. Even when data sets generated through research practice are seemingly protected, corrupt researchers (Cass, 1999 ) or other non-academic parties might steal existing data or hack data repositories to extract valuable information (Mello, 2018 ).

Data availability also facilitates data aggregation and reaching unforeseen conclusions. Whereas a study might be focused on people’s mobility patterns or earning potential, by combining/enriching results with datapoints retrieved from other data sets, possibilities to make seemingly meaningful conclusions are multiplied. For example, administrative data sets employed to determine citizens’ earnings might be linked with data about the distribution of people with particular social or ethnic background in communities, thereby allowing researchers to find correlations and arrive at prejudiced conclusions that they would not have reached if information triggering such questions would not have been readily available. Footnote 12 Accordingly, social scientists employing big data sets generated by public institutions, shared by other researchers, or provided by commercial companies, might inadvertently violate principles of research integrity (e.g., by using data for specific objectives without subjects’ consent). Footnote 13

These three dilemmas are further intensified because most citizens who engage in online interactions rarely understand or are informed about potential uses of their information in future research projects. Accordingly, different views are debated: While some argue that utilizing information in ways that go beyond reasonable user expectations is a violation of privacy (Nissenbaum & Patterson, 2016 ), others believe that research subjects should be directly prompted about data reuse (Mannheimer et al., 2019 ). Either way, since the notion of reasonable user expectation is open to interpretation, and, reaching out to subjects of past projects is not always possible, in practice, the onus seems to be on data collectors to anticipate and/or communicate potential reuse, or to revise their ethics protocols with amendments and obtain consent if necessary (Remenyi et al., 2011 ).

Individual and Social Harms

In cases where SSR exposes participants’ personal characteristics and vulnerabilities (Nissenbaum & Patterson, 2016 ), using big data sets might enable researchers to predict participants’ future behavior (and behavioral patterns), which complicates upholding principles of respect for subjects and social responsibility. Footnote 14 When predictive research efforts are coupled with commercial interests, they have resulted in unfair exclusion of vulnerable groups from opportunities (e.g., access to credit) or led to predatory marketing campaigns (Madden et al., 2017 ). These practices are particularly egregious when research results rationalize policies and practices to target or even discriminate against a particular group through data categorization – a viable practice even when data is anonymized (Ajana, 2017 ). Footnote 15 In fact, some who argue that there is much more information available about us online than we might realize, have directly linked this issue with political power and claimed that this abundance of information makes democracies vulnerable (the more is known about each of us, the more predictable we become and hence, our political choices become more predictable) (Véliz, 2020 ).

Consequently, uncertainties associated with the (future) processing of data sets might impede researchers’ ability to uphold principles of social responsibility and respect for subjects. In employing big data sets, researchers or other users may employ data processing methods to achieve objectives that participants had not consented to or worse, use the data against participants’ social/political/financial interests without any regulatory oversight. Examples include zip code categorization to prioritize services (e.g., by providing faster delivery times to neighborhoods predominantly populated by wealthy white customers, cf. Ingold and Soper 2016 ), gerrymandering to change the political dynamic of communities, or increasing insurance premiums based on demographic segmentation of communities (Duchin, 2019 ).

Use of big data sets has also facilitated questionable research practices such as HARKing (Hypothesising After Results are Known), and question trolling that involves searching data with several constructs or relationships to find notable results (Kerr, 2016 ; Murphy & Aguinis, 2019 ). From a methodological perspective, these practices suggest a move from a hypothesis-driven to a hypothesis-free research paradigm (Pasquetto, 2018 ) – sometimes called the end of social theory (Anderson, 2008 ) – but they also challenge ethical principles of respect for subjects and social responsibility. While both HARKing and question trolling nullify individuals’ consent (e.g., by formulating questions/hypotheses that were not communicated to subjects in information sheets), in SSR they may also exacerbate harmful effects of research on the society through giving more control (over individuals/societies) to those who can access and/or analyze users’ data.

In terms of the principle of respect for subjects, some projects “scoop up personal information” from users’ online activities or even fitness trackers (Madden et al., 2017 , p. 64). This information is then combined with personal evaluation metrics (e.g., credit history, criminal background records, educational testing scores) to tag users with specific characteristics, thereby governing users’ access or privileges (especially low-income people) in relation to various public and private services (e.g., education, insurance). These practices create digital representations of individuals as well as groups of individuals, sometimes called data doubles (Haggerty & Ericson, 2000 ; Ruckenstein, 2014 ). These data doubles are created through pattern recognition methods and then used at a massive scale to create predictive behavioral models (Fire, 2014 ). Subsequently, data scientists willing to engage in HARKing only need to look for patterns in data sets (also called data mining). These data mining methods are commonly used by social scientists aiming “to maximize the overall predictive power” in testing social/psychological hypotheses (Attewell et al., 2015 , p. 14). The unrestricted processing of data about the behavior of large groups (or clusters within groups) might expose characteristics, vulnerabilities, and reveal the decision-making processes of specific cohorts, thereby putting them at a weaker position in comparison with researchers, institutions or companies that have access to and can interpret these results. Such knowledge about cohorts’ decision making might allow parties with financial or political agendas to target studied groups with specific strategies based on cohorts’ predicted behavioral profile, allowing them to engage, for example, in manipulation aided by information derived through HARKing.

In relation to the principle of social responsibility, the high global environmental costs of big data storage and processing are rarely considered when discussing the ethical impact of big data analytics. Crawford ( 2021 ) argues that euphemistic terms such as cloud computing can make us falsely believe that data-processing algorithms function in a sleek and frictionless manner. Crawford adds, devices used to store and process big data are constructed using large quantities of rare minerals, which means that their extraction leaves disastrous effects on the environment and local communities of mined areas. Additionally, these devices consume enormous amounts of electricity and exacerbate the climate crisis. Material and energy requirements are also relevant from the standpoint of the principle of efficiency as in many cases, the use of big data methods might not be the most efficient way of allocating resources when the overall environmental impact of a study is considered.

Furthermore, the distance between researchers and subjects might contribute to individual harms. Researchers involved in big data research do not directly engage with people described by the data, as opposed to SSR that involves interviews, focus groups or surveys that do not result in big data sets. For example, when studying patients’ self-reported feelings about long-term cardiac treatment, Lomborg et al., ( 2020 ) noted that as a result of interviews, researchers felt connected to subjects and their situation. Although these researchers had access to detailed information about subjects’ emotional dispositions and medical history (supplied by data collecting devices), they only recognized personal dimensions of research during direct contact with subjects. Footnote 16 Big data SSR, however, might not necessarily require personal contact with subjects. The ethical concern being that big data’s technological mediation increasingly detaches researchers from participants and dilutes their perception of human subjects (Zimmer, 2018 ). Involved researchers might forget that specific data points within data sets are connected to subjects with expectations, rights and vulnerabilities that should be respected. Consequently, subjects are more likely to be harmed through objectification, instrumentalization of their data.

Suggestions for Developing Ethics Guidelines

In this paper, we have argued that big data SSR involves distinct ethical issues related to prejudices and biases, risks arising from publication and reuse of data, and individual and social harms. We showed that these ethical issues complicate and/or impede researchers’ adherence to principles of honesty, carefulness, openness, efficiency, respect for subjects and social responsibility as articulated in Resnik’s research ethics framework.

Despite a wide range of potential ethical issues in big data SSR, these issues have received relatively little regulatory and ethical scrutiny. While some codes of conduct note individual ethical issues relevant to big data SSR, they rarely capture complexities of this field to a satisfactory degree and are neither globally endorsed nor enforced. Consequently, researchers willing to uphold ethical standards in conducting big data SSR might find it difficult to find relevant ethical guidance. As mentioned in Sect.  2 of this paper, in the absence of comprehensive and universally accepted research ethics procedures regarding big data SSR, research ethics committees are not subjecting big data SSR to appropriate ethical scrutiny as they currently lack the tools and knowledge necessary to do so in a satisfactory manner.

As the volume, variety and velocity of big data increases, the possibility of harnessing information from big data sets for the purposes of SSR will prove more appealing to researchers. To the best of our knowledge, this paper is the first attempt to adopt a research ethics normative framework to explore the complicated landscape of ethics of big data SSR. We believe that it should serve as a call to action for the scientific community and regulatory bodies to devote more attention to the growing complexity and variety of ethical aspects of big data SSR. The formulation of clear guidelines for big data SSR, would be one of the first steps required to reduce the likelihood of ethical issues. In line with issues identified using Resnik’s framework, we provide the following considerations to observe in developing future guidelines about big data SSR:

Prejudices and biases.

When sharing their datasets as a stand-alone research output or as part of a manuscript, researchers should disclose limitations and biases of generated/reused data sets. In the absence of such information, adding disclaimers should be mandatory.

Data repositories should mandate and prompt researchers to disclose limitations and biases when storing data sets (e.g., by adding a new mandatory textbox to fill).

Funders, academic/non-academic research institutions and IRB/research ethics committees should provide guidance and best practices on how to minimize biases embedded in data sets and third-party technologies, and those resulting from researchers’ personal prejudices.

Reuse of big data and the associated risks.

Researchers should be required to obtain research subjects’ explicit consent for the use of their information in big data SSR, as well as for the possibility of future reuse of their information by other studies with the possibility to opt out of future use of their data.

Funders, academic/non-academic research institutions and IRB/research ethics committees should mandate researchers to inform their subjects about the consequences of the openness of data and instruct them about the likely future uses of data.

Data repositories should assign a DOI for every stored data set (and their subsequent versions) to enable and encourage researchers and data watchdogs to improve dataset tracing.

Individual and social harms.

Researchers should be required to follow procedures that anticipate and determine potential social and individual impacts of their study and results (e.g., by performing an anticipatory analysis similar to those gaining popularity in ethics of technology, cf. Brey, 2012 ).

Funders, academic/non-academic research institutions and IRB/research ethics committees should mandate researchers to explicitly inform their subjects about the potential social impacts of studies employing their data.

Researchers employing big data tools should consider local and environmental impacts, and choose providers while considering their environmental footprints, sustainability of supply chains and efficiency of adopted methodologies.

Our working definition of big data is: Large sets of data compiled from various sources (e.g., existing administrative data, online interactions, data collected by devices) and stored in a digital form to be analyzed with computers. Big data has been characterized by three v’s: volume (the large amount of information), variety (the diverse scope of information) and velocity (the high speed at which new data is generated and analyzed) (Kitchin & McArdle, 2016 ).

While Weinhardt’s study claims to address ethical issues in big data SSR, in our view, it does not explore a single ethical issue that is unique to big data SSR. Examples he uses to illustrate social dimensions of big data research are less specific than what we describe in this paper. For example, while “the development of stock prices around the world, the tracking of trucks in automated toll systems for real-time forecasting of GDP developments, or the extraction of rental housing market information from websites and dedicated portals to estimate the development of rents over time” (Weinhardt, 2020 , p. 358) could be interesting subjects for big data research, they are unrelated to the interpretative nature of SSR and only focus on hypotheses that require big data. Salganik’s book (Bit by Bit) , on the other hand, not only mentions but also elaborates on ethical issues of big data SSR. Salganik uses the four principles of Respect for Persons, Beneficence, Justice and Respect for Law and Public Interest introduced in The Menlo Report for ICT research (Dittrich & Kenneally, 2012 ). Since the Menlo report is built on the Belmont report, it is a better fit for computational biomedical research. Consequently, Salgnik too neglects ethical challenges introduced by the interpretive nature of SSR and the potential for prejudices and biases. In addition to highlighting ethical issues linked to the interpretative nature of SSR, our work specifically discusses ethical issues related to research integrity and environmental sustainability.

Two peer-reviewers brough these special issues to our attention.

As discussed by González-Bailón ( 2013 ), tools focusing merely on the content of the processed information can neither account for the context, nor consider the agency of people involved.

We use the term symbolic violence after Bourdieu to designate non-physical harms, such as derogatory or stigmatizing language, social exclusion, and lack of representation, which are inflicted upon individuals with the purpose of entrenching the existing stratification of society and the associated inequality and injustice (Bourdieu, 1991 ; Bourdieu & Wacquant, 1992 ).

The European Commission has published specific guidelines entitled Ethics in Social Sciences and Humanities (European Commission, 2018 ). This document highlights ethical issues relevant to data collection efforts that are internet-mediated and/or use social media, but it does not capture all issues raised in this article. Furthermore, endorsed by the academy of sciences in more than 40 countries, the European Code of Conduct for Research Integrity specifically notes that “researchers, research institutions and organizations provide transparency about how to access or make use of their data and research materials” (ALLEA, 2017 , p.6). While the notion of ‘how to make use of their data and research materials’ could also imply disclosure of biases and limitations of data sets to facilitate ethical use of data, to the best of our knowledge, none of the major repositories (even the EU Open Data portal) require such disclosures. Although these kinds of disclosures are more common in published manuscripts (wherein study limitations are mentioned), similar practices have not been suggested for data sets.

Self-tracking technologies include devices and smartphone apps that enable users to collect data about themselves and their daily activities (Neff & Nafus, 2016 ). Popular examples include Fitbit fitness bands and Apple Watch that collect information about users’ physical activity, sleep patterns and mood. Since these technologies enable the collection of a variety of behavioral information about subjects with little difficulty and costs, they benefit SSR. For example, Lomborg et al., ( 2020 ) used Fitbits to study how live monitoring of heartrate could impact cardiac patients’ mood, while also discussing patients’ skills and cultural contexts when making sense of their medical information.

As self-tracking technologies reduce qualitative phenomena to their quantifiable characteristics, they often fail to provide contextual factors that could be relevant for the assessment of the information in generated data set.

Crowdsourcing platforms can be seen as inherently exploitative. For example, Crawford ( 2021 ) observed that many users of crowdsourcing platforms receive less than their local minimum wage for their contributions. Since platforms like Mechanical Turk can be the main source of income for some people, and as these platforms often effectively outsource data collection to regions where labor is much cheaper, researchers should envisage that lowering the financial cost of conducting research might have high ethical costs .

A regulatory push from the European Commission to “make as much information available for re-use as possible” by public agencies/institutions has increased availability of data sets (European Commission 2020 , paragraph 1). Additionally, due to the international support and mandates for Open Access publication of data to realize the ambition of “open research data per default (but allowing for opt-outs)” (European Commission n.d., paragraph 3), results and the data associated of thousands of research projects are publicly available for reuse.

In the US, this trend was exacerbated when in 2013 the Obama Administration made open data the default method of disseminating research conducted by the federal government. Accordingly, data sets that include information on health, climate, small business and manufacturing opportunities, crime, education, and public domain information on the federal workforce should be made publicly available. Marion Royal (the director of data.gov) notes that “the model of preserving privacy by individual consent might be obsolete when so much data is passively captured by sensors, and the abundance of social media and search data collected by private companies makes anonymization ‘virtually impossible,’ … Privacy as a concept is becoming less clear as technology increases and big data becomes more prevalent, and available” (Mazmanian, 2014 , paragraph 4–7).

As internet companies commonly track cookies across multiple websites to collect users’ data (e.g., Facebook has admitted to collecting data even on non-members by tracking cookies across partnering websites, cf. Brandom, 2018 ), it is often practically impossible for users to establish which data was willingly and knowingly shared. Moreover, since data is exchanged among a wide range of vendors, it is virtually impossible to determine a full life cycle or value chain of users’ data. For instance, The New York Times website lists among its “nonessential” cookies 19 marketing and 8 advertising trackers which send information about readers’ activity to companies such as Google, Facebook, Microsoft (cf. https://www.nytimes.com/privacy/cookie-policy ) with access to enormous datasets and capability to process/aggregate data. Innocuous data about reading habits could be used to target specific groups with e.g., marketing/political campaigns across other platforms.

In 2013, the New York City Taxi & Limousine Commission released an anonymized dataset with information about 173 million individual cab rides – including pickup and drop-off times, locations, fare and tip amount. After the release, researchers that freely accessed the database were able to reveal private and sensitive information about the taxi-drivers (e.g., religious belief, average income and even an estimation of their home address), thus demonstrating the ease with which databases can be processed to reveal information about individuals (Franceschi-Bicchierai, 2015 ).

Practices such as psychographic targeting that involve targeting users based on their personality traits (Gibney, 2018 ), or the Big Five scale test that measure users’ five personality traits (i.e., openness, conscientiousness, extraversion, agreeableness and neuroticism) based on their Facebook likes (Kosinski, 2013), are among methods that allow predicting but also influencing human behavior.

In data categorization practices, individuals are targeted not based on unique characteristics (e.g., browsing data or employment history) or identifying features (e.g., biometric data), but as a result of their membership of a group purported statistically more likely to exhibit certain behaviors. For example, financial institutions could (unfairly) deny a loan to an individual because according to their data, people belonging to the individual’s ethnic or social group are statistically more likely to default on loans.

Interestingly, Lomborg et al., ( 2020 ) also noted that they were not required to obtain ethical approval for their research despite being intimately involved in their subjects’ lives.

Ajana, B. (2017). Digital health and the biopolitics of the quantified self. Digital Health , 3 , 1–18. https://doi.org/10.1177/2055207616689509

Article   Google Scholar  

All European Academies (ALLEA) (2017). The European code of conduct for research integrity-revised edition. Accessed 11 January 2021. http://ec.europa.eu/research/participants/data/ref/h2020/other/hi/h2020-ethics_code-of-conduct_en.pdf

Anderson, C. (2008). The end of theory: The data deluge makes the scientific method obsolete. Wired . https://www.wired.com/2008/06/pb-theory/

Attewell, P., Monaghan, D. B., & Kwong, D. (2015). Data mining for the social sciences: An introduction . University of California Press. https://www.jstor.org/stable/10.1525/j.ctt13 × 1gcg

Barocas, S., & Selbst, A. D. (2016). Big data’s disparate impact. California Law Review , 104 (671). http://www.jstor.org/stable/24758720

Barrowman, N. (2018). Why data is never raw. The New Atlantis, Summer/Fall 2018. Accessed 14 January 2021. http://www.thenewatlantis.com/publications/why-data-is-never-raw

Bender, S., Jarmin, R., Kreuter, F., & Lane, J. (2016). In I. Foster, R. Ghani, R. S. Jarmin, F. Kreuter, & J. Lane (Eds.), Big data and social science: A practical guide to methods and tools . Chapman and Hall

Bourdieu, P. (1991). Language and symbolic power . Polity Press

Google Scholar  

Bourdieu, P., & Wacquant, L. (1992). An invitation to reflexive sociology . The University of Chicago Press

Boyd, K. M. (2017). Why the biomedical research ethics model is inappropriate for social sciences: A response to ‘Responsible to Whom? Obligations to participants and society in social science research’ by Matt Sleat. In Finding common ground: Consensus in research ethics across the social sciences, Vol. 1, (pp. 55–60). Emerald Publishing Limited. https://doi.org/10.1108/S2398-601820170000001006

boyd, & Metcalf, J. (2014). Example “Big Data” research controversies (p. 4). Council for Big Data, Ethics, and Society. Accessed 17 May 2021 https://bdes.datasociety.net/wp-content/uploads/2016/10/ExampleControversies.pdf

Brandom, R. (2018). Shadow profiles are the biggest flaw in Facebook’s privacy defense’. The Verge , 11 April 2018. https://www.theverge.com/2018/4/11/17225482/facebook-shadow-profiles-zuckerberg-congress-data-privacy

Brey, P. A. E. (2012). Anticipatory ethics for emerging technologies. NanoEthics , 6 (1), 1–13. https://doi.org/10.1007/s11569-012-0141-7

Camfield, L. (2019). Rigor and ethics in the world of big-team qualitative data: Experiences from research in international development. American Behavioral Scientist , 63 (5), 604–621. https://doi.org/10.1177/0002764218784636

Cass, S. (1999). Researcher charged with data theft. Nature Medicine , 5 (5), 474–474. https://doi.org/10.1038/8350

Chang, R. M., Kauffman, R. J., & Kwon, Y. O. (2014). Understanding the paradigm shift to computational social science in the presence of big data. Decision Support Systems , 63 (July), 67–80. https://doi.org/10.1016/j.dss.2013.08.008

Connelly, R., Playford, C. J., Gayle, V., & Dibben, C. (2016). The role of administrative data in the big data revolution in social science research. Social Science Research , 59 , 1–12. https://doi.org/10.1016/j.ssresearch.2016.04.015

Crawford, K. (2021). Atlas of AI: Power, politics, and the planetary costs of artificial intelligence . Yale University Press

Book   Google Scholar  

Crawford, K., Lingel, J., & Karppi, T. (2015). Our metrics, ourselves: A hundred years of self-tracking from the weight scale to the wrist wearable device. European Journal of Cultural Studies , 18 (4–5), 479–496. https://doi.org/10.1177/1367549415584857

Criado Perez, C. (2020). Invisible women . Vintage

Curty, R. G. (2016). Factors influencing research data reuse in the social sciences: An exploratory study. International Journal of Digital Curation , 11 (1), 96–117. https://doi.org/10.2218/ijdc.v11i1.401

Danaher, J., Nyholm, S., & Earp, B. D. (2018). The quantified relationship.  The American Journal of Bioethics , 18 (2), 3–19. https://doi.org/10.1080/15265161.2017.1409823

Dittrich, D., & Kenneally, E. (2012). The Menlo report: Ethical principles guiding information and communication technology research. US Department of Homeland Security. https://www.caida.org/catalog/papers/2012_menlo_report_actual_formatted/menlo_report_actual_formatted.pdf

Duchin, M. (2019). Geometry v. gerrymandering. In M. Pitici (Ed.), The best writing on mathematics 2019 (pp. 1–11). Princeton University Press

European Commission (2020, March 8). European legislation on open data and the re-use of public sector information . Shaping Europe’s Digital Future - European Commission. https://ec.europa.eu/digital-single-market/en/european-legislation-reuse-public-sector-information

European Commission. (n.d.). Open access [Text]. European Commission - European Commission. Accessed 28 (January 2021). from https://ec.europa.eu/info/research-and-innovation/strategy/goals-research-and-innovation-policy/open-science/open-access_en

European Commission (2018). Ethics in social science and humanities. Accessed 23 April 2021. https://ec.europa.eu/info/sites/default/files/6._h2020_ethics-soc-science-humanities_en.pdf

Favaretto, M., Clercq, E., De, Briel, M. & Elger, S. Working through ethics review of big data research projects:, https://doi.org/10.1177/1556264620935223 (2020).

Feldman, S., & Shaw, L. (2019). The epistemological and ethical challenges of archiving and sharing qualitative data. American Behavioral Scientist , 63 (6), 699–721. https://doi.org/10.1177/0002764218796084

Fire, M. R. G. (2014). Online social networks: Threats and solutions. IEEE Communications Surveys & Tutorials , 2019–2036. Accessed 23 May 2021 http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6809839

Foster, I., Ghani, R., Jarmin, R. S., Kreuter, F., & Lane, J. (2016). Big data and social science: A practical guide to methods and tools . Chapman and Hall

Franceschi-Bicchierai, L. (2015). Redditor cracks anonymous data trove to pinpoint Muslim cab drivers. Mashable , Available at: https://mashable.com/2015/01/28/redditor-muslim-cab-drivers/#0_uMsT8dnPqP (Accessed June 2020)

franzke, Bechmann, A., Zimmer, M., Ess, C., & the Association of Internet Researchers. (2020). &. Internet research: Ethical guidelines 3.0. https://aoir.org/reports/ethics3.pdf

Frey, W. R., Patton, D. U., Gaskell, M. B., & McGregor, K. A. (2020). Artificial intelligence and inclusion: Formerly gang-involved youth as domain experts for analyzing unstructured twitter data. Social Science Computer Review , 38 (1), 42–56. https://doi.org/10.1177/0894439318788314

Friedman, B., & Nissenbaum, H. (1996). Bias in computer systems. Computer Ethics , 14 (3), 215–232. https://doi.org/10.4324/9781315259697-23

Gibney, E. (2018). The scant science behind Cambridge Analytica’s controversial marketing techniques. Nature . https://doi.org/10.1038/d41586-018-03880-4

Gitelman, L. (2013). ‘Raw data’ is an oxymoron . MIT Press

González-Bailón, S. (2013). Social science in the era of big data. Policy & Internet , 5 (2), 147–160. https://doi.org/10.1002/1944-2866.POI328

Gurzawska, A., Benčin, R., & SATORI Project Deliverable Ethical Assessment of Research and Innovation. (2015). Ethics assessment in different fields of social sciences, ( A comparative analysis of practices and institutions in the EU and selected other countries. Deliverable 1.1; Stakeholders acting together on the ethical impact assessment of research and innovation - SATORI Project). Accessed 5 June 2021 https://satoriproject.eu/media/2.d-Social-Sciences.pdf

Haggerty, K., & Ericson, R. (2000). The surveillant assemblage. The British Journal of Sociology , 51 (4), 605–622

Hale, L., Hill, T. D., Friedman, E., Javier Nieto, F., Galvao, L. W., Engelman, C. D. … Peppard, P. E. (2013). Perceived neighborhood quality, sleep quality, and health status: Evidence from the Survey of the Health of Wisconsin. Social Science & Medicine , 79 , 16–22. https://doi.org/10.1016/j.socscimed.2012.07.021

Hargittai, E. (2020). Potential biases in big data: Omitted voices on social media. Social Science Computer Review , 38 (1), 10–24. https://doi.org/10.1177/0894439318788322

Hesse, A., Glenna, L., Hinrichs, C., Chiles, R., & Sachs, C. (n.d.) (Eds.). Qualitative research ethics in the big data era. American Behavioral Scientist , 24

Hollingshead, W., Quan-Haase, A., & Chen, W. (2021). Ethics and privacy in computational social science: A call for pedagogy. In Handbook of computational social science (1 vol.). Routledge.

Hossain, N., & Scott-Villiers, P. (2019). Ethical and methodological issues in large qualitative participatory studies. American Behavioral Scientist , 63 (5), 584–603. https://doi.org/10.1177/0002764218775782

Hoy, M. B. (2016). Personal activity trackers and the quantified self. Medical Reference Services Quarterly , 35 (1), 94–100

Ingold, D., & Soper, S. (2016, April 21). Amazon doesn’t consider the race of its customers. Should it? Bloomberg . Accessed 15 May 2021 http://www.bloomberg.com/graphics/2016-amazon-same-day/

Kerr, N. L. (2016). HARKing: Hypothesizing after the results are known. Personality and Social Psychology Review , 2 (3), 196–217. https://doi.org/10.1207/s15327957pspr0203_4

Kitchin, R., & McArdle, G. (2016). What makes big data, big data? Exploring the ontological characteristics of 26 datasets. Big Data & Society , 3 (1), 1–10. https://doi.org/10.1177/2053951716631130

Kosinski, M., Stillwell, D., & Graepel, T. (2013). Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences , 110 (15), 5802–5805. https://doi.org/10.1073/pnas.1218772110

Kramer, D. B., Xu, S., & Kesselheim, A. S. (2020). Regulation of medical devices in the United States and European union. The Ethical Challenges of Emerging Medical Technologies , 41–49. https://doi.org/10.4324/9781003074984-3

Kreitmair, K., & Cho, M. K. (2017). The neuroethical future of wearable and mobile health technology. In J. Illes (Ed.), Neuroethics: Anticipating the future (pp. 80–107). Oxford University Press. https://doi.org/10.1093/oso/9780198786832.003.0005

Chapter   Google Scholar  

Kuc-Czarnecka, M., & Olczyk, M. (2020). How ethics combine with big data: A bibliometric analysis. Humanities and Social Sciences Communications , 7 (1), 1–9. https://doi.org/10.1057/s41599-020-00638-0

Lazer, D., Pentland, A., Adamic, L., Aral, S., Barabási, A. L., Brewer, D., et al. (2009). Computational social science. Science , 323 (5915), 721–723. https://doi.org/10.1126/science.1167742

Lazer, D., Kennedy, R., King, G., & Vespignani, A. (2014). The parable of Google Flu: Traps in big data analysis. Science , 343 (6176), 1203–1205. https://doi.org/10.1126/science.1248506

Lipworth, W., Mason, P. H., Kerridge, I., & Ioannidis, J. P. A. (2017). Ethics and epistemology in big data research. Journal of Bioethical Inquiry , 14 (4), 489–500. https://doi.org/10.1007/s11673-017-9771-3

Liu, H. (2016). Opportunities and challenges of big data for the social sciences: The case of genomic data. Social Science Research , 59 , 13–22. https://doi.org/10.1016/j.ssresearch.2016.04.016

Lomborg, S., Langstrup, H., & Andersen, T. O. (2020). Interpretation as luxury: Heart patients living with data doubt, hope, and anxiety. Big Data & Society , 7 (1), 1–13. https://doi.org/10.1177/2053951720924436

Lunshof, J. E., Chadwick, R., Vorhaus, D. B., & Church, G. M. (2008). From genetic privacy to open consent. Nature Reviews Genetics , 9 (5), 406–411. https://doi.org/10.1038/nrg2360

Lupton, D. (2016). Quantified self . Polity Press

Lupton, D., & Maslen, S. (2018). The more-than-human sensorium: Sensory engagements with digital self-tracking technologies. The Senses and Society , 13 (2), 190–202. https://doi.org/10.1080/17458927.2018.1480177

Madden, M., Gilman, M., Levy, K., & Marwick, A. (2017). Privacy, poverty, and big data: A matrix of vulnerabilities for poor Americans. Washington University Law Review , 95 , 74

Mannheimer, S., Pienta, A., Kirilova, D., Elman, C. & Wutich, A. Qualitative data sharing: Data repositories and academic libraries as, https://doi.org/10.1177/0002764218784991 (2019).

Mauthner, N. S. (2019). Toward a posthumanist ethics of qualitative research in a big data era. American Behavioral Scientist , 63 (6), 669–698. https://doi.org/10.1177/0002764218792701

Mazmanian, B. A. (2014, May 13). The mosaic effect and big data. FCW. https://fcw.com/articles/2014/05/13/fose-mosaic.aspx

Mello, S. (2018). Data breaches in higher education institutions [University of New Hampshire]. Accessed 12 May 2021. https://scholars.unh.edu/cgi/viewcontent.cgi?article=1407&context=honors

Metcalf, J., & Crawford, K. (2016). Where are human subjects in big data research? The emerging ethics divide. Big Data & Society , 3 (1), 1–14. https://doi.org/10.1177/2053951716650211

Mills, K. A. (2018). What are the threats and potentials of big data for qualitative research? Qualitative Research , 18 (6), 591–603. https://doi.org/10.1177/1468794117743465

Mittelstadt, B. D., & Floridi, L. (2016). The ethics of big data: Current and foreseeable issues in biomedical contexts. Science and Engineering Ethics , 22 (2), 303–341. https://doi.org/10.1007/s11948-015-9652-2

Moore, P., & Piwek, L. (2017). Regulating wellbeing in the brave new quantified workplace. Employee Relations , 39 (3), 308–316. https://doi.org/10.1108/ER-06-2016-0126

Murphy, K. R., & Aguinis, H. (2019). HARKing: How badly can cherry-picking and question trolling produce bias in published results? Journal of Business and Psychology , 34 (1), 1–17. https://doi.org/10.1007/s10869-017-9524-7

National Research Council. (2003). Protecting participants and facilitating social and behavioral sciences research . National Academies Press

Neff, G., & Nafus, D. (2016). Self-tracking . The MIT Press

Nissenbaum, H., & Patterson, H. (2016). Biosensing in context: Health privacy in a connected world. In D. Nafus (Ed.), Quantified: Biosensing technologies in everyday life . The MIT Press

Obermeyer, Z., Powers, B., Vogeli, C., & Mullainathan, S. (2019). Dissecting racial bias in an algorithm used to manage the health of populations. Science , 366 (6464), 447–453. https://doi.org/10.1126/science.aax2342

OECD. (2016). " Research ethics and new forms of data for social and economic research", OECD science, technology and industry policy papers, No. 34 . OECD Publishing. https://doi.org/10.1787/5jln7vnpxs32-en

Pangrazio, L., & Sefton-Green, J. (2020). The social utility of data literacy. Learning Media and Technology , 45 (2), 208–220. https://doi.org/10.1080/17439884.2020.1707223

Pasquetto, I. V. (2018). Beyond privacy: The emerging ethics of data reuse. UCLA: Center for knowledge infrastructures. Accessed 14 April 2021 https://escholarship.org/uc/item/92k1b265

Piwek, L., Ellis, D. A., Andrews, S., & Joinson, A. (2016). The rise of consumer health wearables: Promises and barriers. PLOS Medicine , 13 (2), e1001953. https://doi.org/10.1371/journal.pmed.1001953

Polonetsky, J., Tene, O., & Jerome, J. (2015). Beyond the common rule: Ethical structures for data research in non-academic settings. Colorado Technology Law Journal , 13 (2), 333–368

Popham, J., Lavoie, J., & Coomber, N. (2020). Constructing a public narrative of regulations for big data and analytics: Results from a community-driven discussion. Social Science Computer Review , 38 (1), 75–90. https://doi.org/10.1177/0894439318788619

Popper, K. (1961). The poverty of historicism . Harper & Row Publishers

Quinton, S., & Reynolds, N. (2017). The changing roles of researchers and participants in digital and social media research: Ethics challenges and forward directions. In K. Woodfield (Ed.), The ethics of online research, Vol. 2, (pp. 53–78). Emerald Publishing Limited. https://doi.org/10.1108/S2398-601820180000002003

Remenyi, D., Swan, N., & Assem, B. V. D. (2011). Ethics protocols and research ethics committees: Successfully obtaining approval for your academic research . Academic Conferences Limited

Resnik, D. B. (2005). The ethics of science: An introduction . Routledge

Resnik, D. (2015). What is ethics in research & Why is it important?, David B. Resnik, J.D., Ph.D. National Institute of Environmental Health Sciences. Accessed 17 February 2021 https://www.niehs.nih.gov/research/resources/bioethics/whatis/index.cfm

Richardson, F. C., & Fowers, B. J. (1998). Interpretative social science: An overview. American Behavioral Scientist , 41 (1), 465–495. https://doi.org/10.1177/0002764298041004003

Richterich, A. (2018). The big data agenda: Data ethics and critical data studies . University of Westminster Press

Rothstein, M. A. (2015). Ethical issues in big data health research: Currents in contemporary bioethics. The Journal of Law Medicine & Ethics , 43 (2), 425–429. https://doi.org/10.1111/jlme.12258

Ruckenstein, M. (2014). Visualized and interacted life: Personal analytics and engagements with data doubles. Societies , 4 (1), 68–84. https://doi.org/10.3390/soc4010068

Salganik, M. (2017). Bit by bit: Social research in the digital age . Princeton University Press

Selke, S. (2016). Rational discrimination and lifelogging: The expansion of the combat zone and the new taxonomy of the social. In S. Selke (Ed.), Lifelogging: Digital self-tracking and lifelogging – between disruptive technology and cultural transformation (pp. 345–372). Springer

Semuels, A. (2018, January 23). The internet is enabling a new kind of poorly paid hell. The Atlantic. Accessed 16 May 2021 https://www.theatlantic.com/business/archive/2018/01/amazon-mechanical-turk/551192/

Shahin, S., & Zheng, P. (2020). Big data and the illusion of choice: Comparing the evolution of India’s Aadhaar and China’s social credit system as technosocial discourses. Social Science Computer Review , 38 (1), 25–41. https://doi.org/10.1177/0894439318789343

Sharon, T. (2017). Self-tracking for health and the quantified self: Re-articulating autonomy, solidarity, and authenticity in an age of personalized healthcare. Philosophy & Technology , 30 (1), 93–121. 

Starkbaum, J. & Felt, U. Negotiating the reuse of health-data: Research, big data, and the European general data protection regulation, https://doi.org/10.1177/2053951719862594 (2019).

Sterett, S. M. (2019). Data access as regulation. American Behavioral Scientist , 63 (5), 622–642. https://doi.org/10.1177/0002764218797383

Stommel, W., & de Rijk, L. (2021). Ethical approval: None sought. How discourse analysts report ethical issues around publicly available online data. Research Ethics . https://doi.org/10.1177/1747016120988767

Taylor, C. (1971). Interpretation and the sciences of man. The Review of Metaphysics , 25 (1), 3–51. http://www.jstor.org/stable/20125928

Townsend, L., & Wallace, C. (2016). Social media research: A guide to ethics. University of Aberdeen , 1 , 1–16. https://www.gla.ac.uk/media/Media_487729_smxx.pdf

Véliz, C. (2020). Privacy is power: Why and how you should take back control of your data . Bantam Press

Vitak, J., Proferes, N., Shilton, K., & Ashktorab, Z. (2017). Ethics regulation in social computing research: Examining the role of institutional review boards. Journal of Empirical Research on Human Research Ethics , 12 (5), 372–382. https://doi.org/10.1177/1556264617725200

Wallis, J. C., & Borgman, C. L. (2011). Who is responsible for data? An exploratory study of data authorship, ownership, and responsibility. Proceedings of the American Society for Information Science and Technology , 48 , 1–10. https://doi.org/10.1002/meet.2011.14504801188

Weinhardt, M. (2020). Ethical issues in the use of big data for social research. Historical Social Research / Historische Sozialforschung , 45 (3), 342–368. https://www.jstor.org/stable/26918416

Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak, A. … Mons, B. (2016). The FAIR guiding principles for scientific data management and stewardship. Scientific Data , 3 (1), https://doi.org/10.1038/sdata.2016.18

Wolff, A., Gooch, D., Montaner, J., Rashid, U., & Kortuem, G. (2016). Creating an understanding of data literacy for a data-driven society. The Journal of Community Informatics , 12 (3), 9–26. https://doi.org/10.15353/joci.v12i3.3275

Zimmer, M. (2018). Addressing conceptual gaps in big data research ethics: An application of contextual integrity. Social Media + Society , 4 (2), https://doi.org/10.1177/2056305118768300

Download references

Acknowledgements

We thank the journal editor and three anonymous reviewers for their constructive and valuable feedback. We also thank Dr. Maddalena Favaretto for her valuable suggestions that improved this manuscript.

At the time of initial submission, Mohammad Hosseini received funding from the EnTIRE Consortium (Mapping Normative Frameworks for Ethics and Integrity of Research), which is supported by the European Union’s Horizon 2020 research and innovation program under Grant Agreement No. 741782. During the review and resubmission period, Mohammad Hosseini was funded by the Northwestern University Clinical and Translational Sciences Institute (NUCATS, UL1TR001422). Michał Wieczorek received funding from the PROTECT project, which is supported by the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 813,497. The funders have not played a role in the design, analysis, decision to publish, or preparation of the manuscript.

Author information

Authors and affiliations.

Feinberg School of Medicine, Northwestern University, Chicago, USA

Mohammad Hosseini

Institute of Ethics, Dublin City University, Dublin, Ireland

Michał Wieczorek & Bert Gordijn

You can also search for this author in PubMed   Google Scholar

Contributions

CRediT roles: Mohammad Hosseini: Conceptualization, Investigation, Methodology, Writing-Original Draft, Writing-Review & Editing. Michał Wieczorek: Investigation, Methodology, Writing-Original Draft, Writing-Review & Editing. Bert Gordijn: Validation, Writing-Review & Editing, Supervision.

Corresponding author

Correspondence to Mohammad Hosseini .

Ethics declarations

Conflict of interest.

Authors declare no conflicting interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Hosseini, M., Wieczorek, M. & Gordijn, B. Ethical Issues in Social Science Research Employing Big Data. Sci Eng Ethics 28 , 29 (2022). https://doi.org/10.1007/s11948-022-00380-7

Download citation

Received : 25 August 2021

Revised : 11 March 2022

Accepted : 05 May 2022

Published : 15 June 2022

DOI : https://doi.org/10.1007/s11948-022-00380-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Research Ethics
  • Research Integrity; Big Data
  • Social Science
  • Computational Social Science
  • Open Science
  • Find a journal
  • Publish with us
  • Track your research

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • PMC11086874

Logo of plosone

Re-use of research data in the social sciences. Use and users of digital data archive

1 Faculty of Information Technology and Communication Sciences, Tampere University, Tampere, Finland

Michael Ochsner

2 Swiss Centre of Expertise in the Social Sciences, University of Lausanne, Lausanne, Switzerland

Associated Data

All relevant data are within the manuscript and its Supporting Information files.

The aim of this paper is to investigate the re-use of research data deposited in digital data archive in the social sciences. The study examines the quantity, type, and purpose of data downloads by analyzing enriched user log data collected from Swiss data archive. The findings show that quantitative datasets are downloaded increasingly from the digital archive and that downloads focus heavily on a small share of the datasets. The most frequently downloaded datasets are survey datasets collected by research organizations offering possibilities for longitudinal studies. Users typically download only one dataset, but a group of heavy downloaders form a remarkable share of all downloads. The main user group downloading data from the archive are students who use the data in their studies. Furthermore, datasets downloaded for research purposes often, but not always, serve to be used in scholarly publications. Enriched log data from data archives offer an interesting macro level perspective on the use and users of the services and help understanding the increasing role of repositories in the social sciences. The study provides insights into the potential of collecting and using log data for studying and evaluating data archive use.

Introduction

In the context of the Open Science agenda and the Responsible Research and Innovation movement, nations and organizations have put a lot of effort in building research infrastructures for supporting scholars in open science practice. Research data archives (also referred as data repositories) are part of the infrastructure and their aim is to capture and share digital research datasets. Archiving digital research data aims for improving the quality of research, and for economical savings assuming that data once archived will be useful and used by others [ 1 ]. Interest in facilitating data sharing and re-use is high, which is evident in funding agencies’, research organizations’, publishers’ and archives’ efforts in drafting policies regulating data sharing and management [ 2 ]. Also, data openness and sharing are increasingly important factors in the evaluation of impact, concerning both research infrastructures and scholars [ 3 , 4 ].

In the social sciences there is a long tradition in re-using time-series datasets such as those by the World Bank or OECD. However, in the era of open science, data sharing has widened its use to individual scholars uploading their data, which most likely form most of the contents in the digital data archives. Yet, despite the massive financial and intellectual investments, it is still unclear how extensively, by whom and for what purposes research datasets are downloaded from the archives [ 5 ]. The proposed benefits of open data will be materialised only fully if the available data are used or re-used by others [ 5 ]. Also, the importance of creating quantitative metrics for evaluating the impact of research infrastructures is widely recognized [ 6 ].

Will it be possible to realise the optimistic promises of open responsible science when the social sciences go digital? While open research data and data infrastructures have drawn a lot of attention, is there a demand for open data, do differences in re-use exist across types of data, how broad is the base of potential users and where is potential to develop and what service portfolios to be developed? Answering to these questions is vital to understand the evolving knowledge creating practices, the impact of open data and the development of open science and its implementation in research practice. Additionally, this information is important for the archives to better understand the potential needs of their user base. Most of the earlier work has based on self-reported data re-use and focused especially on the experiences and needs of scholars [e.g. 7 – 12 ]. However, before data citation practices are fully formalized in social sciences, log data and number of downloads are useful to measure the frequency of data re-use [ 5 , 13 , 14 ]. Also, Khan, Thelwall and Kousha [ 12 ] call for more comprehensive disciplinary information about repository uptake for enhancing sustainable data sharing.

By now, only very few studies relying on user log data gathered from the social science archives exist. For example, Borgman and colleagues analysed user log data to identify data re-use in the Dutch interdisciplinary data archive DANS [ 5 ] using number of downloads and users. Focusing on data re-use in the social sciences, Late and Kekäläinen [ 15 ] studied the use of the Finnish research data archive in more detail based on enriched log data. Applying their methodology using enriched log data we study the use of Swiss data repository, FORSbase, that archives both qualitative and quantitative social science research data. Our study supplements the findings by Late and Kekäläinen [ 15 ] by providing comparative evidence from another context. We investigate whether there is a demand for open data in the social sciences and address the following research questions:

  • How many times and by how many users are datasets downloaded from the FORSbase?
  • What type of datasets are downloaded from the archive most often?
  • What roles do the users of the archive represent?
  • For what purposes are datasets downloaded?

The article is structured as follows. First, we present related literature concerning open data, data archives and data re-use in the social sciences. We will then describe the research setting and present the results, which will be discussed before being put into the policy and research practice context to draw conclusions.

Research data and data archives in the social sciences

The European Commission [ 16 ] defines research infrastructures as “facilities that provide resources and services for research communities to conduct research and foster innovation” (p. 1). Research data archives are thus part of the infrastructure supporting and enabling open science by storing, managing, and disseminating research data by public (or private) funding without a fee for the users. Although, studies have shown that many scholars rely on their personal data storage for sharing data [ 12 , 17 ], there is a long-standing tradition of using and providing open research data and having large data repositories in the social sciences [ 18 ]. International organisations like the World Bank, International Monetary Fund, Freedom House, the OECD or EUROSTAT have provided valuable data for social scientists for decades just as well as national public statistical offices [see e.g., 19 – 21 ]. Furthermore, for more specific data, national and international data infrastructures, such as the General Social Survey in the US since 1972, the World Values Survey, the European Values Study, the International Social Survey Programme, or Inter-university Consortium of Political and Social (ICPSR), have been offering rich datasets in open access to social scientists [see, e.g., 22 , 23 ]. Also, individual scholars or teams generated and shared data, such as the Democracy Index [ 24 ], the Polity Project [ 25 ], or the World Inequality Database [ 26 ]. Social science data archives providing a hub for sources for secondary analyses have been established in the 1960ies in the US as well as in Europe [ 18 , 27 ] and for example, the CESSDA, Consortium of European social science data archives, exists since 1976 [ 27 ]. Established data archives, provide support and curation for long-term data preservation for the entire data life cycle and tools for data search [ 28 , 29 ].

However, while especially international comparative quantitative social science has this long-standing tradition, in other sub-fields like psychology it has been usual that data and measurement instruments were part of a business model and available in closed access. Qualitative social science does not look back on a similar tradition of sharing data even though in 1994, the Qualitative Data Archival Resource Center has been established at the University of Exeter to foster re-use of qualitative data [ 30 ]. However, this policy-based request has resulted in a heated debate whether it is ethical to share qualitative data because data are potentially sensitive [ 30 , 31 ]. The shift to open science in the STEM fields has changed the attention of policy makers and put pressures on those sub-fields in the social sciences where open data sharing has not yet been part of the tradition and, at the same time, opened new opportunities and increased reputation of the shared existing data infrastructures.

Research data has thus been seen as theory-laden concept with a long history [ 5 ]. Data can take different forms in different disciplines and a particular combination of interests, abilities and accessibility determine what is identified as data in each instance [ 32 ]. Borgman [ 33 ] defines data as “entities used as evidence of phenomena for the purposes of research or scholarship” (p. 25). Data are not seen only as by-products of research but as research outputs, valuable commodities, and public objects [ 1 ]. Data in the social sciences can remain relevant for analysis for a long time as societal developments and historical perspectives can offer new opportunities of, and approaches to, analysis of historical data to researchers.

Re-use of research data in social sciences

Open access to research data is an essential aspect in open science because, among others, it facilitates the verification of given results and enhances the effectiveness of research by the re-use of data. However, also negative aspects of data re-use have been identified, such as narrowing the scope and increasing the bias of research [ 34 , 35 ] and leading to injustice in work division, i.e., when data collectors document and share their data, others may take just advantage of the work accomplished by others, as data stewardship is not acknowledged yet [ 36 ]. Furthermore, not all kinds of data can be opened due to data protection and ethical principles [ 37 ]. This is a frequent issue in the social sciences and earlier studies have claimed relatively low levels of data sharing and re-use [ 38 – 41 ]. However, some data are frequently re-used in the social sciences as for example the open data published by the European Social Survey led to at least 5000 scientific English language publications between 2003 and 2020 [ 42 ].

The whole concept of data re-use needs to be understood far more deeply. Re-use of data can mean for example re-using data to reproduce research, to re-use data independently or to integrate data with other data [ 12 , 33 ]. Re-using a dataset in its original form can be difficult, even if adequate documentation and tools are available, since much must be understood about why the data were collected and why various decisions about data collection, cleaning, and analysis were made [ 33 , 43 , 44 ]. Combining datasets is far more challenging, as extensive information must be known about each dataset if they are to be interpreted and trusted sufficiently to draw conclusions [ 45 ].

By now several studies have analysed scholars needs, experiences and perceptions of data re-use relying on surveys and interviews [ 7 – 12 , 46 , 47 ]. In a recent survey [ 12 ] almost half of the respondents representing social sciences reported re-using data. However, there was some variation between research fields. Data re-use was more frequent by experienced scholars and by those sharing their data. When selecting data for re-use, scholars consider proper documentation, openness, information on usability of data, availability of data in a universal standard format and evidence that the dataset has an associated publication as important factors [ 12 ].

Social scientists re-using data value data that are comprehensive, easy to obtain, easy to manipulate, and credible [ 46 ]. Identified obstacles for data re-use are, for example, barriers to access, lacking interoperability and lack of support [ 47 ]. Faniel, Frank and Yakel [ 9 ] identified ICPSR’s data users’ information needs in 12 contexts relating to how data was originally produced, about the repository it has been archived and about the previous re-use of data. They argue that scholars representing different disciplines have distinct differences in the types of information desires, that should be considered in service development. For example, information about missing data was important for the social scientists. Studies focusing on data re-use by novice scholars emphasize the importance of details about the data collection and coding procedures and peer support for data use [ 8 ]. Re-using data may contribute to the knowledge creating skills of junior scholars and foster them to socialize to their disciplinary communities [ 48 ].

Studies have also focused on how data is searched [ 49 , 50 ] and witnessed scholars struggling with finding datasets to re-use [ 12 , 51 ]. Most typically, data is found from relevant papers, conducting web searches, and searching from disciplinary and interdisciplinary data archives [ 12 ]. Recently, Lafia, Million and Hemphill [ 52 ] studied data search basing their analysis on usage data from ICPSR website. They identified three user paths for navigating the website: direct, orienting, and scenic. Direct and scenic paths targeted dataset retrieval, as orienting paths aimed gathering contextual information. They argue that data archives should support both direct and serendipitous data discovery.

Only a few studies have investigated the use of data archives in the social sciences relying on log data. Borgman and colleagues studied the use of the Danish Data Archiving and Networked Services (DANS) using transaction logs, documentation, and interviews, and showed that communities of data infrastructures can be amorphous and evolve considerably over time [ 5 ]. They argue that trust plays an important role in the re-use of a dataset collected by someone else and the reputation of the hosting archive and organizations responsible for the curation process are important elements in trust creation.

Late and Kekäläinen [ 15 ] studied the use and users of the Finnish research data archive for social sciences by analysing user log data between 2015 and 2018. According to their study, most datasets were downloaded at least once during the time frame and a clear majority of the downloaded data were quantitative. They discovered that the datasets from the archive were downloaded most often for the purposes of studying or master’s or bachelor’s theses. One fifth of the downloading’s were made for research purposes. Similarly, Bishop’s and Kuula-Luumi’s study [ 53 ] about the re-use of qualitative datasets showed that data was downloaded for studying, master’s theses, teaching and research, indicating that data re-use is even less prevalent for qualitative studies. According to Late and Kekäläinen [ 15 ] the most typical downloaded dataset was survey data. The Finnish research data archive was most often used by social scientists from Finnish universities. However, there were users from other European countries and even from outside Europe and other organizations. Borgman and colleagues [ 5 ] argue that user behaviour tends to correlate with existing data practices in a field, and archives tend to be tailored accordingly. However, the results by Late and Kekäläinen [ 15 ] showed that users of the archive for social sciences data represented all major disciplines. Thus, data practices in several fields must be considered when developing the services.

Research setting

The context: forsbase.

The research data archive investigated in this study is FORSbase. FORS is the Swiss Centre of Expertise in the Social Sciences that offers data and consulting services in social sciences, conducts national and international surveys, and offers data and research information services to researchers and academic institutions [ 54 ]. FORSbase was the archive for research projects and research data in the social sciences in Switzerland managed by FORS. It was established in 1992 and was replaced in December 2021 by SWISSUbase ( https://www.swissubase.ch/ ) based at the same institution and issued in collaboration with several partners that includes the functions of FORSbase but serves as the national data repository across disciplines in Switzerland.

Research data from FORSbase and SWISSUbase can be accessed from the online catalogue ( https://forsbase.unil.ch/ and https://www.swissubase.ch/ ). The catalogue is available in English, German and French. Datasets are downloadable free of charge, but users are required to register before downloading datasets. The database has a special structure: It is centred around research projects. Each project can have several datasets and each dataset can have different versions, while only the latest available version is downloadable.

The FORSbase and SWISSUbase data services follow FAIR data principles [ 1 ] and have obtained the official certification of CoreTrustSeal. The CoreTrustSeal is a community based non-profit organization promoting sustainable and trustworthy data infrastructures. FORSbase is a member of CESSDA. The change from FORSbase to SWISSUbase does not have any impact on our analysis and its conclusions because the FORSbase service is integrated in SWISSUbase. The main difference is that the services have been upscaled to accommodate research data and projects from other disciplines (and transdisciplinary research).

Data collection and analyses

The study is based on quantitative user log data that was collected from FORSbase for the time window from 29.2.2016 to 9.2.2020. This time window represents the full user data available for FORSbase since its rebuild in 2016 until the time of the start of our project. The log data contains information about the number of downloads and downloaded datasets. The data is enriched with a) project information data collected from the database and b) data coming from the registration survey that users have to fill in when downloading data. The project information entails information about the archived datasets such as the dataset type. Registration survey data entails information about the users including their role and purpose of data use. Each time a person downloads data from FORSbase, this information is collected.

The data is structured as follows (see Fig 1 ): the main unit is a download; downloads are cross-nested across datasets and users (a user downloading a dataset creates thus a unique download). Each download also points to the version of the dataset that has been downloaded. The raw number of observations (downloads) in the data was 6661. Removing downloads from the dataset made for testing purposes by the FORSbase team resulted in 6656 observations.

An external file that holds a picture, illustration, etc.
Object name is pone.0303190.g001.jpg

The process continued with variable selection and coding. Nine variables analysed in this study are presented in Table 1 along with the research question(s) the variable is used to address. Information for variables 1 to 4 is collected automatically whereas information for variable 5 is constructed in two steps, the name being drawn automatically from the database and then assigned a type of dataset manually from the project information data in the FORSbase online catalogue. Information for variables 6 to 9 is asked from the users in a survey format during registration and when downloading the data.

VariableAnalysesResearch question
1.Date of the downloadDistribution of downloads per year
Average number of downloads per day, month, and year
RQ1
2.ID number of the userNumber of active users (downloaders)
Average number of downloads per unique users
RQ1
3.ID number of the datasetNumber of unique downloaded datasets
Version of the dataset
RQ1, RQ2
4.Type of datasetShare of qualitative or quantitative datasetsRQ2
5.Name of datasetName of 10 most downloaded datasetsRQ2
6.The role of the userDistribution of users by roleRQ3
7.Use purposeDistribution of downloads for teaching and researchRQ4
8.Research description (open ended question)Categorisation of the use purposes other than teaching/research, type of studying purposeRQ4
9.Is publication expected (yes/no)Distribution of downloads aiming and not aiming for publicationRQ4

To identify how many times and by how many users are datasets downloaded from FORSbase (RQ1), we analyse the download date, user id, and dataset id ( Table 1 , variables 1–3). Concerning the number of downloads, we analyse the full number and share of downloaded datasets and unique user-dataset downloads ( Table 1 , variable 3) to control for downloading dataset updates and to exclude duplicate downloads. By analysing the unique dataset downloads we can identify whether the same user downloaded the same dataset twice or two versions of it. Concerning the number of users, we analyse the average number of downloads for the registered users and the active users ( Table 1 , variable 2). Registered users are those who have registered to FORSbase for archiving and downloading data. The number of registered users was asked from the archive personnel in time of the data collection in 2020. Active users are those who downloaded data during the time window of the data collection. Each user is identified in the data with a unique user ID number automatically provided by the system during registration ( Table 1 , variable 2).

To identify what type of datasets are downloaded from the archive most often (RQ2) we use the id of the dataset , the type of dataset (quantitative or qualitative data) and the name of the dataset ( Table 1 , variables 3, 4, 5). The name of the downloaded dataset ( Table 1 , variable 5) was also used to study the 10 most downloaded datasets in more detail. For these datasets, information (i.e., descriptive details) were traced from the FORSbase online catalogue.

To analyse what roles do the users of the archive represent (RQ3), we use the role of the downloading user ( Table 1 ). Originally, users were provided a list of 11 roles from which they selected the most suitable one. For the analyses, some categories were combined to form a shorter list of seven different roles (i.e., student, doctoral student, lecturer/post doc, professor, other research/project manager, teacher, and non-academic).

Finally, to identify for what purposes datasets are downloaded (RQ4), we use information on the use purpose of the data , the research description and whether a publication is expected ( Table 1 , variables 7,8, 9). When users were downloading datasets from FORSbase, they were asked whether the dataset was downloaded either for research or for teaching purposes ( Table 1 , variable 7). Although these categories did not serve well for the students downloading datasets for their course work, they were forced to choose between the two options. Therefore, for the means of this study, a new use purpose type “studying” was constructed manually in two steps. First, all the users that identified themselves as students were identified from the data ( Table 1 , variable 6). In the second step, the coding was assigned by thoroughly reading the research descriptions ( Table 1 , variable 8) written by the students to find out the purpose of the download. Based on these descriptions we also categorised the sub-type of studying purpose if possible (e.g., bachelor theses, master’s theses). However, the research description was asked only for those downloads where the users were indicating research ( Table 1 , variable 7) as the purpose for the download. Consequently, this information is missing for the downloads where users indicated teaching as purpose. Obviously, this applies also to students who had selected teaching as use purpose. These were categorised as studying as we assume that students do not teach yet but chose teaching as there was no option for studying. Downloading data for doctoral dissertation were categorized as “research” purpose.

Variable nine ( Table 1 ) was used to study the purpose of research use of the dataset by asking whether the user was expecting a publication resulting from the downloaded dataset. This information was asked only for those downloading data for research purposes. Thus, this information is missing for the downloads the users indicated teaching as the purpose.

For the analyses step the data were gathered into one dataset and analysed with Stata 16. Given that we analyse full data, we do not apply inferential statistics. Whenever we are interested in differences between groups, we apply bootstrapped 95 per cent stability intervals to indicate the precision of the estimates. Differences were then tested also using bootstrapping procedures either with regression models (numbers of downloads per user group) or tests on the equality of proportions [ 55 ] for the intention to publish across user groups.

Number of dataset downloads

In February 2020, at the time of our data collection, FORSbase had 6628 registered users. The archive contained 725 datasets, the majority of which were quantitative. Within the time window that covers 49 months, a total of 6656 downloads were made from FORSbase ( Table 2 ). This results in an average 136 downloads per month or 5 downloads per day. When excluding incomplete months from 2016 and 2020 in our dataset, we cover a total of 6593 downloads over 47 months, leading to a mean of 140 downloads per month ( range = 40–286, median = 122). From 2017 to 2018, the number of downloads increased by 18 per cent and from 2018 to 2019 by 16 per cent. The downloads per months show a high volatility as can be seen from Fig 2 that shows the downloads per month for fully covered months, i.e., March 2016 to January 2020, and a smoothed moving average. The figure makes visible an increase of downloads over time with a tendency to stabilise. Note that March, April and October, November show the highest downloads while July and August show the lowest downloads, reflecting semester beginnings for highs and semester break for lows.

An external file that holds a picture, illustration, etc.
Object name is pone.0303190.g002.jpg

Smoothed moving average is calculated using weights as suggested in [ 56 ].

YearFrequencyPercentAvg. per month
2016 83912.684
2017157723.7131
2018186027.9155
2019216132.5180
2020 2193.3n/a
Total6656100.0

Notes . * The time window does not cover the full year for 2016 (February 29 th –December 31 st ) and 2020 (January 1 st -February 9 th ).

** Only full months are taken into account. January-February 2016 and January and February are excluded from the calculations.

Of the 725 datasets archived in FORSbase, 470 datasets were downloaded at least once representing 65 per cent of all archived datasets. One fifth of the downloaded datasets were downloaded once and 13 per cent twice. Consequently, 67 per cent were downloaded three times or more (see Table 3 ). Datasets, however, can be updated and new versions are released. Users are informed so that they can download the new version. This leads to the fact that some datasets are downloaded more often than others. Additionally, users can download the same dataset twice (e.g., on two different workstations). To control for updates and to have a measure that reflects better the number of times a dataset is used (as opposed to downloaded), we identified duplicates, i.e., if the same user downloaded the same dataset twice or two versions of it. This was counted as one unique user-dataset download (see Table 3 , columns on the right). Both measures are somewhat imperfect because, on the one hand, regarding the full count measure, a dataset that is published quickly and corrected afterwards will score more downloads than one that is not updated. On the other hand, regarding the corrected measure, it might be that a same user downloads the same data multiple times for different persons, e.g., as teacher and student (a situation that is not compliant to the user agreement) or for different uses. Additionally, it is not clearly defined by the database what a “version” is. It is usually an update of the same dataset, but it could also be used to have a dataset updated with new waves while another dataset would create a new dataset for each new wave added. We did our best to control for the later and try to treat a study (and each wave) as a dataset if archived separately.

Number of DownloadsFrequencyPercentage of Total Archived DatasetsPercentage of Datasets at Least Downloaded OnceFrequency of Unique User-Dataset DownloadsPercentage of Unique User-Datasets DownloadsPercentage of Unique User-Datasets at Least Downloaded Once
025535.2n/a25535.2n/a
110113.921.510614.622.6
2557.611.7689.414.5
3+31443.366.829640.863.0
Total725100.0100.0725100100

* Column sums exceeding 100% are due to rounding

Table 4 shows that the main download statistics between the two measures differ only slightly. The mean amounts to 9 downloads per data (8 if only unique user-dataset downloads are counted), but the distribution is highly skewed with a first quartile of 0 downloads, a median of 2 downloads and a third quartile of 6 downloads irrespective of how to count dataset-downloads.

Mean1 QuartileMedian3 QuartileMinMax
All Downloads9.20260638
Unique User-Dataset Downloads8.10260527

Type of most downloaded datasets

FORSbase allows the archiving of both quantitative and qualitative data. Qualitative data can be archived since 2017 only. From the 725 datasets, only 15 datasets were archived as qualitative datasets, which corresponds to 2 per cent. Of the 470 datasets that were downloaded at least once, 5 were qualitative datasets (1%). On the level of downloads, the vast majority (98%) of the downloads concerned quantitative datasets. Qualitative datasets were downloaded only 15 times (13 times if we consider only unique user-dataset downloads). Two of which were downloaded once, two twice and one nine times (7 times if only unique user-dataset downloads are counted).

Ten datasets were downloaded more than 100 times (see Table 5 ). Downloads for these 10 datasets represent almost 40 per cent of all downloads from FORSbase in the given time window. FORS was the collector of eight out of the ten most downloaded datasets. The other two datasets were collected by Swiss universities. The most downloaded datasets were all quantitative and either cumulative datasets or single year issues of longitudinal (cross-sectional or panel) surveys collected at regular intervals. Those surveys can be considered social sciences data infrastructures of national or even international importance and are designed for secondary data analysis.

Title of the datasetNumber of downloadsPercentage of total downloads (N = 6656) Number of unique user-dataset downloadsPercentage of unique user-dataset downloads (N = 5842) Collector
1. SHP Data Waves 1–196389.6
(8.9–10.3)
5279.0
(8.3–9.8)
FORS
2. Selects 2015 Post-electoral study4006.0
(5.5–6.6)
3085.3
(4.7–5.9)
FORS
3.CCS Wave II—Cumulative Dataset 2013–20182684.0
(3.6–4.5)
2063.5
(3.1–4.0)
FORS
4.CCS Wave I—Cumulative Dataset 2005–20132654.0
(3.5–4.5)
2123.6
(3.2–4.1)
FORS
5. Selects, cumulated file 1971–20152223.3
(2.9–3.8)
1853.2
(2.7–3.6)
FORS
6. Selects 2015 Panel / Rolling cross-section study2163.2
(2.8–3.7)
1692.9
(2.5–3.4)
FORS
7. TREE, cohort 12133.2
(2.8–3.7)
1722.9
(2.5–3.4)
University of Bern
8. Selects 2015 Candidate survey1852.8
(2.4–3.2)
1642.8
(2.4–3.3)
FORS
9.VoxIt: standardized post-vote surveys1241.9
(1.6–2.2)
1061.8
(1.5–2.2)
Universities of Geneve and Zurich, FORS
10. Swiss Volunteering Survey 20161201.8
(1.5–2.2)
1091.9
(1.5–2.3)
University of Bern
Total265139.8 215836.9

* Traced from FORSbase online catalogue.

** Bootstrapped 95% stability intervals based on 1000 resamples

*** Column sums differing from cell sums are due to rounding

The most downloaded dataset, SHP Data Waves 1–19, is the Swiss annual household panel study based on a random sample of private households in Switzerland, interviewing all household members mainly by telephone. SHP is provided free of charge from FORSbase for the scientific community [ 57 ]. Other datasets are related with Swiss elections or popular votes (datasets 2, 3, 4, 5, 6, 9) or with education and civil society (datasets 7, 10).

The fact that the share of the ten most downloaded datasets decreases slightly if duplicates and versions of the same dataset are excluded ( Table 5 “Percentage of total downloads” vs. “Percentage of unique user-dataset downloads”) shows that the most downloaded datasets are updated more often than the other datasets. However, the ranking of the most downloaded datasets does not change substantially showing that duplicates and versions spread quite evenly across those highly downloaded datasets. The bootstrapped 95%-stability intervals (see Table 5 , column 3 in brackets) show that the ranking consists of four parts: A clear leader (dataset 1) and a clear second place (dataset 2) followed by a middle part (datasets 3 to 8) and studies 9 and 10 form the fourth group.

Users of the archive

During the examined time window, 2281 unique users downloaded data from FORSbase. These users are called as “active users” in Table 6 . In February 2020, there were 6628 registered users in FORSbase. Thus, only a third of the registered users downloaded a dataset during the time window (note that to upload data, one needs to register as a user). One half of the active users downloaded only one dataset during the given time period ( Table 6 , column on the righthand side). One fifth downloaded two datasets and 28 per cent downloaded three or more datasets. There was a group of heavy users downloading more than 5 datasets (5% of the registered users and 13% of the active users). At the end of the scale, one user downloaded 149 datasets during the time window. The group of 306 users downloading at least five datasets combined more than half (51.7%) of all the downloads during the time window. On average, considering all registered users, one user downloaded one dataset, while considering only active users, a user downloaded 2.9 datasets.

Number of Downloaded DatasetsNumber of Registered UsersPercentage of Registered UsersPercentage of Active Users
0434765.6n/a
1118717.952.0
24576.920.0
32103.29.2
41211.85.3
5+3064.613.4
Total6628100.0100.0

Looking at unique user-dataset downloads ( Table 7 ), 58 per cent of the active users downloaded only one unique dataset whereas 21 per cent downloaded two and 22 per cent three or more. The group of heavy users (5+ downloaded datasets) amounts to 4 per cent of all registered users and 11 per cent of the active users. The person who downloaded most datasets downloaded 140 unique datasets. If only unique user-dataset downloads are considered, the average is 0.9 downloads per registered user and 2.6 downloads per active user.

Number of Downloaded DatasetsNumber of Registered Users Downloading Unique DatasetsPercentage of Registered Users Downloading Unique DatasetsPercentage of Active Users Downloading Unique Datasets
0434765.6n/a
1131119.857.5
24747.220.8
31602.47.0
4951.44.2
5+2413.610.6
Total6628100100

A clear majority of users downloaded only quantitative datasets (99%), 8 users downloaded both quantitative and qualitative data and 4 users only qualitative data.

Regarding the role of users, the majority of the downloads were made by users registered as students, while doctoral students, lecturers/postdocs and professors and other researchers were downloading less, and teachers and non-academics the least ( Table 8 ).

User groupFrequencyPercent
Student387458.2
Doctoral student95414.3
Lecturer / post-doc6039.1
Professor5137.7
Other researcher, project manager4036.1
Teacher1962.9
Non-academic1131.7
Total6656100.0

Regarding download frequency across user groups, students were more likely to download many datasets compared to scholars, teachers, and non-academics (see Fig 3 ). Note that using bootstrapped regression, only the difference between students and scholars, teachers and non-academics were significant. If one takes only unique user-dataset downloads into account, students downloaded significantly more unique datasets than all other groups except for non-academics (as the latter have a large variability). However, the user roles are not clear-cut entities as the same person can indicate a different role for each download. This means that for unique user-dataset downloads only the first role is retained.

An external file that holds a picture, illustration, etc.
Object name is pone.0303190.g003.jpg

Average number of Downloads per user group with bootstrapped 95% stability intervals using 1000 resamples on the basis of a) all downloads and b) only unique user-dataset downloads.

Purpose of the downloads

The majority of the downloads were made for studying purposes (see Table 9 ). Of those downloading data for study purposes, at least 13 per cent (n = 497) downloaded the dataset for a bachelor’s thesis and at least 12 per cent (n = 452) for master’s thesis (combining 14.3% of all downloads used for a BA or MA thesis). However, these numbers represent minima because not all users did describe their purpose of download in such detail and the users not describing the purpose in detail might have used the data for a thesis as well.

Purpose of downloadFrequencyPercent
Studying387858.3
Research256538.5
Teaching2133.2
Total6656100

Almost 40 per cent of the downloads served research purposes. Out of downloads used for research, at least 5 per cent download data for doctoral thesis (2% of the total downloads). However, the real share of downloading data for doctoral theses is probably much higher since more than 14 per cent of the users were registered as doctoral students.

Finally, only 3 per cent of the downloads served teaching purposes. This is surprising given that the biggest user group are students, and one would expect that it is the teachers who inform students about the dataset(s) used in the courses. However, users can only indicate one purpose for the download but can of course use it for many purposes after download. Also, it might mean that some teachers invite students to download the data themselves, while others download it and distribute the data to the students–which would mean that even more users would be students as the data covers only those students who downloaded the data themselves.

Users downloading datasets were also asked if they expect to write publications using the downloaded dataset. This was asked only if they were indicating that they were using the data for research and not teaching. Also, the question has a high share of non-response (463 or 7% of those who indicated research as the use of the download). Of those who replied to the question, a large majority (77.4%) did not expect to publish and just over one fifth expected to do so. Those downloading the dataset for research purposes were most likely to expect to write a publication (43%). Expectedly, professors, lecturers/postdoctoral researchers, and doctoral students expected publication more often compared to students ( Table 10 ). Indeed, professors, lecturers/post-docs and, more unexpectedly, non-academics have a similar percentage intending to publish as the bootstrapped differences are not significant. All other groups do differ significantly from these three groups and between each other. The relationship between role and intention to publish is quite strong with a Cramér’s V of 0.43.

User rolePercentage intending to publishBootstrapped 95% Stability Intervals
Professor (n = 513)47.242.9–51.5
Lecturer / post-doc (n = 603)48.144.0–52.3
Doctoral student (n = 954)40.537.5–43.5
Other researcher project manager (n = 403)31.827.4–36.4
Student (n = 3343)7.26.4–8.1
Non-Academic (n = 96)52.142.2–61.8
Total (N = 5912)22.6Cramér’s V = 0.43

Note. Bootstrapped stability intervals were calculated using 1000 resamples.

This study investigated whether there is a demand for open data in the social sciences by examining the use and users of a research data archive. It continued a discussion started by Late and Kekäläinen [ 15 ] studying the use of social science research data archives based on user log data. The results show that there is a demand for research data as datasets have been downloaded frequently from the FORSbase, i.e., on average 145 downloads per month. As in Finland [ 15 ], the number of downloads has increased in Switzerland from 2016 to 2019. During the time window of the study, a large majority (65%) of the datasets archived in FORSbase were downloaded at least for once. The share of downloaded datasets was similar with the Finnish results (70%) [ 15 ].

An overwhelming majority of the downloaded datasets are quantitative. The number of archived qualitative datasets in FORSbase is very low, which explains the low numbers in the downloads. Earlier studies have discussed the obstacles of data sharing and re-use in social sciences [ 38 – 40 , 58 ]. Our results suggest that there might be strong differences in the habit of downloading open data from repositories across different specialisations: in qualitative social sciences, data sharing seems to be far less prominent than in quantitative social sciences. There is little evidence about the re-use of qualitative datasets and further studies are needed to understand the potential and pitfalls of open data policies for qualitative studies [ 53 , 58 ]. The lack of data sharing, and re-use has certainly several reasons but ethical issues play an important role [ 59 ].

In this study, from the 725 archived datasets, the ten most frequently downloaded ones were investigated in more detail. Each of these datasets was downloaded more than 100 times, the most popular being downloaded more than 600 times. The downloads of these ten datasets amounts to almost 40 per cent of all downloads from the archive, which indicates that, similar to publications [ 60 ], a small share of datasets gains most of the attention. The same phenomenon was observed by Late and Kekäläinen [ 15 ]. The most frequently downloaded datasets share a few properties: all of them are longitudinal or time-series survey data collected not by individual scholars or research groups but by organizations or consortia such as FORS. Also, those datasets are local survey projects and the analysed archive, FORSbase, is the main source for obtaining this data. International longitudinal or time-series datasets were not among the ten most downloaded, even though local versions of these datasets would be available in the archive. Researchers interested in those cross-national datasets are more likely to download the datasets containing data from several countries from the international repository. Again, these results are in line with the study of Late and Kekäläinen [ 15 ]. In Finland, most downloaded datasets were local and national surveys. However, in the Finnish archive, the most downloaded datasets also included large international statistics collected by a single scholar. Qualitative datasets were also more often downloaded from the Finnish archive compared to the Swiss archive.

The fact that the most downloaded datasets were collected by prestigious and well-known organizations is in line with the argument raised in earlier studies [ 5 , 9 ] that scholars’ trust in data is essential for the data re-use. However, what is considered as trustworthy may differ between disciplines. For the social scientists, reputation along with data selection and cleaning process play an important role in trust creation [ 61 ]. Systematic documentation and providing high quality paradata (i.e. data about the data) is valued by the data users [ 8 , 9 , 12 , 62 ]. Other factors influencing the users’ trust in the data archives are recommendations, frequency of use, past experiences, and perceptions of the role of the archive [ 10 ]. However, frequently downloaded datasets are probably more well-known and thus, more visible for the users. Data findability is another critical point for data re-use that should be supported better [ 12 , 52 ]. Furthermore, archives can increase their own visibility and prestige by archiving high quality and well-known datasets by establishing collection strategies and profiling for certain topics and data types to gain competitive advantage and reputation. However, the value of non-used (or non-downloaded) datasets cannot be overlooked, since they may become valuable in the future as needs are difficult to predict (i.e. delayed recognition in science [ 63 ]).

Earlier studies have not investigated the number of users of the data archives although it can be considered as an is important metric for evaluating the impact of archives. Our results show that FORSbase was used by more than 2000 unique users as one third of the registered users downloaded data from FORSbase. Most of them downloaded only one dataset. However, there was a smaller group of heavy users of the archive downloading several datasets and forming a remarkable share of all downloads. This might be an indication of field specific differences; in some fields of social sciences data can and is re-used more often. Also, it might indicate personal differences between the users. Users that have found datasets useful come back for downloading more relevant data or new versions of the datasets. Indeed, other studies have shown that scholars sharing their data are also more active re-users of data shared by others [ 12 ]. Our results show, however, that not all registered users download data which might indicate that some users of FORSbase use it for archiving, not data retrieval. Late and Kekäläinen [ 15 ] showed that users represented several countries, disciplines, and organisations. Our data did not allow for such analyses.

Earlier research has focused mainly on scholars’ data sharing and re-use practices and shown experienced scholars being the most active data re-users [ 12 ]. Yet, our findings confirm the results by Late and Kekäläinen’s [ 15 ] that students form the largest user group for the data archive. Students as a special user group should be taken into special consideration by data archives and service providers since there is a great potential in this user group as future data users and providers. Re-using data is important for developing knowledge creation skills and in socializing into the discipline [ 48 ]. Novice users have specific needs for data re-use and are influenced by experiences of their mentors [ 8 ]. Therefore, data archives need to pay special attention when thinking what services could be offered especially for the students and what guidance students need. More research, for example on the data management skills of students, is certainly needed. This is not only relevant for students who want to become future academics, but data becomes an important part of many professions in a digitalised society and skills in data use, management, archiving, and documenting will be relevant competences students need to learn. Also, scholars wish training for data management skills [ 64 ]. The role of data archives along with data managers and libraries have been identified as central in fostering such skills [ 17 ].

Only three per cent of the downloads served teaching purposes. However, studies by Late and Kekäläinen [ 15 ] and Bishop and Kuula-Luumi [ 53 ] show higher share of downloads for teaching purposes from Finnish and UK archives. There might be several reasons for the difference. However, users of FORSbase can only indicate one use purpose per download, while they could use the data for several purposes. Researchers can download a dataset for a research project and then use this project and the dataset in teaching without re-downloading the data and register it as a purpose for teaching. Also, they may ask the students to download the data, for example, in a research methods seminar. The high share of students among the users suggests that teaching is a frequent use of the datasets downloaded from FORSbase. However, an important question for future research is what data re-use means in teaching. Is it rather to teach research methods or also to replicate studies and foster the idea of responsible research already in teaching? Familiarizing students with the open research infrastructures might be an effective way to promote open science ideals.

More than one third of the downloads were made for research purposes. The share of research use was lower in the study by Late and Kekäläinen [ 15 ] covering only on fifth of the total use. In the Swiss archive, about half of the downloads for research were expected to result in a publication. Professors, lecturers, and post-doctoral scholars were most likely to plan to use the dataset for a publication. However, there is little evidence about how often re-used data are actually utilized in publications and for what purposes data are used for [ 65 ]. Unfortunately, no further information is available from our data that shows other research purposes than publications. Regarding Responsible Research and Innovation, it would be interesting to follow how often data is re-used for validation or replication purposes rather than publication.

Regarding the policy demand for open science and open data, the valorisation of data sharing becomes relevant. Data stewardship is not yet a relevant aspect in academic career development, which might hinder the motivation to share and document data sufficiently [ 36 , 39 ]. However, European guidelines for responsible research assessment have already included data and data sharing as research outputs and activities to be recognized in the evaluation [ 66 ]. Therefore, further efforts should be made to study how (and how often) re-used datasets are cited in publications and how archives guide users to cite data. Data citation practices in social sciences are still evolving since citations are shown to be often incomplete or erroneous [ 15 , 67 – 69 ]. Not all re-used research data are cited, at least not in a formal way [ 15 ]. Developing more formal data citation practices would enable a quantitative evaluation of the impact of data re-use. The challenge is to get scholars to cite data in a systematic way [ 70 ]. This would also serve the need to provide quantitative metrics for evaluating the impact of research infrastructures [ 6 ]. User log data can provide information concerning the number of downloaded data, but for evaluating the impact on research, further studies are needed exploiting, for example, bibliometric methods.

Practical implications and limitations of the study

The results provide several practical implications for utilizing user log data for evaluating digital data archive use and as a source of research data. First, it would be important for the archives to define clearly what a data “version” is and to separate updates from new waves that comprise a new dataset. As new versions and updates of the datasets influence user behaviour and the number of downloads and thus, should be taken into consideration when user log data is used in archive evaluation or in research. The most frequently downloaded datasets are characterised by various versions and are updated more often than datasets provided by individual scholars. In our study we decided to analyse both, the full number of downloads and unique downloads to recognize the share of duplicates. The differences were not significant yet existed. Further, our results provide implications for collecting user log data. For example, information collection should cover all kinds of users and use types. In the case of FORSbase, for example, “studying” as a data re-use purpose was not provided. This underlines the importance of user studies for the service providers to truly know who their clients are. Given the relevance of replicational and open research data in science policy and the lack of knowledge on open research data practices, it is also advisable to archives to collect meaningful log data to be able to supplement ethical considerations with empirical evidence on data re-use.

This study comes with limitations: making conclusions about data re-use based on user log data is somewhat unreliable since it is likely that not all downloaded datasets are used, or some are used for many times or for other purposes than expected. Generalizing findings across organizations may be challenging because download metrics may be contingent on the specific characteristics of the data archive or related organisations [ 4 ]. For example, datasets can be used as course material possibly leading to hundreds of data downloads [ 15 ]. Additionally, log-data cannot provide qualitative insights into the data re-use (e.g., why a dataset was selected and how it was used). Still, user log data can give useful insights into re-use of research data and the users of data archives on the macro level beyond self-reported data re-use and from the point of view of the archive [ 5 ]. Our findings show that data is downloaded for various purposes and by various user groups from the archive. Thus, studying data re-use based for example on citations captures only part of the data re-use. Results of this study will give grounds for future studies in this respect. In addition, we analysed log data only from one archive. However, as our results are in line with a similar study conducted in Finland [ 15 ], we believe the results can be generalised to similar national social science data archives. Future research will show how the frequency of data downloads will develop as open data practices establish in the social sciences.

Conclusions

This study contributes to our understanding of the utilization of digital data archives in the realm of social sciences. The findings indicate the demand for social science data, as evidenced by the increasing number of data downloads from a Swiss data archive. However, it is noteworthy that as majority of the archived datasets were downloaded at least once, a limited set of longitudinal and time-series survey datasets compiled by organizations rather than individual scholars gained substantial share of the downloads. Since the case archive primarily specializes in housing quantitative data, the re-use of qualitative data was marginal. Among the users, students constituted a significant proportion who accessed the archive to acquire data for their educational purposes. Nonetheless, the user base encompassed individuals from diverse roles, including experienced and novice scholars and non-academics. As the findings are in line with previous research [ 15 ] it is likely to find similar patterns across data archives specialised for the social sciences. The increasing availability of digital datasets for the re-use may create new data practices within social sciences.

Enriched log data capturing the use of the digital data archive provide a macro level understanding about the re-use of the data from singular archive. To obtain more comprehensive insights into data re-use and evolving data practices within social sciences, future research applying quantitative and qualitative approaches is needed. A future research agenda on data re-use would include comparative studies of different archives (which would preclude some previous agreement on the collection of meta-data between archives), studies into the (epistemological and empirical) meanings and definitions of re-use of research data in social sciences and into the trade-offs between collecting new data versus re-using existing data. A very important issue is the data citation practices in social sciences. For further developing the research infrastructures, user studies are needed to address how users interact with the infrastructures, what obstacles they face and what support they desire.

Supporting information

Acknowledgments.

We thank Dr. Jaana Kekäläinen for her valuable comments for the manuscript.

Funding Statement

This research was partially funded by Academy of Finland ( https://www.aka.fi/en/ ) grant 351247 (EL) and benefitted from a Short Term Scientific Mission of the COST Action CA 15137 ‘European Network for Research Evaluation in the SSH (ENRESSH)’, supported by European Cooperation in Science and Technology ( https://www.cost.eu/ ) (EL, MO). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Data Availability

Center for Scalable Data Analytics and Artificial Intelligence

big data in social science research

OpenGPT-X Team publishes European LLM Leaderboard

15.07.2024 // Press Releases and Reports , Research , ScaDS.AI Dresden/Leipzig

big data in social science research

Exploring the Future of AI: 10th International Summer School on AI and Big Data

09.07.2024 // Events , ScaDS.AI Dresden/Leipzig

big data in social science research

Mutation Explorer – A 3D Tool for Protein Mutation Visualization and Analysis

02.07.2024 // ScaDS.AI Dresden/Leipzig

Tutorial: JupyterHub on HPC

A Hitchhiker’s Guide to Ontology

ScaDS.AI Dresden/Leipzig

ScaDS.AI (Center for Scalable Data Analytics and Artificial Intelligence Dresden/Leipzig is a center for Data Science, Artificial Intelligence and Big Data with locations in Dresden and Leipzig. It is one of the five new AI centers in Germany funded under the federal government’s AI strategy by the Federal Ministry of Education and Research (BMBF) and the Free State of Saxony. It is established as a permanent research facility at both locations with strong connections to the local universities: the TUD Dresden University of Technology and the Leipzig University .

Photo. All team members of ScaDS.AI Dresden/Leipzig gathered for a group photo at the General Assembly Meeting in December 2023 in Leipzig.

Research on AI and Big Data

ScaDS.AI Dresden/Leipzig expands the former Big Data competence center ScaDS Dresden/Leipzig, founded in 2014, and combines the AI and data science expertise of the partner institutions to close the gap between the efficient use of mass data, knowledge management and advanced AI. For this reason, the center conducts interdisciplinary research with an international team of over 60 Principal Investigators, more than 180 employees and 4 junior research groups in Dresden and Leipzig.

Photo. Prof. Birte Platow presenting at the Visit of the Coordination Committee.

For Professionals

We offer consulting and support as well as possibilities for training and collaboration to companies and professionals…

Photo. Full rows of people in the lecture hall.

For Students

Our multifaceted teaching activities lay the foundation for future in-depth research and high-quality practice…

Photo. Jörn Hoffmann and Josephine talking to each other. (Jan Ewald and Oliver Welz in the background)

For Beginners

The Living Lab invites interested citizens of all ages and experience levels to learn more about Artificial Intelligence…

Next Events

Living Lab Lecture Series

Our partners

TU Dresden Logo

Quicklinks:

Supported by:.

Copyright 2024 © SCADS.AI Dresden/Leipzig – All rights reserved.

background-bubble

  • Equity & Inclusion

UC Berkeley launches AI training program to address criminal justice system inequalities

UC Berkeley researchers launched a pioneering interdisciplinary training program this week that will blend criminal justice and computer science in ways that experts say will help reduce long-standing, systemic inequities in the criminal legal system.

The program, called Computational Research for Equity in the Legal System (CRELS), is being made possible with a $3-million National Science Foundation grant. Launched by a multidisciplinary research team that includes Berkeley’s Division of Social Sciences , Social Science Matrix , D-Lab , College of Computing, Data Science, and Society , Berkeley Institute for Data Science , Institute for the Study of Societal Issues , Human Technology Futures group, Possibility Lab , Eviction Research Network and EPIC Data Lab , the CRELS program will bring together researchers in the social sciences, computer science and statistics. It will equip a new generation of diverse Ph.D. students with the skills needed to tackle problems at the intersection of inequality, criminal legal systems, data science, artificial intelligence and big data.

"This program leverages Berkeley's expertise in social sciences, data science and AI to bring a transformative approach to the study of criminal justice systems,” said David Harding, the grant’s principal investigator and chair of Berkeley’s Department of Sociology. “We will train doctoral students to harness the power of large-scale data to develop novel interventions to reduce inequities in criminal justice systems and foster a more just society.” 

berkeley social sciences

This innovative program aligns with the NSF’s Big Ideas, including Harnessing the Data Revolution, Growing Convergence Research and Transforming Education and Career Pathways. It seeks to create a link between these ambitious goals and Berkeley’s faculty expertise in the social sciences, criminal legal systems, data science, and the ethics and social implications of AI. CRELS is supported by a $3-million dollar, 5-year grant from NSF’s NRT Research Traineeship Program .

"We're excited to be at the forefront of this crucial convergence of disciplines,” Berkeley Social Sciences Dean Raka Ray said. “The CRELS program reflects our commitment to using data science and technology to address some of society's most pressing social issues, such as the need for criminal justice system reform.” 

BIDS Executive Director Ashish Sahni added that this is a great opportunity to do interdisciplinary work while creating knowledge and providing opportunity.

CRELS seeks to examine the use and misuse of AI within justice institutions. This research could contribute to a broader understanding of the social implications of AI, a topic of increasing importance as AI technologies become more pervasive in our society. The program aims to generate new scientific knowledge and develop novel tools for large-scale data integration and analysis. 

The program will not only prepare a workforce capable of addressing complex societal issues with cutting-edge tools, but also contribute to the diversification of the scientific workforce by recruiting and training a broader representation of graduate students in these critical fields and implementing diversity, equity, inclusion and belonging values. 

"By actively building these values into program design, CRELS seeks to enrich the data science workforce, creating inclusive intellectual spaces and expanding opportunities for traditionally underrepresented students," Harding said. “Its innovative approach will shape the future of graduate training programs and foster enduring interdisciplinary collaborations among faculty."

This story was first published by Berkeley Social Sciences .

For Media Inquiries

[email protected]

University of Leeds logo

  • Faculty of Social Sciences
  • School of Law

Spotlight on researchers for South Asian Heritage Month

School of Law news Wednesday 17 July 2024

Spotlight on researchers for South Asian Heritage Month

S outh Asian Heritage Month in the UK was co-founded in 2019 by Jasvir Singh CBE and Dr Binita Kane.

Jasvir Singh CBE  and  Dr Binita Kane ’s mission was to use the  heritage month  to ‘deepen people’s understanding of the rich and diverse contributions of South Asian communities to British society’. First celebrated in 2020, it commemorates, marks and celebrates South Asian cultures, histories, and communities. The dates of the month (18 July – 17 August) are designed to span several Independence Days across the region (Maldives, Bhutan, Pakistan, India). It also roughly coincides with Saravan/Sawan, the primary monsoon month during which the region’s habitat undergoes renewal.

Our academic strategy for 2020 to 2030 ‘Universal Values, Global Change’, sets a blueprint for a values-driven university that harnesses expertise in research and education to help shape a better future for humanity. The School of Law is proud to have staff that strengthen the vital link between South Asia and the UK by conducting impactful research that addresses global inequalities, highlights the significant contributions of South Asian countries on the international stage, and makes a tangible difference in the world.

Read on to discover their research profiles.

Big data and the Global South

Professor Subhajit Basu , Professor of Law and Technology, investigates the numerous challenges posed by digital technologies across various societal sectors, including transport, education, healthcare, and social justice. Renowned internationally for pioneering interdisciplinary research, Professor Basu's work particularly emphasizes the Global South. Recognizing the potential for ‘big data’ to control lives, Professor Basu is deeply interested in enhancing consumer information and empowerment. Additionally, he seeks to update legal frameworks to better protect privacy and provide the public with the necessary knowledge to make informed decisions. From 2018 to 2021, he served as the Chair of the  British and Irish Law Education Technology Association . In 2020, he was honoured with the  Hind Rattan  by the  Non-Resident Indians Welfare Society of India  for outstanding contributions to education and achievements in Information Technology Law.

He is an Adjunct Professor at  Parul University , as well as a Visiting Scholar at  West Bengal National University of Juridical Sciences . He also became the International Advisor of  The Dialogue  think tank last year. He sits on the Editorial Advisory Board of two Indian Law School journals:  NUJS Journal of Regulatory Studies  and  NALSAR Law Review .

This year alone, he has  delivered 7 Keynotes or been Invited Speaker at conferences organised in India , including a keynote at the prestigious  Rajiv Gandhi School of Intellectual Property Law, IIT Kharagpur . This year he also published two articles centring India:

Evaluating ICT Adoption in the Indian Judiciary: Challenges, Opportunities, and the Impact of the eCourts Project  and  Silenced Voices: Unravelling India's Dissent Crisis Through Historical and Contemporary Analysis of Free Speech and Suppression .

Professor Basu is a member of the  Centre for Business Law and Practice .

You can find him on X: basu_subhajit and LinkedIn: Subhajit Basu.

Challenging exclusionary city streets

Dr Sanjay Jain, an outstanding blind scholar who was formerly the Principal of the  Indian Law Society College of Law University  in Pune, is now based at the  National Law School of India University, Bengaluru.

Dr Jain’s publications, which have been quoted by the Indian Supreme Court, include leading works on Indian Constitutional Law as well as on issues of disability and human rights in India. His advice is regularly sought by members of the judiciary, the administration and civil society.

Dr Jain, and his institutions in Pune and Bengaluru, have been collaborating with Professor Anna Lawson and colleagues at the School of Law, University of Leeds, since 2018. In that year, Dr Jain began acting as the Indian partner for the Leeds-based  Inclusive Public Space  research project. This project, funded by the  European Research Council , explores ways in which law can more effectively be used to challenge the disadvantage (to disabled and older pedestrians in particular) caused by inaccessible, exclusionary aspects of city streets. The project shines a light on such barriers and legal initiatives in ten cities across five countries: India being one of them. Dr Jain is the lead author of an extensive report on relevant Indian law and policy and has played a vital role in facilitating and supporting the fieldwork in India.

Besides Dr Jain’s collaboration with Leeds through the Inclusive Public Space project, the Leeds  Centre for Disability Studies  has supported and co-hosted two international conferences led by Dr Jain during his time at Pune. He will support three events in India in July and August. Dr Jain will be visiting Leeds in September 2024 – when he will take part in the Inclusive Public Space final conference on 16-17 September as well as a number of other events.

Migrant stories and experiences

Dr Amrita Limbu ’s research delves into the lived experiences of individuals from migrant communities and from low- and moderate-income backgrounds. She is currently a Postdoctoral Research Fellow at the School of Law working on the  Making it to the Registers: Documenting Migrant Carers’ Experiences of Registration and Fitness to Practice  project with  Professor Marie-Andrée Jacob  (Primary Investigator) and  Dr Priyasha Saksena  (Co-Investigator). In this role, she is involved in archival and qualitative research exploring the migrant and refugee health professionals’ experience with professional registration in the UK.

She completed her PhD on migration and affective family relations across two migration pathways of education migration and labour migration from Nepal from the  Institute for Culture and Society, Western Sydney University, Australia.

She is interested in migrant stories and experiences, and experiences of transnational family life owing to migration and living away from her family. Prior to her PhD, she was a researcher at  Social Science Baha ’s  Centre for the Study of Labour and Mobility (CESLAM)  in Kathmandu, Nepal. At CESLAM, she conducted research and fieldwork for several projects focused on labour migration from Nepal to the Persian Gulf and Malaysia.

In 2024 she completed a research project on ‘Migration and the Persistence of Inequality’ as part of the  University of Leeds Michael Beverley Innovation Fellowship , to understand the inequality and the continual cycle of intergenerational migration from Nepal to the Persian Gulf countries.

She is part of a University of Reading-led consortium on  transnational families  – and presented at their symposium  Migration, Care and Intersecting Inequalities  in June 2024, with ‘Care, inequality, and intergenerational migration: Cultural insights on care and migration in Nepal’. Dr Limbu was the lead author of their policy briefing paper:  Impact of COVID-19 on migrant families in the UK , published in March 2024. In May this year she gave a paper at the  Britain-Nepal Academic Council Nepal Study Days.

She is a member of the  Centre for Law and Social Justice .

Policing, community resilience and climate change

Dr Ali Malik is a lecturer in Criminal Justice. Dr Malik leads the project  ‘Policing and community resilience in the context of climate change’ ,  funded by  Economic and Social Research Council’s (ESRC)  Vulnerability & Policing Futures Research Centre . His current research focuses on the role of police and local governance actors in preparing for and responding to climate disasters and extreme weather events. He is interested in exploring how police and local governance actors perceive, categorise, and track climate vulnerability, and how they leverage community-based actors to inform local emergency planning and disaster response activity.  He is also leading a project funded by the University of Leeds’  Research Culture  Research Equity, Diversity and Inclusion (REDI) Fund to raise awareness about the impacts of climate change on marginalised communities and public services in the UK through the use of visual (photography) and aural (stories) narratives.

As the holder of the  Michael Beverley Innovation Fellowship  (Cohort 4, 2023-24), Dr Malik has been involved in fostering collaborative ties with local police forces and national bodies such as  HMICS , the  College of Policing , and the  Police Foundation  to garner support for co-produced research examining the impact of climate change on local communities and local police and first responders. Additionally, to develop links with international scholars and researchers in this field, in December 2023, Dr Malik participated in a  symposium on Policing the Climate Crisis  as part of the  Australian and New Zealand Criminology Conference , held in Melbourne.

His book,  The Politics of Police Governance: Scottish Police Reform, Localism, and Epistocracy  (Policy Press) was launched in May 2024. In the book he developed an innovative framework that synthesised the concept of epistocracy with the broader scholarship on democratic policing, public administration, and police governance and accountability.

He is a Fellow of the Higher Education Academy and co-Deputy Director of the  Centre for Criminal Justice Studies . He also appeared in the University of Leeds’ Celebrate Our Staff for the month of May 2024.

Find him on Twitter/X: @DrAliMalik_

Equity in international relations

Professor Surya Subedi  OBE, KC, DCL, is Professor of International Law.

He has published 12 books and more than 60 scholarly articles in all major areas of international law in leading international law journals throughout his academic career. His publications emphasize the promotion of equity in international relations and the advancement of human rights.

At his OBE investiture he was described as having:

...made a highly distinguished contribution to our understanding of international law, and to its evolution" while his work in international law had "spanned almost every aspect of it – with a special focus on issues ... which make a real difference to people's lives. British Foreign Secretary

He has been an advisor to: the British Foreign Secretary (2010-2015); World Conservation Congress of the International Union for Conservation of Nature (2021); and a member of the Task Force on Investment Policy of the World Economic Forum (2015). He served for six years as the UN’s Special Rapporteur for human rights in Cambodia. In Nepal, he assisted the Prime Minister and other political leaders in resolving a 10-year Maoist conflict and in writing a new democratic constitution. 

This year,  Prime Minister Pushpa Kamal Dahal ‘Prachanda’ commended his achievements as a significant member of the Nepali diaspora. He said:

I sought Professor Surya Subedi’s assistance while drafting the Constitution of Nepal and reviewing past treaty agreements with India. He played a crucial role in those endeavours. Even in the recent Millennium Challenge Corporation (MCC) agreement, his input provided a middle-path which we embraced. Prime Minister Pushpa Kamal Dahal ‘Prachanda'

Between 2015 and 2022, he was Chairman of the Board of Editors of the Asian Journal of International Law, which is published by Cambridge University Press. He is also the editor of a Routledge series of books on ‘Human Rights and International Law’.

He was recently elected a Council Member of the  Royal Asiatic Society of Great Britian and Ireland .

Professor Subedi is a member of  Centre for Business Law and Practice .

The rights of abducted children

Dr Nazia Yaqub  is a lecturer in law, and her research interests span international human rights law, with her publishing record covering family law, child rights, law and religion, Islamic family law and cross-border parental child abduction. Dr Yaqub is a Solicitor of the Supreme Court of England and Wales and previously represented clients in Criminal, Mental Health, Family and Children’s law.

Dr Yaqub’s 2022 book  Child Abduction to Islamic Law Countries  examines statistical and empirical data she collated to explore how domestic and international law policies should be developed to uphold the rights of abducted children. Dr Yaqub is invited by the Permanent Bureau of the Hague Conference on Private International Law (HCCH) to share this research  later this year  with government officials and judges at its  Fifth Malta Conference . 

Dr Yaqub continues to work on policy developments in this area to prevent abductions and assess the implications of Islamic country accession to the Private International Law treaty, the 1980 Hague Abduction Convention. In this endeavour, she examines whether the use of GPS monitoring can be viewed as a bodyguard rather than a prison guard, to reduce the risk of cross-border parental child abduction, to be published in the leading journal, the  Modern Law Review . She also received funding as a Michael Beverly Innovation Fellow to disseminate this novel research in  video format .

In other projects, Dr Yaqub is working with adoption agencies in community engagement work, to improve adoption law processes for Muslim communities in the UK. And on the legal subject of ‘fam-migration’, she is working with colleagues at the Universities of  Liverpool  and  Birmingham , together with NGOs:  Social Workers without Borders  and  Bid  (Bail for immigration detainees) investigating the complex interplay between family and immigration law court processes and decision-making.

Dr Yaqub is a fellow of the Higher Education Academy. She is a member of the  Centre for Law and Social Justice  and the  Centre for Criminal Justice Studies . You can find her on X/Twitter: @DrNazia_Yaqub

The School of Law takes immense pride in counting such brilliant researchers among our staff, reflecting our commitment to using research to tackle some of the most important issues facing the global community today.

Related News

See all School of Law news

Alumna Spotlight: From Van Gogh to Copyright Law

Alumna Spotlight: From Van Gogh to Copyright Law

School of Law - Friday 12 July 2024

Strengthening international ties: Zhongnan University of Economics and Law visit the School of Law

Strengthening international ties: Zhongnan University of Economics and Law visit the School of Law

School of Law - Thursday 11 July 2024

Addressing Systems Generated Trauma: a collaborative conference

Addressing Systems Generated Trauma: a collaborative conference

Policy Basics: Where Do Our Federal Tax Dollars Go?

The federal government collects taxes to finance various public services. As policymakers and the public weigh key decisions about revenues and expenditures, it is important to examine what the government does with the money it collects.

In fiscal year 2023, the federal government spent $6.1 trillion, amounting to 22.7 percent of the nation’s gross domestic product (GDP). About nine-tenths of the total went toward federal programs; the remainder went toward interest payments on the federal debt. Of that $6.1 trillion, over $4.4 trillion was financed by federal revenues. The remaining amount was financed by borrowing.

As the chart below shows, three major areas of program spending make up the majority of the budget:

Health insurance : Four health insurance programs — Medicare, Medicaid, the Children’s Health Insurance Program (CHIP), and Affordable Care Act (ACA) marketplace health insurance subsidies — together accounted for 24 percent of the budget in 2023, or $1.6 trillion. Roughly half of this amount, or $848 billion, went to Medicare, which in March 2023 provided health coverage to around 65.7 million people who are age 65 or older or have disabilities. The rest of this amount funded the federal costs of Medicaid and CHIP ($633 billion) and ACA subsidy and marketplace costs ($91 billion). Both Medicaid and CHIP require states to pay some of their total costs.

In March 2023, Medicaid and CHIP provided health coverage or long-term care to 93.9 million low-income children, parents, older adults, and people with disabilities. That was significantly higher than the 70.9 million enrollees before the pandemic because of temporary pandemic-related coverage protection, which expired in April 2023. With its expiration, enrollment dropped to 82.8 million by March 2024 and is likely to fall further, though projections are highly uncertain.

In February 2023, 14.3 million of the 15.7 million people enrolled in health insurance through ACA marketplaces received subsidies that lowered their premiums and out-of-pocket costs. Additionally, 20.8 million people opted for ACA marketplace coverage during the 2024 open enrollment period, a significant increase over enrollment in 2023.

  • Social Security : In 2023, 21 percent of the budget, or $1.4 trillion, was spent on Social Security, which provided monthly retirement benefits in March 2023 averaging $1,833 to 49.1 million retired workers. Social Security also provided benefits to 2.7 million spouses and children of retired workers, 5.9 million surviving children and spouses of deceased workers, and 8.7 million disabled workers and their eligible dependents.
  • Defense : Another 13 percent of the budget, or $820 billion, was spent on national defense activities. About 95 percent of the spending in this category reflects the underlying costs of the Defense Department, largely for operations and maintenance; military personnel; procurement of weapons; and research, development, testing, and evaluation.

Three other categories together account for the remaining program spending:

The Majority of the Budget Goes for Health, Social Security, and Defense

Economic security programs : About 8 percent (or $545 billion) of the 2023 federal budget supported programs that provide aid (other than health insurance or Social Security benefits) to individuals and families facing hardship. Economic security programs include: the refundable portions of the Earned Income Tax Credit and Child Tax Credit, which assist low- and moderate-income working families; programs that provide cash payments to eligible individuals or households, including unemployment insurance and Supplemental Security Income for low-income people who are over age 65 or disabled; various forms of in-kind assistance for low-income people, including the Supplemental Nutrition Assistance Program (formerly known as food stamps), school meals, low-income housing assistance, child care assistance, and help meeting home energy bills; and other programs such as aid for abused or neglected children.

Economic security programs keep millions of people above the poverty line each year. They also reduce, but do not eliminate, racial and ethnic differences in poverty rates.

  • Benefits for veterans and federal retirees : In 2023, about 7 percent (or $481 billion) of the budget provided benefits to veterans and former career employees of the federal government, both civilian and military. About nine-tenths of the benefits available to all veterans are either disability payments or medical care, which is often specialized to deal with the unusual conditions that military service may impose. In March 2023 there were more than 18 million veterans of U.S. military service, of whom 2.2 million were retired career military who have earned monthly retiree pensions or survivor benefits. There were also 2.7 million federal civil service retirees.
  • All other program spending : As the chart above shows, the remaining federal program spending — the dark blue slice of the pie — supports a variety of other public services. They include investing in education; investing in basic infrastructure such as roads, bridges, and airports; maintaining natural resources, farms, and the environment; investing in scientific and medical research; enforcing the nation’s laws to promote justice; and other basic duties of the federal government. A very small slice — 1 percent of the budget — goes to programs that operate internationally, including humanitarian aid and the operation of U.S. embassies and consulates.

In addition to program spending, the federal government makes regular interest payments on the money it has borrowed to finance past and current deficits. The net federal debt reached $23.7 trillion by the end of fiscal year 2023 and led to $658 billion in interest payments in 2023, or 10 percent of the budget. Interest costs reflect debt accumulated over the nation’s history — that is, the net impact of deficits and surpluses since 1789 — and therefore result from both revenue levels and program costs, past and present.

While critics often decry “government spending” in the abstract, it is important to determine whether the actual public services and investments that government programs provide are valuable. Federal revenue is used to pay for these services and investments. Consequently, when thinking about the costs that taxes impose, those costs should be weighed against the benefits the nation receives from the expenditure of those funds.

This backgrounder discusses total federal spending and thus does not distinguish between programs financed by general revenue and those financed by dedicated revenue (for example, the payroll taxes that support Social Security). For more information, see Policy Basics: Federal Payroll Taxes .

Our figures for fiscal year 2023 are derived from a database of account-level expenditures accompanying the President’s budget, released by the Office of Management and Budget on March 11, 2024. (Fiscal year 2023 ran from October 1, 2022 to September 30, 2023.)

The broad expenditure categories presented in this paper are constructed from official classifications commonly used by budget agencies. The categories consist of related programs and activities in different functions and subfunctions, as described below.

Health insurance:

This category consists of the Medicare function (570), including benefits, administrative costs, and premiums, as well as the “Grants to States for Medicaid” account, the “Children’s health insurance fund” account, the ACA’s “Refundable Premium Tax Credit and Cost Sharing Reductions” account, and the ACA’s “Risk Adjustment Program Payments” account (all in function 550).

Social Security:

This category consists of all expenditures in the Social Security function (650), including both benefits and administrative costs.

This category is the national defense function (050).

Economic security programs:

This category includes all programs in the income security function (600) except those in the following two subfunctions: federal employees’ retirement and disability (602) and general retirement and disability insurance (601). The latter contains the Pension Benefit Guarantee Corporation and covers programs that provide pension and disability benefits to certain small groups of private sector workers.

Benefits for veterans and federal retirees:

This category combines the veterans’ benefits and services function (700) and the federal employee retirement and disability subfunction (602), which is part of the income security function.

Interest on debt:

This category consists of the net interest function (900).

Remaining program areas:

This category includes all federal expenditures not included in one of the six categories defined above. The subcomponents of this category that are displayed in the chart are defined as follows:

  • Education : This subcategory combines three subfunctions of the education, training, employment, and social services function: elementary, secondary, and vocational education; higher education; and research and general educational aids (subfunctions 501, 502, and 503, respectively).
  • Transportation : This subcategory consists of the entire transportation function (400).
  • Natural resources and agriculture : This subcategory consists of the natural resources and environment function (300) and the agriculture function (350).
  • Science and medical research : This subcategory consists of the general science, space, and technology function (250) and the health research and training subfunction (552).
  • Law enforcement : This subcategory consists of the administration of justice function (750).
  • International : This subcategory consists of the international affairs function (150).
  • All other : This subcategory consists of all other federal expenditures.

The Center on Budget and Policy Priorities is a nonprofit, nonpartisan research organization and policy institute that conducts research and analysis on a range of government policies and programs. It is supported primarily by foundation grants.

More on this topic

Where do federal tax revenues come from.

  • Deficits, Debt, and Interest
  • Introduction to the Federal Budget Process

Policy Basics Federal Budget

  • Fiscal Stimulus
  • Non-Defense Discretionary Programs
  • The “Pay-As-You-Go” Budget Rule
  • Where Do Our Federal Tax Dollars Go?

IMAGES

  1. Big Data and Social Science: A Practical Guide to Methods and Tools

    big data in social science research

  2. Different fields of science contributing to and utilizing Big Social

    big data in social science research

  3. Big Data in Social Research

    big data in social science research

  4. Big Data Analysis for Social Scientists

    big data in social science research

  5. Big Data Research for Social Sciences and Social Impact

    big data in social science research

  6. (PDF) Big Data Research for Social Science and Social Impact

    big data in social science research

VIDEO

  1. SSLC SOCIAL SCIENCE 6 MARK SURE🔥🔥

  2. What is Social Science Research and which opportunities Social Science Research offers?

  3. BI

  4. Using Big Data to Revolutionize Sustainability

  5. "The Big Deal about Big Data" with Dr. Gary King

  6. Introduction to Management: Big Data and Social Media

COMMENTS

  1. Big Data and Social Science

    Big Data and Social Science Data Science Methods and Tools for Research and Practice. Ian Foster, Rayid Ghani, Ron S. Jarmin, Frauke Kreuter and Julia Lane. Preface to the 2nd edition.

  2. Methods for big data in social sciences

    2. Digital data require specific methods. The abundance and granularity of social media data have empowered and transformed network analysis. This latter technique has been used in sociology (Latour, Citation 2005; Scott, Citation 2012) and can be traced back to the sociometric work of Moreno (Citation 1934), who mapped out likes and dislikes among members of small social groups, such as ...

  3. Big Data in Social Research

    These data often have major differences in their origins, structure, and attributes compared to the data typically used social science research. Big Data Management: The big data management phase of the BDaaP framework involves both processes and supporting technologies for acquiring, storing, preparing, and retrieving the information for ...

  4. The data revolution in social science needs qualitative research

    We see at least seven reasons why qualitative research will be essential to 'big data' social science (Fig. 1). Fig. 1: Qualitative research and big data. Seven roles for qualitative research ...

  5. Moving back to the future of big data-driven research: reflecting on

    The conclusion cautions against the marginalization of social science in the wake of developments in data-driven research that neglect social theory, established methodology and the contextual ...

  6. Big data in social and psychological science: theoretical and

    Big data presents unprecedented opportunities to understand human behavior on a large scale. It has been increasingly used in social and psychological research to reveal individual differences and group dynamics. There are a few theoretical and methodological challenges in big data research that require attention. In this paper, we highlight four issues, namely data-driven versus theory-driven ...

  7. Social Research in Times of Big Data. The Challenges of New Data Worlds

    forms of social research that dominated the 20th century. We claim that the challenges of big data are most pronounced vis-à-vis the canonical quantitative methodologies that have dominated social sciences for decades: they question established statistical techniques as well as key epistemic values and orienta-tions underpinning these approaches.

  8. Big data, computational social science, and other recent innovations in

    While sociologists have studied social networks for about one hundred years, recent developments in data, technology, and methods of analysis provide opportunities for social network analysis (SNA) to play a prominent role in the new research world of big data and computational social science (CSS).

  9. We Have Big Data, But Do We Need Big Theory? Review-Based Remarks on an

    From a philosophy-of-social-science perspective on big data, some researchers have discussed a paradigmatic shift toward "new empiricism" (based on a stronger focus on data evidence; Arbia 2021) or "digital positivism" (related to computer-generated evidence about the world; Fuchs 2017).More specifically, Chin-Yee and Upshur (2019) have identified three major philosophical problems ...

  10. Causation, Correlation, and Big Data in Social Science Research

    Drawing on interviews conducted with researchers at the forefront of big data research, we offer insight into questions of causal versus correlational research, the use of inductive methods, and the utility of theory in the big data age. ... they reassert the importance of fundamental tenets of social science research such as establishing ...

  11. Big Data and Social Science

    Big Data and Social Science: Data Science Methods and Tools for Research and Practice, Second Edition shows how to apply data science to real-world problems, covering all stages of a data-intensive social science or policy project. Prominent leaders in the social sciences, statistics, and computer science as well as the field of data science provide a unique perspective on how to apply modern ...

  12. Social media and the social sciences: How researchers employ Big Data

    Perhaps some social scientists are resistant to social media data analytics because the methods differ from more traditional social science research. However, most Twitter analytic tools enable multiple research methods such as social network analysis, geographic analysis, content analysis, and textual hermeneutics and more.

  13. Big Data and Social Science Data Science Methods and Tools for Research

    Big Data and Social Science: Data Science Methods and Tools for Research and Practice, Second Edition shows how to apply data science to real-world problems, covering all stages of a data-intensive social science or policy project. Prominent leaders in the social sciences, statistics, and computer science as well as the field of data science provide a unique perspective on how to apply modern ...

  14. Big Data in Computational Social Science and Humanities

    This edited volume focuses on big data implications for computational social science and humanities from management to usage. The first part of the book covers geographic data, text corpus data, and social media data, and exemplifies their concrete applications in a wide range of fields including anthropology, economics, finance, geography, history, linguistics, political science, psychology ...

  15. Opportunities and challenges of big data for the social sciences: The

    These data in conjunction with genome-wide genotype data and social science measures can reveal new insights to important research questions, for example, which genes of interest are subject to social regulation, how the social environment provokes the dynamics, and what social, psychological and biological mechanisms mediate the effects.

  16. Top 10 big data and social science innovations

    People look to academia as the source of innovation, and especially so in the natural and physical sciences. Researchers in biosciences, clinical medicine, physics, and chemistry have always generated new ideas for industry to capitalize on. Generally, innovations coming out of the social sciences w

  17. Introduction to Big Data for Social Science Research

    Summary. Mihály Fazekas, PhD, Assistant Professor at the School of Public Policy, Central European University, discusses using big data for social science research including, new data sources and what they can help achieve, the difference between big data and traditional research methodology, and the collection and analysis of big data.

  18. The role of administrative data in the big data revolution in social

    1. Introduction. Big data is heralded as a powerful new resource for social science research. The excitement around big data emerges from the recognition of the opportunities it may offer to advance our understanding of human behaviour and social phenomenon in a way that has never been possible before (see for example Burrows and Savage, 2014, Kitchin, 2014a, Kitchin, 2014b, Manovich, 2011 ...

  19. Big Data and Historical Social Science

    The project on Big Data and Historical Social Science brings together researchers across a range of disciplines, methods, and research strategies to explore the intersection of classical historical and social science problems with big data. ... Social Science Research Council 300 Cadman Plaza West, 15th Floor Brooklyn, NY 11201, USA. We use ...

  20. Big Data Social Science

    Big Data Social Science has three desired goals to better support big data and related research: (1) Expand research support (2) Help build an intellectual community around this work (3) Help expand data science teaching. ... The California Census Research Data Center (CCRDC) will soon be moving to into its new home at SSCERT. ...

  21. Ethical Issues in Social Science Research Employing Big Data

    This paper analyzes the ethics of social science research (SSR) employing big data. We begin by highlighting the research gap found on the intersection between big data ethics, SSR and research ethics. We then discuss three aspects of big data SSR which make it warrant special attention from a research ethics angle: (1) the interpretative character of both SSR and big data, (2) complexities of ...

  22. Re-use of research data in the social sciences. Use and users of

    Research data and data archives in the social sciences. The European Commission [] defines research infrastructures as "facilities that provide resources and services for research communities to conduct research and foster innovation" (p. 1).Research data archives are thus part of the infrastructure supporting and enabling open science by storing, managing, and disseminating research data ...

  23. ScaDS.AI Dresden/Leipzig

    Research on AI and Big Data. ScaDS.AI Dresden/Leipzig expands the former Big Data competence center ScaDS Dresden/Leipzig, founded in 2014, and combines the AI and data science expertise of the partner institutions to close the gap between the efficient use of mass data, knowledge management and advanced AI.

  24. UC Berkeley launches AI training program to address criminal justice

    UC Berkeley researchers launched a pioneering interdisciplinary training program this week that will blend criminal justice and computer science in ways that experts say will help reduce long-standing, systemic inequities in the criminal legal system. The program, called Computational Research for Equity in the Legal System (CRELS), is being made possible with a $3-million National Science ...

  25. Spotlight on researchers for South Asian Heritage Month

    Read on to discover their research profiles. Big data and the Global South. Professor Subhajit Basu, Professor of Law and Technology, investigates the numerous challenges posed by digital technologies across various societal sectors, including transport, education, healthcare, and social justice. Renowned internationally for pioneering ...

  26. Policy Basics: Where Do Our Federal Tax Dollars Go?

    Economic security programs: About 8 percent (or $545 billion) of the 2023 federal budget supported programs that provide aid (other than health insurance or Social Security benefits) to individuals and families facing hardship.Economic security programs include: the refundable portions of the Earned Income Tax Credit and Child Tax Credit, which assist low- and moderate-income working families ...