An official website of the United States government
Official websites use .gov A .gov website belongs to an official government organization in the United States.
Secure .gov websites use HTTPS A lock ( Lock Locked padlock icon ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.
- Publications
- Account settings
- Advanced Search
- Journal List
Research: Articulating Questions, Generating Hypotheses, and Choosing Study Designs
Mary p tully.
- Author information
- Copyright and License information
Address correspondence to: Dr Mary P Tully, Manchester Pharmacy School, University of Manchester, Oxford Road, Manchester M13 9PT UK, e-mail: [email protected]
INTRODUCTION
Articulating a clear and concise research question is fundamental to conducting a robust and useful research study. Although “getting stuck into” the data collection is the exciting part of research, this preparation stage is crucial. Clear and concise research questions are needed for a number of reasons. Initially, they are needed to enable you to search the literature effectively. They will allow you to write clear aims and generate hypotheses. They will also ensure that you can select the most appropriate research design for your study.
This paper begins by describing the process of articulating clear and concise research questions, assuming that you have minimal experience. It then describes how to choose research questions that should be answered and how to generate study aims and hypotheses from your questions. Finally, it describes briefly how your question will help you to decide on the research design and methods best suited to answering it.
TURNING CURIOSITY INTO QUESTIONS
A research question has been described as “the uncertainty that the investigator wants to resolve by performing her study” 1 or “a logical statement that progresses from what is known or believed to be true to that which is unknown and requires validation”. 2 Developing your question usually starts with having some general ideas about the areas within which you want to do your research. These might flow from your clinical work, for example. You might be interested in finding ways to improve the pharmaceutical care of patients on your wards. Alternatively, you might be interested in identifying the best antihypertensive agent for a particular subgroup of patients. Lipowski 2 described in detail how work as a practising pharmacist can be used to great advantage to generate interesting research questions and hence useful research studies. Ideas could come from questioning received wisdom within your clinical area or the rationale behind quick fixes or workarounds, or from wanting to improve the quality, safety, or efficiency of working practice.
Alternatively, your ideas could come from searching the literature to answer a query from a colleague. Perhaps you could not find a published answer to the question you were asked, and so you want to conduct some research yourself. However, just searching the literature to generate questions is not to be recommended for novices—the volume of material can feel totally overwhelming.
Use a research notebook, where you regularly write ideas for research questions as you think of them during your clinical practice or after reading other research papers. It has been said that the best way to have a great idea is to have lots of ideas and then choose the best. The same would apply to research questions!
When you first identify your area of research interest, it is likely to be either too narrow or too broad. Narrow questions (such as “How is drug X prescribed for patients with condition Y in my hospital?”) are usually of limited interest to anyone other than the researcher. Broad questions (such as “How can pharmacists provide better patient care?”) must be broken down into smaller, more manageable questions. If you are interested in how pharmacists can provide better care, for example, you might start to narrow that topic down to how pharmacists can provide better care for one condition (such as affective disorders) for a particular subgroup of patients (such as teenagers). Then you could focus it even further by considering a specific disorder (depression) and a particular type of service that pharmacists could provide (improving patient adherence). At this stage, you could write your research question as, for example, “What role, if any, can pharmacists play in improving adherence to fluoxetine used for depression in teenagers?”
TYPES OF RESEARCH QUESTIONS
Being able to consider the type of research question that you have generated is particularly useful when deciding what research methods to use. There are 3 broad categories of question: descriptive, relational, and causal.
Descriptive
One of the most basic types of question is designed to ask systematically whether a phenomenon exists. For example, we could ask “Do pharmacists ‘care’ when they deliver pharmaceutical care?” This research would initially define the key terms (i.e., describing what “pharmaceutical care” and “care” are), and then the study would set out to look for the existence of care at the same time as pharmaceutical care was being delivered.
When you know that a phenomenon exists, you can then ask description and/or classification questions. The answers to these types of questions involve describing the characteristics of the phenomenon or creating typologies of variable subtypes. In the study above, for example, you could investigate the characteristics of the “care” that pharmacists provide. Classifications usually use mutually exclusive categories, so that various subtypes of the variable will have an unambiguous category to which they can be assigned. For example, a question could be asked as to “what is a pharmacist intervention” and a definition and classification system developed for use in further research.
When seeking further detail about your phenomenon, you might ask questions about its composition. These questions necessitate deconstructing a phenomenon (such as a behaviour) into its component parts. Within hospital pharmacy practice, you might be interested in asking questions about the composition of a new behavioural intervention to improve patient adherence, for example, “What is the detailed process that the pharmacist implicitly follows during delivery of this new intervention?”
After you have described your phenomena, you may then be interested in asking questions about the relationships between several phenomena. If you work on a renal ward, for example, you may be interested in looking at the relationship between hemoglobin levels and renal function, so your question would look something like this: “Are hemoglobin levels related to level of renal function?” Alternatively, you may have a categorical variable such as grade of doctor and be interested in the differences between them with regard to prescribing errors, so your research question would be “Do junior doctors make more prescribing errors than senior doctors?” Relational questions could also be asked within qualitative research, where a detailed understanding of the nature of the relationship between, for example, the gender and career aspirations of clinical pharmacists could be sought.
Once you have described your phenomena and have identified a relationship between them, you could ask about the causes of that relationship. You may be interested to know whether an intervention or some other activity has caused a change in your variable, and your research question would be about causality. For example, you may be interested in asking, “Does captopril treatment reduce blood pressure?” Generally, however, if you ask a causality question about a medication or any other health care intervention, it ought to be rephrased as a causality–comparative question. Without comparing what happens in the presence of an intervention with what happens in the absence of the intervention, it is impossible to attribute causality to the intervention. Although a causality question would usually be answered using a comparative research design, asking a causality–comparative question makes the research design much more explicit. So the above question could be rephrased as, “Is captopril better than placebo at reducing blood pressure?”
The acronym PICO has been used to describe the components of well-crafted causality–comparative research questions. 3 The letters in this acronym stand for Population, Intervention, Comparison, and Outcome. They remind the researcher that the research question should specify the type of participant to be recruited, the type of exposure involved, the type of control group with which participants are to be compared, and the type of outcome to be measured. Using the PICO approach, the above research question could be written as “Does captopril [ intervention ] decrease rates of cardiovascular events [ outcome ] in patients with essential hypertension [ population ] compared with patients receiving no treatment [ comparison ]?”
DECIDING WHETHER TO ANSWER A RESEARCH QUESTION
Just because a question can be asked does not mean that it needs to be answered. Not all research questions deserve to have time spent on them. One useful set of criteria is to ask whether your research question is feasible, interesting, novel, ethical, and relevant. 1 The need for research to be ethical will be covered in a later paper in the series, so is not discussed here. The literature review is crucial to finding out whether the research question fulfils the remaining 4 criteria.
Conducting a comprehensive literature review will allow you to find out what is already known about the subject and any gaps that need further exploration. You may find that your research question has already been answered. However, that does not mean that you should abandon the question altogether. It may be necessary to confirm those findings using an alternative method or to translate them to another setting. If your research question has no novelty, however, and is not interesting or relevant to your peers or potential funders, you are probably better finding an alternative.
The literature will also help you learn about the research designs and methods that have been used previously and hence to decide whether your potential study is feasible. As a novice researcher, it is particularly important to ask if your planned study is feasible for you to conduct. Do you or your collaborators have the necessary technical expertise? Do you have the other resources that will be needed? If you are just starting out with research, it is likely that you will have a limited budget, in terms of both time and money. Therefore, even if the question is novel, interesting, and relevant, it may not be one that is feasible for you to answer.
GENERATING AIMS AND HYPOTHESES
All research studies should have at least one research question, and they should also have at least one aim. As a rule of thumb, a small research study should not have more than 2 aims as an absolute maximum. The aim of the study is a broad statement of intention and aspiration; it is the overall goal that you intend to achieve. The wording of this broad statement of intent is derived from the research question. If it is a descriptive research question, the aim will be, for example, “to investigate” or “to explore”. If it is a relational research question, then the aim should state the phenomena being correlated, such as “to ascertain the impact of gender on career aspirations”. If it is a causal research question, then the aim should include the direction of the relationship being tested, such as “to investigate whether captopril decreases rates of cardiovascular events in patients with essential hypertension, relative to patients receiving no treatment”.
The hypothesis is a tentative prediction of the nature and direction of relationships between sets of data, phrased as a declarative statement. Therefore, hypotheses are really only required for studies that address relational or causal research questions. For the study above, the hypothesis being tested would be “Captopril decreases rates of cardiovascular events in patients with essential hypertension, relative to patients receiving no treatment”. Studies that seek to answer descriptive research questions do not test hypotheses, but they can be used for hypothesis generation. Those hypotheses would then be tested in subsequent studies.
CHOOSING THE STUDY DESIGN
The research question is paramount in deciding what research design and methods you are going to use. There are no inherently bad research designs. The rightness or wrongness of the decision about the research design is based simply on whether it is suitable for answering the research question that you have posed.
It is possible to select completely the wrong research design to answer a specific question. For example, you may want to answer one of the research questions outlined above: “Do pharmacists ‘care’ when they deliver pharmaceutical care?” Although a randomized controlled study is considered by many as a “gold standard” research design, such a study would just not be capable of generating data to answer the question posed. Similarly, if your question was, “Is captopril better than placebo at reducing blood pressure?”, conducting a series of in-depth qualitative interviews would be equally incapable of generating the necessary data. However, if these designs are swapped around, we have 2 combinations (pharmaceutical care investigated using interviews; captopril investigated using a randomized controlled study) that are more likely to produce robust answers to the questions.
The language of the research question can be helpful in deciding what research design and methods to use. Subsequent papers in this series will cover these topics in detail. For example, if the question starts with “how many” or “how often”, it is probably a descriptive question to assess the prevalence or incidence of a phenomenon. An epidemiological research design would be appropriate, perhaps using a postal survey or structured interviews to collect the data. If the question starts with “why” or “how”, then it is a descriptive question to gain an in-depth understanding of a phenomenon. A qualitative research design, using in-depth interviews or focus groups, would collect the data needed. Finally, the term “what is the impact of” suggests a causal question, which would require comparison of data collected with and without the intervention (i.e., a before–after or randomized controlled study).
CONCLUSIONS
This paper has briefly outlined how to articulate research questions, formulate your aims, and choose your research methods. It is crucial to realize that articulating a good research question involves considerable iteration through the stages described above. It is very common that the first research question generated bears little resemblance to the final question used in the study. The language is changed several times, for example, because the first question turned out not to be feasible and the second question was a descriptive question when what was really wanted was a causality question. The books listed in the “Further Reading” section provide greater detail on the material described here, as well as a wealth of other information to ensure that your first foray into conducting research is successful.
This article is the second in the CJHP Research Primer Series, an initiative of the CJHP Editorial Board and the CSHP Research Committee. The planned 2-year series is intended to appeal to relatively inexperienced researchers, with the goal of building research capacity among practising pharmacists. The articles, presenting simple but rigorous guidance to encourage and support novice researchers, are being solicited from authors with appropriate expertise.
Previous article in this series:
Bond CM. The research jigsaw: how to get started. Can J Hosp Pharm . 2014;67(1):28–30.
Competing interests: Mary Tully has received personal fees from the UK Renal Pharmacy Group to present a conference workshop on writing research questions and nonfinancial support (in the form of travel and accommodation) from the Dubai International Pharmaceuticals and Technologies Conference and Exhibition (DUPHAT) to present a workshop on conducting pharmacy practice research.
- 1. Hulley S, Cummings S, Browner W, Grady D, Newman T. Designing clinical research. 4th ed. Philadelphia (PA): Lippincott, Williams and Wilkins; 2013. [ Google Scholar ]
- 2. Lipowski EE. Developing great research questions. Am J Health Syst Pharm. 2008;65(17):1667–70. doi: 10.2146/ajhp070276. [ DOI ] [ PubMed ] [ Google Scholar ]
- 3. Richardson WS, Wilson MC, Nishikawa J, Hayward RS. The well-built clinical question: a key to evidence-based decisions. ACP J Club. 1995;123(3):A12–3. [ PubMed ] [ Google Scholar ]
Further Reading
- Cresswell J. Research design: qualitative, quantitative and mixed methods approaches. London (UK): Sage; 2009. [ Google Scholar ]
- Haynes RB, Sackett DL, Guyatt GH, Tugwell P. Clinical epidemiology: how to do clinical practice research. 3rd ed. Philadelphia (PA): Lippincott, Williams & Wilkins; 2006. [ Google Scholar ]
- Kumar R. Research methodology: a step-by-step guide for beginners. 3rd ed. London (UK): Sage; 2010. [ Google Scholar ]
- Smith FJ. Conducting your pharmacy practice research project. London (UK): Pharmaceutical Press; 2005. [ Google Scholar ]
- View on publisher site
- PDF (592.1 KB)
- Collections
Similar articles
Cited by other articles, links to ncbi databases.
- Download .nbib .nbib
- Format: AMA APA MLA NLM
Add to Collections
Data-Driven Hypothesis Generation in Clinical Research: What We Learned from a Human Subject Study?
Affiliations.
- 1 Department of Public Health Sciences, College of Behavioral, Social and Health Sciences, Clemson University, Clemson, SC.
- 2 Informatics Institute, School of Medicine, University of Alabama, Birmingham, Birmingham, AL.
- 3 Cognitive Studies in Medicine and Public Health, The New York Academy of Medicine, New York City, NY.
- 4 Department of Educational Studies, Patton College of Education, Ohio University, Athens, OH.
- 5 Department of Clinical Sciences and Community Health, Touro University California College of Osteopathic Medicine, Vallejo, CA.
- 6 Department of Electrical Engineering and Computer Science, Russ College of Engineering and Technology, Ohio University, Athens, OH.
- 7 Department of Health Science, California State University Channel Islands, Camarillo, CA.
- PMID: 39211055
- PMCID: PMC11361316
- DOI: 10.18103/mra.v12i2.5132
Hypothesis generation is an early and critical step in any hypothesis-driven clinical research project. Because it is not yet a well-understood cognitive process, the need to improve the process goes unrecognized. Without an impactful hypothesis, the significance of any research project can be questionable, regardless of the rigor or diligence applied in other steps of the study, e.g., study design, data collection, and result analysis. In this perspective article, the authors provide a literature review on the following topics first: scientific thinking, reasoning, medical reasoning, literature-based discovery, and a field study to explore scientific thinking and discovery. Over the years, scientific thinking has shown excellent progress in cognitive science and its applied areas: education, medicine, and biomedical research. However, a review of the literature reveals the lack of original studies on hypothesis generation in clinical research. The authors then summarize their first human participant study exploring data-driven hypothesis generation by clinical researchers in a simulated setting. The results indicate that a secondary data analytical tool, VIADS-a visual interactive analytic tool for filtering, summarizing, and visualizing large health data sets coded with hierarchical terminologies, can shorten the time participants need, on average, to generate a hypothesis and also requires fewer cognitive events to generate each hypothesis. As a counterpoint, this exploration also indicates that the quality ratings of the hypotheses thus generated carry significantly lower ratings for feasibility when applying VIADS. Despite its small scale, the study confirmed the feasibility of conducting a human participant study directly to explore the hypothesis generation process in clinical research. This study provides supporting evidence to conduct a larger-scale study with a specifically designed tool to facilitate the hypothesis-generation process among inexperienced clinical researchers. A larger study could provide generalizable evidence, which in turn can potentially improve clinical research productivity and overall clinical research enterprise.
Keywords: Clinical research; data-driven hypothesis generation; medical informatics; scientific hypothesis generation; translational research; visualization.
Grants and funding
- P20 GM121342/GM/NIGMS NIH HHS/United States
- R15 LM012941/LM/NLM NIH HHS/United States
- T15 LM013977/LM/NLM NIH HHS/United States
- First Online: 01 January 2024
Cite this chapter
- Hiroshi Ishikawa 3
Part of the book series: Studies in Big Data ((SBD,volume 139))
371 Accesses
This chapter will explain the definition and properties of a hypothesis, the related concepts, and basic methods of hypothesis generation as follows.
Describe the definition, properties, and life cycle of a hypothesis.
Describe relationships between a hypothesis and a theory, a model, and data.
Categorize and explain research questions that provide hints for hypothesis generation.
Explain how to visualize data and analysis results.
Explain the philosophy of science and scientific methods in relation to hypothesis generation in science.
Explain deduction, induction, plausible reasoning, and analogy concretely as reasoning methods useful for hypothesis generation.
Explain problem solving as hypothesis generation methods by using familiar examples.
This is a preview of subscription content, log in via an institution to check access.
Access this chapter
Subscribe and save.
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
- Available as PDF
- Read on any device
- Instant download
- Own it forever
- Available as EPUB and PDF
- Durable hardcover edition
- Dispatched in 3 to 5 business days
- Free shipping worldwide - see info
Tax calculation will be finalised at checkout
Purchases are for personal use only
Institutional subscriptions
Aufmann RN, Lockwood JS et al (2018) Mathematical excursions. CENGAGE
Google Scholar
Bortolotti L (2008) An introduction to the philosophy of science. Polity
Cairo A (2016) The truthful art: data, charts, and maps for communication. New Riders
Cellucci C (2017) Rethinking knowledge: the heuristic view. Springer
Chang M (2014) Principles of scientific methods. CRC Press
Crease RP (2010) The great equations: breakthroughs in science from Pythagoras to Heisenberg. W. W. Norton & Company
Danks D, Ippoliti E (eds) Building theories: Heuristics and hypotheses in sciences. Springer
Diggle PJ, Chetwynd AG (2011) Statistics and scientific method: an introduction for students and researchers. Oxford University Press
DOAJ (2022) Directory of open access journal. https://doaj.org/ Accessed 2022
Gilchrist P, Wheaton B (2011) Lifestyle sport, public policy and youth engagement: examining the emergence of Parkour. Int J Sport Policy Polit 3(1):109–131. https://doi.org/10.1080/19406940.2010.547866
Article Google Scholar
Google Maps. https://www.google.com/maps Accessed 2022.
Ishikawa H (2015) Social big data mining. CRC Press
Järvinen P (2008) Mapping research questions to research methods. In: Avison D, Kasper GM, Pernici B, Ramos I, Roode D (eds) Advances in information systems research, education and practice. Proceedings of IFIP 20th world computer congress, TC 8, information systems, vol 274. Springer. https://doi.org/10.1007/978-0-387-09682-7-9_3
JAXA (2022) Martian moons eXploration. http://www.mmx.jaxa.jp/en/ . Accessed 2022
Lewton T (2020) How the bits of quantum gravity can buzz. Quanta Magazine. 2020. https://www.quantamagazine.org/gravitons-revealed-in-the-noise-of-gravitational-waves-20200723/ . Accessed 2022
Mahajan S (2014) The art of insight in science and engineering: Mastering complexity. The MIT Press
Méndez A, Rivera–Valentín EG (2017) The equilibrium temperature of planets in elliptical orbits. Astrophys J Lett 837(1)
NASA (2022) Mars sample return. https://www.jpl.nasa.gov/missions/mars-sample-return-msr Accessed 2022
OpenStreetMap (2022). https://www.openstreetmap.org . Accessed 2022
Pólya G (2009) Mathematics and plausible reasoning: vol I: induction and analogy in mathematics. Ishi Press
Pólya G, Conway JH (2014) How to solve it. Princeton University Press
Rehm J (2019) The four fundamental forces of nature. Live science https://www.livescience.com/the-fundamental-forces-of-nature.html
Sadler-Smith E (2015) Wallas’ four-stage model of the creative process: more than meets the eye? Creat Res J 27(4):342–352. https://doi.org/10.1080/10400419.2015.1087277
Siegel E, This is why physicists think string theory might be our ‘theory of everything.’ Forbes, 2018. https://www.forbes.com/sites/startswithabang/2018/05/31/this-is-why-physicists-think-string-theory-might-be-our-theory-of-everything/?sh=b01d79758c25
Zeitz P (2006) The art and craft of problem solving. Wiley
Download references
Author information
Authors and affiliations.
Department of Systems Design, Tokyo Metropolitan University, Hino, Tokyo, Japan
Hiroshi Ishikawa
You can also search for this author in PubMed Google Scholar
Corresponding author
Correspondence to Hiroshi Ishikawa .
Rights and permissions
Reprints and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Ishikawa, H. (2024). Hypothesis. In: Hypothesis Generation and Interpretation. Studies in Big Data, vol 139. Springer, Cham. https://doi.org/10.1007/978-3-031-43540-9_2
Download citation
DOI : https://doi.org/10.1007/978-3-031-43540-9_2
Published : 01 January 2024
Publisher Name : Springer, Cham
Print ISBN : 978-3-031-43539-3
Online ISBN : 978-3-031-43540-9
eBook Packages : Computer Science Computer Science (R0)
Share this chapter
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
- Publish with us
Policies and ethics
- Find a journal
- Track your research
An official website of the United States government
Official websites use .gov A .gov website belongs to an official government organization in the United States.
Secure .gov websites use HTTPS A lock ( Lock Locked padlock icon ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.
- Publications
- Account settings
- Advanced Search
- Journal List
This is a preprint.
It has not yet been peer reviewed by a journal.
The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.
How do clinical researchers generate data-driven scientific hypotheses? Cognitive events using think-aloud protocol
Brooke n draghi, mytchell a ernst, vimla l patel, james j cimino, jay h shubrook, yuchun zhou, sonsoles de lacalle.
- Author information
- Copyright and License information
coauthors contributed equally
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License , which allows reusers to copy and distribute the material in any medium or format in unadapted form only, for noncommercial purposes only, and only so long as attribution is given to the creator.
Objectives:
This study aims to identify the cognitive events related to information use (e.g., “Analyze data”, “Seek connection”) during hypothesis generation among clinical researchers. Specifically, we describe hypothesis generation using cognitive event counts and compare them between groups.
The participants used the same datasets, followed the same scripts, used VIADS (a v isual i nteractive a nalysis tool for filtering and summarizing large d ata s ets coded with hierarchical terminologies) or other analytical tools (as control) to analyze the datasets, and came up with hypotheses while following the think-aloud protocol. Their screen activities and audio were recorded and then transcribed and coded for cognitive events.
The VIADS group exhibited the lowest mean number of cognitive events per hypothesis and the smallest standard deviation. The experienced clinical researchers had approximately 10% more valid hypotheses than the inexperienced group. The VIADS users among the inexperienced clinical researchers exhibit a similar trend as the experienced clinical researchers in terms of the number of cognitive events and their respective percentages out of all the cognitive events. The highest percentages of cognitive events in hypothesis generation were “Using analysis results” (30%) and “Seeking connections” (23%).
Conclusion:
VIADS helped inexperienced clinical researchers use fewer cognitive events to generate hypotheses than the control group. This suggests that VIADS may guide participants to be more structured during hypothesis generation compared with the control group. The results provide evidence to explain the shorter average time needed by the VIADS group in generating each hypothesis.
Keywords: Scientific hypothesis generation, Clinical research, Cognitive events, Think-aloud method, Data-driven hypothesis generation, Secondary data analytical tool
Introduction
A research hypothesis is an educated guess regarding relationships among different variables [ 1 , 2 ]. A research question typically comprises one to several scientific hypotheses that drive the direction of most research projects [ 1 , 3 – 5 ]. If we consider the life cycle of a research project, hypothesis generation constitutes its starting point. Without a significant, insightful, and novel hypothesis to begin with, it is difficult to have an impactful research project regardless of the study design, experiment implementation, and results analysis. Therefore, hypothesis generation plays a critical role in a research project. There are several studies investigating the mechanism of the generation of scientific hypotheses by researchers, both in science (e.g., Dunbar and Khar [ 6 , 7 ]) and in clinical medicine (e.g., Joseph and Patel [ 8 , 9 ]). However, none of these studies address how an analytic tool can be used to facilitate the hypothesis-generation process.
At least two categories of hypothesis are used frequently in scientific research. One is a hypothesis originating from experimental observations, e.g., any unusual phenomena observed during experiments in the context of “wet lab”. The other category is a hypothesis originating from the context of data analysis, for example, studies in epidemiology, genomics, and informatics [ 10 – 12 ]. Observations of unique or unusual phenomena in the first category and observations of trends in the second category are both critical in developing hypotheses [ 7 , 13 ]. Herein, we focus on the hypothesis generation within the second category.
In the past decades, there has been much work toward understanding scientific thinking and reasoning, medical reasoning, analogy, and working memory [ 7 , 14 ]. Educational settings and math problems were used to explore the reasoning process [ 15 – 17 ]. However, scientific hypothesis generation was not addressed, and the mechanism of explicit cognitive processes during scientific hypothesis generation remains unclear. The main differences between scientific reasoning and hypothesis generation include: a) the starting points of the two processes are different; many studies involving scientific reasoning start from an existing problem or puzzle [ 17 – 20 ], whereas data-driven hypothesis generation searches for a problem or a focus area to begin, named as open discovery by; Henry et al. [ 21 ]; b) the mechanisms between the start and end points of the two processes may differ, with convergent thinking used more in scientific reasoning when a question or a puzzle needs to be solved [ 7 ] and divergent thinking used more in data-driven scientific hypothesis generation. Meanwhile, hypothesis generation in medical diagnosis starts with a presented medical case or symptoms [ 19 , 22 ], which is similar to scientific reasoning.
We previously developed a conceptual framework for scientific hypothesis generation and its contributing factors [ 23 ]. Researchers have explored the possibilities of automatically generating scientific hypotheses in the past [ 10 , 24 – 28 ]; however, these authors recognized the challenges faced by an automated tool for such an advanced cognitive process [ 24 , 29 , 30 ].
Our study aims to obtain a better understanding regarding the scientific hypothesis generation process in clinical research. Considering hypotheses can directly impact and guide the direction of any research project, the findings of this work can potentially impact the clinical research enterprise. The research protocol [ 31 ], VIADS [ 32 – 34 ] (a v isual i nteractive a nalytic tool for filtering and summarizing large health d ata s ets coded with hierarchical terminologies—VIADS, a secondary data analytical tool developed by our team) usability [ 35 ], and quality evaluation of the hypotheses generated by participants [ 23 ] have all been published. This manuscript focuses on the cognitive events used by experienced and inexperienced clinical researchers during hypothesis generation.
Study flow and data sets used
The 2 × 2 study compared the hypothesis generation process of the clinical researchers with and without VIADS on the same datasets ( Appendix A ), with the same study scripts ( Appendix B ), and within the same timeframe (2 hours/study session), and they all followed the think-aloud method. The participants were separated into experienced and inexperienced clinical researchers based on predetermined criteria[ 31 ], e.g., years of experience and number of publications as significant contributors. The data were extracted from the National Ambulatory Medical Care Survey (NAMCS) conducted by the Centers for Disease Control and Prevention in 2005 and 2015 [ 36 ]. We preprocessed the NAMCS data sets by calculating and aggregating the ICD-9-CM diagnostic and procedural codes and their frequencies. The participants were asked to analyze the data and generate hypotheses and articulate their mind and actions during the process, i.e., study sessions. The screen activities and conversations between participants and the study facilitator were recorded via BBFlashback. The recordings were transcribed by a professional service.
Cognitive events coding for the hypothesis generation recordings
Based on the experience of conducting all study sessions, initial data analysis, the feedback from the investigation team, and literature review [ 1 , 13 , 37 – 41 ], a preliminary conceptual framework of the cognitive hypothesis generation process was developed before coding ( Figure 1 ). The conceptual framework served as a foundational framework to formulate the initial codes and code groups ( Appendix C ) that were used to code the transcriptions of the recordings, mainly for cognitive events (e.g., seek connections, analogy) in the hypothesis generation process. For example, “Analogy” was used when a participant compared one’s last study with the analysis results in front of him/her. “Use PICOT” was used when a participant used PICOT (i.e., patient, intervention, comparison, outcome, type of study) to formulate an idea into a formal hypothesis.
Initial version of the framework on cognitive events during hypothesis generation
The transcription of one study session was utilized as a pilot coding case to set the initial coding principles ( Appendix D ). The pilot coding sessions were used as training sessions for the two coders as well. The rest of the transcriptions were coded by the two coders independently and separately first. The two coders compared their coding results, discussed any discrepancies, and reached a consensus on coding later by including the study facilitator and modifying the coding principles. More codes and code groups were added while the coding progressed. After coding all the study session transcripts, the two coders also organized each hypothesis generation as an independent process and labeled the cognitive events during each hypothesis generation. We investigated the possible hypothesis generation processes based on coded cognitive events.
Data analytics strategy
This study used the cognitive events and the aggregated frequencies of these events to demonstrate the possible hypothesis generation process. While analyzing the cognitive events, we considered the results from four levels: (1) each hypothesis generation as a unit and we examined all hypotheses (n = 199), (2) each participant as a unit and all participants (n = 16) as a unit, (3) the group of participants who used VIADS as a unit (n = 9), and (4) the group of participants who did not use VIADS as a unit (n = 7). Correspondingly, the results were also organized at these four levels. We performed independent t-tests to compare the cognitive events between participants (a) in the VIADS and control groups and (b) between the experienced (3 participants, 36 hypotheses) and inexperienced clinical researchers (13 participants, 163 hypotheses). The study sessions of two participants’ (in the control group, both were inexperienced clinical researchers) were missing from the coding data because of technical failures resulting in partial recording of their study sessions, and their data were excluded from the analysis.
All hypotheses were rated by an expert panel of seven members using the same metrics for quality evaluation [ 23 , 42 ]. We deemed a hypothesis as invalid if three or more experts rated it as 1 (the lowest rating) on validity (significance and feasibility are two additional dimensions used for evaluation) of the hypothesis. However, we included the analysis of the result for all the hypotheses and valid hypotheses.
Ethics statement
The study was approved by the Institutional Review Board of Clemson University, South Carolina (IRB2020–056) and Ohio University Institutional Review Boards (18-X-192).
Hypothesis generation framework
Figure 2 is a refined and evolving version of the initial framework shown in Figure 1 , our preliminary understanding of hypothesis generation. Figure 2 was instrumental in directly guiding the coding of the cognitive events. The predominant cognitive events within the processing evidence category include “Using analysis results” (30%), “Seeking connections” (23%), and “Analyze data” (20.81%, Figure 2 ). Appendix E illustrates the processes and events used percentages while generating hypotheses. Appendix F presents individual cognitive events used for all hypotheses and valid hypotheses, respectively.
Cognitive process frameworks for scientific hypothesis generation in clinical research; the highest percentages of cognitive events used by clinical researchers were highlighted.
Overall cognitive events usage during hypothesis generation
Sixteen participants generated 199 hypotheses during the 2-hour study sessions, with 163 originating from the inexperienced groups ( Table 1 ). We used 20 distinct codes, i.e., cognitive events and 6 code groups ( Figure 2 ). Appendix C showcases the comprehensive codebook. Appendix D delineates the rationale and principles established during the coding phase. In total, 1216 times of cognitive events were applied across the 199 hypotheses. On average, inexperienced clinical researchers in the control group applied 7.38 cognitive events per hypothesis. Conversely, inexperienced clinical researchers in the VIADS group used 4.48 (p< 0.001 versus control) cognitive events per hypothesis with the lowest standard deviation (SD, 2.43). Experienced clinical researchers employed 6.15 (p < 0.01 versus junior VIADS) cognitive events per hypothesis. Notably, the inexperienced clinical researchers in the control group demonstrated the highest average number of cognitive event usage with the largest SD (5.02), whether we considered all hypotheses or just valid ones ( Table 1 ). The experienced participants have approximately 10% higher valid hypotheses (72.22% vs. 63.19%) than junior participants.
Group-wise comparison of cognitive events used while generating hypotheses
Note: SD, standard deviation;
p < 0.001 between junior C and junior V;
p < 0.01 between junior V and experienced.
Cognitive events comparison between VIADS and non-VIADS participants
Furthermore, we compared the percentages of cognitive event count between the VIADS and non-VIADS groups among inexperienced clinical researchers ( Figure 3 ). “Use analysis results” (31.3% vs.27.1%, p < 0.001), “Seek connections” (25.4% vs. 17.8%, p < 0.001), and “Analyze data” (22.1% vs. 21.1%) were the events with the highest percentages. The “Seek connections”, “Use analysis results”, and “Pause/think” (3.8% vs. 9.3%, p < 0.05) all show statistical differences between the VIADS and control groups by t tests. Our results indicate that the participants in the VIADS group registered higher event counts during “Preparation”, when “Analyzing results”, and when “Seeking connections”. Conversely, the control group exhibited greater event counts in categories such as “Needing further study”, “Inferring”, “Pausing”, “Using checklists”, and “Using PICOT”.
Comparing cognitive events generated by VIADS and control groups among inexperienced clinical researchers while generating hypotheses
Cognitive events comparison between experienced and inexperienced clinical researchers
We also examined the differences between experienced and inexperienced clinical researchers regarding the percentages of cognitive events they used ( Figure 4 ). “Use analysis results” (31.7% vs. 29.4%, p < 0.01), “Seek connections” (27.6% vs. 21.9%, p < 0.01), and “Analyze data” (17.5% vs. 21.6%, p< 0.01)) were events with the highest percentages of use. The data suggest that experienced clinical researchers exhibit higher percentages regarding these cognitive events: “Using analysis results”, “Seeking connections”, “Inferring”, and “Pausing”. Conversely, inexperienced clinical researchers demonstrated elevated percentages in cognitive events such as “Preparation”, “Data analysis”, “Utilizing suggestions”, “Utilizing checklists”, and “Utilizing PICOT”.
Comparison of cognitive events between experienced and inexperienced clinical researchers while generating hypotheses
Summary of results
The inexperienced clinical researchers in the VIADS group used the fewest cognitive events to generate each hypothesis on average versus the control group (p < 0.001) and the experienced clinical researchers (p < 0.01, Table 2). The most frequently used cognitive events were “Use analysis results” (29.85%), “Seek connections” (23.03%), and “Analyze data” (20.81%) during hypothesis generation ( Figure 2 ). It seems the inexperienced clinical researchers in the VIADS group demonstrated a similar trend to experienced clinical researchers ( Figures 3 and 4 ).
Results interpretation
Several findings of this study were notable. The experienced clinical researchers had a 10% higher percentage of valid hypotheses than the inexperienced clinical researchers (72.22% vs. 63.19%; Table 1 ), consistent with proposition and experience. Another interesting phenomenon is regarding the average cognitive events used by the different groups: the junior VIADS group used far fewer events per hypothesis than the control or experienced groups (4.38 vs. 7.38 vs. 6.15, Table 1 ) and exhibited the lowest SD. This is highly significant as it indicates that the VIADS group, despite comprising inexperienced clinical researchers, used fewer cognitive events to generate each hypothesis on average. This result supports our hypothesis that VIADS facilitates hypothesis generation. In addition, this result supports our findings that the VIADS group used a shorter time to generate each hypothesis on average [ 23 ].
Our results show clinical researchers spent ≥ 70% of cognitive events to process evidence during hypothesis generation. The top three cognitive events used by clinical researchers during hypothesis generation included “Using analysis results” (29.85%), “Seeking connections” (23.03%), and “Analyzing data” (20.81%, Figure 2 ).
Figure 3 presents the cognitive events and their distributions between the VIADS and control groups comprising the inexperienced clinical researchers. The participants in the VIADS group showed a higher number of cognitive events for interpreting the results, and the participants in the control group showed a higher number of cognitive events for external help, such as checklists and PICOT, during hypothesis generation. Figures 3 and 4 show that the VIADS group exhibits similar cognitive event trends with those of the experienced group in terms of “Using analysis results” and “Seeking connections”:
- VIADS versus control: 31.35% versus 27.11% (p< 0.001);
- experienced versus inexperienced: 31.71% versus 29.38% (p < 0.01)
- VIADS versus control: 25.38% versus 17.78% (p< 0.001);
- experienced versus inexperienced: 27.64% versus 21.86% (p< 0.01).
The results indicate that VIADS may help inexperienced clinical researchers move in a direction that aligns more with that of experienced clinical researchers. A more carefully designed study is needed to support or deny such a statement. However, it appears that the current quantitative evidence of cognitive events and their distributions among all cognitive events support such a trend.
Significance of the work
We consider this study to have the following significance: 1) developed the cognitive framework for hypothesis generation in the clinical research context and provided quantitative evidence through cognitive events for the framework; 2) identified and elaborated evidence-based cognitive mechanisms that might be underneath hypothesis generation; 3) identified that experienced clinical researchers possess a considerably higher valid rate of hypothesis generated in a 2-hour window than the inexperienced clinical researchers; 4) demonstrated that VIADS may help inexperienced clinical researchers to use fewer cognitive events than participants without using in hypothesis generation, which indicates VIADS provides a structured way of thinking during hypothesis generation; and 5) established the baseline measures of cognitive events in hypothesis generation and the following events were used in descending order: processing evidence, seeking evidence, and preparation.
Comparing to other studies
Patel et al. have explored medical reasoning through diagnoses, which have significantly influenced the design of the current study [ 7 , 8 , 20 , 22 ]. From their studies, we know that there were differences in the reasoning processes and thinking steps between experienced and inexperienced clinicians in medical diagnosis [ 9 , 19 , 22 , 43 , 44 ]. Therefore, we separated the participants into experienced and inexperienced groups before assigning them randomly into VIADS or control groups. The findings of this study mostly align with those of Patel et al. despite our different settings, medical diagnosis versus scientific hypothesis generation in clinical research. The experienced participants used fewer cognitive events than inexperienced participants on average, although the VIADS group used the lowest number of cognitive events despite comprising inexperienced clinical researchers.
Klahr and Dunbar’s landmark study published in 1988 [ 6 ] also enlightened our study [ 6 ]. Their study taught participants to use an electronic device. The participants had to figure out an unencountered function of the device. The process was employed to study hypothesis generation, reasoning, and testing iteratively. They concluded that searching memory and using results from prior experiments are critical for hypothesis generation. The primary differences between our studies lay in two folds: (1) the tasks for the participants (2) and the types of hypotheses generated. In the Klahr and Dunbar’s study, hypotheses had correct answers, i.e., problem-solving with one or multiple correct answers. Most likely, the participants used convergent thinking [ 7 ]. Their study used a simulated lab environment to assess scientific thinking. Conversely, the hypothesis generation in our study is open discovery without correct answers. The participants in our study used more divergent thinking during the process [ 7 ]. The hypothesis generation process in our study was substantially messier, unpredictable, and challenging to consistently evaluate comparing to their well-defined problems.
Limitations and challenges
One of the main limitations is only three experienced clinical researchers participated in our study who generated 36 hypotheses. We compared the inexperienced and experienced groups regarding all the hypotheses and cognitive events used. However, we could not compare the cognitive events between the VIADS and control groups among the experienced clinical researchers. We made similar efforts to recruit inexperienced and experienced clinical researchers via comparable platforms; however, the recruitment results were considerably worse in the experienced group.
Another limitation of the study was that the information could be captured via the think-aloud protocol. We acknowledge that we only captured the verbalized events during the study sessions, which is a subset of the conscious process and a small portion of the real process. Our coding, aggregation, and analysis are based on the captured events.
In addition, we also faced challenges in terms of unexpected technical failure and unpredictability because this was a human-participation study. The audio recordings of two participants were partial because of a technical failure. One mitigation strategy that could be used was to conduct a test recording each time for every participant, which can be particularly critical if a new device is used in the middle of the study.
Future work
Several avenues for future research emerge from our study. First, we aim to explore the sequence pattern of cognitive events to furnish additional insights into hypothesis generation. Furthermore, juxtaposing the frequencies of cognitive events with the quality evaluation results of the generated hypotheses might illuminate the potential patterns, further enriching our understanding of the process. Finally, a larger scale study encompassing a larger participant sample size and situated in a more natural environment can enhance the robustness of our findings.
Experienced clinical researchers exhibit a higher valid hypothesis rate than inexperienced clinical researchers. The VIADS group of inexperienced clinical researchers used the fewest cognitive events with the lowest standard deviation to generate each hypothesis compared with experienced and inexperienced clinical researchers not using VIADS. This efficiency is further underscored by the VIADS group taking the least average time to generate a hypothesis. Notably, the VIADS inexperienced cohort mirrored the trend observed in experienced clinical researchers in terms of cognitive event distribution. Such findings indicate that VIADS may provide structured guidance during hypothesis generation. Further studies, ideally on a grander scale and in a more natural environment, could offer a deeper understanding of the process. Our research provides foundational metrics on cognitive event measures during hypothesis generation in clinical research, demonstrating the viability of executing such experiments in a simulated setting and unraveling the intricacies of the hypothesis generation process through these experiments.
Supplementary Material
What is already known on this topic:.
how hypotheses were generated when solving a puzzle or a medical case and the reasoning differences between experienced and inexperienced physicians.
What this study adds:
Our study facilitates our understanding of how clinical researchers generate hypotheses with secondary data analytical tools and datasets, the cognitive events used during hypothesis generation in an open discovery context.
How this study might affect research, practice, or policy:
Our work suggests secondary data analytical tools and visualization may facilitate hypothesis generation among inexperienced clinical researchers regarding the number of hypotheses, average time, and the cognitive events needed per hypothesis.
Acknowledgments
We want to thank all participants sincerely for their precious time, courage, and expertise in helping us understand this critical but less-known hypothesis generation process better. This project received support from the National Library of Medicine (R15LM012941) and was funded partially by the National Institute of General Medical Sciences of the National Institutes of Health (P20 GM121342). The intellectual environment and research training resources provided by the NIH/NLM T15 SC BIDS4Health (T15LM013977) enriched this work.
Appendices:
Appendix A : Datasets used by participants during study sessions
Appendix B : Study session scripts followed by all participants
Appendix C : Codes and code group used during study session transcription analysis
Appendix D : Rationale and guidelines for coding data-driven hypothesis generation recordings
Appendix E : Cognitive events and their percentages during hypothesis generation in clinical research
Appendix F : Cognitive events used while generating data-driven hypotheses
- 1. Supino P, Borer J. Principles of research methodology: A guide for clinical investigators. 2012
- 2. Parahoo A. Nursing research: Principles, Process & issues. 1997
- 3. Hulley S, Cummings S, Browner W, Grady D, Newman T. Designing clinical research. 2013
- 4. Browner W, Newman T, Cummings S, et al. Designing Clinical Research. 5th ed. Philadelphia, PA: Wolters Kluwer, 2023. [ Google Scholar ]
- 5. Gallin JI, Ognibene FP, Ognibene FP. Principles and Practice of Clinical Research. Burlington, UNITED STATES: Elsevier Science & Technology, 2007. [ Google Scholar ]
- 6. Klahr D, Dunbar K. Dual Space Search During Scientific Reasoning. Cognitive Science 1988;12(1):1–48. doi: 10.1207/s15516709cog1201_1 [ DOI ] [ Google Scholar ]
- 7. The Oxford handbook of thinking and reasoning. New York, NY, US: Oxford University Press, 2012. [ Google Scholar ]
- 8. Joseph G-M, Patel VL. Domain knowledge and hypothesis generation in diagnostic reasoning. Medical Decision Making 1990;10:31–46. [ DOI ] [ PubMed ] [ Google Scholar ]
- 9. Arocha J, Patel V, Patel Y. Hypothesis generation and the coordiantion of theory and evidence in novice diagnostic reasoning. Medical Decision Making 1993;13:198–211. [ DOI ] [ PubMed ] [ Google Scholar ]
- 10. Spangler S. Accelerating discovery : mining unstructured information for hypothesis generation. 2016
- 11. Petric I, Ligeti B, Gyorffy B, Pongor S. Biomedical hypothesis generation by text mining and gene prioritization. Protein Pept Lett 2014;21(8):847–57. doi: 10.2174/09298665113209990063 [ DOI ] [ PubMed ] [ Google Scholar ]
- 12. Biesecker L. Hypothesis-generating research and predictive medicine. Genome Res 2013;23:1051–53. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 13. Kitano H. Nobel Turing Challenge: creating the engine for scientific discovery. npj Systems Biology and Applications 2021;7(1):29. doi: 10.1038/s41540-021-00189-3 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 14. The Cambridge Handbook of Thinking and Reasoning. New York: Cambridge University Press, 2005. [ Google Scholar ]
- 15. Sprenger AM, Dougherty MR, Atkins SM, et al. Implications of cognitive load for hypothesis generation and probability judgment. Front Psychol 2011;2:129. doi: 10.3389/fpsyg.2011.00129 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 16. Thomas R, Dougherty M, Sprenger A, Harbison J. Diagnostic hypothesis generation and human judgment. Psychological Review 2008;115(1):155–85. doi: doi: 10.1037/0033-295X.115.1.155 [ DOI ] [ PubMed ] [ Google Scholar ]
- 17. Klauer KC, Stegmaier R, Meiser T. Working Memory Involvement in Propositional and Spatial Reasoning. Thinking & Reasoning 1997;3(1):9–47. doi: 10.1080/135467897394419 [ DOI ] [ Google Scholar ]
- 18. Dunbar K, Fugelsang J. Causal thinking in science: How scientists and students interpret the unexpected. In: Gorman M, Kincannon A, Gooding D, Tweney R, eds. New directions in scientific and technical thinking. Mahway, NJ: Erlbaum, 2004:57–59. [ Google Scholar ]
- 19. PATEL VL, GROEN GJ, AROCHA JF. Medical expertise as a function of task difficulty. Memory & cognition 1990;18(4):394–406. [ DOI ] [ PubMed ] [ Google Scholar ]
- 20. Patel VL, Arocha JF, Zhang J. Chapter 30: Thinking and Reasoning in Medicine. In: Holyoak KJ, Morrison RG, eds. The Cambridge Handbook of Thinking and Reasoning. New York: Cambridge University Press, 2005:727–50. [ Google Scholar ]
- 21. Henry S, McInnes BT. Literature Based Discovery: Models, methods, and trends. J Biomed Inform 2017;74:20–32. doi: 10.1016/j.jbi.2017.08.011 [ DOI ] [ PubMed ] [ Google Scholar ]
- 22. Patel V, Groen G. Knowledge Based Solution Strategies in Medical Reasoning. Cognitive Sci 1986;10:91–116. doi: 10.1207/s15516709cog1001_4 [ DOI ] [ Google Scholar ]
- 23. Jing X, Cimino JJ, Patel VL, et al. Data-driven hypothesis generation among inexperienced clinical researchers: A comparison of secondary data analyses with visualization (VIADS) and other tools. Journal of Clinical and Translational Science, under review 2023. doi: 10.1101/2023.05.30.23290719v1 [ DOI ] [ PMC free article ] [ PubMed ]
- 24. Spangler S, Wilkins AD, Bachman BJ, et al. Automated hypothesis generation based on mining scientific literature. Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. New York, New York, USA: Association for Computing Machinery, 2014:1877–86. [ Google Scholar ]
- 25. Akujuobi U, Spranger M, Palaniappan SK, Zhang X. T-PAIR: Temporal Node-Pair Embedding for Automatic Biomedical Hypothesis Generation. IEEE Transactions on Knowledge and Data Engineering 2022;34(6):2988–3001. doi: 10.1109/TKDE.2020.3017687 [ DOI ] [ Google Scholar ]
- 26. Sybrandt J, Shtutman M, Safro I. Large-Scale Validation of Hypothesis Generation Systems via Candidate Ranking. Proc IEEE Int Conf Big Data 2018;2018:1494–503. doi: 10.1109/bigdata.2018.8622637 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 27. Sybrandt J, Carrabba A, Herzog A, Safro I. Are Abstracts Enough for Hypothesis Generation? 2018 IEEE International Conference on Big Data (Big Data); 2018; Seattle, WA, USA. IEEE; 1504–13. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 28. Sybrandt J, Shtutman M, Safro I. Moliere: Automatic biomedical hypothesis generation system: ACM, 2017. [ DOI ] [ PMC free article ] [ PubMed ]
- 29. Wittkop T, TerAvest E, Evani US, et al. STOP using just GO: a multi-ontology hypothesis generation tool for high throughput experimentation. BMC Bioinformatics 2013;14:53. doi: 10.1186/1471-2105-14-53 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 30. Callahan A, Dumontier M, Shah NH. HyQue: evaluating hypotheses using Semantic Web technologies. Journal of Biomedical Semantics 2011;2:NA. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 31. Jing X, Patel VL, Cimino JJ, et al. The Roles of a Secondary Data Analytics Tool and Experience in Scientific Hypothesis Generation in Clinical Research: Protocol for a Mixed Methods Study. JMIR Res Protoc 2022;11(7):e39414. doi: 10.2196/39414 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 32. Jing X, Emerson M, Masters D, et al. A visual interactive analysis tool for filtering and summarizing large data sets coded with hierarchical terminologies (VIADS). BMC Med Inform Decis Mak 2019;19(31) doi: 10.1186/s12911-019-0750-y [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 33. Jing X, Cimino JJ. A complementary graphical method for reducing and analyzing large data sets: Case studies demonstrating thresholds setting and selection. Methods of Information in Medicine 2014;53 doi: 10.3414/ME13-01-0075 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 34. Jing X, Cimino JJ. Graphical methods for reducing, visualizing and analyzing large data sets using hierarchical terminologies. AMIA 2011. Washington DC, 2011:635–43. [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 35. Jing X, Patel VL, Cimino JJ, et al. A Visual Analytic Tool (VIADS) to Assist the Hypothesis Generation Process in Clinical Research: Mixed Methods Usability Study. JMIR Human Factors 2023;10:e44644. doi: doi: 10.2196/44644 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 36. Statistics CNCfH. NAMCS datasets and documentation. 2017
- 37. Farrugia P, Petrisor B, Farrokhyar F, Bhandari M. Research questions, hypotheses and objectives. J Can Chir 2010;50 [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 38. Pruzan P. Research Methodology: The Aims, Practices and Ethics of Science: Springer International Publishing; Switzerland, 2016. [ Google Scholar ]
- 39. Hicks CM. Research methods for clinical therapists: Applied project design and analysis. 1999 [ DOI ] [ PubMed ]
- 40. Misra DP, Gasparyan AY, Zimba O, Yessirkepov M, Agarwal V, Kitas GD. Formulating Hypotheses for Different Study Designs. J Korean Med Sci 2021;36(50):e338. doi: 10.3346/jkms.2021.36.e338 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 41. Foster JG, Rzhetsky A, Evans JA. Tradition and Innovation in Scientists’ Research Strategies. American Sociological Review 2015;80(5):875–908. [ Google Scholar ]
- 42. Jing X, Zhou Y, Cimino J, et al. Development, validation, and usage of metrics to evaluate clinical research hypothesis quality. BMC Medical Research Methodology, under review 2023. doi: 10.1101/2023.01.17.23284666v2 [ DOI ]
- 43. Patel V, Groen G, Patel Y. Cognitive aspects of clinical performance during patient workup: The role of medical expertise. Advances in Health Sciences Education 1997;2:95–114. [ DOI ] [ PubMed ] [ Google Scholar ]
- 44. Kushniruk A, Patel V, Marley A. Small worlds and medical expertise: implications for medical cognition and knowledge engineering. Int J Med Inform 1998;49:255–71. [ DOI ] [ PubMed ] [ Google Scholar ]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
- View on publisher site
- PDF (486.6 KB)
- Collections
Similar articles
Cited by other articles, links to ncbi databases.
- Download .nbib .nbib
- Format: AMA APA MLA NLM
Add to Collections
Scientific hypothesis generation process in clinical research: a secondary data analytic tool versus experience study protocol
- Find this author on Google Scholar
- Find this author on PubMed
- Search for this author on this site
- ORCID record for Xia Jing
- For correspondence: [email protected]
- Info/History
- Supplementary material
- Preview PDF
Background Scientific hypothesis generation is a critical step in scientific research that determines the direction and impact of any investigation. Despite its vital role, we have limited knowledge of the process itself, hindering our ability to address some critical questions.
Objective To what extent can secondary data analytic tools facilitate scientific hypothesis generation during clinical research? Are the processes similar in developing clinical diagnoses during clinical practice and developing scientific hypotheses for clinical research projects? We explore the process of scientific hypothesis generation in the context of clinical research. The study is designed to compare the role of VIADS, our web-based interactive secondary data analysis tool, and the experience levels of study participants during their scientific hypothesis generation processes.
Methods Inexperienced and experienced clinical researchers are recruited. In this 2×2 study design, all participants use the same data sets during scientific hypothesis-generation sessions, following pre-determined scripts. The inexperienced and experienced clinical researchers are randomly assigned into groups with and without using VIADS. The study sessions, screen activities, and audio recordings of participants are captured. Participants use the think-aloud protocol during the study sessions. After each study session, every participant is given a follow-up survey, with participants using VIADS completing an additional modified System Usability Scale (SUS) survey. A panel of clinical research experts will assess the scientific hypotheses generated based on pre-developed metrics. All data will be anonymized, transcribed, aggregated, and analyzed.
Results This study is currently underway. Recruitment is ongoing via a brief online survey 1 . The preliminary results show that study participants can generate a few to over a dozen scientific hypotheses during a 2-hour study session, regardless of whether they use VIADS or other analytic tools. A metric to assess scientific hypotheses within a clinical research context more accurately, comprehensively, and consistently has also been developed.
Conclusion The scientific hypothesis-generation process is an advanced cognitive activity and a complex process. Clinical researchers can quickly generate initial scientific hypotheses based on data sets and prior experience based on our current results. However, refining these scientific hypotheses is much more time-consuming. To uncover the fundamental mechanisms of generating scientific hypotheses, we need breakthroughs that capture thinking processes more precisely.
Competing Interest Statement
The authors have declared no competing interest.
Clinical Trial
This study is not a clinical trial per NIH definition.
Funding Statement
The project is supported by a grant from the National Library of Medicine of the United States National Institutes of Health (R15LM012941) and partially supported by the National Institute of General Medical Sciences of the National Institutes of Health (P20 GM121342). The content is solely the author's responsibility and does not necessarily represent the official views of the National Institutes of Health.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
The study has been approved by the Institutional Review Board (IRB) at Clemson University (IRB2020-056).
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.
Data Availability
This manuscript is the study protocol. After we analyze and publish the results, transcribed, aggregated, de-identified data can be requested from the authors.
Abbreviations
View the discussion thread.
Supplementary Material
Thank you for your interest in spreading the word about medRxiv.
NOTE: Your email address is requested solely to identify you as the sender of this article.
Citation Manager Formats
- EndNote (tagged)
- EndNote 8 (xml)
- RefWorks Tagged
- Ref Manager
- Tweet Widget
- Facebook Like
- Google Plus One
Subject Area
- Health Informatics
- Addiction Medicine (377)
- Allergy and Immunology (692)
- Anesthesia (186)
- Cardiovascular Medicine (2794)
- Dentistry and Oral Medicine (324)
- Dermatology (238)
- Emergency Medicine (420)
- Endocrinology (including Diabetes Mellitus and Metabolic Disease) (992)
- Epidemiology (12455)
- Forensic Medicine (10)
- Gastroenterology (790)
- Genetic and Genomic Medicine (4328)
- Geriatric Medicine (397)
- Health Economics (710)
- Health Informatics (2788)
- Health Policy (1034)
- Health Systems and Quality Improvement (1025)
- Hematology (371)
- HIV/AIDS (884)
- Infectious Diseases (except HIV/AIDS) (13891)
- Intensive Care and Critical Care Medicine (823)
- Medical Education (409)
- Medical Ethics (113)
- Nephrology (457)
- Neurology (4094)
- Nursing (218)
- Nutrition (608)
- Obstetrics and Gynecology (771)
- Occupational and Environmental Health (718)
- Oncology (2167)
- Ophthalmology (610)
- Orthopedics (253)
- Otolaryngology (314)
- Pain Medicine (257)
- Palliative Medicine (79)
- Pathology (482)
- Pediatrics (1161)
- Pharmacology and Therapeutics (484)
- Primary Care Research (476)
- Psychiatry and Clinical Psychology (3590)
- Public and Global Health (6694)
- Radiology and Imaging (1466)
- Rehabilitation Medicine and Physical Therapy (852)
- Respiratory Medicine (892)
- Rheumatology (426)
- Sexual and Reproductive Health (426)
- Sports Medicine (356)
- Surgery (467)
- Toxicology (57)
- Transplantation (196)
- Urology (173)
- Open access
- Published: 10 October 2012
Approaches to informed consent for hypothesis-testing and hypothesis-generating clinical genomics research
- Flavia M Facio 1 ,
- Julie C Sapp 1 ,
- Amy Linn 1 , 2 &
- Leslie G Biesecker 1
BMC Medical Genomics volume 5 , Article number: 45 ( 2012 ) Cite this article
5724 Accesses
11 Citations
Metrics details
Massively-parallel sequencing (MPS) technologies create challenges for informed consent of research participants given the enormous scale of the data and the wide range of potential results.
We propose that the consent process in these studies be based on whether they use MPS to test a hypothesis or to generate hypotheses. To demonstrate the differences in these approaches to informed consent, we describe the consent processes for two MPS studies. The purpose of our hypothesis-testing study is to elucidate the etiology of rare phenotypes using MPS. The purpose of our hypothesis-generating study is to test the feasibility of using MPS to generate clinical hypotheses, and to approach the return of results as an experimental manipulation. Issues to consider in both designs include: volume and nature of the potential results, primary versus secondary results, return of individual results, duty to warn, length of interaction, target population, and privacy and confidentiality.
The categorization of MPS studies as hypothesis-testing versus hypothesis-generating can help to clarify the issue of so-called incidental or secondary results for the consent process, and aid the communication of the research goals to study participants.
Peer Review reports
Advances in DNA sequencing technologies and concomitant cost reductions have made the use of massively-parallel sequencing (MPS) in clinical research practicable for many researchers. Implementations of MPS include whole genome sequencing and whole exome sequencing, which we consider to be the same, for the purposes of informed consent. A challenge for researchers employing these technologies is to develop appropriate informed consent [ 1 , 2 ], given the enormous amount of information generated for each research participant, and the wide range of medically-relevant genetic results. Most of the informed consent challenges raised by MPS are not novel – what is novel is the scale and scope of genetic interrogation, and the opportunity to develop novel clinical research paradigms.
Massively-parallel sequencing has the capacity to detect nearly any disease-causing gene variant, including late-onset disorders, such as neurologic or cancer-susceptibility syndromes, subclinical disease or endo-phenotypes, such as impaired fasting glucose, and heterozygous carriers of traits inherited in a recessive pattern. Not only is the range of the disorders broad, but the variants have a wide range of relative risks from very high to nearly zero. This is a key distinction of MPS when compared to common SNP variant detection (using so-called gene chips). Because some variants discovered by MPS can be highly penetrant, the detection of such variants can have enormous medical and counseling impact. While many of these informed consent issues have been addressed previously [ 1 , 3 ], the use of MPS in clinical research combines these issues and is on a scale that is orders of magnitude greater than previous study designs.
The initial clinical research uses of MPS were a brute force approach to the identification of mutations for rare mendelian disorders [ 4 ]. This is a variation of positional cloning (also known as gene mapping) and thus a form of classical hypothesis-testing research. The hypothesis is that the phenotype under study is caused by a genetic variant and a suite of techniques is employed (in this case MPS) to identify that causative variant. The application of this technology in this setting is of great promise and will identify causative gene variants for numerous traits, with some predicting that the majority of Mendelian disorders will be elucidated in 5–10 years.
The second of these pathways to discovery is a more novel approach of generating and then sifting MPS results as the raw material to allow the generation of clinical hypotheses, which are in turn used to design clinical experiments to discover the phenotype that is associated with that genotype. This approach we term hypothesis-generating clinical genomics. These hypothesis-generating studies require a consent process that provides the participant with an understanding of scale and scope of the interrogation, which is based on a contextual understanding of the goal and overall organization of the research since specific risks and benefits can be difficult to delineate [ 5 , 6 ]. Importantly, participants need to understand the notion that the researcher is exploring their genomes in an open-ended fashion, that the goal of the experiment is not predictable at the outset, and that the participant will be presented with downstream situations that are not currently foreseeable.
We outline here our approaches to informed consent for our hypothesis-testing and hypothesis-generating MPS research studies. We propose that the consent process be tailored depending on which of these two designs is used, and whether the research aims include study of the return of results.
General issues regarding return of results
Participants in our protocols have the option to learn their potentially clinically relevant genetic variant results. The issue of return of results is controversial and the theoretical arguments for and against the return of results have been extensively debated [ 7 ]. Although an increasing body of literature describes the approaches taken by a few groups no clear consensus exists in either the clinical genomics or bioethics community [ 8 ]. At one end of the spectrum there are those who argue that no results should be returned [ 9 ], and at the other end others contend that the entire sequence should be presented to the research participant [ 10 – 12 ]. In between these extremes lies a qualified or intermediate disclosure policy [ 13 , 14 ]. We take the intermediate position in both of our protocols by giving research participants the choice to receive results, including variants deemed to be clinically actionable [ 3 , 15 ]. Additionally, both protocols are investigating participants’ intentions towards receiving different types of results in order to inform the disclosure policies within the projects and in the broader community [ 16 ]. Because one of our research goals is to study the issues surrounding return of results, it is appropriate and necessary to return results. Thus, the following discussion focuses on issues pertinent to studies that plan to return results.
Issues to consider
Issue #1: primary versus secondary variant results and the open-ended nature of clinical genomics.
In our hypothesis-testing study we distinguish variants as either primary or secondary variants, the distinction reflecting the purpose of the study. A primary variant is a mutation that causes the phenotype that is under study, i.e., the hypothesis that is being tested in the study. A secondary variant is any mutation result not related to the disorder under study, but discovered as part of the quest for the primary variant.
We prefer the term ‘secondary’ to ‘incidental’ because the latter is an adjective indicating chance occurrence, and the discovery of a disease causing mutation by MPS cannot be considered a chance occurrence. The word ‘incidental’ also suggests a lesser degree of importance or impact and it is important to recognize that secondary findings can be of greater medical or personal impact than primary findings.
The consent discussion about results potentially available from participation in a hypothesis-testing study is framed in terms of the study goal, and we assume a high degree of alignment between participants’ goals and the researchers’ aims with respect to primary variants. Participants are, in general, highly motivated to learn the primary variant result and we presume that this motivation contributed to their decision to enroll in the study, similar to motivations for those who have been involved in positional cloning studies. This motivation may not hold for secondary variants, but our approach is to offer them the opportunity to learn secondary and actionable variants that may substantially alter susceptibility to, or reproductive risk for, disease.
In the hypothesis-generating study design no categorical distinction (primary vs. secondary) is made among pathogenic variants, i.e., all variants are treated the same without the label of ‘primary’ or ‘secondary’. This is because we are not using MPS to uncover genetic variants for a specific disease, and any of the variants could potentially be used for hypothesis generation. We suggest that this is the most novel issue with respect to informed consent as the study is open-ended regarding its goals and downstream research activities. This is challenging for informed consent because it is impossible to know what types of hypotheses may be generated at the time of enrollment and consent.
Because the downstream research topics and activities are impossible to predict in hypothesis-generating research, subjects must be consented initially to the open-ended nature of the project. During the course of the study, they must be iteratively re-consented as hypothesis are generated from the genomic data and more specific follow-up studies are designed and proposed to test those newly generated hypotheses. These downstream, iterative consents will vary in their formality, and the degree to which they need to be reviewed and approved. Some general procedures can be approved in advance; for example it may be anticipated that segregation studies would be useful to determine causality for sequence variants or the investigator may simply wish to obtain some additional targeted medical or family history from the research subject. This could be approved prospectively by the IRB with the iterative consent with the subject comprising a verbal discussion of the nature of the trait for which the segregation analysis or additional information is being sought. More specific or more invasive or risky iterative analyses would necessitate review and approval by the IRB with written informed consent.
Informed consent approach
The informed consent process must reflect the fundamental study design distinction of hypothesis-testing versus hypothesis-generating clinical genomics research. For the latter, the challenge is to help the research subjects understand that they are enrolling in a study that could lead to innumerable downstream research activities and goals. The informed consent process must be, like the research, iterative, and involve ongoing communication and consent with respect to those downstream activities.
Issue #2: Volume and nature of information
Whole genome sequencing can elucidate an enormous number of variations for a given individual. A typical whole genome sequence yields ~4,000,000 sequence variations. A whole exome sequence limits the interrogation to the coding regions of genes (about 1–1.5% of the genome) and generates typically 30,000-50,000 gene variants. While most are benign or of unknown consequence, some are associated with a significant increased risk of disease for the individual and/or their family members. For example, the typical human is a carrier for three to five deleterious genetic variants or mutations that cause severe recessive diseases [ 17 , 18 ]. In addition, there are over 30 known cancer susceptibility syndromes, which in aggregate may affect more than 1/500 patients, and the sequence variants that cause these disorders can be readily detected with MPS. These variants can have extremely high relative risks. For some disorders, a rare variant can be associated with a relative risk of greater than 1,000. This is in contrast with common SNP typing which detects variants associated with small relative risks (typically on the order of 1.2-1.5). It is arguable whether the latter type of variant has any clinical utility as an individual test.
Conveying the full scope of genomic interrogation planned for each sample and the volume of information generated for a given participant is impossible. The goal and challenge in this instance is to give the participant as realistic a picture as possible of the likely amount of clinically actionable results the technology can generate. Our approach is two-fold: to give the subjects the clear message that the number and nature of the findings is enormous and literally impossible to describe in a comprehensive manner and to use illustrative examples of the spectrum of these results.
To provide examples, we bin genetic variants into broad categories, as follows: heterozygous carriers of genetic variants implicated in recessive conditions (e.g., CFTR p.Phe508del and cystic fibrosis); variants that cause a treatable disorder that may be present, but asymptomatic or undiagnosed (e.g., LDLR p.Trp87X, familial hypercholesterolemia); variants that predispose to later-onset conditions (e.g., BRCA2 c.5946delT (commonly known as c.6174delT), breast and ovarian cancer susceptibility); variants that predispose to late-onset but untreatable disorders (e.g., frontotemporal dementia MAPT p.Pro301Leu).
Additionally, the scale and scope of the results determines a near certainty that all participants will be found to harbor disease-causing mutations. This is because the interrogation of all genes brings to light the fact that the average human carries 3–5 recessive deleterious genes in addition to the risks for later onset or incompletely penetrant dominant disorders. This reality can be unsettling and surprising to research subjects and we believe it is important to address this early in the process, not downstream in the iterative phase. It is essential for the participants to choose whether MPS research is appropriate for them, taking into account their personal views and values.
Communicate to participants both the overwhelming scale and scope of genetic results they may opt to receive and provide them with specific disease examples that illustrate the kinds of decisions they may need to make as the results become available. These examples should also assist the research subjects in making a decision about whether to participate in the study and if so, the kinds of decisions they may need be making in the future as results become available.
Issue #3: Return of individual genotype results
The return of individual genotype results from MPS presents a new challenge in the clinical research environment, again because of the scale and breadth of the results. The genetic and medical counseling can be challenging because of the volume of results generated, participants’ expectations, the many different categories of results, and the length of time for the information to be available. We suggest that the most reasonable practice is to take a conservative approach and disclose only clinically actionable results. To this end, the absence of a deleterious gene variant (or a negative result) would not be disclosed to research participants. It is our understanding that it is mandatory to validate any individual results that are returned to research subjects in a CLIA-certified laboratory. Using current clinical practice as a standard or benchmark, we suggest that until other approaches are shown to be appropriate and effective, disclosure should take place during a face-to-face encounter involving a multidisciplinary team (geneticist, genetic counselor, and specialists on an ad-hoc basis based on the phenotype in question).
During the initial consent, participants are alerted to the fact that in the future the study team will contact them by telephone and their previously-stated preferences and impressions about receiving primary and secondary variant results will be reviewed. The logistics and details of this future conversation feature prominently in the initial informed consent session, as it is challenging to make and to receive such calls. Participants make a choice to learn or not learn a result each time a result becomes available. Once a participant makes the decision to learn a genotype result, the variant is confirmed in a CLIA lab, and a report is generated. The results are communicated to the participant during a face-to-face meeting with a geneticist and genetic counselor, and with the participation of other specialists depending on the case and the participant’s preferences. These phone discussions are seen as an extension of the initial informed consent process and as opportunities for the participants to make decisions in a more relevant and current context (compared to the original informed consent session). We see this as an iterative approach to consent, also known as circular consent [ 5 ]. Participants who opt not to learn a specific result can still be contacted later if other results become available, unless they choose not to be contacted by us any longer.
This approach to returning results is challenged by the hypothesis-generating genomics research approach. Participants in our hypothesis-testing protocol are not asked to make a decision about learning individual genotype results at the time of consent. This is because we cannot know the nature of the future potential finding at the time of the original consent. Rather, they are engaged in a discussion of what they currently imagine their preferences might be at some future date, again using exemplar disorders and hypothetical scenarios of hypothesis-generating studies.
In the hypothesis-generating study, we have distinct approaches for variants in known disease-causing genes versus variants in genes that are hypothesized to cause disease (the latter being the operative hypothesis generating activity). For the former, the results are handled in a manner quite similar to the hypothesis-testing study. In the latter case, the participant may be asked if they would be willing to return for further phenotyping to help us determine the nature of the variant of uncertain clinical significance (VUCS). The participant is typically informed that they have a sequence variant and that we would like to learn, through clinical research whether this variant has any phenotypic or clinical significance. It is emphasized that current knowledge does not show that the variant causes any phenotype and the chances are high that the variant is benign. However, neither the gene nor the sequence variant is disclosed and the research finding is not confirmed in a CLIA certified lab. This type of VUCS would only be communicated back to the participant if the clinical research showed that the variant was causative, and the return of the result was determined medically appropriate by our Mutation Advisory Committee, and following confirmation in a CLIA-certified laboratory.
For the return of mutations in known, disease causing genes, the initial consent cannot comprehensively inform subjects of the nature of the diseases, because of the scale and scope of the potential results. Instead, exemplars are given to elicit general preferences, which are then affirmed or refined at the time results are available. Hypothesis-generating studies require that subjects receive sufficient information to make an informed choice about participation in the specific follow-up study, with return of individual results only if the cause and effect relationship is established, with appropriate oversight.
Issue #4: Duty to warn
Given the breadth of MPS gene interrogation, it is reasonable to anticipate that occasional participants may have mutations that pose a likely severe negative consequence, which we classify as “panic” results. This models clinical and research practice for the return of results such as a pulmonary mass or high serum potassium level. In contrast to the above-mentioned autosomal recessive carrier states that are expected to be nearly universal, genetic panic results should be uncommon. However, they should not be considered as unanticipated – it is obvious that such variants will be detected and the informed consent process should anticipate these. Examples would be deleterious variants for malignant hyperthermia or Long QT Syndrome, either of which have a substantial risk of sudden death and the risk can be mitigated.
Both our hypothesis-testing and hypothesis-generating studies include mechanisms for the participants to indicate the types of results that they wish to have returned to them. In the hypothesis-testing mode of research this is primarily to respect the autonomy of the participants, but in addition, for the hypothesis-generating study we are assessing the motivations and interests of the subjects in various types of results and manipulating the return of results as an experimental aim. It is our clinical research experience that participants are challenged by making decisions regarding possible future results that are rare, but potentially severe. As well, the medical and social contexts of the subjects evolves over time and the consent that was obtained at enrollment may not be relevant or appropriate at a later time when such a result arises. This is particularly relevant for a research study that is ongoing for substantial periods of time (see also point #7, below).
To address these issues we have consented the subjects to the potential return of “panic” results, irrespective of their preferences at the initial consent session. In effect, the consent process is for some participants a consent to override their preference.
In both hypothesis-testing and hypothesis-generating research it is important to outline circumstances in which researchers’ duty-to-warn may result in a return of results that may be contrary to the preferences of the subject. It is essential that the subjects understand this approach to unusually severe mutation results. Subjects who are uncomfortable with this approach to return of results are encouraged to decline enrollment.
Issue #5: Length of researcher and participant interaction
Approaches to MPS data are evolving rapidly and it is anticipated that this ongoing research into the significance of DNA variants will continue for years or decades. The different purposes of the two study designs lead to different endpoints in terms of researcher’s responsibility to analyze results. In our hypothesis-testing research, discussion of the relationship of the participants to the researchers is framed in terms of the discovery of the primary variant. We ask participants to be willing to interact with us for a period of months or years as it is impossible for to set a specific timeline to determine the cause of the disorder under investigation (if it ever discovered). While attempts to elucidate the primary variant are underway, participants’ genomic data are periodically annotated using the most current bioinformatic methodologies available. We conceptualize our commitment to return re-annotated and updated results to participants as diminishing, but not disappearing, after this initial results’ disclosure. As the primary aim of the study has been accomplished, less attention will be directed to the characterization of ancillary genomic data, yet we believe we retain an obligation to share highly clinically actionable findings with participants should we obtain them.
In the hypothesis-generating study the researcher’s responsibility to annotate participants’ genomes/exomes is ongoing. This is ongoing because, as noted above, one of the experimental aims is to study the motivations and interests of the subjects in these types of results. Determining how this motivation and interest fares over time is an important research goal. During the informed consent discussion it is emphasized that the iterative nature of result interpretation will lead to multiple meetings for the disclosure of clinically actionable results, and that the participant may be contacted months or years after the date of enrollment. Additionally, it is outlined that the participant will make a choice about learning the result each time he/she is re-contacted about the availability of a research finding, and that finding will only be confirmed in a CLIA-certified laboratory if the participant opts to learn the information. Participants who return to discuss results are reminded that they will be contacted in the future if and when other results deemed to be clinically actionable are found for that individual.
Describe nature, mutual commitments, and duration of researcher-participant relationship to participants. For hypothesis-testing studies it is appropriate that the intensity of the clinical annotation of secondary variants may decline when the primary goal of the study is met. For hypothesis-generating studies, such interactions may continue for as long as there are variants to be further evaluated and as long as the subject retains an interest in the participation.
Issue #6: Target population
The informed consent process needs to take into account the target population in terms of their disease phenotype, age, and whether the goal is to enroll individual participants or families. These considerations represent the greatest divergence in approaches to informed consent when comparing hypothesis-testing and hypothesis-generating research. In our two studies, the hypothesis-testing study focuses on rare diseases and often family participation, whereas the hypothesis-generating study focuses on more common diseases and unrelated index cases. There are an infinite number of study designs and investigators may adapt our approaches to informed consent for their own designs.
Our hypothesis-testing protocol enrolls both individual participants and families (most commonly trios), the latter being more common. In hypothesis-testing research, many participants are either affected by a genetic disease or are a close relative (typically a parent) of a person with a genetic disease. The research participants must weigh their hope for, and personal meaning ascribed to, learning the genetic cause for their disorder against the possibility of being in a position to learn a significant amount of unanticipated information. Discussing and addressing the potential discrepancy of the participants’ expectations of the value of their results and what they may realistically stand to learn (both desired and undesired information) is a central component of the informed consent process.
In our hypothesis-testing protocol, when parents are consenting on behalf of a minor child, we review with them the issues surrounding genetic testing of children and discuss their attitudes regarding their child’s autonomy and their parental decision-making values. Because family trios (most often mother-father-child) are enrolled together, we discuss how one individual’s preferences regarding results may be disrupted or superseded by another family member’s choice and communication of that individual’s knowledge.
In contrast, our hypothesis-generating protocol enrolls as probands or primary participants older, unrelated individuals [ 19 ]. Most participants are self-selected in terms of their decision to enroll and are not enrolled because they or a relative have a rare disease. Participants in the hypothesis-generating protocol are consented for future exploration of any and all possible phenotypes. This is a key distinguishing feature of this hypothesis-generating approach to research, which is a different paradigm – going from genotype to phenotype. The participants may be invited for additional phenotyping. In fact, multiple satellite studies are ongoing to evaluate various subsets of participants for different phenotypes. The key with the consent for these subjects is to initially communicate to the subjects the general approach – that their genome will be explored, variations will be identified, and they may be re-contacted for a potential follow-up study to understand the potential relationship of that variant to their phenotype. These subsequent consents for follow-up studies are considered an iterative consent process, which is similar to the Informed Cohort concept [ 20 ].
Hypothesis-generating research is a novel approach to clinical research design and requires an ongoing, iterative approach to informed consent. For hypothesis-testing research a key informed consent issue is for the subjects to balance the desire for information on the primary disease causing mutation with the pros and cons of obtaining possibly undesired information on secondary variants.
Issue #7: Privacy and confidentiality
In MPS studies, privacy and confidentiality is a complex and multifaceted issue. Some potential challenges include: the deposition of genetic and phenotypic data in public databases, the placement of CLIA-validated results in the individual’s medical chart, and the discovery of secondary variants in relatives of affected probands in family-based (typically hypothesis-testing) research.
The field of genomics has a tradition of deposition of data in publicly accessible databases. Participants in our protocols are informed that the goal of sharing de-identified information in public databases is to advance research, and that there are methods in place maximize the privacy and confidentiality of personally identifiable information. However, the deposition of genomic-scale data for an individual participant, such as a MPS sequence, is far above the minimal amount of data to uniquely identify the sample [ 21 , 22 ]. Therefore, the participants should be made aware that the scale of the data could allow analysts to connect sequence data to individuals by matching variants in the deposited research data to other data from that person. As well, the public deposition of data in some cases is an irrevocable decision. Once the data are deposited and distributed, it may be impossible to remove the data from all computer servers, should the subject decide to withdraw from the study.
Additionally, participants are informed that once a result is CLIA-certified, that result is placed in the individual’s medical chart of the clinical research institution and may be accessible by third parties. Although there are state and federal laws to protect individuals against genetic discrimination, including GINA, this law has not yet been tested in the courts. This is explained to participants up front at the time of enrollment and a more detailed discussion takes place at the time of results disclosure. To offer additional protection in the event of a court subpoena, a Certificate of Confidentiality has been obtained in the hypothesis-testing and hypothesis-generating protocols. The discussion surrounding privacy and confidentiality is approached in a similar manner in both protocols.
The third issue regarding confidentiality is that MPS can generate many results in each individual and it is highly likely that some, if not all, of the variants detected in one research participant may be present in another research participant (e.g., a parent). This is again a consequence of the scale and breadth of MPS in that the large number of variants that can be detected in each participant makes it exceedingly likely that their relatives share many of these variants and that their genetic risks of rare diseases may be measurably altered. It is important to communicate to the participants that it is likely that such variants can be detected and that they may have implications for other members of the family, and that the consented individuals, or their parent may need to communicate those results to other members of the family.
The informed consent should include discussion of public deposition of data, the entry of CLIA-validated results into medical records, and the likely discovery of variants with implications for family members.
We describe an approach to the informed consent process as a mutual opportunity for researchers and participants to assess one another’s goals in MPS protocols that employ both hypothesis-generating and hypothesis-testing methodologies. The use of MPS in clinical research requires adaptation of established processes of human subjects protections. The potentially overwhelming scale of information generated by MPS necessitates that investigators and IRBs adapt traditional approaches to consent the subjects. Because nearly all subjects will have a clinically actionable result, investigators must implement thoughtful plan for consent regarding results disclosure, including setting a threshold for the types of information that should be disclosed to the participants.
While some of the informed consent issues for MPS are independent of the study design, others should be adapted based on whether the research study is employing MPS to test a hypothesis (i.e., find the cause of a rare condition in an affected cohort), or to generate hypotheses (i.e., find deleterious or potentially deleterious variants that warrant participant follow-up and further investigation). For example, the health-related attributes of the study cohort (healthy individuals versus disease patients) are likely to influence participants’ motivations and expectations of MPS, and in the case of a disease cohort create the need to dichotomize the genetic variants into primary and secondary. Conversely, issues inherent to MPS technology are central to the informed consent approach in both types of studies. The availability of MPS allows a paradigm shift in genetics research – no longer are investigators constrained to long-standing approaches of hypothesis-testing modes of research. The scale of MPS allows investigators to proceed from genotype to phenotype, and leads to new challenges for genetic and medical counseling. Research participants receiving results from MPS might not present with a personal and/or family history suggestive of conditions revealed by their genotypic variants, and consequently might not perceive their a priori risk to be elevated for those conditions.
Participants’ motivations to have whole genome/exome sequencing at this early stage are important to take into consideration in the informed consent process. Initial qualitative data suggest that individuals enroll in the hypothesis-generating study because of altruism in promoting research, and a desire to learn about genetic factors that contribute to their own health and disease risk [ 23 ]. Most participants expect that genomic information will improve the overall knowledge of disease causes and treatments. Moreover, data on research participants’ preferences to receive different types of genetic results suggest that they have strong intentions to receive all types of results [ 16 ]. However, they are able to discern between the types and quality of information they could learn, and demonstrate stronger attitudes to learn clinically actionable and carrier status results when compared to results that are uncertain or not clinically actionable. These findings provide initial insights into the value these early adopters place on information generated by high-throughput sequencing studies, and help us tailor the informed consent process to this group of individuals. However, more empirical data are needed to guide the informed consent process, including data on research participants’ ability to receive results for multiple disorders and traits.
Participants in both types of studies are engaged in a discussion of the complex and dynamic nature of genomic annotation so that they may make an informed decision about participation and may be aware of the need to revisit results learned at additional time points in the future. As well, we advocate a process whereby investigators retain some latitude with respect to the most serious, potentially life-threatening mutations. While it is mandatory to respect the autonomy of research subjects, this does not mean that investigators must accede to the research subject’s views of these “panic” results. In a paradoxical way, the research participant and the researcher can agree that the latter can maintain a small, but initially ambiguous degree of latitude with respect to these most serious variants. In the course of utilizing MPS technology for further elucidation of the genetic architecture of health and disease, it is imperative that research participants and researchers be engaged in a continuous discussion about the state of scientific knowledge and the types of information that could potentially be learned from MPS. Although resource-intensive, this “partnership model” [ 2 ] or informed cohort approach to informed consent promotes respect for participants, and allows evaluation of the benefits and harms of disclosure in a more timely and relevant manner.
We have here proposed a categorization of massively-parallel clinical genomics research studies as hypothesis-testing versus hypothesis-generating to help clarify the issue of so-called incidental or secondary results for the consent process, and aid the communication of the research goals to study participants. By using this categorization approach and considering seven important features of this kind of research (Primary versus secondary variant results and the open-ended nature of clinical genomics, Volume and nature of information, Return of individual genotype results, Duty to warn, Length of researcher and participant interaction, Target population, and Privacy and confidentiality) researchers can design an informed consent process that is open, transparent, and appropriately balances risks and benefits of this exciting approach to heritable disease research.
This study was supported by funding from the Intramural Research Program of the National Human Genome Research Institute. The authors have no conflicts to declare.
Netzer C, Klein C, Kohlhase J, Kubisch C: New challenges for informed consent through whole genome array testing. J Med Genet. 2009, 46: 495-496. 10.1136/jmg.2009.068015.
Article CAS PubMed Google Scholar
McGuire AL, Beskow LM: Informed consent in genomics and genetic research. Annu Rev Genomics Hum Genet. 2010, 11: 361-381. 10.1146/annurev-genom-082509-141711.
Article CAS PubMed PubMed Central Google Scholar
Bookman EB, Langehorne AA, Eckfeldt JH, Glass KC, Jarvik GP, Klag M, Koski G, Motulsky A, Wilfond B, Manolio TA, Fabsitz RR, Luepker RV, NHLBI Working Group: Reporting genetic results in research studies: Summary and recommendations of an NHLBI Working Group. Am J Med Genet A. 2006, 140: 1033-1040.
Article PubMed PubMed Central Google Scholar
Ng PC, Kirkness EF: Whole genome sequencing. Methods Mol Biol. 2010, 628: 215-226. 10.1007/978-1-60327-367-1_12.
Mascalzoni D, Hicks A, Pramstaller P, Wjst M: Informed consent in the genomics era. PLoS Med. 2008, 5: e192-10.1371/journal.pmed.0050192.
Rotimi CN, Marshall PA: Tailoring the process of informed consent in genetic and genomic research. Genome Med. 2010, 2: 20-10.1186/gm141.
Bredenoord AL, Kroes HY, Cuppen E, Parker M, van Delden JJ: Disclosure of individual genetic data to research participants: the debate reconsidered. Trends Genet. 2011, 27: 41-47. 10.1016/j.tig.2010.11.004.
Kronenthal C, Delaney SK, Christman MF: Broadening research consent in the era of genome-informed medicine. Genet Med. 2012, 14: 432-436. 10.1038/gim.2011.76.
Article PubMed Google Scholar
Forsberg JS, Hansson MG, Eriksson S: Changing perspectives in biobank research: from individual rights to concerns about public health regarding the return of results. Eur J Hum Genet. 2009, 17: 1544-1549. 10.1038/ejhg.2009.87.
Shalowitz DI, Miller FG: Disclosing individual results of clinical research: implications of respect for participants. JAMA. 2005, 294: 737-740. 10.1001/jama.294.6.737.
Fernandez CV, Kodish E, Weijer C: Informing study participants of research results: an ethical imperative. IRB. 2003, 25: 12-19.
McGuire AL, Lupski JR: Personal genome research: what should the participant be told?. Trends Genet. 2010, 26: 199-201. 10.1016/j.tig.2009.12.007.
Wolf SM, Lawrenz FP, Nelson CA, Kahn JP, Cho MK, Clayton EW, Fletcher JG, Georgieff MK, Hammerschmidt D, Hudson K, Illes J, Kapur V, Keane MA, Koenig BA, Leroy BS, McFarland EG, Paradise J, Parker LS, Terry SF, Van Ness B, Wilfond BS: Managing incidental findings in human subjects research: analysis and recommendations. J Law Med Ethics. 2008, 36: 219-248. 10.1111/j.1748-720X.2008.00266.x.
Kohane IS, Taylor PL: Multidimensional results reporting to participants in genomic studies: Getting it right. Sci Transl Med. 2010, 2: 37cm19-10.1126/scitranslmed.3000809.
Fabsitz RR, McGuire A, Sharp RR, Puggal M, Beskow LM, Biesecker LG, Bookman E, Burke W, Burchard EG, Church G, Clayton EW, Eckfeldt JH, Fernandez CV, Fisher R, Fullerton SM, Gabriel S, Gachupin F, James C, Jarvik GP, Kittles R, Leib JR, O'Donnell C, O'Rourke PP, Rodriguez LL, Schully SD, Shuldiner AR, Sze RK, Thakuria JV, Wolf SM, Burke GL, National Heart, Lung, and Blood Institute working group: Ethical and practical guidelines for reporting genetic research results to study participants: updated guidelines from a national heart, lung, and blood institute working group. Circ Cardiovasc Genet. 2010, 3: 574-580. 10.1161/CIRCGENETICS.110.958827.
Facio FM, Fisher T, Eidem H, Brooks S, Linn A, Biesecker LG, Biesecker BB: Intentions to receive individual results from whole-genome sequencing among participants in the ClinSeqTM study. Eu J Hum Genet. in press
Morton NE: The detection and estimation of linkage between the genes for elliptocytosis and the Rh blood type. Am J Hum Genet. 1956, 8: 80-96.
CAS PubMed PubMed Central Google Scholar
Morton NE: The mutational load due to detrimental genes in man. Am J Hum Genet. 1960, 12: 348-364.
Biesecker LG, Mullikin JC, Facio FM, Turner C, Cherukuri PF, Blakesley RW, Bouffard GG, Chines PS, Cruz P, Hansen NF, Teer JK, Maskeri B, Young AC, Manolio TA, Wilson AF, Finkel T, Hwang P, Arai A, Remaley AT, Sachdev V, Shamburek R, Cannon RO, Green ED, NISC Comparative Sequencing Program: The ClinSeq Project: piloting large-scale genome sequencing for research in genomic medicine. Genome Res. 2009, 19: 1665-1674. 10.1101/gr.092841.109.
Kohane IS, Mandl KD, Taylor PL, Holm IA, Nigrin DJ, Kunkel LM: Medicine. Reestablishing the researcher-patient compact. Science. 2007, 316: 836-837. 10.1126/science.1135489.
Lin Z, Owen AB, Altman RB: Genomic Research and Human Subject Privacy. Science. 2004, 305: 183-10.1126/science.1095019.
Homer N, Szelinger S, Redman M, Duggan D, Tembe W, Muehling J, Pearson JV, Stephan DA, Nelson SF, Craig DW: Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet. 2008, 29: e1000167.
Article Google Scholar
Facio FM, Brooks S, Loewenstein J, Green S, Biesecker LG, Biesecker BB: Motivators for participation in a whole-genome sequencing study: implications for translational genomics research. Eur J Hum Genet. 2011, 19: 1213-1217. 10.1038/ejhg.2011.123.
Pre-publication history
The pre-publication history for this paper can be accessed here: http://www.biomedcentral.com/1755-8794/5/45/prepub
Download references
Author information
Authors and affiliations.
National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
Flavia M Facio, Julie C Sapp, Amy Linn & Leslie G Biesecker
Kennedy Krieger Institute, Baltimore, MD, USA
You can also search for this author in PubMed Google Scholar
Corresponding author
Correspondence to Leslie G Biesecker .
Additional information
Competing interests.
LGB is an uncompensated consultant to, and collaborates with, the Illumina Corp.
Authors’ contributions
FMF and JCS drafted the initial manuscript. LGB Organized and edited the manuscript. All authors read and approved the final manuscript.
Rights and permissions
Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Reprints and permissions
About this article
Cite this article.
Facio, F.M., Sapp, J.C., Linn, A. et al. Approaches to informed consent for hypothesis-testing and hypothesis-generating clinical genomics research. BMC Med Genomics 5 , 45 (2012). https://doi.org/10.1186/1755-8794-5-45
Download citation
Received : 07 November 2011
Accepted : 05 October 2012
Published : 10 October 2012
DOI : https://doi.org/10.1186/1755-8794-5-45
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
- Whole genome sequencing
- Whole exome sequencing
- Informed consent
BMC Medical Genomics
ISSN: 1755-8794
- General enquiries: [email protected]
- About the LSE Impact Blog
- Comments Policy
- Popular Posts
- Recent Posts
- Subscribe to the Impact Blog
- Write for us
- LSE comment
February 3rd, 2016
Putting hypotheses to the test: we must hold ourselves accountable to decisions made before we see the data..
5 comments | 3 shares
Estimated reading time: 5 minutes
We are giving $1,000 prizes to 1,000 scholars simply for making clear when data were used to generate or test a hypothesis. Science is the best tool we have for understanding the way the natural world works. Unfortunately, it is in our imperfect hands . Though scientists are curious and can be quite clever , we also fall victim to biases that can cloud our vision. We seek rewards from our community, we ignore information that contradicts what we believe, and we are capable of elaborate rationalizations for our decisions. We are masters of self-deception .
Yet we don’t want to be. Many scientists choose their career because they are curious and want to find real answers to meaningful questions. In its idealized form, science is a process of proposing explanations and then using data to expose their weaknesses and improve them. This process is both elegant and brutal. It is elegant when we find a new way to explain the world, a way that no one has thought of before. It is brutal in a way that is familiar to any graduate student who has proposed an experiment to a committee or to any researcher who has submitted a paper for peer-review. Logical errors, alternative explanations, and falsification are not just common – they are baked into the process.
Image credit: Winnowing Grain Eastman Johnson Museum of Fine Arts, Boston
Using data to generate potential discoveries and using data to subject those discoveries to tests are distinct processes. This distinction is known as exploratory (or hypothesis-generating) research and confirmatory (or hypothesis-testing) research. In the daily practice of doing research, it is easy to confuse which one is being done. But there is a way – preregistration. Preregistration defines how a hypothesis or research question will be tested – the methodology and analysis plan. It is written down in advance of looking at the data, and it maximizes the diagnosticity of the statistical inferences used to test the hypothesis. After the confirmatory test, the data can then be subjected to any exploratory analyses to identify new hypotheses that can be the focus of a new study. In this way, preregistration provides an unambiguous distinction between exploratory and confirmatory research.The two actions, building and tearing down, are both crucial to advancing our knowledge. Building pushes our potential knowledge a bit further than it was before. Tearing down separates the wheat from the chaff. It exposes that new potential explanation to every conceivable test to see if it survives.
To illustrate how confirmatory and exploratory approaches can be easily confused, picture a path through a garden, forking at regular intervals, as it spreads out into a wide tree. Each split in this garden of forking paths is a decision that can be made when analysing a data set. Do you exclude these samples because they are too extreme? Do you control for income/age/height/wealth? Do you use the mean or median of the measurements? Each decision can be perfectly justifiable and seem insignificant in the moment. After a few of these decisions there exists a surprisingly large number of reasonable analyses. One quickly reaches the point where there are so many of these reasonable analyses, that the traditional threshold of statistical significance, p < .05, or 1 in 20, can be obtained by chance alone .
If we don’t have strong reasons to make these decisions ahead of time, we are simply exploring the dataset for the path that tells the most interesting story. Once we find that interesting story, bolstered by the weight of statistical significance, every decision on that path becomes even more justified, and all of the reasonable, alternative paths are forgotten. Without us realizing what we have done, the diagnosticity of our statistical inferences is gone. We have no idea if our significant result is a product of accumulated luck with random error in the data, or if it is revealing a truly unusual result worthy of interpretation.
This is why we must hold ourselves accountable to decisions made before seeing the data. Without putting those reasons into a time-stamped, uneditable plan, it becomes nearly impossible to avoid making decisions that lead to the most interesting story. This is what preregistration does. Without preregistration, we effectively change our hypothesis as we make those decisions along the forking path. The work that we thought was confirmatory becomes exploratory without us even realizing it.
I am advocating for a way to make sure the data we use to create our explanations is separated from the data that we use to test those explanations. Preregistration does not put science in chains . Scientists should be free to explore the garden and to advance knowledge. Novelty, happenstance, and unexpected findings are core elements of discovery. However, when it comes time to put our new explanations to the test, we will make progress more efficiently and effectively by being as rigorous and as free from bias as possible.
Preregistration is effective. After the United States required that all clinical trials of new treatments on human subjects be preregistered, the rate of finding a significant effect on the primary outcome variable fell from 57% to just 8% within a group of 55 cardiovascular studies. This suggests that flexibility in analytical decisions had an enormous effect on the analysis and publication of these large studies. Preregistration is supported by journals and research funders . Taking this step will show that you are taking every reasonable precaution to reach the most robust conclusions possible, and will improve the weight of your assertions.
Most scientists, when testing a hypothesis, do not specify key analytical decisions prior to looking through a dataset. It’s not what we’re trained to do. We at the Center for Open Science want to change that. We will be giving 1,000 researchers $1,000 prizes for publishing the results of preregistered work. You can be one of them. Begin your preregistration by going to https://cos.io/prereg .
Note: This article gives the views of the author(s), and not the position of the LSE Impact blog, nor of the London School of Economics. Please review our Comments Policy if you have any concerns on posting a comment below.
About the Author:
David Mellor is a Project Manager at the Center for Open Science and works to encourage preregistration. He received his PhD from Rutgers University in Ecology and Evolution has been an active researcher in the behavioral ecology and citizen science communities.
About the author
I strongly agree with almost all of this. One question, though. I sometimes take part in studies that use path models. It can happen that a referee suggests an additional pathway that makes sense to us. But this would not have been in the original specification of the model. Come to think of it this kind of thing must happen pretty often. How would you view that?
That is a great point and is a very frequent occurrence. I think that the vast majority of papers come out of peer review with one or more changes in how the data are analyzed. The best way to handle that is with transparency: “The following, additional paths (or tests, interactions, correlations, etc..) were conducted after data collection was complete…” The important distinction is to not present those new pathways as simply part of the a-priori tests or to lump them with the same analyses presented initially and planned ahead of time. This way, the reader will be able to properly situate those new tests in the complete body of evidence presented in the paper. After data collection and initial analysis, any new tests were created by being influenced by the data and are, in essence, a new hypothesis that is now being tested with the same data that was used to create it. That new test can be confirmed with later follow up study using newly collected data.
Doesn’t this just say – we can only be honest by being rigid? It carries hypothetico-deductive ‘logic’ to a silly extreme, ignoring the inherently iterative process of theorization, recognition of interesting phenomena, and data analysis. But, creative research is not like this. How can you formulate meaningful hypotheses without thinking about and recognizing patterning in the data – the two go hand in hand, and are not the same as simply ‘milking’ data for significant results.
- Pingback: Testing a Hypothesis? Be Upfront About It and Win $1,000
Hi Patrick, Thank you for commenting. I very much agree that meaningful hypotheses cannot be made without recognizing patterns in the date. That may the best way to make a reasonable hypothesis. However, the same data that are used to create the hypothesis cannot be used to test that same hypothesis, and this is what preregistration does. It makes it clear to ourselves exactly what the hypothesis is before seeing the data, so that the data aren’t then used to subtly change/create a new hypothesis. If it does, fine, great! But that is hypothesis building, not hypothesis testing. That is exploratory work, not confirmatory work.
Leave a Comment Cancel reply
Your email address will not be published. Required fields are marked *
Notify me of follow-up comments by email.
Related Posts
Developing social science identities in interdisciplinary research and education
January 12th, 2017.
Five Minutes with Professor Gary King: Transformational power of big data lies, pure and simple, in its analytics
August 17th, 2016.
How to make better mistakes in public policy: Learn from the negative results just as much as the positive ones.
June 27th, 2016.
Why Do Research? Mapping the futures of Higher Education through the CUNY map of New York City.
February 19th, 2015.
Visit our sister blog LSE Review of Books
Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.
- View all journals
- My Account Login
- Explore content
- About the journal
- Publish with us
- Sign up for alerts
- Open access
- Published: 09 July 2024
Automating psychological hypothesis generation with AI: when large language models meet causal graph
- Song Tong ORCID: orcid.org/0000-0002-4183-8454 1 , 2 , 3 , 4 na1 ,
- Kai Mao 5 na1 ,
- Zhen Huang 2 ,
- Yukun Zhao 2 &
- Kaiping Peng 1 , 2 , 3 , 4
Humanities and Social Sciences Communications volume 11 , Article number: 896 ( 2024 ) Cite this article
2876 Accesses
10 Altmetric
Metrics details
- Science, technology and society
Leveraging the synergy between causal knowledge graphs and a large language model (LLM), our study introduces a groundbreaking approach for computational hypothesis generation in psychology. We analyzed 43,312 psychology articles using a LLM to extract causal relation pairs. This analysis produced a specialized causal graph for psychology. Applying link prediction algorithms, we generated 130 potential psychological hypotheses focusing on “well-being”, then compared them against research ideas conceived by doctoral scholars and those produced solely by the LLM. Interestingly, our combined approach of a LLM and causal graphs mirrored the expert-level insights in terms of novelty, clearly surpassing the LLM-only hypotheses ( t (59) = 3.34, p = 0.007 and t (59) = 4.32, p < 0.001, respectively). This alignment was further corroborated using deep semantic analysis. Our results show that combining LLM with machine learning techniques such as causal knowledge graphs can revolutionize automated discovery in psychology, extracting novel insights from the extensive literature. This work stands at the crossroads of psychology and artificial intelligence, championing a new enriched paradigm for data-driven hypothesis generation in psychological research.
Similar content being viewed by others
Augmenting interpretable models with large language models during training
ThoughtSource: A central hub for large language model reasoning data
Testing theory of mind in large language models and humans
Introduction.
In an age in which the confluence of artificial intelligence (AI) with various subjects profoundly shapes sectors ranging from academic research to commercial enterprises, dissecting the interplay of these disciplines becomes paramount (Williams et al., 2023 ). In particular, psychology, which serves as a nexus between the humanities and natural sciences, consistently endeavors to demystify the complex web of human behaviors and cognition (Hergenhahn and Henley, 2013 ). Its profound insights have significantly enriched academia, inspiring innovative applications in AI design. For example, AI models have been molded on hierarchical brain structures (Cichy et al., 2016 ) and human attention systems (Vaswani et al., 2017 ). Additionally, these AI models reciprocally offer a rejuvenated perspective, deepening our understanding from the foundational cognitive taxonomy to nuanced esthetic perceptions (Battleday et al., 2020 ; Tong et al., 2021 ). Nevertheless, the multifaceted domain of psychology, particularly social psychology, has exhibited a measured evolution compared to its tech-centric counterparts. This can be attributed to its enduring reliance on conventional theory-driven methodologies (Henrich et al., 2010 ; Shah et al., 2015 ), a characteristic that stands in stark contrast to the burgeoning paradigms of AI and data-centric research (Bechmann and Bowker, 2019 ; Wang et al., 2023 ).
In the journey of psychological research, each exploration originates from a spark of innovative thought. These research trajectories may arise from established theoretical frameworks, daily event insights, anomalies within data, or intersections of interdisciplinary discoveries (Jaccard and Jacoby, 2019 ). Hypothesis generation is pivotal in psychology (Koehler, 1994 ; McGuire, 1973 ), as it facilitates the exploration of multifaceted influencers of human attitudes, actions, and beliefs. The HyGene model (Thomas et al., 2008 ) elucidated the intricacies of hypothesis generation, encompassing the constraints of working memory and the interplay between ambient and semantic memories. Recently, causal graphs have provided psychology with a systematic framework that enables researchers to construct and simulate intricate systems for a holistic view of “bio-psycho-social” interactions (Borsboom et al., 2021 ; Crielaard et al., 2022 ). Yet, the labor-intensive nature of the methodology poses challenges, which requires multidisciplinary expertise in algorithmic development, exacerbating the complexities (Crielaard et al., 2022 ). Meanwhile, advancements in AI, exemplified by models such as the generative pretrained transformer (GPT), present new avenues for creativity and hypothesis generation (Wang et al., 2023 ).
Building on this, notably large language models (LLMs) such as GPT-3, GPT-4, and Claude-2, which demonstrate profound capabilities to comprehend and infer causality from natural language texts, a promising path has emerged to extract causal knowledge from vast textual data (Binz and Schulz, 2023 ; Gu et al., 2023 ). Exciting possibilities are seen in specific scenarios in which LLMs and causal graphs manifest complementary strengths (Pan et al., 2023 ). Their synergistic combination converges human analytical and systemic thinking, echoing the holistic versus analytic cognition delineated in social psychology (Nisbett et al., 2001 ). This amalgamation enables fine-grained semantic analysis and conceptual understanding via LLMs, while causal graphs offer a global perspective on causality, alleviating the interpretability challenges of AI (Pan et al., 2023 ). This integrated methodology efficiently counters the inherent limitations of working and semantic memories in hypothesis generation and, as previous academic endeavors indicate, has proven efficacious across disciplines. For example, a groundbreaking study in physics synthesized 750,000 physics publications, utilizing cutting-edge natural language processing to extract 6368 pivotal quantum physics concepts, culminating in a semantic network forecasting research trajectories (Krenn and Zeilinger, 2020 ). Additionally, by integrating knowledge-based causal graphs into the foundation of the LLM, the LLM’s capability for causative inference significantly improves (Kıcıman et al., 2023 ).
To this end, our study seeks to build a pioneering analytical framework, combining the semantic and conceptual extraction proficiency of LLMs with the systemic thinking of the causal graph, with the aim of crafting a comprehensive causal network of semantic concepts within psychology. We meticulously analyzed 43,312 psychological articles, devising an automated method to construct a causal graph, and systematically mining causative concepts and their interconnections. Specifically, the initial sifting and preparation of the data ensures a high-quality corpus, and is followed by employing advanced extraction techniques to identify standardized causal concepts. This results in a graph database that serves as a reservoir of causal knowledge. In conclusion, using node embedding and similarity-based link prediction, we unearthed potential causal relationships, and thus generated the corresponding hypotheses.
To gauge the pragmatic value of our network, we selected 130 hypotheses on “well-being” generated by our framework, comparing them with hypotheses crafted by novice experts (doctoral students in psychology) and the LLM models. The results are encouraging: Our algorithm matches the caliber of novice experts, outshining the hypotheses generated solely by the LLM models in novelty. Additionally, through deep semantic analysis, we demonstrated that our algorithm contains more profound conceptual incorporations and a broader semantic spectrum.
Our study advances the field of psychology in two significant ways. Firstly, it extracts invaluable causal knowledge from the literature and converts it to visual graphics. These aids can feed algorithms to help deduce more latent causal relations and guide models in generating a plethora of novel causal hypotheses. Secondly, our study furnishes novel tools and methodologies for causal analysis and scientific knowledge discovery, representing the seamless fusion of modern AI with traditional research methodologies. This integration serves as a bridge between conventional theory-driven methodologies in psychology and the emerging paradigms of data-centric research, thereby enriching our understanding of the factors influencing psychology, especially within the realm of social psychology.
Methodological framework for hypothesis generation
The proposed LLM-based causal graph (LLMCG) framework encompasses three steps: literature retrieval, causal pair extraction, and hypothesis generation, as illustrated in Fig. 1 . In the literature gathering phase, ~140k psychology-related articles were downloaded from public databases. In step two, GPT-4 were used to distil causal relationships from these articles, culminating in the creation of a causal relationship network based on 43,312 selected articles. In the third step, an in-depth examination of these data was executed, adopting link prediction algorithms to forecast the dynamics within the causal relationship network for searching the highly potential causality concept pairs.
Note: LLM stands for large language model; LLMCG algorithm stands for LLM-based causal graph algorithm, which includes the processes of literature retrieval, causal pair extraction, and hypothesis generation.
Step 1: Literature retrieval
The primary data source for this study was a public repository of scientific articles, the PMC Open Access Subset. Our decision to utilize this repository was informed by several key attributes that it possesses. The PMC Open Access Subset boasts an expansive collection of over 2 million full-text XML science and medical articles, providing a substantial and diverse base from which to derive insights for our research. Furthermore, the open-access nature of the articles not only enhances the transparency and reproducibility of our methodology, but also ensures that the results and processes can be independently accessed and verified by other researchers. Notably, the content within this subset originates from recognized journals, all of which have undergone rigorous peer review, lending credence to the quality and reliability of the data we leveraged. Finally, an added advantage was the rich metadata accompanying each article. These metadata were instrumental in refining our article selection process, ensuring coherent thematic alignment with our research objectives in the domains of psychology.
To identify articles relevant to our study, we applied a series of filtering criteria. First, the presence of certain keywords within article titles or abstracts was mandatory. Some examples of these keywords include “psychol”, “clin psychol”, and “biol psychol”. Second, we exploited the metadata accompanying each article. The classification of articles based on these metadata ensured alignment with recognized thematic standards in the domains of psychology and neuroscience. Upon the application of these criteria, we managed to curate a subset of approximately 140K articles that most likely discuss causal concepts in both psychology and neuroscience.
Step 2: Causal pair extraction
The process of extracting causal knowledge from vast troves of scientific literature is intricate and multifaceted. Our methodology distils this complex process into four coherent steps, each serving a distinct purpose. (1) Article selection and cost analysis: Determines the feasibility of processing a specific volume of articles, ensuring optimal resource allocation. (2) Text extraction and analysis: Ensures the purity of the data that enter our causal extraction phase by filtering out nonrelevant content. (3) Causal knowledge extraction: Uses advanced language models to detect, classify, and standardize causal factors relationships present in texts. (4) Graph database storage: Facilitates structured storage, easy retrieval, and the possibility of advanced relational analyses for future research. This streamlined approach ensures accuracy, consistency, and scalability in our endeavor to understand the interplay of causal concepts in psychology and neuroscience.
Text extraction and cleaning
After a meticulous cost analysis detailed in Appendix A , our selection process identified 43,312 articles. This selection was strategically based on the criterion that the journal titles must incorporate the term “Psychol”, signifying their direct relevance to the field of psychology. The distributions of publication sources and years can be found in Table 1 . Extracting the full texts of the articles from their PDF sources was an essential initial step, and, for this purpose, the PyPDF2 Python library was used. This library allowed us to seamlessly extract and concatenate titles, abstracts, and main content from each PDF article. However, a challenge arose with the presence of extraneous sections such as references or tables, in the extracted texts. The implemented procedure, employing regular expressions in Python, was not only adept at identifying variations of the term “references” but also ascertained whether this section appeared as an isolated segment. This check was critical to ensure that the identified that the “references” section was indeed distinct, marking the start of a reference list without continuation into other text. Once identified as a standalone entity, the next step in the method was to efficiently remove the reference section and its subsequent content.
Causal knowledge extraction method
In our effort to extract causal knowledge, the choice of GPT-4 was not arbitrary. While several models were available for such tasks, GPT-4 emerged as a frontrunner due to its advanced capabilities (Wu et al., 2023 ), extensive training on diverse data, with its proven proficiency in understanding context, especially in complex scientific texts (Cheng et al., 2023 ; Sanderson, 2023 ). Other models were indeed considered; however, the capacity of GPT-4 to generate coherent, contextually relevant responses gave our project an edge in its specific requirements.
The extraction process commenced with the segmentation of the articles. Due to the token constraints inherent to GPT-4, it was imperative to break down the articles into manageable chunks, specifically those of 4000 tokens or fewer. This approach ensured a comprehensive interpretation of the content without omitting any potential causal relationships. The next phase was prompt engineering. To effectively guide the extraction capabilities of GPT-4, we crafted explicit prompts. A testament to this meticulous engineering is demonstrated in a directive in which we asked the model to elucidate causal pairs in a predetermined JSON format. For a clearer understanding, readers are referred to Table 2 , which elucidates the example prompt and the subsequent model response. After extraction, the outputs were not immediately cataloged. A filtering process was initiated to ascertain the standardization of the concept pairs. This process weeded out suboptimal outputs. Aiding in this quality control, GPT-4 played a pivotal role in the verification of causal pairs, determining their relevance, causality, and ensuring correct directionality. Finally, while extracting knowledge, we were aware of the constraints imposed by the GPT-4 API. There was a conscious effort to ensure that we operated within the bounds of 60 requests and 150k tokens per minute. This interplay of prompt engineering and stringent filtering was productive.
In addition, we conducted an exploratory study to assess GPT-4’s discernment between “causality” and “correlation” involved four graduate students (mean age 31 ± 10.23), each evaluating relationship pairs extracted from their familiar psychology articles. The experimental details and results can be found in Appendix A and Table A1. The results showed that out of 289 relationships identified by GPT-4, 87.54% were validated. Notably, when GPT-4 classified relationships as causal, only 13.02% (31/238) were recognized as non-relationship, while 65.55% (156/238) agreed upon as causality. This shows that GPT-4 can accurately extract relationships (causality or correlation) in psychological texts, underscoring the potential as a tool for the construction of causal graphs.
To enhance the robustness of the extracted causal relationships and minimize biases, we adopted a multifaceted approach. Recognizing the indispensable role of human judgment, we periodically subjected random samples of extracted causal relationships to the scrutiny of domain experts. Their valuable feedback was instrumental in the real-time fine-tuning the extraction process. Instead of heavily relying on referenced hypotheses, our focus was on extracting causal pairs, primarily from the findings mentioned in the main texts. This systematic methodology ultimately resulted in a refined text corpus distilled from 43,312 articles, which contained many conceptual insights and were primed for rigorous causal extraction.
Graph database storage
Our decision to employ Neo4j as the database system was strategic. Neo4j, as a graph database (Thomer and Wickett, 2020 ), is inherently designed to capture and represent complex relationships between data points, an attribute that is essential for understanding intricate causal relationships. Beyond its technical prowess, Neo4j provides advantages such as scalability, resilience, and efficient querying capabilities (Webber, 2012 ). It is particularly adept at traversing interconnected data points, making it an excellent fit for our causal relationship analysis. The mined causal knowledge finds its abode in the Neo4j graph database. Each pair of causal concepts is represented as a node, with its directionality and interpretations stored as attributes. Relationships provide related concepts together. Storing the knowledge graph in Neo4j allows for the execution of the graph algorithms to analyze concept interconnectivity and reveal potential relationships.
The graph database contains 197k concepts and 235k connections. Table 3 encapsulates the core concepts and provides a vivid snapshot of the most recurring themes; helping us to understand the central topics that dominate the current psychological discourse. A comprehensive examination of the core concepts extracted from 43,312 psychological papers, several distinct patterns and focal areas emerged. In particular, there is a clear balance between health and illness in psychological research. The prominence of terms such as “depression”, “anxiety”, and “symptoms of depression magnifies the commitment in the discipline to understanding and addressing mental illnesses. However, juxtaposed against these are positive terms such as “life satisfaction” and “sense of happiness”, suggesting that psychology not only fixates on challenges but also delves deeply into the nuances of positivity and well-being. Furthermore, the significance given to concepts such as “life satisfaction”, “sense of happiness”, and “job satisfaction” underscores an increasing recognition of emotional well-being and job satisfaction as integral to overall mental health. Intertwining the realms of psychology and neuroscience, terms such as “microglial cell activation”, “cognitive impairment”, and “neurodegenerative changes” signal a growing interest in understanding the neural underpinnings of cognitive and psychological phenomena. In addition, the emphasis on “self-efficacy”, “positive emotions”, and “self-esteem” reflect the profound interest in understanding how self-perception and emotions influence human behavior and well-being. Concepts such as “age”, “resilience”, and “creativity” further expand the canvas, showcasing the eclectic and comprehensive nature of inquiries in the field of psychology.
Overall, this analysis paints a vivid picture of modern psychological research, illuminating its multidimensional approach. It demonstrates a discipline that is deeply engaged with both the challenges and triumphs of human existence, offering holistic insight into the human mind and its myriad complexities.
Step 3: Hypothesis generation using link prediction
In the quest to uncover novel causal relationships beyond direct extraction from texts, the technique of link prediction emerges as a pivotal methodology. It hinges on the premise of proposing potential causal ties between concepts that our knowledge graph does not explicitly connect. The process intricately weaves together vector embedding, similarity analysis, and probability-based ranking. Initially, concepts are transposed into a vector space using node2vec, which is valued for its ability to capture topological nuances. Here, every pair of unconnected concepts is assigned a similarity score, and pairs that do not meet a set benchmark are quickly discarded. As we dive deeper into the higher echelons of these scored pairs, the likelihood of their linkage is assessed using the Jaccard similarity of their neighboring concepts. Subsequently, these potential causal relationships are organized in descending order of their derived probabilities, and the elite pairs are selected.
An illustration of this approach is provided in the case highlighted in Figure A1. For instance, the behavioral inhibition system (BIS) exhibits ties to both the behavioral activation system (BAS) and the subsequent behavioral response of the BAS when encountering reward stimuli, termed the BAS reward response. Simultaneously, another concept, interference, finds itself bound to both the BAS and the BAS Reward Response. This configuration hints at a plausible link between the BIS and interference. Such highly probable causal pairs are not mere intellectual curiosity. They act as springboards, catalyzing the genesis of new experimental designs or research hypotheses ripe for empirical probing. In essence, this capability equips researchers with a cutting-edge instrument, empowering them to navigate the unexplored waters of the psychological and neurological domains.
Using pairs of highly probable causal concepts, we pushed GPT-4 to conjure novel causal hypotheses that bridge concepts. To further elucidate the process of this method, Table 4 provides some examples of hypotheses generated from the process. Such hypotheses, as exemplified in the last row, underscore the potential and power of our method for generating innovative causal propositions.
Hypotheses evaluation and results
In this section, we present an analysis focusing on quality in terms of novelty and usefulness of the hypotheses generated. According to existing literature, these dimensions are instrumental in encapsulating the essence of inventive ideas (Boden, 2009 ; McCarthy et al., 2018 ; Miron-Spektor and Beenen, 2015 ). These parameters have not only been quintessential for gauging creative concepts, but they have also been adopted to evaluate the caliber of research hypotheses (Dowling and Lucey, 2023 ; Krenn and Zeilinger, 2020 ; Oleinik, 2019 ). Specifically, we evaluate the quality of the hypotheses generated by the proposed LLMCG algorithm in relation to those generated by PhD students from an elite university who represent human junior experts, the LLM model, which represents advanced AI systems, and the research ideas refined by psychological researchers which represents cooperation between AI and humans.
The evaluation comprises three main stages. In the first stage, the hypotheses are generated by all contributors, including steps taken to ensure fairness and relevance for comparative analysis. In the second stage, the hypotheses from the first stage are independently and blindly reviewed by experts who represent the human academic community. These experts are asked to provide hypothesis ratings using a specially designed questionnaire to ensure statistical validity. The third stage delves deeper by transforming each research idea into the semantic space of a bidirectional encoder representation from transformers (BERT) (Lee et al., 2023 ), allowing us to intricately analyze the intrinsic reasons behind the rating disparities among the groups. This semantic mapping not only pinpoints the nuanced differences, but also provides potential insights into the cognitive constructs of each hypothesis.
Evaluation procedure
Selection of the focus area for hypothesis generation.
Selecting an appropriate focus area for hypothesis generation is crucial to ensure a balanced and insightful comparison of the hypothesis generation capacities between various contributors. In this study, our goal is to gauge the quality of hypotheses derived from four distinct contributors, with measures in place to mitigate potential confounding variables that might skew the results among groups (Rubin, 2005 ). Our choice of domain is informed by two pivotal criteria: the intricacy and subtlety of the subject matter and familiarity with the domain. It is essential that our chosen domain boasts sufficient complexity to prompt meaningful hypothesis generation and offer a robust assessment of both AI and human contributors” depth of understanding and creativity. Furthermore, while human contributors should be well-acquainted with the domain, their expertise need not match the vast corpus knowledge of the AI.
In terms of overarching human pursuits such as the search for happiness, positive psychology distinguishes itself by avoiding narrowly defined, individual-centric challenges (Seligman and Csikszentmihalyi, 2000 ). This alignment with our selection criteria is epitomized by well-being, a salient concept within positive psychology, as shown in Table 3 . Well-being, with its multidimensional essence that encompass emotional, psychological, and social facets, and its central stature in both research and practical applications of positive psychology (Diener et al., 2010 ; Fredrickson, 2001 ; Seligman and Csikszentmihalyi, 2000 ), becomes the linchpin of our evaluation. The growing importance of well-being in the current global context offers myriad novel avenues for hypothesis generation and theoretical advancement (Forgeard et al., 2011 ; Madill et al., 2022 ; Otu et al., 2020 ). Adding to our rationale, the Positive Psychology Research Center at Tsinghua University is a globally renowned hub for cutting-edge research in this domain. Leveraging this stature, we secured participation from specialized Ph.D. students, reinforcing positive psychology as the most fitting domain for our inquiry.
Hypotheses comparison
In our study, the generated psychological hypotheses were categorized into four distinct groups, consisting of two experimental groups and two control groups. The experimental groups encapsulate hypotheses generated by our algorithm, either through random selection or handpicking by experts from a pool of generated hypotheses. On the other hand, control groups comprise research ideas that were meticulously crafted by doctoral students with substantial academic expertise in the domains and hypotheses generated by representative LLMs. In the following, we elucidate the methodology and underlying rationale for each group:
LLMCG algorithm output (Random-selected LLMCG)
Following the requirement of generating hypotheses centred on well-being, the LLMCG algorithm crafted 130 unique hypotheses. These hypotheses were derived by LLMCG’s evaluation of the most likely causal relationships related to well-being that had not been previously documented in research literature datasets. From this refined pool, 30 research ideas were chosen at random for this experimental group. These hypotheses represent the algorithm’s ability to identify causal relationships and formulate pertinent hypotheses.
LLMCG expert-vetted hypotheses (Expert-selected LLMCG)
For this group, two seasoned psychological researchers, one male aged 47 and one female aged 46, in-depth expertise in the realm of Positive Psychology, conscientiously handpicked 30 of the most promising hypotheses from the refined pool, excluding those from the Random-selected LLMCG category. The selection criteria centered on a holistic understanding of both the novelty and practical relevance of each hypothesis. With an illustrious postdoctoral journey and a robust portfolio of publications in positive psychology to their names, they rigorously sifted through the hypotheses, pinpointing those that showcased a perfect confluence of originality and actionable insight. These hypotheses were meticulously appraised for their relevance, structural coherence, and potential academic value, representing the nexus of machine intelligence and seasoned human discernment.
PhD students’ output (Control-Human)
We enlisted the expertise of 16 doctoral students from the Positive Psychology Research Center at Tsinghua University. Under the guidance of their supervisor, each student was provided with a questionnaire geared toward research on well-being. The participants were given a period of four working days to complete and return the questionnaire, which was distributed during vacation to ensure minimal external disruptions and commitments. The specific instructions provided in the questionnaire is detailed in Table B1 , and each participant was asked to complete 3–4 research hypotheses. By the stipulated deadline, we received responses from 13 doctoral students, with a mean age of 31.92 years (SD = 7.75 years), cumulatively presenting 41 hypotheses related to well-being. To maintain uniformity with the other groups, a random selection was made to shortlist 30 hypotheses for further analysis. These hypotheses reflect the integration of core theoretical concepts with the latest insights into the domain, presenting an academic interpretation rooted in their rigorous training and education. Including this group in our study not only provides a natural benchmark for human ingenuity and expertise but also underscores the invaluable contribution of human cognition in research ideation, serving as a pivotal contrast to AI-generated hypotheses. This juxtaposition illuminates the nuanced differences between human intellectual depth and AI’s analytical progress, enriching the comparative dimensions of our study.
Claude model output (Control-Claude)
This group exemplifies the pinnacle of current LLM technology in generating research hypotheses. Since LLMCG is a nascent technology, its assessment requires a comparative study with well-established counterparts, creating a key paradigm in comparative research. Currently, Claude-2 and GPT-4 represent the apex of AI technology. For example, Claude-2, with an accuracy rate of 54. 4% excels in reasoning and answering questions, substantially outperforming other models such as Falcon, Koala and Vicuna, which have accuracy rates of 17.1–25.5% (Wu et al., 2023 ). To facilitate a more comprehensive evaluation of the new model by researchers and to increase the diversity and breadth of comparison, we chose Claude-2 as the control model. Using the detailed instructions provided in Table B2, Claude-2 was iteratively prompted to generate research hypotheses, generating ten hypotheses per prompt, culminating in a total of 50 hypotheses. Although the sheer number and range of these hypotheses accentuate the capabilities of Claude-2, to ensure compatibility in terms of complexity and depth between all groups, a subsequent refinement was considered essential. With minimal human intervention, GPT-4 was used to evaluate these 50 hypotheses and select the top 30 that exhibited the most innovative, relevant, and academically valuable insights. This process ensured the infusion of both the LLM”s analytical prowess and a layer of qualitative rigor, thus giving rise to a set of hypotheses that not only align with the overarching theme of well-being but also resonate with current academic discourse.
Hypotheses assessment
The assessment of the hypotheses encompasses two key components: the evaluation conducted by eminent psychology professors emphasizing novelty and utility, and the deep semantic analysis involving BERT and t -distributed stochastic neighbor embedding ( t -SNE) visualization to discern semantic structures and disparities among hypotheses.
Human academic community
The review task was entrusted to three eminent psychology professors (all male, mean age = 42.33), who have a decade-long legacy in guiding doctoral and master”s students in positive psychology and editorial stints in renowned journals; their task was to conduct a meticulous evaluation of the 120 hypotheses. Importantly, to ensure unbiased evaluation, the hypotheses were presented to them in a completely randomized order in the questionnaire.
Our emphasis was undeniably anchored to two primary tenets: novelty and utility (Cohen, 2017 ; Shardlow et al., 2018 ; Thompson and Skau, 2023 ; Yu et al., 2016 ), as shown in Table B3 . Utility in hypothesis crafting demands that our propositions extend beyond mere factual accuracy; they must resonate deeply with academic investigations, ensuring substantial practical implications. Given the inherent challenges of research, marked by constraints in time, manpower, and funding, it is essential to design hypotheses that optimize the utilization of these resources. On the novelty front, we strive to introduce innovative perspectives that have the power to challenge and expand upon existing academic theories. This not only propels the discipline forward but also ensures that we do not inadvertently tread on ground already covered by our contemporaries.
Deep semantic analysis
While human evaluations provide invaluable insight into the novelty and utility of hypotheses, to objectively discern and visualize semantic structures and the disparities among them, we turn to the realm of deep learning. Specifically, we employ the power of BERT (Devlin et al., 2018 ). BERT, as highlighted by Lee et al. ( 2023 ), had a remarkable potential to assess the innovation of ideas. By translating each hypothesis into a high-dimensional vector in the BERT domain, we obtain the profound semantic core of each statement. However, such granularity in dimensions presents challenges when aiming for visualization.
To alleviate this and to intuitively understand the clustering and dispersion of these hypotheses in semantic space, we deploy the t -SNE ( t -distributed Stochastic Neighbor Embedding) technique (Van der Maaten and Hinton, 2008 ), which is adept at reducing the dimensionality of the data while preserving the relative pairwise distances between the items. Thus, when we map our BERT-encoded hypotheses onto a 2D t -SNE plane, an immediate visual grasp on how closely or distantly related our hypotheses are in terms of their semantic content. Our intent is twofold: to understand the semantic terrains carved out by the different groups and to infer the potential reasons for some of the hypotheses garnered heightened novelty or utility ratings from experts. The convergence of human evaluations and semantic layouts, as delineated by Algorithm 1 in Appendix B , reveal the interplay between human intuition and the inherent semantic structure of the hypotheses.
Qualitative analysis by topic analysis
To better understand the underlying thought processes and the topical emphasis of both PhD students and the LLMCG model, qualitative analyses were performed using visual tools such as word clouds and connection graphs, as detailed in Appendix B . The word cloud, as a graphical representation, effectively captures the frequency and importance of terms, providing direct visualization of the dominant themes. Connection graphs, on the other hand, elucidate the relationships and interplay between various themes and concepts. Using these visual tools, we aimed to achieve a more intuitive and clear representation of the data, allowing for easy comparison and interpretation.
Observations drawn from both the word clouds and the connection graphs in Figures B1 and B2 provide us with a rich tapestry of insights into the thought processes and priorities of Ph.D. students and the LLMCG model. For instance, the emphasis in the Control-Human word cloud on terms such as “robot” and “AI” indicates a strong interest among Ph.D. students in the nexus between technology and psychology. It is particularly fascinating to see a group of academically trained individuals focusing on the real world implications and intersections of their studies, as shown by their apparent draw toward trending topics. This not only underscores their adaptability but also emphasizes the importance of contextual relevance. Conversely, the LLMCG groups, particularly the Expert-selected LLMCG group, emphasize the community, collective experiences, and the nuances of social interconnectedness. This denotes a deep-rooted understanding and application of higher-order social psychological concepts, reflecting the model”s ability to dive deep into the intricate layers of human social behavior.
Furthermore, the connection graphs support these observations. The Control-Human graph, with its exploration of themes such as “Robot Companionship” and its relation to factors such as “heart rate variability (HRV)”, demonstrates a confluence of technology and human well-being. The other groups, especially the Random-selected LLMCG group, yield themes that are more societal and structural, hinting at broader determinants of individual well-being.
Analysis of human evaluations
To quantify the agreement among the raters, we employed Spearman correlation coefficients. The results, as shown in Table B5, reveal a spectrum of agreement levels between the reviewer pairs, showcasing the subjective dimension intrinsic to the evaluation of novelty and usefulness. In particular, the correlation between reviewer 1 and reviewer 2 in novelty (Spearman r = 0.387, p < 0.0001) and between reviewer 2 and reviewer 3 in usefulness (Spearman r = 0.376, p < 0.0001) suggests a meaningful level of consensus, particularly highlighting their capacity to identify valuable insights when evaluating hypotheses.
The variations in correlation values, such as between reviewer 2 and reviewer 3 ( r = 0.069, p = 0.453), can be attributed to the diverse research orientations and backgrounds of each reviewer. Reviewer 1 focuses on social ecology, reviewer 3 specializes in neuroscientific methodologies, and reviewer 2 integrates various views using technologies like virtual reality, and computational methods. In our evaluation, we present specific hypotheses cases to illustrate the differing perspectives between reviewers, as detailed in Table B4 and Figure B3. For example, C5 introduces the novel concept of “Virtual Resilience”. Reviewers 1 and 3 highlighted its originality and utility, while reviewer 2 rated it lower in both categories. Meanwhile, C6, which focuses on social neuroscience, resonated with reviewer 3, while reviewers 1 and 2 only partially affirmed it. These differences underscore the complexity of evaluating scientific contributions and highlight the importance of considering a range of expert opinions for a comprehensive evaluation.
This assessment is divided into two main sections: Novelty analysis and usefulness analysis.
Novelty analysis
In the dynamic realm of scientific research, measuring and analyzing novelty is gaining paramount importance (Shin et al., 2022 ). ANOVA was used to analyze the novelty scores represented in Fig. 2 a, and we identified a significant influence of the group factor on the mean novelty score between different reviewers. Initially, z-scores were calculated for each reviewer”s ratings to standardize the scoring scale, which were then averaged. The distinct differences between the groups, as visualized in the boxplots, are statistically underpinned by the results in Table 5 . The ANOVA results revealed a pronounced effect of the grouping factor ( F (3116) = 6.92, p = 0.0002), with variance explained by the grouping factor (R-squared) of 15.19%.
Box plots on the left ( a ) and ( b ) depict distributions of novelty and usefulness scores, respectively, while smoothed line plots on the right demonstrate the descending order of novelty and usefulness scores and subjected to a moving average with a window size of 2. * denotes p < 0.05, ** denotes p <0.01.
Further pairwise comparisons using the Bonferroni method, as delineated in Table 5 and visually corroborated by Fig. 2 a; significant disparities were discerned between Random-selected LLMCG and Control-Claude ( t (59) = 3.34, p = 0.007) and between Control-Human and Control-Claude ( t (59) = 4.32, p < 0.001). The Cohen’s d values of 0.8809 and 1.1192 respectively indicate that the novelty scores for the Random-selected LLMCG and Control-Human groups are significantly higher than those for the Control-Claude group. Additionally, when considering the cumulative distribution plots to the right of Fig. 2 a, we observe the distributional characteristics of the novel scores. For example, it can be observed that the Expert-selected LLMCG curve portrays a greater concentration in the middle score range when compared to the Control-Claude , curve but dominates in the high novelty scores (highlighted in dashed rectangle). Moreover, comparisons involving Control-Human with both Random-selected LLMCG and Expert-selected LLMCG did not manifest statistically significant variances, indicating aligned novelty perceptions among these groups. Finally, the comparisons between Expert-selected LLMCG and Control-Claude ( t (59) = 2.49, p = 0.085) suggest a trend toward significance, with a Cohen’s d value of 0.6226 indicating generally higher novelty scores for Expert-selected LLMCG compared to Control-Claude .
To mitigate potential biases due to individual reviewer inclinations, we expanded our evaluation to include both median and maximum z-scores from the three reviewers for each hypothesis. These multifaceted analyses enhance the robustness of our results by minimizing the influence of extreme values and potential outliers. First, when analyzing the median novelty scores, the ANOVA test demonstrated a notable association with the grouping factor ( F (3,116) = 6.54, p = 0.0004), which explained 14.41% of the variance. As illustrated in Table 5 , pairwise evaluations revealed significant disparities between Control-Human and Control-Claude ( t (59) = 4.01, p = 0.001), with Control-Human performing significantly higher than Control-Claude (Cohen’s d = 1.1031). Similarly, there were significant differences between Random-selected LLMCG and Control-Claude ( t (59) = 3.40, p = 0.006), where Random-selected LLMCG also significantly outperformed Control-Claude (Cohen’s d = 0.8875). Interestingly, the comparison of Expert-selected LLMCG with Control-Claude ( t (59) = 1.70, p = 0.550) and other group pairings did not include statistically significant differences.
Subsequently, turning our attention to maximum novelty scores provided crucial insights, especially where outlier scores may carry significant weight. The influence of the grouping factor was evident ( F (3,116) = 7.20, p = 0.0002), indicating an explained variance of 15.70%. In particular, clear differences emerged between Control-Human and Control-Claude ( t (59) = 4.36, p < 0.001), and between Random-selected LLMCG and Control-Claude ( t (59) = 3.47, p = 0.004). A particularly intriguing observation was the significant difference between Expert-selected LLMCG and Control-Claude ( t (59) = 3.12, p = 0.014). The Cohen’s d values of 1.1637, 1.0457, and 0.6987 respectively indicate that the novelty scores for the Control-Human , Random-selected LLMCG , and Expert-selected LLMCG groups are significantly higher than those for the Control-Claude group. Together, these analyses offer a multifaceted perspective on novelty evaluations. Specifically, the results of the median analysis echo and support those of the mean, reinforcing the reliability of our assessments. The discerned significance between Control-Claude and Expert-selected LLMCG in the median data emphasizes the intricate differences, while also pointing to broader congruence in novelty perceptions.
Usefulness analysis
Evaluating the practical impact of hypotheses is crucial in scientific research assessments. In the mean useful spectrum, the grouping factor did not exert a significant influence ( F (3,116) = 5.25, p = 0.553). Figure 2 b presents the utility score distributions between groups. The narrow interquartile range of Control-Human suggests a relatively consistent assessment among reviewers. On the other hand, the spread and outliers in the Control-Claude distribution hint at varied utility perceptions. Both LLMCG groups cover a broad score range, demonstrating a mixture of high and low utility scores, while the Expert-selected LLMCG gravitates more toward higher usefulness scores. The smoothed line plots accompanying Fig. 2 b further detail the score densities. For instance, Random-selected LLMCG boasts several high utility scores, counterbalanced by a smattering of low scores. Interestingly, the distributions for Control-Human and Expert-selected LLMCG appear to be closely aligned. While mean utility scores provide an overarching view, the nuances within the boxplots and smoothed plots offer deeper insights. This comprehensive understanding can guide future endeavors in content generation and evaluation, spotlighting key areas of focus and potential improvements.
Comparison between the LLMCG and GPT-4
To evaluate the impact of integrating a causal graph with GPT-4, we performed an ablation study comparing the hypotheses generated by GPT-4 alone and those of the proposed LLMCG framework. For this experiment, 60 hypotheses were created using GPT-4, following the detailed instructions in Table B2 . Furthermore, 60 hypotheses for the LLMCG group were randomly selected from the remaining pool of 70 hypotheses. Subsequently, both sets of hypotheses were assessed by three independent reviewers for novelty and usefulness, as previously described.
Table 6 shows a comparison between the GPT-4 and LLMCG groups, highlighting a significant difference in novelty scores (mean value: t (119) = 6.60, p < 0.0001) but not in usefulness scores (mean value: t (119) = 1.31, p = 0.1937). This indicates that the LLMCG framework significantly enhances hypothesis novelty (all Cohen’s d > 1.1) without affecting usefulness compared to the GPT-4 group. Figure B6 visually contrasts these findings, underlining the causal graph’s unique role in fostering novel hypothesis generation when integrated with GPT-4.
The t -SNE visualizations (Fig. 3 ) illustrate the semantic relationships between different groups, capturing the patterns of novelty and usefulness. Notably, a distinct clustering among PhD students suggests shared academic influences, while the LLMCG groups display broader topic dispersion, hinting at a wider semantic understanding. The size of the bubbles reflects the novelty and usefulness scores, emphasizing the diverse perceptions of what is considered innovative versus beneficial. Additionally, the numbers near the yellow dots represent the participant IDs, which demonstrated that the semantics of the same participant, such as H05 or H06, are closely aligned. In Fig. B4 , a distinct clustering of examples is observed, particularly highlighting the close proximity of hypotheses C3, C4, and C8 within the semantic space. This observation is further elucidated in Appendix B , enhancing the comprehension of BERT’s semantic representation. Instead of solely depending on superficial textual descriptions, this analysis penetrates into the underlying understanding of concepts within the semantic space, a topic also explored in recent research (Johnson et al., 2023 ).
Comparison of ( a ) novelty and ( b ) usefulness scores (bubble size scaled by 100) among the different groups.
In the distribution of semantic distances (Fig. 4 ), we observed that the Control-Human group exhibits a distinctively greater semantic distance in comparison to the other groups, emphasizing their unique semantic orientations. The statistical support for this observation is derived from the ANOVA results, with a significant F-statistic ( F (3,1652) = 84.1611, p < 0.00001), underscoring the impact of the grouping factor. This factor explains a remarkable 86.96% of the variance, as indicated by the R -squared value. Multiple comparisons, as shown in Table 7 , further elucidate the subtleties of these group differences. Control-Human and Control-Claude exhibit a significant contrast in their semantic distances, as highlighted by the t value of 16.41 and the adjusted p value ( < 0.0001). This difference indicates distinct thought patterns or emphasis in the two groups. Notably, Control-Human demonstrates a greater semantic distance (Cohen’s d = 1.1630). Similarly, a comparison of the Control-Claude and LLMCG models reveals pronounced differences (Cohen’s d > 0.9), more so with the Expert-selected LLMCG ( p < 0.0001). A comparison of Control-Human with the LLMCG models shows divergent semantic orientations, with statistically significant larger distances than Random-selected LLMCG ( p = 0.0036) and a trend toward difference with Expert-selected LLMCG ( p = 0.0687). Intriguingly, the two LLMCG groups—Random-selected and Expert-selected—exhibit similar semantic distances, as evidenced by a nonsignificant p value of 0.4362. Furthermore, the significant distinctions we observed, particularly between the Control-Human and other groups, align with human evaluations of novelty. This coherence indicates that the BERT space representation coupled with statistical analyses could effectively mimic human judgment. Such results underscore the potential of this approach for automated hypothesis testing, paving the way for more efficient and streamlined semantic evaluations in the future.
Note: ** denotes p < 0.01, **** denotes p < 0.0001.
In general, visual and statistical analyses reveal the nuanced semantic landscapes of each group. While the Ph.D. students’ shared background influences their clustering, the machine models exhibit a comprehensive grasp of topics, emphasizing the intricate interplay of individual experiences, academic influences, and algorithmic understanding in shaping semantic representations.
This investigation carried out a detailed evaluation of the various hypothesis contributors, blending both quantitative and qualitative analyses. In terms of topic analysis, distinct variations were observed between Control-Human and LLMCG, the latter presenting more expansive thematic coverage. For human evaluation, hypotheses from Ph.D. students paralleled the LLMCG in novelty, reinforcing AI’s growing competence in mirroring human innovative thinking. Furthermore, when juxtaposed with AI models such as Control-Claude , the LLMCG exhibited increased novelty. Deep semantic analysis via t -SNE and BERT representations allowed us to intuitively grasp semantic essence of hypotheses, signaling the possibility of future automated hypothesis assessments. Interestingly, LLMCG appeared to encompass broader complementary domains compared to human input. Taken together, these findings highlight the emerging role of AI in hypothesis generation and provide key insights into hypothesis evaluation across diverse origins.
General discussion
This research delves into the synergistic relationship between LLM and causal graphs in the hypothesis generation process. Our findings underscore the ability of LLM, when integrated with causal graph techniques, to produce meaningful hypotheses with increased efficiency and quality. By centering our investigation on “well-being” we emphasize its pivotal role in psychological studies and highlight the potential convergence of technology and society. A multifaceted assessment approach to evaluate quality by topic analysis, human evaluation and deep semantic analysis demonstrates that AI-augmented methods not only outshine LLM-only techniques in generating hypotheses with superior novelty and show quality on par with human expertise but also boast the capability for more profound conceptual incorporations and a broader semantic spectrum. Such a multifaceted lens of assessment introduces a novel perspective for the scholarly realm, equipping researchers with an enriched understanding and an innovative toolset for hypothesis generation. At its core, the melding of LLM and causal graphs signals a promising frontier, especially in regard to dissecting cornerstone psychological constructs such as “well-being”. This marriage of methodologies, enriched by the comprehensive assessment angle, deepens our comprehension of both the immediate and broader ramifications of our research endeavors.
The prominence of causal graphs in psychology is profound, they offer researchers a unified platform for synthesizing and hypothesizing diverse psychological realms (Borsboom et al., 2021 ; Uleman et al., 2021 ). Our study echoes this, producing groundbreaking hypotheses comparable in depth to early expert propositions. Deep semantic analysis bolstered these findings, emphasizing that our hypotheses have distinct cross-disciplinary merits, particularly when compared to those of individual doctoral scholars. However, the traditional use of causal graphs in psychology presents challenges due to its demanding nature, often requiring insights from multiple experts (Crielaard et al., 2022 ). Our research, however, harnesses LLM’s causal extraction, automating causal pair derivation and, in turn, minimizing the need for extensive expert input. The union of the causal graphs’ systematic approach with AI-driven creativity, as seen with LLMs, paves the way for the future of psychological inquiry. Thanks to advancements in AI, barriers once created by causal graphs’ intricate procedures are being dismantled. Furthermore, as the era of big data dawns, the integration of AI and causal graphs in psychology augments research capabilities, but also brings into focus the broader implications for society. This fusion provides a nuanced understanding of the intricate sociopsychological dynamics, emphasizing the importance of adapting research methodologies in tandem with technological progress.
In the realm of research, LLMs serve a unique purpose, often by acting as the foundation or baseline against which newer methods and approaches are assessed. The demonstrated productivity enhancements by generative AI tools, as evidenced by Noy and Zhang ( 2023 ), indicate the potential of such LLMs. In our investigation, we pit the hypotheses generated by such substantial models against our integrated LLMCG approach. Intriguingly, while these LLMs showcased admirable practicality in their hypotheses, they substantially lagged behind in terms of innovation when juxtaposed with the doctoral student and LLMCG group. This divergence in results can be attributed to the causal network curated from 43k research papers, funneling the vast knowledge reservoir of the LLM squarely into the realm of scientific psychology. The increased precision in hypothesis generation by these models fits well within the framework of generative networks. Tong et al. ( 2021 ) highlighted that, by integrating structured constraints, conventional neural networks can accurately generate semantically relevant content. One of the salient merits of the causal graph, in this context, is its ability to alleviate inherent ambiguity or interpretability challenges posed by LLMs. By providing a systematic and structured framework, the causal graph aids in unearthing the underlying logic and rationale of the outputs generated by LLMs. Notably, this finding echoes the perspective of Pan et al. ( 2023 ), where the integration of structured knowledge from knowledge graphs was shown to provide an invaluable layer of clarity and interpretability to LLMs, especially in complex reasoning tasks. Such structured approaches not only boost the confidence of researchers in the hypotheses derived but also augment the transparency and understandability of LLM outputs. In essence, leveraging causal graphs may very well herald a new era in model interpretability, serving as a conduit to unlock the black box that large models often represent in contemporary research.
In the ever-evolving tapestry of research, every advancement invariably comes with its unique set of constraints, and our study was no exception. On the technical front, a pivotal challenge stemmed from the opaque inner workings of the GPT. Determining the exact machinations within the GPT that lead to the formation of specific causal pairs remains elusive, thereby reintroducing the age-old issue of AI’s inherent lack of transparency (Buruk, 2023 ; Cao and Yousefzadeh, 2023 ). This opacity is magnified in our sparse causal graph, which, while expansive, is occasionally riddled with concepts that, while semantically distinct, converge in meaning. In tangible applications, a careful and meticulous algorithmic evaluation would be imperative to construct an accurate psychological conceptual landscape. Delving into psychology, which bridges humanities and natural sciences, it continuously aims to unravel human cognition and behavior (Hergenhahn and Henley, 2013 ). Despite the dominance of traditional methodologies (Henrich et al., 2010 ; Shah et al., 2015 ), the present data-centric era amplifies the synergy of technology and humanities, resonating with Hasok Chang’s vision of enriched science (Chang, 2007 ). This symbiosis is evident when assessing structural holes in social networks (Burt, 2004 ) and viewing novelty as a bridge across these divides (Foster et al., 2021 ). Such perspectives emphasize the importance of thorough algorithmic assessments, highlighting potential avenues in humanities research, especially when incorporating large language models for innovative hypothesis crafting and verification.
However, there are some limitations to this research. Firstly, we acknowledge that constructing causal relationship graphs has potential inaccuracies, with ~13% relationship pairs not aligning with human expert estimations. Enhancing the estimation of relationship extraction could be a pathway to improve the accuracy of the causal graph, potentially leading to more robust hypotheses. Secondly, our validation process was limited to 130 hypotheses, however, the vastness of our conceptual landscape suggests countless possibilities. As an exemplar, the twenty pivotal psychological concepts highlighted in Table 3 alone could spawn an extensive array of hypotheses. However, the validation of these surrounding hypotheses would unquestionably lead to a multitude of speculations. A striking observation during our validation was the inconsistency in the evaluations of the senior expert panels (as shown in Table B5 ). This shift underscores a pivotal insight: our integration of AI has transitioned the dependency on scarce expert resources from hypothesis generation to evaluation. In the future, rigorous evaluations ensuring both novelty and utility could become a focal point of exploration. The promising path forward necessitates a thoughtful integration of technological innovation and human expertise to fully realize the potential suggested by our study.
In conclusion, our research provides pioneering insight into the symbiotic fusion of LLMs, which are epitomized by GPT, and causal graphs from the realm of psychological hypothesis generation, especially emphasizing “well-being”. Importantly, as highlighted by (Cao and Yousefzadeh, 2023 ), ensuring a synergistic alignment between domain knowledge and AI extrapolation is crucial. This synergy serves as the foundation for maintaining AI models within their conceptual limits, thus bolstering the validity and reliability of the hypotheses generated. Our approach intricately interweaves the advanced capabilities of LLMs with the methodological prowess of causal graphs, thereby optimizing while also refining the depth and precision of hypothesis generation. The causal graph, of paramount importance in psychology due to its cross-disciplinary potential, often demands vast amounts of expert involvement. Our innovative approach addresses this by utilizing LLM’s exceptional causal extraction abilities, effectively facilitating the transition of intense expert engagement from hypothesis creation to evaluation. Therefore, our methodology combined LLM with causal graphs, propelling psychological research forward by improving hypothesis generation and offering tools to blend theoretical and data-centric approaches. This synergy particularly enriches our understanding of social psychology’s complex dynamics, such as happiness research, demonstrating the profound impact of integrating AI with traditional research frameworks.
Data availability
The data generated and analyzed in this study are partially available within the Supplementary materials . For additional data supporting the findings of this research, interested parties may contact the corresponding author, who will provide the information upon receiving a reasonable request.
Battleday RM, Peterson JC, Griffiths TL (2020) Capturing human categorization of natural images by combining deep networks and cognitive models. Nat Commun 11(1):5418
Article ADS PubMed PubMed Central Google Scholar
Bechmann A, Bowker GC (2019) Unsupervised by any other name: hidden layers of knowledge production in artificial intelligence on social media. Big Data Soc 6(1):2053951718819569
Article Google Scholar
Binz M, Schulz E (2023) Using cognitive psychology to understand GPT-3. Proc Natl Acad Sci 120(6):e2218523120
Article CAS PubMed PubMed Central Google Scholar
Boden MA (2009) Computer models of creativity. AI Mag 30(3):23–23
Google Scholar
Borsboom D, Deserno MK, Rhemtulla M, Epskamp S, Fried EI, McNally RJ (2021) Network analysis of multivariate data in psychological science. Nat Rev Methods Prim 1(1):58
Article CAS Google Scholar
Burt RS (2004) Structural holes and good ideas. Am J Sociol 110(2):349–399
Buruk O (2023) Academic writing with GPT-3.5: reflections on practices, efficacy and transparency. arXiv preprint arXiv:2304.11079
Cao X, Yousefzadeh R (2023) Extrapolation and AI transparency: why machine learning models should reveal when they make decisions beyond their training. Big Data Soc 10(1):20539517231169731
Chang H (2007) Scientific progress: beyond foundationalism and coherentism1. R Inst Philos Suppl 61:1–20
Cheng K, Guo Q, He Y, Lu Y, Gu S, Wu H (2023) Exploring the potential of GPT-4 in biomedical engineering: the dawn of a new era. Ann Biomed Eng 51:1645–1653
Article ADS PubMed Google Scholar
Cichy RM, Khosla A, Pantazis D, Torralba A, Oliva A (2016) Comparison of deep neural networks to spatio-temporal cortical dynamics of human visual object recognition reveals hierarchical correspondence. Sci Rep 6(1):27755
Article ADS CAS PubMed PubMed Central Google Scholar
Cohen BA (2017) How should novelty be valued in science? Elife 6:e28699
Article PubMed PubMed Central Google Scholar
Crielaard L, Uleman JF, Châtel BD, Epskamp S, Sloot P, Quax R (2022) Refining the causal loop diagram: a tutorial for maximizing the contribution of domain expertise in computational system dynamics modeling. Psychol Methods 29(1):169–201
Article PubMed Google Scholar
Devlin J, Chang M W, Lee K & Toutanova (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 4171–4186)
Diener E, Wirtz D, Tov W, Kim-Prieto C, Choi D-W, Oishi S, Biswas-Diener R (2010) New well-being measures: short scales to assess flourishing and positive and negative feelings. Soc Indic Res 97:143–156
Dowling M, Lucey B (2023) ChatGPT for (finance) research: the Bananarama conjecture. Financ Res Lett 53:103662
Forgeard MJ, Jayawickreme E, Kern ML, Seligman ME (2011) Doing the right thing: measuring wellbeing for public policy. Int J Wellbeing 1(1):79–106
Foster J G, Shi F & Evans J (2021) Surprise! Measuring novelty as expectation violation. SocArXiv
Fredrickson BL (2001) The role of positive emotions in positive psychology: The broaden-and-build theory of positive emotions. Am Psychol 56(3):218
Gu Q, Kuwajerwala A, Morin S, Jatavallabhula K M, Sen B, Agarwal, A et al. (2024) ConceptGraphs: open-vocabulary 3D scene graphs for perception and planning. In 2nd Workshop on Language and Robot Learning: Language as Grounding
Henrich J, Heine SJ, Norenzayan A (2010) Most people are not WEIRD. Nature 466(7302):29–29
Article ADS CAS PubMed Google Scholar
Hergenhahn B R, Henley T (2013) An introduction to the history of psychology . Cengage Learning
Jaccard J, Jacoby J (2019) Theory construction and model-building skills: a practical guide for social scientists . Guilford publications
Johnson DR, Kaufman JC, Baker BS, Patterson JD, Barbot B, Green AE (2023) Divergent semantic integration (DSI): Extracting creativity from narratives with distributional semantic modeling. Behav Res Methods 55(7):3726–3759
Kıcıman E, Ness R, Sharma A & Tan C (2023) Causal reasoning and large language models: opening a new frontier for causality. arXiv preprint arXiv:2305.00050
Koehler DJ (1994) Hypothesis generation and confidence in judgment. J Exp Psychol Learn Mem Cogn 20(2):461–469
Krenn M, Zeilinger A (2020) Predicting research trends with semantic and neural networks with an application in quantum physics. Proc Natl Acad Sci 117(4):1910–1916
Lee H, Zhou W, Bai H, Meng W, Zeng T, Peng K & Kumada T (2023) Natural language processing algorithms for divergent thinking assessment. In: Proc IEEE 6th Eurasian Conference on Educational Innovation (ECEI) p 198–202
Madill A, Shloim N, Brown B, Hugh-Jones S, Plastow J, Setiyawati D (2022) Mainstreaming global mental health: Is there potential to embed psychosocial well-being impact in all global challenges research? Appl Psychol Health Well-Being 14(4):1291–1313
McCarthy M, Chen CC, McNamee RC (2018) Novelty and usefulness trade-off: cultural cognitive differences and creative idea evaluation. J Cross-Cult Psychol 49(2):171–198
McGuire WJ (1973) The yin and yang of progress in social psychology: seven koan. J Personal Soc Psychol 26(3):446–456
Miron-Spektor E, Beenen G (2015) Motivating creativity: The effects of sequential and simultaneous learning and performance achievement goals on product novelty and usefulness. Organ Behav Hum Decis Process 127:53–65
Nisbett RE, Peng K, Choi I, Norenzayan A (2001) Culture and systems of thought: holistic versus analytic cognition. Psychol Rev 108(2):291–310
Article CAS PubMed Google Scholar
Noy S, Zhang W (2023) Experimental evidence on the productivity effects of generative artificial intelligence. Science 381:187–192
Oleinik A (2019) What are neural networks not good at? On artificial creativity. Big Data Soc 6(1):2053951719839433
Otu A, Charles CH, Yaya S (2020) Mental health and psychosocial well-being during the COVID-19 pandemic: the invisible elephant in the room. Int J Ment Health Syst 14:1–5
Pan S, Luo L, Wang Y, Chen C, Wang J & Wu X (2024) Unifying large language models and knowledge graphs: a roadmap. IEEE Transactions on Knowledge and Data Engineering 36(7):3580–3599
Rubin DB (2005) Causal inference using potential outcomes: design, modeling, decisions. J Am Stat Assoc 100(469):322–331
Article MathSciNet CAS Google Scholar
Sanderson K (2023) GPT-4 is here: what scientists think. Nature 615(7954):773
Seligman ME, Csikszentmihalyi M (2000) Positive psychology: an introduction. Am Psychol 55(1):5–14
Shah DV, Cappella JN, Neuman WR (2015) Big data, digital media, and computational social science: possibilities and perils. Ann Am Acad Political Soc Sci 659(1):6–13
Shardlow M, Batista-Navarro R, Thompson P, Nawaz R, McNaught J, Ananiadou S (2018) Identification of research hypotheses and new knowledge from scientific literature. BMC Med Inform Decis Mak 18(1):1–13
Shin H, Kim K, Kogler DF (2022) Scientific collaboration, research funding, and novelty in scientific knowledge. PLoS ONE 17(7):e0271678
Thomas RP, Dougherty MR, Sprenger AM, Harbison J (2008) Diagnostic hypothesis generation and human judgment. Psychol Rev 115(1):155–185
Thomer AK, Wickett KM (2020) Relational data paradigms: what do we learn by taking the materiality of databases seriously? Big Data Soc 7(1):2053951720934838
Thompson WH, Skau S (2023) On the scope of scientific hypotheses. R Soc Open Sci 10(8):230607
Tong S, Liang X, Kumada T, Iwaki S (2021) Putative ratios of facial attractiveness in a deep neural network. Vis Res 178:86–99
Uleman JF, Melis RJ, Quax R, van der Zee EA, Thijssen D, Dresler M (2021) Mapping the multicausality of Alzheimer’s disease through group model building. GeroScience 43:829–843
Van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9(11):2579–2605
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N & Polosukhin I (2017) Attention is all you need. In Advances in Neural Information Processing Systems
Wang H, Fu T, Du Y, Gao W, Huang K, Liu Z (2023) Scientific discovery in the age of artificial intelligence. Nature 620(7972):47–60
Webber J (2012) A programmatic introduction to neo4j. In Proceedings of the 3rd annual conference on systems, programming, and applications: software for humanity p 217–218
Williams K, Berman G, Michalska S (2023) Investigating hybridity in artificial intelligence research. Big Data Soc 10(2):20539517231180577
Wu S, Koo M, Blum L, Black A, Kao L, Scalzo F & Kurtz I (2023) A comparative study of open-source large language models, GPT-4 and Claude 2: multiple-choice test taking in nephrology. arXiv preprint arXiv:2308.04709
Yu F, Peng T, Peng K, Zheng SX, Liu Z (2016) The Semantic Network Model of creativity: analysis of online social media data. Creat Res J 28(3):268–274
Download references
Acknowledgements
The authors thank Dr. Honghong Bai (Radboud University), Dr. ChienTe Wu (The University of Tokyo), Dr. Peng Cheng (Tsinghua University), and Yusong Guo (Tsinghua University) for their great comments on the earlier version of this manuscript. This research has been generously funded by personal contributions, with special acknowledgment to K. Mao. Additionally, he conceived and developed the causality graph and AI hypothesis generation technology presented in this paper from scratch, and generated all AI hypotheses and paid for its costs. The authors sincerely thank K. Mao for his support, which enabled this research. In addition, K. Peng and S. Tong were partly supported by the Tsinghua University lnitiative Scientific Research Program (No. 20213080008), Self-Funded Project of Institute for Global Industry, Tsinghua University (202-296-001), Shuimu Scholars program of Tsinghua University (No. 2021SM157), and the China Postdoctoral International Exchange Program (No. YJ20210266).
Author information
These authors contributed equally: Song Tong, Kai Mao.
Authors and Affiliations
Department of Psychological and Cognitive Sciences, Tsinghua University, Beijing, China
Song Tong & Kaiping Peng
Positive Psychology Research Center, School of Social Sciences, Tsinghua University, Beijing, China
Song Tong, Zhen Huang, Yukun Zhao & Kaiping Peng
AI for Wellbeing Lab, Tsinghua University, Beijing, China
Institute for Global Industry, Tsinghua University, Beijing, China
Kindom KK, Tokyo, Japan
You can also search for this author in PubMed Google Scholar
Contributions
Song Tong: Data analysis, Experiments, Writing—original draft & review. Kai Mao: Designed the causality graph methodology, Generated AI hypotheses, Developed hypothesis generation techniques, Writing—review & editing. Zhen Huang: Statistical Analysis, Experiments, Writing—review & editing. Yukun Zhao: Conceptualization, Project administration, Supervision, Writing—review & editing. Kaiping Peng: Conceptualization, Writing—review & editing.
Corresponding authors
Correspondence to Yukun Zhao or Kaiping Peng .
Ethics declarations
Competing interests.
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Ethical approval
In this study, ethical approval was granted by the Institutional Review Board (IRB) of the Department of Psychology at Tsinghua University, China. The Research Ethics Committee documented this approval under the number IRB202306, following an extensive review that concluded on March 12, 2023. This approval indicates the research’s strict compliance with the IRB’s guidelines and regulations, ensuring ethical integrity and adherence throughout the study.
Informed consent
Before participating, all study participants gave their informed consent. They received comprehensive details about the study’s goals, methods, potential risks and benefits, confidentiality safeguards, and their rights as participants. This process guaranteed that participants were fully informed about the study’s nature and voluntarily agreed to participate, free from coercion or undue influence.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplemental material, rights and permissions.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .
Reprints and permissions
About this article
Cite this article.
Tong, S., Mao, K., Huang, Z. et al. Automating psychological hypothesis generation with AI: when large language models meet causal graph. Humanit Soc Sci Commun 11 , 896 (2024). https://doi.org/10.1057/s41599-024-03407-5
Download citation
Received : 08 November 2023
Accepted : 25 June 2024
Published : 09 July 2024
DOI : https://doi.org/10.1057/s41599-024-03407-5
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
Quick links
- Explore articles by subject
- Guide to authors
- Editorial policies
Machine Learning as a Tool for Hypothesis Generation
While hypothesis testing is a highly formalized activity, hypothesis generation remains largely informal. We propose a systematic procedure to generate novel hypotheses about human behavior, which uses the capacity of machine learning algorithms to notice patterns people might not. We illustrate the procedure with a concrete application: judge decisions about who to jail. We begin with a striking fact: The defendant’s face alone matters greatly for the judge’s jailing decision. In fact, an algorithm given only the pixels in the defendant’s mugshot accounts for up to half of the predictable variation. We develop a procedure that allows human subjects to interact with this black-box algorithm to produce hypotheses about what in the face influences judge decisions. The procedure generates hypotheses that are both interpretable and novel: They are not explained by demographics (e.g. race) or existing psychology research; nor are they already known (even if tacitly) to people or even experts. Though these results are specific, our procedure is general. It provides a way to produce novel, interpretable hypotheses from any high-dimensional dataset (e.g. cell phones, satellites, online behavior, news headlines, corporate filings, and high-frequency time series). A central tenet of our paper is that hypothesis generation is in and of itself a valuable activity, and hope this encourages future work in this largely “pre-scientific” stage of science.
This is a revised version of Chicago Booth working paper 22-15 “Algorithmic Behavioral Science: Machine Learning as a Tool for Scientific Discovery.” We gratefully acknowledge support from the Alfred P. Sloan Foundation, Emmanuel Roman, and the Center for Applied Artificial Intelligence at the University of Chicago. For valuable comments we thank Andrei Shliefer, Larry Katz and five anonymous referees, as well as Marianne Bertrand, Jesse Bruhn, Steven Durlauf, Joel Ferguson, Emma Harrington, Supreet Kaur, Matteo Magnaricotte, Dev Patel, Betsy Levy Paluck, Roberto Rocha, Evan Rose, Suproteem Sarkar, Josh Schwartzstein, Nick Swanson, Nadav Tadelis, Richard Thaler, Alex Todorov, Jenny Wang and Heather Yang, as well as seminar participants at Bocconi, Brown, Columbia, ETH Zurich, Harvard, MIT, Stanford, the University of California Berkeley, the University of Chicago, the University of Pennsylvania, the 2022 Behavioral Economics Annual Meetings and the 2022 NBER summer institute. For invaluable assistance with the data and analysis we thank Cecilia Cook, Logan Crowl, Arshia Elyaderani, and especially Jonas Knecht and James Ross. This research was reviewed by the University of Chicago Social and Behavioral Sciences Institutional Review Board (IRB20-0917) and deemed exempt because the project relies on secondary analysis of public data sources. All opinions and any errors are of course our own. The views expressed herein are those of the authors and do not necessarily reflect the views of the National Bureau of Economic Research.
MARC RIS BibTeΧ
Download Citation Data
Published Versions
Jens Ludwig & Sendhil Mullainathan, 2024. " Machine Learning as a Tool for Hypothesis Generation, " The Quarterly Journal of Economics, vol 139(2), pages 751-827.
Working Groups
Conferences, more from nber.
In addition to working papers , the NBER disseminates affiliates’ latest findings through a range of free periodicals — the NBER Reporter , the NBER Digest , the Bulletin on Retirement and Disability , the Bulletin on Health , and the Bulletin on Entrepreneurship — as well as online conference reports , video lectures , and interviews .
- Feldstein Lecture
- Presenter: Cecilia E. Rouse
- Methods Lectures
- Presenter: Susan Athey
- Panel Discussion
- Presenters: Karen Dynan , Karen Glenn, Stephen Goss, Fatih Guvenen & James Pearce
IMAGES
VIDEO
COMMENTS
The hypothesis-generating mode of research has been primarily practiced in basic science but has recently been extended to clinical-translational work as well. Just as in basic science, this approach to research can facilitate insights into human health and disease mechanisms and provide the crucially needed data set of the full spectrum of ...
The difference between generating and confirming a hypothesis is crucial for the interpretation of the results. Presenting an outcome from a hypothesis-generating study as if it had been produced in a confirmatory study is misleading and represents methodological ignorance or scientific misconduct. Hypothesis-generating studies differ ...
For the study above, the hypothesis being tested would be "Captopril decreases rates of cardiovascular events in patients with essential hypertension, relative to patients receiving no treatment". Studies that seek to answer descriptive research questions do not test hypotheses, but they can be used for hypothesis generation.
Hypothesis generation is an early and critical step in any hypothesis-driven clinical research project. Because it is not yet a well-understood cognitive process, the need to improve the process goes unrecognized. Without an impactful hypothesis, the significance of any research project can be questionable, regardless of the rigor or diligence applied in other steps of the study, e.g., study ...
Hypothesis generation is an early and critical step in any hypothesis-driven clinical research project. Because it is not yet a well-understood cognitive process, the need to improve the process goes unrecognized. Without an impactful hypothesis, the significance of any research project can be questionable, regardless of the rigor or diligence ...
Data-Driven Hypothesis Generation in Clinical Research 1 Introduction based on hypotheses related to A hypothesis is an educated guess about the relationships among several variables 1,2. Hypothesis generation occurs at the very early stage of the lifecycle of a research project 1,3-5. Typically, after hypothesis generation, study
This chapter argues that the rigor of a study is determined by its ability to persuade skeptics and that researchers should distinguish more clearly between exploratory, data-driven, hypothesis-generating research and confirmatory, theory-driven, hypothesis testing research. Rigorously designed and executed confirmatory studies propel scientific progress by resolving theoretical disagreements.
Generate a hypothesis in advance through pre-analyzing a problem (i.e., generation of a prestage hypothesis). 3. Collect data related to the prestage hypothesis by appropriate means such as experiment, observation, database search, and Web search (i.e., data collection). 4. Process and transform the collected data as needed. 5.
Hypothesis Generation is a literature-based discovery approach that utilizes existing literature to automatically generate implicit biomedical associations and provide reasonable predictions for future research. Despite its potential, current hypothesis generation methods face challenges when applied to research on biological mechanisms.
Scientific hypothesis generation and scientific hypothesis testing are distinct processes 2,5. In clinical research, research questions are often delineated without the support of systematic data analysis (i.e., not data-driven) 2,6,7. Using and analyzing existing data to facilitate scientific hypothesis generation is considered ecological ...
Study flow and data sets used. The 2 × 2 study compared the hypothesis generation process of the clinical researchers with and without VIADS on the same datasets (), with the same study scripts (), and within the same timeframe (2 hours/study session), and they all followed the think-aloud method.The participants were separated into experienced and inexperienced clinical researchers based on ...
The hypothesis-generating mode of research has been primarily practiced in basic science but has recently been extended to clinical-translational work as well. Just as in basic science, this approach to research can facilitate insights into human health and disease mechanisms and provide the crucially needed data set of the full spectrum of ...
The preliminary results show that study participants can generate a few to over a dozen scientific hypotheses during a 2-hour study session, regardless of whether they use VIADS or other analytic tools. ... Background Scientific hypothesis generation is a critical step in scientific research that determines the direction and impact of any ...
research; instead, it complements them and has facilitated discoveries that may not have been possible with hypothesis-testing research. The hypothesis-generating mode of research has been primarily practiced in basic science but has re-cently been extended to clinical-translational work as well.
Hypothesis-generating research is a novel approach to clinical research design and requires an ongoing, iterative approach to informed consent. For hypothesis-testing research a key informed consent issue is for the subjects to balance the desire for information on the primary disease causing mutation with the pros and cons of obtaining ...
Using data to generate potential discoveries and using data to subject those discoveries to tests are distinct processes. This distinction is known as exploratory (or hypothesis-generating) research and confirmatory (or hypothesis-testing) research. In the daily practice of doing research, it is easy to confuse which one is being done.
5. Phrase your hypothesis in three ways. To identify the variables, you can write a simple prediction in if…then form. The first part of the sentence states the independent variable and the second part states the dependent variable. If a first-year student starts attending more lectures, then their exam scores will improve.
Leveraging the synergy between causal knowledge graphs and a large language model (LLM), our study introduces a groundbreaking approach for computational hypothesis generation in psychology. We ...
Stating a Research Hypothesis . Research hypotheses should be clear and specific, yet also succinct. A hypothesis should also be testable. If we state a hypothesis that is impossible to test, it forecloses any further investigation. To the contrary, a hypothesis should be what directs and demands investigation. In addition, a hypothesis should ...
Abstract To correct a common imbalance in methodology courses, focusing almost entirely on hypothesis-testing issues to the neglect of hypothesis-generating issues which are at least as important, 49 creative heuristics are described, divided into 5 categories and 14 subcategories. Each of these heuristics has often been used to generate hypotheses in psychological research, and each is ...
Jens Ludwig & Sendhil Mullainathan, 2024. "Machine Learning as a Tool for Hypothesis Generation," The Quarterly Journal of Economics, vol 139 (2), pages 751-827. Founded in 1920, the NBER is a private, non-profit, non-partisan organization dedicated to conducting economic research and to disseminating research findings among academics, public ...