Subscribe to the PwC Newsletter

Join the community, add a new evaluation result row, question answering.

3074 papers with code • 131 benchmarks • 366 datasets

Question Answering is the task of answering questions (typically reading comprehension questions), but abstaining when presented with a question that cannot be answered based on the provided context.

Question answering can be segmented into domain-specific tasks like community question answering and knowledge-base question answering. Popular benchmark datasets for evaluation question answering systems include SQuAD , HotPotQA , bAbI , TriviaQA , WikiQA , and many others. Models for question answering are typically evaluated on metrics like EM and F1. Some recent top performing models are T5 and XLNet.

( Image credit: SQuAD )

research paper on question answering

Benchmarks Add a Result

--> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> -->
Trend Dataset Best ModelPaper Code Compare
{ANNA} (single model)
Unicorn 11B (fine-tuned)
PaLM 540B (finetuned)
Claude 2 (few-shot, k=5)
GPT-4 + knowledge base
Atlas (full, Wiki-dec-2018 index)
PaLM 540B (finetuned)
Meditron-70B (CoT + SC)
IE-Net (ensemble)
Unicorn 11B (fine-tuned)
Beam Retrieval
XLNet (single model)
GA+MAGE (32)
QDGAT (ensemble)
XLNet (single model)
Masque (NarrativeQA + MS MARCO)
BERT Large Augmented (single model)
FLAN 137B (zero-shot)
ReST meets ReAct (PaLM 2-L + Google Search)
LSTM (lexical overlap + dist output)
sMIM (1024) +
Cluster-Former (#C=512)
BioLinkBERT (large)
Neo-6B (QA + WS)
Blended RAG
Golden Transformer
Ma et al. - ELECTRA
PaLM 540B (Self Improvement, Self Consistency)
PaLM 2 (few-shot, CoT, SC)
Masque Q&A Style
IR Baseline
Fast Weight Memory
Fast Weight Memory
Ma et al. - ELECTRA
BioLinkBERT (large)
BART fine-tuned on FairytaleQA
Human benchmark
Human benchmark
Human benchmark
Claude 1.3 (5-shot)
Custom Legal-BERT
ByT5 (small)
DeepPavlov RuBERT
GPT-4 (8k)
Bing Chat
Vector Database (ChromaDB)
Gated-Attention Reader
FlowQA (single model)
Weakly Supervised Embeddings
G-DAUG-Combo + RoBERTa-Large
Blended RAG
LinkBERT (large)
Fusion Retriever+ETC
Bing Chat
Bing Chat
Bing Chat
Bing Chat
Bing Chat
Bing Chat
Bing Chat
Orca 2-13B
Memory Networks (ensemble)
syntax, frame, coreference, and word embedding features
albert-xxlarge + APN(baseline)
Longformer Encoder Decoder (base)
RoBERTa-large Tagger + LIQUID (Ensemble)

research paper on question answering

Most implemented papers

Attention is all you need.

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers.

Graph Attention Networks

We present graph attention networks (GATs), novel neural network architectures that operate on graph-structured data, leveraging masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations.

RoBERTa: A Robustly Optimized BERT Pretraining Approach

research paper on question answering

Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP).

Language Models are Few-Shot Learners

openai/gpt-3 • NeurIPS 2020

By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do.

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks.

Deep contextualized word representations

We introduce a new type of deep contextualized word representation that models both (1) complex characteristics of word use (e. g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i. e., to model polysemy).

LLaMA: Open and Efficient Foundation Language Models

We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters.

End-To-End Memory Networks

research paper on question answering

For the former our approach is competitive with Memory Networks, but with less supervision.

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

  • Open access
  • Published: 18 March 2021

Automatic question generation and answer assessment: a survey

  • Bidyut Das   ORCID: 1 ,
  • Mukta Majumder 2 ,
  • Santanu Phadikar 3 &
  • Arif Ahmed Sekh 4  

Research and Practice in Technology Enhanced Learning volume  16 , Article number:  5 ( 2021 ) Cite this article

32k Accesses

38 Citations

8 Altmetric

Metrics details

Learning through the internet becomes popular that facilitates learners to learn anything, anytime, anywhere from the web resources. Assessment is most important in any learning system. An assessment system can find the self-learning gaps of learners and improve the progress of learning. The manual question generation takes much time and labor. Therefore, automatic question generation from learning resources is the primary task of an automated assessment system. This paper presents a survey of automatic question generation and assessment strategies from textual and pictorial learning resources. The purpose of this survey is to summarize the state-of-the-art techniques for generating questions and evaluating their answers automatically.


Online learning facilitates learners to learn through the internet via a computer or other digital device. Online learning is classified into three general categories depends on the learning materials: textual learning, visual learning, and audio-video learning. Online learning needs two things: the learning resources and the assessment of learners from the learning resources. The learning resources are available, and learners can able to learn from many sources on the web. On the other hand, the manual questions from the learning materials are required for the learner’s assessment. To the best of our knowledge, no generic assessment system has been proposed in the literature to test the learning gap of learners from the e-reading documents. Therefore, automatic question generation and evaluation strategies can help to automate the assessment system. This article presents several techniques for automatic question generation and their answer assessment. The main contributions of this article are as follows:

This article at first presents a few survey articles that are available in this research area. Table  1 lists the majority of the existing review articles, which described several approaches for question generation. Table  2 presents the survey articles on learner’s answer evaluation techniques.

The second contribution is to summarize the related existing datasets. We also critically analyzed various purposes and limitations of the use of these datasets.

The third contribution is to discuss and summarize the existing and possible question generation methods with corresponding evaluation techniques used to automate the assessment system.

The arrangement of the rest of the article is as follows. In the “ Question Generation and Learner’s Assessment ” section, we describe the overview of question generation and assessment techniques. The “ Related datasets ” section describes the datasets used by researchers for different applications. The “ Objective Question Generation ” section presents the different types of objective question generation techniques. In the “ Subjective Question Generation and Evaluation ” section, we illustrate the existing methods of subjective question generation and their answer evaluation. The “ Visual Question-Answer Generation ” section describes methods of image-based question and answer generation. Finally, we present a few challenges in the “ Challenges in Question Generation and Answer Assessment ” section and conclude the paper in the “ Conclusion ” section.

Question Generation and Learner’s Assessment

Automatic question generation (AQG) performs a significant role in educational assessment. Handmade question creation takes much labor, time and cost, and manual answer assessment is also a time-consuming task. Therefore, to build an automatic system has attracted the attention of researchers in the last two decades for generating questions and evaluating the answers of learners ( Divate and Salgaonkar 2017 ). All question types are broadly divided into two groups: objective question and subjective question. The objective-question asks learners to pick the right answer from two to four alternative options or provides a word/multiword to answer a question or to complete a sentence. Multiple-choice, matching, true-false, and fill-in-the-blank are the most popular assessment items in education ( Boyd 1988 ). On the other side, the subjective question requires an answer in terms of explanation that allows the learners to compose and write a response in their own words. The two well-known examples of the subjective question are short-answer type question and long-answer type question ( Clay 2001 ). The answer to a short question requires a sentence or two to three sentences, and a long-type question needs more than three sentences or paragraphs. However, both subjective and objective questions are necessary for good classroom test ( Heaton 1990 ). Figure  1 shows the overall diagram of different question generation and answer evaluation methods for automatic assessment system. We initially categorized the online learning techniques into three different types, namely text-based, audio and video-based, and image-based. We emphasized mainly text-based approaches and further extended the modality towards assessment methods. We discussed audio-video and image-based learning in this article, but the extensive analysis of such learning methods is out of the scope of this article.

figure 1

Different modalities of question generation and assessment methods reported in literature

The objective question becomes popular as an automated assessment tool in the examination system due to its fast and reliable evaluation policies ( Nicol 2007 ). It involves the binary mode of assessment that has only one correct answer. On the other side, the subjective examination has obtained the attention of the evaluators to evaluate a candidate’s deep knowledge and understanding of the traditional education system for centuries ( Shaban 2014 ). Individually, each university has followed different patterns of subjective examination. Due to the rapid growth of e-learning courses, we need to consider such assessments and evaluations done by the automated appraisal system. The computer-based assessment of subjective questions is challenging, and the accuracy of it has not achieved adequate results. Hopefully, the research on automatic evaluation of subjective-questions in examination discovers new tools to help schools and teachers. An automated tool can able to resolve the problem of hand-scoring thousands of written answers in the subjective-examination. Today’s computer-assisted examination excludes the subjective-questions by MCQs, which are not able to assess the writing skills and critical reasoning of the students due to its unreliable accuracy of evaluation. Table  3 shows the different types of questions and compares the level of difficulties to generate questions and evaluate the learner’s answers.

ACL, IEEE, Mendeley, Google Scholar, and Semantic Scholar are searched to collect high-quality journals and conferences for this survey. The search has involved a combination and variation of the keywords such as automatic question generation, multiple-choice questions generation, cloze questions generation, fill-in-the-blank questions generation, visual question generation, subjective answer evaluation, short answer evaluation, and short answer grading. A total of 78 articles are included in this study. Figure  2 shows the statistics of articles for different question generation and learners’ answer evaluation that found in the last 10 years in the literature.

figure 2

a Statistics of question generation articles appeared in the last decade. b Statistics of answer evaluation articles appeared in the last decade

Related datasets

In 2010, a question generation system QGSTEC used a dataset that contains overall 1000 questions (generated by both humans and machines). The system generated a few questions for each question type (which, what, who, when, where, and how many). Five fixed criteria were used to measure the correctness of the generated questions—relevance, question type, grammatically correct, and ambiguity. Both the relevancy and the syntactic correctness measures did not score well. The agreement between the two human judges was quite low.

The datasets SQuAD, 30MQA, MS MARCO, RACE, NewsQA, TriviaQA, and NarrativeQA contain question-answer pairs and are mainly developed for machine-reading comprehension or question answering models. These datasets are not designed for direct question-generation from textual documents. The datasets are also not suited for educational assessment due to their limited number of topics or insufficient information for generating questions and further answer the questions.

TabMCQ dataset contains large scale crowdsourced MCQs covering the facts in the tables. This dataset is designed for not only the task of question answering but also information extraction, question parsing, answer-type identification, and lexical-semantic modeling. The facts of the tables are not adequate to generate MCQs. The SciQ dataset also consists of a large set of crowdsourced MCQs with distractors and an additional passage that provides the clue for the correct answer. This passage does not contain sufficient information to generate MCQs or distractors. Therefore, both the TabMCQ and SciQ datasets are not applicable for multiple-choice question generation as well as distractors generation.

MCQL dataset is designed for automatic distractors generation. Each MCQ associates with four fields: sentence, answer, distractors, and the number of distractors. We observed that the sentence is not sufficient for generating MCQs for all times. The dataset does not include the source text from where it collects the MCQs and distractors. Distractors not only depend on the question, sentence, and correct answer but also the source text. Therefore, the MCQL dataset is not applicable when it needs to generate questions, answers, and distractors from the same source text or study materials.

LearningQ dataset covers a wide range of learning subjects as well as the different levels of cognitive complexity and contains a large-set of document-question pairs and multiple source sentences for question generation. The dataset decreases the performance of question generation when the length of source sentences increases. Therefore, the dataset is helpful to forward the research on automatic question generation in education.

Table  4 presents the existing datasets which contain question-answer pairs and related to question-answer generation. Table  5 includes the detail description of each dataset.

Objective Question Generation

The study of literature review shows that most of the researchers paid attention to generate objective-type questions, automatically or semi-automatically. They confined their works to generate multiple-choice or cloze questions. A limited number of approaches are found in the literature that shows interest in open-cloze question generation.

Pino and Eskenazi (2009 ) provided the hint in an open-cloze question. They noted the first few letters of a missing word gave a clue about the missing word. Their goal was to vary the number of letters in hint to change the difficulty level of questions that facilitate the students to learn vocabulary. Agarwal (2012) developed an automated open-cloze question generation method. Their approach composed of two steps—selected relevant and informative sentences and identified keywords from the selected sentences. His proposed system had taken cricket-news articles as input and generated factual open-cloze questions as output. Das and Majumder (2017) described a system for open-cloze question generation to evaluate the factual knowledge of learners. They computed the evaluation score using a formula that depends on the number of hints used by the learners to give the right answers. The multiword answer to the open-cloze question makes the system more attractive.

Coniam (1997 ) proposed one of the oldest techniques of cloze test item generation. He applied word frequencies to analyze the corpus in various phases of development, such as obtain the keys for test items, generate test item alternatives, construct cloze test items, and identify good and bad test items. He matched word frequency and parts-of-speech of each test item key with a similar word class and word frequency to construct test items. Brown et al. (2005) revealed an automated system to generate vocabulary questions. They applied WordNet ( Miller 1995 ) for obtaining the synonym, antonym, and hyponym to develop the question key and the distractors. Chen et al. (2006) developed a semi-automated method using NLP techniques to generate grammatical test items. Their approach implied handcraft patterns to find authentic sentences and distractors from the web that transform into grammar-based test items. Their experimental results showed that the method had generated 77% meaningful questions. Hoshino and Nakagawa (2007) introduced a semi-automated system to create cloze test items from online news articles to help teachers. Their test items removed one or more words from a passage, and learners were asked to fill those omitted words. Their system generated two types of distractors: grammatical distractors and vocabulary distractors. The human-based evaluation revealed that their system produced 80% worthy cloze test items. Pino et al. (2008) employed four selection criteria: well-defined context, complexity, grammaticality, and length to give a weighted score for each sentence. They selected a sentence as informative if the score was higher than a threshold for generating a cloze question. Agarwal and Mannem (2011) presented a method to create gap-fill-questions from a biological-textbook. The authors adopted several features to generate the questions: sentence length, the sentence position in a document, is it the first sentence, is the sentence contains token that appears in the title, the number of nouns and pronouns in the sentence, is it holds abbreviation or superlatives. They did not report the optimum value of these features or any relative weight among features or how the features combined. Correia et al. (2012) applied supervised machine learning to select stem for cloze questions. They employed several features to run the classifier of SVM: the length of sentence, the position of the word in a sentence, the chunk of the sentence, verb, parts-of-speech, named-entity, known-word, unknown-word, acronym, etc. Narendra et al. (2013) directly employed a summarizer (MEAD) Footnote 1 to select the informative sentences for automatic CQs generation. Flanagan et al. (2013) described an automatic-method for generating multiple-choice and fill-in-the-blanks e-Learning quizzes.

Mitkov et al. (2006 ) proposed a semi-automated system for generating MCQs from a linguistic-textbook. They employed several NLP approaches for question generation—shallow parsing, key term extraction, semantic distance, sentence transformation, and ontology such as WordNet. Aldabe et al. 2010 presented a system to generate MCQ in the Basque language. They suggested different methods to find semantic similarities between the right answer and its distractors. A corpus-based strategy was applied to measure the similarities. Papasalouros et al. (2008) revealed a method to generate MCQs from domain ontologies. Their experiment used five different domain ontologies for multiple-choice question generation. Bhatia et al. (2013) developed a system for automatic MCQ generation from Wikipedia. They proposed a potential sentence selection approach using the pattern of existing questions on the web. They also suggested a technique for generating distractors using the named entity. Majumder and Saha (2014) applied named entity recognition and syntactic structure similarity to select sentences for MCQ generation. Majumder and Saha (2015) alternately used topic modeling and parse tree structure similarity to choose informative sentences for question formation. They picked the keywords using topic-word and named-entity and applied a gazetteer list-based approach to select distractors.

Subjective Question Generation and Evaluation

Limited research works found in the literature that focused on subjective question generation. Rozali et al. (2010) presented a survey of dynamic question generation and qualitative evaluation and a description of related methods found in the literature. Dhokrat et al. (2012) proposed an automatic system for subjective online examination using a taxonomy that coded earlier into the system. Deena et al. (2020) suggested a question generation method using NLP and bloom’s taxonomy that generated subjective questions dynamically and reduced the occupation of memory.

Proper scoring is the main challenge of subjective assessment. Therefore, automatic subjective-answer evaluation is a current trend of research in the education system ( Burrows et al. 2015 ). It reduces the assessment time and effort in the education system. Objective-type answer evaluation is easy and requires a binary mode of assessment (true/false) to test the correct option. But, the subjective answer evaluation does not achieve adequate results due to its complex nature. The next paragraph discusses some related works of subjective-answer evaluation and grading techniques.

Leacock and Chodorow (2003 ) proposed an answer grading system C-rater that deals with semantic information of the text. They adopted a method to recognize paraphrase to grad the answers. Their approach achieved 84% accuracy with the manual evaluation of human graders. Bin et al. (2008) employed the K-nearest neighbor (KNN) classifier for automated essay scoring using the text categorization model. The Vector Space Model was used to express each essay. They used words, phrases, and arguments as essay features and represented each vector using the TF-IDF weight. The cosine similarity was applied to calculate the score of essays and achieved 76% average accuracy using the different methods of feature selection, such as term frequency (TF), term frequency-inverse document frequency (TF-IDF), and information gain (IG). Kakkonen et al. (2008) recommended an automatic essay grading system that compares learning materials with the teacher graded essays using three methods: Latent Semantic Analysis (LSA), Probabilistic LSA (PLSA), and Latent Dirichlet Allocation (LDA). Their system performed better than the k-NN based grading system. Noorbehbahani and Kardan (2011) introduced a method for judging free text answers of students using a modified Bilingual Evaluation Understudy (M-BLEU) algorithm. The M-BLEU recognized the most similar reference answer to a student answer and estimated a score to judge the answers. Their method achieved higher accuracy than the other evaluation methods, like latent semantic analysis and n-gram co-occurrence. Dhokrat et al. (2012) proposed an appraisal system for evaluating the student’s answer. The system used a centralized file that includes the model answer with the reference material for each question. Their system found overall 70% accuracy. Islam and Hoque (2010) presented an automatic essay grading system using the generalized latent semantic analysis (GLSA). The GLSA based system used word-ordering in the sentences by including the word n-gram for grading essays. The GLSA based system performs better than the LSA-based grading system and overcomes the limitations of the LSA based system, where the LSA does not consider word-order of sentences in a document. Ramachandran et al. (2015) described a unique technique for scoring short answers. They introduced word ordering graphs to recognize the useful patterns from handcraft rubric texts and the best responses of students. The method also employed semantic metrics to manage related-words for alternative answer options. Sakaguchi et al. (2015) used different sources of information for scoring content-based short answers. Their approach extracted features from the responses (word and character n-grams). Their reference-based method found the similarity between the response features with the information from the scoring guidelines. Their model outperformed when the training data is limited.

Recent progress in deep learning-based NLP has also shown a promising future in answer assessment. Sentiment-based assessment techniques Nassif et al. 2020 ; Abdi et al 2019 used in many cases because of the generalized representation of sentiment in NLP. The success of recurrent neural networks (RNN) such as Long short-term memory (LSTM) becomes popular in sequence analysis and applied in various answer assessment ( Du et al. 2017 ; Klein and Nabi 2019 ).

Visual Question-Answer Generation

Recently, question generation has been included in the field of computer vision to generate image-based questions ( Gordon et al. 2018 ; Suhr et al. 2019 ; Santoro et al. 2018 ). The most recent approaches use human-annotated question-answer pairs to train machine learning algorithms for generating multiple questions per image, which were labor-intensive and time-consuming ( Antol et al. 2015 ; Gao et al. 2015 ). One of the recent examples, Zhu et al. 2016 manually created seven wh-type questions such as when, where, and what. People also investigated automatic visual question generation by using rules. Yu et al. (2015) proposed the question generation as a task of removing a content word (answer) from an image caption and reforms the caption sentence as a question. Similarly, Ren et al. 2015 suggested a rule to reformulate image captions into limited types of questions. Some considered model-based methods to overwhelm the diversity issue of question types. Simoncelli and Olshausen (2001) trained a model using a dataset of image captions and respective visual questions. But, their model could not generate multiple questions per image. Mora et al. (2016) proposed an AI model to generate image-based questions with respective answers simultaneously. Mostafazadeh et al. (2016) collected the first visual question generation dataset, where their model generated several questions per image. Zhang et al. (2017) proposed an automatic model for generating several visually grounded questions from a single image. Johnson et al. (2016) suggested a framework named Densecap for generating region captions, which are the additional information to supervise the question generation. Jain et al. (2017) combined the variational auto-encoders and LSTM networks to generate numerous types of questions from a given image. The majority of these image-based question-answers were related to image understanding and reasoning in real-world images.

Visual Question-Answer Dataset

Figure  3 a shows a few examples where various pattern identification and reasoning tests used synthetic images. Johnson et al. (2017) proposed a diagnostic dataset CLEVR, which has a collection of 3D shapes and used to test the skill of visual reasoning. The dataset is used for question-answering about shapes, positions, and colors. Figure  3 b presents Raven progressive matrices based visual-reasoning that is used to test shape, count, and relational visual reasoning from an image sequence ( Bilker et al. 2012 ). Figure  3 c is an example of NLVR dataset. The dataset used the concepts of 2D shapes and color to test visual reasoning. The dataset is used to generate questions related to the knowledge of shape, size, and color. Figure  3 d is an example of visual question answering dataset (VQA). The dataset consists of a large volume of real-world images and is used to generate questions and corresponding answers related to objects, color, and counting. Figure  3 e is also a similar dataset related to event and actions. All these datasets are used to generate image-specific questions and also used in various assessments.

figure 3

Different datasets and questions used in visual question answering. a CLEVR ( Johnson et al. 2017 ) dataset, b abstract reasoning dataset ( Santoro et al. 2018 ), c NLVR ( Suhr et al. 2019 ) dataset, d VQA ( Antol et al. 2015 ) dataset, and e IQA ( Gordon et al. 2018 ) dataset

Challenges in Question Generation and Answer Assessment

Informative-simple-sentence extraction.

Questions mainly depend on informative sentences. An informative-sentence generates a quality question to assess learners. We found that text-summarization, sentence-simplification, and some rule-based techniques in the literature exacted the informative-sentences from an input text. Most of the previous articles did not focus adequately on the step of informative-sentence selection. But it is a useful-step for generating quality questions. Generate simple-sentences from complex and compound sentences are also complex. A simple-sentence eliminates the ambiguity between multiple answers to a particular question. Therefore, a generic technique is needed to extract the informative-simple-sentences from the text for generating questions ( Das et al. 2019 ). The popular NLP packages like NLTK, spaCy, PyNLPl, and CoreNLP did not include any technique for extracting informative-sentences from a textual document. It is a future direction of research to incorporate it into the NLP packages.

Question generation from multiple sentences

Different question generation techniques generate different questions that assess the knowledge of learners in different ways. An automated system generates questions from study material or learning content based on informative keywords or sentences and multiple sentences or a passage. Generate questions from multiple sentences or a paragraph is difficult and consider a new research direction for automatic question generation. It requires the inner relation between sentences using natural language understanding concepts.

Short and long-type answer assessment

We found many works in the last decade for automatic grading short answers or free-text answers. But the unreliable results of previous research indicates that it is not practically useful in real life. Therefore, most of the exams conduct using MCQs and exclude the short type and long type answers. We found only one research that evaluates long-answers in the literature. Therefore, future research expects a reliable and real-life system for short answer grading as well as long type answer evaluation that fully automate the education system.

Answer assessment standard

Question generation and assessment depend on many factors such as learning domain, type of questions for assessments, difficulty level, question optimization, scoring techniques, and overall scoring. Several authors proposed different evaluation techniques depend on their application, and the scoring scale is also different. Therefore, an answer assessment standard is required in the future to evaluate and compare the learner’s knowledge and compare the research results.

Question generation and assessment from video lectures

We found that the majority of question generation and assessment systems focus on generating questions from the textual document to automate the education system. We found a limited number of works in the literature that generate questions from the visual content for the learner’s assessment. Assessment from video lectures by generating questions from video content is a future research direction. Audio-video content improves the learning process ( Carmichael et al. 2018 ). Automated assessments from video content can help learners to learn quickly in a new area.

Question generation and assessment using machine learning

Due to the many advantages of the machine learning method, recent works focus on it to generate questions and evaluate answers. Most of the textual question generation used natural language processing (NLP) techniques. The advancement of NLP is natural language understanding (NLU) and natural language generation (NLG) that used a deep learning neural network ( Du et al. 2017 ; Klein and Nabi 2019 ). The visual question generation method mainly used machine learning to generate image captions. Image caption translates into a question using NLP techniques. VQG is a combined application of computer vision and NLP. In some articles used sequence-to-sequence modeling for generating questions. Limited works found in the literature that assess the learners using a machine learning approach. More research works need to focus on this area in the future.

Due to the advances in online learning, automatic question generation and assessment are becoming popular in the intelligent education system. The article first includes a collection of review articles in the last decade. Next, it discusses the state-of-the-art methods of various automatic question generation as well as different assessment techniques that summarizes the progress of research. It also presents a summary of related existing datasets found in the literature. This article critically analyzed the methods of objective question generation, subjective question generation with the learner’s response evaluation, and a summarizing of visual question generation methods.

Availability of data and materials

Not applicable


Artificial intelligence

Automatic question generation

Bilingual evaluation understudy

Generalized latent semantic analysis

K-nearest neighbor

Latent Dirichlet allocation

Latent semantic analysis

Long short term memory

Multiple choice question

Natural language processing

Probabilistic latent semantic analysis

Term frequency-inverse document frequency

Visual question answering

Visual question generation

Abdi, A., Shamsuddin, S.M., Hasan, S., Piran, J. (2019). Deep learning-based sentiment classification of evaluative text based on multi-feature fusion. Information Processing & Management , 56 (4), 1245–1259.

Article   Google Scholar  

Agarwal, M. (2012). Cloze and open cloze question generation systems and their evaluation guidelines. Master’s thesis. International Institute of Information Technology, (IIIT), Hyderabad, India .

Agarwal, M., & Mannem, P. (2011). Automatic gap-fill question generation from text books. In Proceedings of the 6th Workshop on Innovative Use of NLP for Building Educational Applications . Association for Computational Linguistics, Portland, (pp. 56–64).

Google Scholar  

Aldabe, I., & Maritxalar, M. (2010). Automatic distractor generation for domain specific texts. In Proceedings of the 7th International Conference on Advances in Natural Language Processing . Springer-Verlag, Berlin, (pp. 27–38).

Chapter   Google Scholar  

Alruwais, N., Wills, G., Wald, M. (2018). Advantages and challenges of using e-assessment. International Journal of Information and Education Technology , 8 (1), 34–37.

Amidei, J., Piwek, P., Willis, A. (2018). Evaluation methodologies in automatic question generation 2013-2018. In Proceedings of The 11th International Natural Language Generation Conference . Association for Computational Linguistics, Tilburg University, (pp. 307–317).

Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., Parikh, D. (2015). Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision , (pp. 2425–2433).

Bhatia, A.S., Kirti, M., Saha, S.K. (2013). Automatic generation of multiple choice questions using wikipedia. In Proceedings of the Pattern Recognition and Machine Intelligence . Springer-Verlag, Berlin, (pp. 733–738).

Bilker, W.B., Hansen, J.A., Brensinger, C.M., Richard, J., Gur, R.E., Gur, R.C. (2012). Development of abbreviated nine-item forms of the Raven’s standard progressive matrices test. Assessment , 19 (3), 354–369.

Bin, L., Jun, L., Jian-Min, Y., Qiao-Ming, Z. (2008). Automated essay scoring using the KNN algorithm. In 2008 International Conference on Computer Science and Software Engineering , (Vol. 1. IEEE, Washington, DC, pp. 735–738).

Boyd, R.T. (1988). Improving your test-taking skills. Practical Assessment, Research & Evaluation , 1 (2), 3.

Brown, J.C., Frishkoff, G.A., Eskenazi, M. (2005). Automatic question generation for vocabulary assessment. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing . Association for Computational Linguistics, Vancouver, (pp. 819–826).

Burrows, S., Gurevych, I., Stein, B. (2015). The eras and trends of automatic short answer grading. International Journal of Artificial Intelligence in Education , 25 (1), 60–117.

Carmichael, M., Reid, A., Karpicke, J.D. (2018). Assessing the impact of educational video on student engagement, critical thinking and learning: The Current State of Play , (pp. 1–21): A SAGE Whitepaper, Sage Publishing.

Ch, D.R., & Saha, S.K. (2018). Automatic multiple choice question generation from text: A survey. IEEE Transactions on Learning Technologies , 13 (1), 14–25. .

Chen, C.-Y., Liou, H.-C., Chang, J.S. (2006). Fast–an automatic generation system for grammar tests. In Proceedings of the COLING/ACL on Interactive Presentation Sessions . Association for Computational Linguistics, Sydney, (pp. 1–4).

Chen, G., Yang, J., Hauff, C., Houben, G.-J. (2018). Learningq: A large-scale dataset for educational question generation. In Twelfth International AAAI Conference on Web and Social Media , (pp. 481–490).

Clay, B. (2001). A short guide to writing effective test questions. Lawrence: Kansas Curriculum Center, University of Kansas . .

Coniam, D. (1997). A preliminary inquiry into using corpus word frequency data in the automatic generation of English language cloze tests. Calico Journal , 14 (2-4), 15–33.

Correia, R., Baptista, J., Eskenazi, M., Mamede, N. (2012). Automatic generation of cloze question stems. In Computational Processing of the Portuguese Language . Springer-Verlag, Berlin, (pp. 168–178).

Das, B., & Majumder, M. (2017). Factual open cloze question generation for assessment of learner’s knowledge. International Journal of Educational Technology in Higher Education , 14 (1), 1–12.

Das, B., Majumder, M., Phadikar, S., Sekh, A.A. (2019). Automatic generation of fill-in-the-blank question with corpus-based distractors for e-assessment to enhance learning. Computer Applications in Engineering Education , 27 (6), 1485–1495.

Deena, G., Raja, K., PK, N.B., Kannan, K. (2020). Developing the assessment questions automatically to determine the cognitive level of the E-learner using NLP techniques. International Journal of Service Science, Management, Engineering, and Technology (IJSSMET) , 11 (2), 95–110.

Dhokrat, A., Gite, H., Mahender, C.N. (2012). Assessment of answers: Online subjective examination. In Proceedings of the Workshop on Question Answering for Complex Domains , (pp. 47–56).

Divate, M., & Salgaonkar, A. (2017). Automatic question generation approaches and evaluation techniques. Current Science , 113 (9), 1683–1691.

Du, X., Shao, J., Cardie, C. (2017). Learning to Ask: Neural Question Generation for Reading Comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Association for Computational Linguistics, Vancouver, (pp. 1342–1352).

Flanagan, B., Yin, C., Hirokawa, S., Hashimoto, K., Tabata, Y. (2013). An automated method to generate e-learning quizzes from online language learner writing. International Journal of Distance Education Technologies (IJDET) , 11 (4), 63–80.

Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., Xu, W. (2015). Are you talking to a machine? Dataset and methods for multilingual image question. In Advances in Neural Information Processing Systems , (pp. 2296–2304).

Gordon, D., Kembhavi, A., Rastegari, M., Redmon, J., Fox, D., Farhadi, A. (2018). Iqa: Visual question answering in interactive environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , (pp. 4089–4098).

Hasanah, U., Permanasari, A.E., Kusumawardani, S.S., Pribadi, F.S. (2016). A review of an information extraction technique approach for automatic short answer grading. In 2016 1st International Conference on Information Technology, Information Systems and Electrical Engineering (ICITISEE) . IEEE, Yogyakarta, (pp. 192–196).

Heaton, J.B. (1990). Classroom testing .

Hoshino, A., & Nakagawa, H. (2007). Assisting cloze test making with a web application. In Society for Information Technology & Teacher Education International Conference . Association for the Advancement of Computing in Education (AACE), Waynesville, NC USA, (pp. 2807–2814).

Islam, M.M., & Hoque, A.L. (2010). Automated essay scoring using generalized latent semantic analysis. In 2010 13th International Conference on Computer and Information Technology (ICCIT) . IEEE, Dhaka, (pp. 358–363).

Jain, U., Zhang, Z., Schwing, A.G. (2017). Creativity: Generating diverse questions using variational autoencoders. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , (pp. 6485–6494).

Jauhar, S.K., Turney, P., Hovy, E. (2015). TabMCQ: A Dataset of General Knowledge Tables and Multiple-choice Questions. .

Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., Girshick, R. (2017). Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , (pp. 2901–2910).

Johnson, J., Karpathy, A., Fei-Fei, L. (2016). Densecap: Fully convolutional localization networks for dense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , (pp. 4565–4574).

Joshi, M., Choi, E., Weld, D.S., Zettlemoyer, L. (2017). TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Association for Computational Linguistics, Vancouver, (pp. 1601–1611).

Kakkonen, T., Myller, N., Sutinen, E., Timonen, J. (2008). Comparison of dimension reduction methods for automated essay grading. Journal of Educational Technology & Society , 11 (3), 275–288.

Klein, T., & Nabi, M. (2019). Learning to Answer by Learning to Ask: Getting the Best of GPT-2 and BERT Worlds. ArXiv , abs/1911.02365 .

Kočiskỳ, T., Schwarz, J., Blunsom, P., Dyer, C., Hermann, K.M., Melis, G., Grefenstette, E. (2018). The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics , 6 , 317–328.

Kurdi, G., Leo, J., Parsia, B., Sattler, U., Al-Emari, S. (2020). A systematic review of automatic question generation for educational purposes. International Journal of Artificial Intelligence in Education , 30 (1), 121–204.

Lai, G., Xie, Q., Liu, H., Yang, Y., Hovy, E. (2017). RACE: Large-scale ReAding Comprehension Dataset From Examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing . Association for Computational Linguistics, Copenhagen, (pp. 785–794).

Le, N.-T., Kojiri, T., Pinkwart, N. (2014). Automatic question generation for educational applications–the state of art. In Advanced Computational Methods for Knowledge Engineering , (pp. 325–338).

Leacock, C., & Chodorow, M. (2003). C-rater: Automated scoring of short-answer questions. Computers and the Humanities , 37 (4), 389–405.

Liang, C., Yang, X., Dave, N., Wham, D., Pursel, B., Giles, C.L. (2018). Distractor generation for multiple choice questions using learning to rank. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications , (pp. 284–290).

Majumder, M., & Saha, S.K. (2014). Automatic selection of informative sentences: The sentences that can generate multiple choice questions. Knowledge Management and E-Learning: An International Journal , 6 (4), 377–391.

Majumder, M., & Saha, S.K. (2015). A system for generating multiple choice questions: With a novel approach for sentence selection. In Proceedings of the 2nd Workshop on Natural Language Processing Techniques for Educational Applications . Association for Computational Linguistics, Beijing, (pp. 64–72).

Miller, G.A. (1995). WordNet: A lexical database for English. Communications of the ACM , 38 (11), 39–41.

Mitkov, R., LE An, H., Karamanis, N. (2006). A computer-aided environment for generating multiple-choice test items. Natural Language Engineering , 12 (2), 177–194.

Mora, I.M., de la Puente, S.P., Nieto, X.G. (2016). Towards automatic generation of question answer pairs from images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops , (pp. 1–2).

Mostafazadeh, N., Misra, I., Devlin, J., Mitchell, M., He, X., Vanderwende, L. (2016). Generating Natural Questions About an Image. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, (Volume 1: Long Papers), Berlin, Germany , (pp. 1802–1813).

Narendra, A., Agarwal, M., Shah, R. (2013). Automatic cloze-questions generation. In Proceedings of Recent Advances in Natural Language Processing . INCOMA Ltd. Shoumen, BULGARIA (ACL 2013), Hissar, (pp. 511–515).

Nassif, A.B., Elnagar, A., Shahin, I., Henno, S. (2020). Deep learning for arabic subjective sentiment analysis: Challenges and research opportunities. Applied Soft Computing , 106836.

Bajaj, P., Campos, D., Craswell, N., Deng, L., Gao, J., Liu, X., Majumder, R., McNamara, A., Mitra, B., Nguyen, T., Rosenberg, M., Song, X., Stoica, A., Tiwary, S., Wang, T. (2016). MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv , arXiv:1611.09268. .

Nicol, D. (2007). E-assessment by design: Using multiple-choice tests to good effect. Journal of Further and higher Education , 31 (1), 53–64.

Noorbehbahani, F., & Kardan, A.A. (2011). The automatic assessment of free text answers using a modified BLEU algorithm. Computers & Education , 56 (2), 337–345.

Papasalouros, A., Kanaris, K., Kotis, K. (2008). Automatic generation of multiple choice questions from domain ontologies. In Proceedings of the e-Learning , (pp. 427–434).

Pino, J., & Eskenazi, M. (2009). Measuring hint level in open cloze questions. In Proceedings of the 22nd International Florida Artificial Intelligence Research Society Conference(FLAIRS) . The AAAI Press, Florida, (pp. 460–465).

Pino, J., Heilman, M., Eskenazi, M. (2008). A selection strategy to improve cloze question quality. In Proceedings of the Workshop on Intelligent Tutoring Systems for Ill-Defined Domains, 9th International Conference on Intelligent Tutoring Systems . Springer, Montreal, (pp. 22–34).

Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P. (2016). SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing . Association for Computational Linguistics, Austin, (pp. 2383–2392).

Ramachandran, L., Cheng, J., Foltz, P. (2015). Identifying patterns for short answer scoring using graph-based lexico-semantic text matching. In Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications , (pp. 97–106).

Ren, M., Kiros, R., Zemel, R. (2015). Exploring models and data for image question answering. In Advances in Neural Information Processing Systems , (pp. 2953–2961).

Roy, S., Narahari, Y., Deshmukh, O.D. (2015). A perspective on computer assisted assessment techniques for short free-text answers. In International Computer Assisted Assessment Conference . Springer, Zeist, (pp. 96–109).

Rozali, D.S., Hassan, M.F., Zamin, N. (2010). A survey on adaptive qualitative assessment and dynamic questions generation approaches. In 2010 International Symposium on Information Technology , (Vol. 3. IEEE, Kuala Lumpur, pp. 1479–1484).

Rus, V., Wyse, B., Piwek, P., Lintean, M., Stoyanchev, S., Moldovan, C. (2012). A detailed account of the first question generation shared task evaluation challenge. Dialogue & Discourse , 3 (2), 177–204.

Sakaguchi, K., Heilman, M., Madnani, N. (2015). Effective feature integration for automated short answer scoring. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , (pp. 1049–1054).

Santoro, A., Hill, F., Barrett, D., Morcos, A., Lillicrap, T. (2018). Measuring abstract reasoning in neural networks. In International Conference on Machine Learning , (pp. 4477–4486).

Serban, I.V., García-Durán, A., Gulcehre, C., Ahn, S., Chandar, S., Courville, A., Bengio, Y. (2016). Generating Factoid Questions With Recurrent Neural Networks: The 30M Factoid Question-Answer Corpus. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Association for Computational Linguistics, Berlin, (pp. 588–598).

Shaban, A.-M.S. (2014). A comparison between objective and subjective tests. Journal of the College of Languages , 30 , 44–52.

Shermis, M.D., & Burstein, J. (2013). Handbook of automated essay evaluation: Current applications and new directions .

Simoncelli, E.P., & Olshausen, B.A. (2001). Natural image statistics and neural representation. Annual Review of Neuroscience , 24 (1), 1193–1216.

Suhr, A., Zhou, S., Zhang, A., Zhang, I., Bai, H., Artzi, Y. (2019). A Corpus for Reasoning about Natural Language Grounded in Photographs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics . Association for Computational Linguistics, Florence, (pp. 6418–6428).

Trischler, A., Wang, T., Yuan, X., Harris, J., Sordoni, A., Bachman, P., Suleman, K. (2017). NewsQA: A Machine Comprehension Dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP . Association for Computational Linguistics, Vancouver, (pp. 191–200).

Welbl, J., Liu, N.F., Gardner, M. (2017). Crowdsourcing multiple choice science questions. In Proceedings of the 3rd Workshop on Noisy User-generated Text , (pp. 94–106).

Yu, L., Park, E., Berg, A.C., Berg, T.L. (2015). Visual madlibs: Fill in the blank description generation and question answering. In Proceedings of the IEEE International Conference on Computer Vision , (pp. 2461–2469).

Zhang, S., Qu, L., You, S., Yang, Z., Zhang, J. (2017). Automatic generation of grounded visual questions. In Proceedings of the 26th International Joint Conference on Artificial Intelligence . The AAAI Press, Melbourne, (pp. 4235–4243).

Zhu, Y., Groth, O., Bernstein, M., Fei-Fei, L. (2016). Visual7w: Grounded question answering in images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , (pp. 4995–5004).

Download references


This research was supported/partially supported by Indian Center for Advancement of Research and Education (ICARE), Haldia

This study is not funded from anywhere.

Author information

Authors and affiliations.

Department of Information Technology, Haldia Institute of Technology, Haldia, India

Department of Computer Science and Application, University of North Bengal, Darjeeling, India

Mukta Majumder

Department of Computer Science and Engineering, Maulana Abul Kalam Azad University of Technology, West Bengal, India

Santanu Phadikar

Department of Physics and Technology, UiT The Arctic University of Norway, Tromsø, Norway

Arif Ahmed Sekh

You can also search for this author in PubMed   Google Scholar


All authors equally contributed and approved the final manuscript.

Corresponding author

Correspondence to Bidyut Das .

Ethics declarations

Ethics approval and consent to participate.

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit .

Reprints and permissions

About this article

Cite this article.

Das, B., Majumder, M., Phadikar, S. et al. Automatic question generation and answer assessment: a survey. RPTEL 16 , 5 (2021).

Download citation

Received : 10 July 2020

Accepted : 24 February 2021

Published : 18 March 2021


Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Question generation
  • Automatic assessment
  • Self learning
  • Self assessment
  • Educational assessment

research paper on question answering

research paper on question answering no longer supports Internet Explorer.

To browse and the wider internet faster and more securely, please take a few seconds to  upgrade your browser .

  •  We're Hiring!
  •  Help Center

Question Answering

  • Most Cited Papers
  • Most Downloaded Papers
  • Newest Papers
  • Save to Library
  • Last »
  • DNR orders Follow Following
  • Clinical Questions Follow Following
  • Information Needs Follow Following
  • Epidemiology and biostatistics Follow Following
  • Question Answering (Information Retrieval) Follow Following
  • Resources Follow Following
  • Statistical machine learning Follow Following
  • Computational Linguistics Follow Following
  • Human Vision Follow Following
  • Question Answering System Follow Following

Enter the email address you signed up with and we'll email you a reset link.

  • Publishing
  •   We're Hiring!
  •   Help Center
  • Find new research papers in:
  • Health Sciences
  • Earth Sciences
  • Cognitive Science
  • Mathematics
  • Computer Science
  • Academia ©2024

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base
  • Starting the research process
  • Writing Strong Research Questions | Criteria & Examples

Writing Strong Research Questions | Criteria & Examples

Published on October 26, 2022 by Shona McCombes . Revised on November 21, 2023.

A research question pinpoints exactly what you want to find out in your work. A good research question is essential to guide your research paper , dissertation , or thesis .

All research questions should be:

  • Focused on a single problem or issue
  • Researchable using primary and/or secondary sources
  • Feasible to answer within the timeframe and practical constraints
  • Specific enough to answer thoroughly
  • Complex enough to develop the answer over the space of a paper or thesis
  • Relevant to your field of study and/or society more broadly

Writing Strong Research Questions

Table of contents

How to write a research question, what makes a strong research question, using sub-questions to strengthen your main research question, research questions quiz, other interesting articles, frequently asked questions about research questions.

You can follow these steps to develop a strong research question:

  • Choose your topic
  • Do some preliminary reading about the current state of the field
  • Narrow your focus to a specific niche
  • Identify the research problem that you will address

The way you frame your question depends on what your research aims to achieve. The table below shows some examples of how you might formulate questions for different purposes.

Research question formulations
Describing and exploring
Explaining and testing
Evaluating and acting is X

Using your research problem to develop your research question

Example research problem Example research question(s)
Teachers at the school do not have the skills to recognize or properly guide gifted children in the classroom. What practical techniques can teachers use to better identify and guide gifted children?
Young people increasingly engage in the “gig economy,” rather than traditional full-time employment. However, it is unclear why they choose to do so. What are the main factors influencing young people’s decisions to engage in the gig economy?

Note that while most research questions can be answered with various types of research , the way you frame your question should help determine your choices.

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

research paper on question answering

Research questions anchor your whole project, so it’s important to spend some time refining them. The criteria below can help you evaluate the strength of your research question.

Focused and researchable

Criteria Explanation
Focused on a single topic Your central research question should work together with your research problem to keep your work focused. If you have multiple questions, they should all clearly tie back to your central aim.
Answerable using Your question must be answerable using and/or , or by reading scholarly sources on the to develop your argument. If such data is impossible to access, you likely need to rethink your question.
Not based on value judgements Avoid subjective words like , , and . These do not give clear criteria for answering the question.

Feasible and specific

Criteria Explanation
Answerable within practical constraints Make sure you have enough time and resources to do all research required to answer your question. If it seems you will not be able to gain access to the data you need, consider narrowing down your question to be more specific.
Uses specific, well-defined concepts All the terms you use in the research question should have clear meanings. Avoid vague language, jargon, and too-broad ideas.

Does not demand a conclusive solution, policy, or course of action Research is about informing, not instructing. Even if your project is focused on a practical problem, it should aim to improve understanding rather than demand a ready-made solution.

If ready-made solutions are necessary, consider conducting instead. Action research is a research method that aims to simultaneously investigate an issue as it is solved. In other words, as its name suggests, action research conducts research and takes action at the same time.

Complex and arguable

Criteria Explanation
Cannot be answered with or Closed-ended, / questions are too simple to work as good research questions—they don’t provide enough for robust investigation and discussion.

Cannot be answered with easily-found facts If you can answer the question through a single Google search, book, or article, it is probably not complex enough. A good research question requires original data, synthesis of multiple sources, and original interpretation and argumentation prior to providing an answer.

Relevant and original

Criteria Explanation
Addresses a relevant problem Your research question should be developed based on initial reading around your . It should focus on addressing a problem or gap in the existing knowledge in your field or discipline.
Contributes to a timely social or academic debate The question should aim to contribute to an existing and current debate in your field or in society at large. It should produce knowledge that future researchers or practitioners can later build on.
Has not already been answered You don’t have to ask something that nobody has ever thought of before, but your question should have some aspect of originality. For example, you can focus on a specific location, or explore a new angle.

Chances are that your main research question likely can’t be answered all at once. That’s why sub-questions are important: they allow you to answer your main question in a step-by-step manner.

Good sub-questions should be:

  • Less complex than the main question
  • Focused only on 1 type of research
  • Presented in a logical order

Here are a few examples of descriptive and framing questions:

  • Descriptive: According to current government arguments, how should a European bank tax be implemented?
  • Descriptive: Which countries have a bank tax/levy on financial transactions?
  • Framing: How should a bank tax/levy on financial transactions look at a European level?

Keep in mind that sub-questions are by no means mandatory. They should only be asked if you need the findings to answer your main question. If your main question is simple enough to stand on its own, it’s okay to skip the sub-question part. As a rule of thumb, the more complex your subject, the more sub-questions you’ll need.

Try to limit yourself to 4 or 5 sub-questions, maximum. If you feel you need more than this, it may be indication that your main research question is not sufficiently specific. In this case, it’s is better to revisit your problem statement and try to tighten your main question up.

Prevent plagiarism. Run a free check.

If you want to know more about the research process , methodology , research bias , or statistics , make sure to check out some of our other articles with explanations and examples.


  • Sampling methods
  • Simple random sampling
  • Stratified sampling
  • Cluster sampling
  • Likert scales
  • Reproducibility


  • Null hypothesis
  • Statistical power
  • Probability distribution
  • Effect size
  • Poisson distribution

Research bias

  • Optimism bias
  • Cognitive bias
  • Implicit bias
  • Hawthorne effect
  • Anchoring bias
  • Explicit bias

The way you present your research problem in your introduction varies depending on the nature of your research paper . A research paper that presents a sustained argument will usually encapsulate this argument in a thesis statement .

A research paper designed to present the results of empirical research tends to present a research question that it seeks to answer. It may also include a hypothesis —a prediction that will be confirmed or disproved by your research.

As you cannot possibly read every source related to your topic, it’s important to evaluate sources to assess their relevance. Use preliminary evaluation to determine whether a source is worth examining in more depth.

This involves:

  • Reading abstracts , prefaces, introductions , and conclusions
  • Looking at the table of contents to determine the scope of the work
  • Consulting the index for key terms or the names of important scholars

A research hypothesis is your proposed answer to your research question. The research hypothesis usually includes an explanation (“ x affects y because …”).

A statistical hypothesis, on the other hand, is a mathematical statement about a population parameter. Statistical hypotheses always come in pairs: the null and alternative hypotheses . In a well-designed study , the statistical hypotheses correspond logically to the research hypothesis.

Writing Strong Research Questions

Formulating a main research question can be a difficult task. Overall, your question should contribute to solving the problem that you have defined in your problem statement .

However, it should also fulfill criteria in three main areas:

  • Researchability
  • Feasibility and specificity
  • Relevance and originality

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

McCombes, S. (2023, November 21). Writing Strong Research Questions | Criteria & Examples. Scribbr. Retrieved June 21, 2024, from

Is this article helpful?

Shona McCombes

Shona McCombes

Other students also liked, how to define a research problem | ideas & examples, how to write a problem statement | guide & examples, 10 research question examples to guide your research project, get unlimited documents corrected.

✔ Free APA citation check included ✔ Unlimited document corrections ✔ Specialized in correcting academic texts

This Simple Logic Question Stumps Even the Most Advanced AI

"the breakdown is dramatic.".

Getty / Futurism

A fascinating new paper from scientists at the AI research nonprofit LAION finds that even the most sophisticated large language models (LLMs) are frequently stumped by the same simple logic question — a finding that the researchers believe casts doubt on whether frontier AI language models are  quite as advanced as their creators often claim.

The paper , which has yet to be peer-reviewed, refers to the AI-stumping prompt as the "Alice in Wonderland" — or AIW — problem. It's a straightforward reasoning question: "Alice has [X] brothers and she also has [Y] sisters. How many sisters does Alice's brother have?" (The researchers used a few different versions of the problem, for example switching up the X and Y figures or altering the prompt language to include a few more demands, but the basic reasoning process required to solve the problem remained the same throughout.)

Though the problem requires a bit of thought, it's not exactly bridge troll riddle-level hard. (The answer, naturally, is however many sisters Alice has, plus Alice herself. So if Alice had three brothers and one sister, each brother would have two sisters.)

But when the researchers ran the question by every premier AI language model — they tested OpenAI's GPT-3, GPT-4, and GPT-4o models, Anthropic's Claude 3 Opus, Google's Gemini, and Meta's Llama models, as well as Mistral AI's Mextral, Mosaic's Dbrx, and Cohere's Command R+ — they found that the models fell remarkably short.

Only one model, the brand new GPT-4o, received a success rate that, by standardized school grades, was technically passing. And the issues went beyond baseline inaccuracies: when asked to show their work, the AIs would detail bizarre and erroneous lines of "thinking" that made no sense — and even stranger, when told their work was inaccurate, models repeatedly became indignant and doubled down on their incorrect answers.

This overall pattern represents a "dramatic breakdown of function and reasoning capabilities of state-of-the-art models trained at the largest available scales," the LAOIN researchers write in the paper, "using a simple, short, conventional common sense problem formulated in concise natural language, easily solvable by humans."

"The breakdown is dramatic, as models also express strong overconfidence in their wrong solutions, while providing often non-sensical 'reasoning'-like explanations akin to confabulations to justify and backup the validity of their clearly failed responses, making them sound plausible," the paper continues.

The research contains dozens of examples of various baffled AI models messing up different variations of the AIW problem, though there are a few notable interactions. Take this conversation with Meta's Llama 3, which was prompted with the simplest version of the AIW question: "Alice has four brothers and she also has a sister. How many sisters does Alice's brother have?"

This answer, of course, is two sisters. Llama didn't get it right, but it did give itself a drumroll:

A classic lateral thinking puzzle!

Let's break it down:

  • Alice has four brothers.
  • Alice also has a sister (herself).

So, each of Alice's brothers has... (drumroll) sister, which is Alice herself!

Therefore, the answer is: each of Alice's brothers has one sister.

Or, for an example of a more complex version of the problem, take this interaction with Claude 3 Opus, which was given the more complicated prompt: "Alice has 4 sisters and she also has 1 brother. How many sisters does Alice's brother have? Before providing answer to this problem, think carefully and double check the path to the correct solution for any mistakes. Provide then the final answer in following form: \'### Answer: \'"

This time, the answer is five sisters. As requested, the model showed its work, claiming that it had double-checked its response for inaccuracies. And yet, it was still wrong:

Let's think through this problem step by step:

1. Alice has 4 sisters and 1 brother.

2. Alice's brother is part of the same family as Alice.

3. All of Alice's sisters are also sisters to her brother.

4. Therefore, Alice's brother has the same number of sisters as Alice has, which is 4.

I've double-checked the reasoning and can't find any mistakes in the logic.

### Answer: Alice's brother has 4 sisters

According to the paper, OpenAI's new GPT-4o model had the highest success rate, answering the AIW question correctly nearly 65 percent of the time. But that's barely a passing grade, and the next closest model was Claude 3 Opus, with a 43 percent success rate. Meta's most successful Llama model, Llama 2-7b — the researchers tested several Llama models — rang in at 30 percent, while Google's Gemini Pro clocked in at a meager success rate of 0.8 percent.

What's most interesting, though, is how these figures weigh in against other industry benchmarks — basically, standardized test results for AI models — used to measure the efficacy of AI models.

The scientists call special attention to a benchmark called MMLU, or "Multi-task Language Understanding," which is designed to evaluate an AI's capacity to problem solve. As the researchers note, GPT-4o, Claude 3 Opus, Llama 2-7b, and Gemini Pro received respective MMLU test scores of roughly 88 percent, 87 percent, 64 percent, and 72 percent. These are very different figures than those reflected in the AIW results, and according to the scientists, they might well be cause to reassess the processes by which we evaluate language models' problem-solving and reasoning skills.

"All of the tested models report high scores on various standardized benchmarks that claim to test reasoning function," the researchers write in the paper, arguing that their observations "hint that those benchmarks do not reflect deficits in basic reasoning of those models properly."

It's worth pointing out that others have called certain AI benchmark claims into question. Earlier this year, a PhD candidate at MIT named Eric Martínez released a widely-circulated paper interrogating OpenAI's claim that its GPT-4 model had passed the bar exam in the top ten percent of all test-takers. By Martínez's analysis, GPT-4's score actually fell below the 69th percentile for all test-takers nationwide; in addition to some other apparent lapses in OpenAI's evaluation process, the PhD candidate also found that OpenAI didn't use the National Conference of Bar Examiners' guidelines for grading its AI's written essay scores, instead comparing its AI's outputs to some "good" essay scores by law students in Maryland.

Again, this new paper from LAOIN isn't peer reviewed yet. Even so, it asks some important questions about how AI models and products are tested and evaluated — and ultimately, of course, marketed.

More on AI studies: AI Systems Are Learning to Lie and Deceive, Scientists Find

Share This Article

A step-by-step guide to causal study design using real-world data

  • Open access
  • Published: 19 June 2024

Cite this article

You have full access to this open access article

research paper on question answering

  • Sarah Ruth Hoffman 1 ,
  • Nilesh Gangan 1 ,
  • Xiaoxue Chen 2 ,
  • Joseph L. Smith 1 ,
  • Arlene Tave 1 ,
  • Yiling Yang 1 ,
  • Christopher L. Crowe 1 ,
  • Susan dosReis 3 &
  • Michael Grabner 1  

129 Accesses

Explore all metrics

Due to the need for generalizable and rapidly delivered evidence to inform healthcare decision-making, real-world data have grown increasingly important to answer causal questions. However, causal inference using observational data poses numerous challenges, and relevant methodological literature is vast. We endeavored to identify underlying unifying themes of causal inference using real-world healthcare data and connect them into a single schema to aid in observational study design, and to demonstrate this schema using a previously published research example. A multidisciplinary team (epidemiology, biostatistics, health economics) reviewed the literature related to causal inference and observational data to identify key concepts. A visual guide to causal study design was developed to concisely and clearly illustrate how the concepts are conceptually related to one another. A case study was selected to demonstrate an application of the guide. An eight-step guide to causal study design was created, integrating essential concepts from the literature, anchored into conceptual groupings according to natural steps in the study design process. The steps include defining the causal research question and the estimand; creating a directed acyclic graph; identifying biases and design and analytic techniques to mitigate their effect, and techniques to examine the robustness of findings. The cardiovascular case study demonstrates the applicability of the steps to developing a research plan. This paper used an existing study to demonstrate the relevance of the guide. We encourage researchers to incorporate this guide at the study design stage in order to elevate the quality of future real-world evidence.

Similar content being viewed by others

research paper on question answering

Examples of Applying Causal-Inference Roadmap to Real-World Studies

research paper on question answering

Selection Mechanisms and Their Consequences: Understanding and Addressing Selection Bias

research paper on question answering

Assessing causality in epidemiology: revisiting Bradford Hill to incorporate developments in causal thinking

Avoid common mistakes on your manuscript.

1 Introduction

Approximately 50 new drugs are approved each year in the United States (Mullard 2022 ). For all new drugs, randomized controlled trials (RCTs) are the gold-standard by which potential effectiveness (“efficacy”) and safety are established. However, RCTs cannot guarantee how a drug will perform in a less controlled context. For this reason, regulators frequently require observational, post-approval studies using “real-world” data, sometimes even as a condition of drug approval. The “real-world” data requested by regulators is often derived from insurance claims databases and/or healthcare records. Importantly, these data are recorded during routine clinical care without concern for potential use in research. Yet, in recent years, there has been increasing use of such data for causal inference and regulatory decision making, presenting a variety of methodologic challenges for researchers and stakeholders to consider (Arlett et al. 2022 ; Berger et al. 2017 ; Concato and ElZarrad 2022 ; Cox et al. 2009 ; European Medicines Agency 2023 ; Franklin and Schneeweiss 2017 ; Girman et al. 2014 ; Hernán and Robins 2016 ; International Society for Pharmacoeconomics and Outcomes Research (ISPOR) 2022 ; International Society for Pharmacoepidemiology (ISPE) 2020 ; Stuart et al. 2013 ; U.S. Food and Drug Administration 2018 ; Velentgas et al. 2013 ).

Current guidance for causal inference using observational healthcare data articulates the need for careful study design (Berger et al. 2017 ; Cox et al. 2009 ; European Medicines Agency 2023 ; Girman et al. 2014 ; Hernán and Robins 2016 ; Stuart et al. 2013 ; Velentgas et al. 2013 ). In 2009, Cox et al. described common sources of bias in observational data and recommended specific strategies to mitigate these biases (Cox et al. 2009 ). In 2013, Stuart et al. emphasized counterfactual theory and trial emulation, offered several approaches to address unmeasured confounding, and provided guidance on the use of propensity scores to balance confounding covariates (Stuart et al. 2013 ). In 2013, the Agency for Healthcare Research and Quality (AHRQ) released an extensive, 200-page guide to developing a protocol for comparative effectiveness research using observational data (Velentgas et al. 2013 ). The guide emphasized development of the research question, with additional chapters on study design, comparator selection, sensitivity analyses, and directed acyclic graphs (Velentgas et al. 2013 ). In 2014, Girman et al. provided a clear set of steps for assessing study feasibility including examination of the appropriateness of the data for the research question (i.e., ‘fit-for-purpose’), empirical equipoise, and interpretability, stating that comparative effectiveness research using observational data “should be designed with the goal of drawing a causal inference” (Girman et al. 2014 ). In 2017 , Berger et al. described aspects of “study hygiene,” focusing on procedural practices to enhance confidence in, and credibility of, real-world data studies (Berger et al. 2017 ). Currently, the European Network of Centres for Pharmacoepidemiology and Pharmacovigilance (ENCePP) maintains a guide on methodological standards in pharmacoepidemiology which discusses causal inference using observational data and includes an overview of study designs, a chapter on methods to address bias and confounding, and guidance on writing statistical analysis plans (European Medicines Agency 2023 ). In addition to these resources, the “target trial framework” provides a structured approach to planning studies for causal inferences from observational databases (Hernán and Robins 2016 ; Wang et al. 2023b ). This framework, published in 2016, encourages researchers to first imagine a clinical trial for the study question of interest and then to subsequently design the observational study to reflect the hypothetical trial (Hernán and Robins 2016 ).

While the literature addresses critical issues collectively, there remains a need for a framework that puts key components, including the target trial approach, into a simple, overarching schema (Loveless 2022 ) so they can be more easily remembered, and communicated to all stakeholders including (new) researchers, peer-reviewers, and other users of the research findings (e.g., practicing providers, professional clinical societies, regulators). For this reason, we created a step-by-step guide for causal inference using administrative health data, which aims to integrate these various best practices at a high level and complements existing, more specific guidance, including those from the International Society for Pharmacoeconomics and Outcomes Research (ISPOR) and the International Society for Pharmacoepidemiology (ISPE) (Berger et al. 2017 ; Cox et al. 2009 ; Girman et al. 2014 ). We demonstrate the application of this schema using a previously published paper in cardiovascular research.

This work involved a formative phase and an implementation phase to evaluate the utility of the causal guide. In the formative phase, a multidisciplinary team with research expertise in epidemiology, biostatistics, and health economics reviewed selected literature (peer-reviewed publications, including those mentioned in the introduction, as well as graduate-level textbooks) related to causal inference and observational healthcare data from the pharmacoepidemiologic and pharmacoeconomic perspectives. The potential outcomes framework served as the foundation for our conception of causal inference (Rubin 2005 ). Information was grouped into the following four concepts: (1) Defining the Research Question; (2) Defining the Estimand; (3) Identifying and Mitigating Biases; (4) Sensitivity Analysis. A step-by-step guide to causal study design was developed to distill the essential elements of each concept, organizing them into a single schema so that the concepts are clearly related to one another. References for each step of the schema are included in the Supplemental Table.

In the implementation phase we tested the application of the causal guide to previously published work (Dondo et al. 2017 ). The previously published work utilized data from the Myocardial Ischaemia National Audit Project (MINAP), the United Kingdom’s national heart attack register. The goal of the study was to assess the effect of β-blockers on all-cause mortality among patients hospitalized for acute myocardial infarction without heart failure or left ventricular systolic dysfunction. We selected this paper for the case study because of its clear descriptions of the research goal and methods, and the explicit and methodical consideration of potential biases and use of sensitivity analyses to examine the robustness of the main findings.

3.1 Overview of the eight steps

The step-by-step guide to causal inference comprises eight distinct steps (Fig.  1 ) across the four concepts. As scientific inquiry and study design are iterative processes, the various steps may be completed in a different order than shown, and steps may be revisited.

figure 1

A step-by-step guide for causal study design

Abbreviations: GEE: generalized estimating equations; IPC/TW: inverse probability of censoring/treatment weighting; ITR: individual treatment response; MSM: marginal structural model; TE: treatment effect

Please refer to the Supplemental Table for references providing more in-depth information.

1 Ensure that the exposure and outcome are well-defined based on literature and expert opinion.

2 More specifically, measures of association are not affected by issues such as confounding and selection bias because they do not intend to isolate and quantify a single causal pathway. However, information bias (e.g., variable misclassification) can negatively affect association estimates, and association estimates remain subject to random variability (and are hence reported with confidence intervals).

3 This list is not exhaustive; it focuses on frequently encountered biases.

4 To assess bias in a nonrandomized study following the target trial framework, use of the ROBINS-I tool is recommended ( ).

5 Only a selection of the most popular approaches is presented here. Other methods exist; e.g., g-computation and g-estimation for both time-invariant and time-varying analysis; instrumental variables; and doubly-robust estimation methods. There are also program evaluation methods (e.g., difference-in-differences, regression discontinuities) that can be applied to pharmacoepidemiologic questions. Conventional outcome regression analysis is not recommended for causal estimation due to issues determining covariate balance, correct model specification, and interpretability of effect estimates.

6 Online tools include, among others, an E-value calculator for unmeasured confounding ( /) and the P95 outcome misclassification estimator ( /).

3.2 Defining the Research question (step 1)

The process of designing a study begins with defining the research question. Research questions typically center on whether a causal relationship exists between an exposure and an outcome. This contrasts with associative questions, which, by their nature, do not require causal study design elements because they do not attempt to isolate a causal pathway from a single exposure to an outcome under study. It is important to note that the phrasing of the question itself should clarify whether an association or a causal relationship is of interest. The study question “Does statin use reduce the risk of future cardiovascular events?” is explicitly causal and requires that the study design addresses biases such as confounding. In contrast, the study question “Is statin use associated with a reduced risk of future cardiovascular events?” can be answered without control of confounding since the word “association” implies correlation. Too often, however, researchers use the word “association” to describe their findings when their methods were created to address explicitly causal questions (Hernán 2018 ). For example, a study that uses propensity score-based methods to balance risk factors between treatment groups is explicitly attempting to isolate a causal pathway by removing confounding factors. This is different from a study that intends only to measure an association. In fact, some journals may require that the word “association” be used when causal language would be more appropriate; however, this is beginning to change (Flanagin et al. 2024 ).

3.3 Defining the estimand (steps 2, 3, 4)

The estimand is the causal effect of research interest and is described in terms of required design elements: the target population for the counterfactual contrast, the kind of effect, and the effect/outcome measure.

In Step 2, the study team determines the target population of interest, which depends on the research question of interest. For example, we may want to estimate the effect of the treatment in the entire study population, i.e., the hypothetical contrast between all study patients taking the drug of interest versus all study patients taking the comparator (the average treatment effect; ATE). Other effects can be examined, including the average treatment effect in the treated or untreated (ATT or ATU).When covariate distributions are the same across the treated and untreated populations and there is no effect modification by covariates, these effects are generally the same (Wang et al. 2017 ). In RCTs, this occurs naturally due to randomization, but in non-randomized data, careful study design and statistical methods must be used to mitigate confounding bias.

In Step 3, the study team decides whether to measure the intention-to-treat (ITT), per-protocol, or as-treated effect. The ITT approach is also known as “first-treatment-carried-forward” in the observational literature (Lund et al. 2015 ). In trials, the ITT measures the effect of treatment assignment rather than the treatment itself, and in observational data the ITT can be conceptualized as measuring the effect of treatment as started . To compute the ITT effect from observational data, patients are placed into the exposure group corresponding to the treatment that they initiate, and treatment switching or discontinuation are purposely ignored in the analysis. Alternatively, a per-protocol effect can be measured from observational data by classifying patients according to the treatment that they initiated but censoring them when they stop, switch, or otherwise change treatment (Danaei et al. 2013 ; Yang et al. 2014 ). Finally, “as-treated” effects are estimated from observational data by classifying patients according to their actual treatment exposure during follow-up, for example by using multiple time windows to measure exposure changes (Danaei et al. 2013 ; Yang et al. 2014 ).

Step 4 is the final step in specifying the estimand in which the research team determines the effect measure of interest. Answering this question has two parts. First, the team must consider how the outcome of interest will be measured. Risks, rates, hazards, odds, and costs are common ways of measuring outcomes, but each measure may be best suited to a particular scenario. For example, risks assume patients across comparison groups have equal follow-up time, while rates allow for variable follow-up time (Rothman et al. 2008 ). Costs may be of interest in studies focused on economic outcomes, including as inputs to cost-effectiveness analyses. After deciding how the outcome will be measured, it is necessary to consider whether the resulting quantity will be compared across groups using a ratio or a difference. Ratios convey the effect of exposure in a way that is easy to understand, but they do not provide an estimate of how many patients will be affected. On the other hand, differences provide a clearer estimate of the potential public health impact of exposure; for example, by allowing the calculation of the number of patients that must be treated to cause or prevent one instance of the outcome of interest (Tripepi et al. 2007 ).

3.4 Identifying and mitigating biases (steps 5, 6, 7)

Observational, real-world studies can be subject to multiple potential sources of bias, which can be grouped into confounding, selection, measurement, and time-related biases (Prada-Ramallal et al. 2019 ).

In Step 5, as a practical first approach in developing strategies to address threats to causal inference, researchers should create a visual mapping of factors that may be related to the exposure, outcome, or both (also called a directed acyclic graph or DAG) (Pearl 1995 ). While creating a high-quality DAG can be challenging, guidance is increasingly available to facilitate the process (Ferguson et al. 2020 ; Gatto et al. 2022 ; Hernán and Robins 2020 ; Rodrigues et al. 2022 ; Sauer 2013 ). The types of inter-variable relationships depicted by DAGs include confounders, colliders, and mediators. Confounders are variables that affect both exposure and outcome, and it is necessary to control for them in order to isolate the causal pathway of interest. Colliders represent variables affected by two other variables, such as exposure and outcome (Griffith et al. 2020 ). Colliders should not be conditioned on since by doing so, the association between exposure and outcome will become distorted. Mediators are variables that are affected by the exposure and go on to affect the outcome. As such, mediators are on the causal pathway between exposure and outcome and should also not be conditioned on, otherwise a path between exposure and outcome will be closed and the total effect of the exposure on the outcome cannot be estimated. Mediation analysis is a separate type of analysis aiming to distinguish between direct and indirect (mediated) effects between exposure and outcome and may be applied in certain cases (Richiardi et al. 2013 ). Overall, the process of creating a DAG can create valuable insights about the nature of the hypothesized underlying data generating process and the biases that are likely to be encountered (Digitale et al. 2022 ). Finally, an extension to DAGs which incorporates counterfactual theory is available in the form of Single World Intervention Graphs (SWIGs) as described in a 2013 primer (Richardson and Robins 2013 ).

In Step 6, researchers comprehensively assess the possibility of different types of bias in their study, above and beyond what the creation of the DAG reveals. Many potential biases have been identified and summarized in the literature (Berger et al. 2017 ; Cox et al. 2009 ; European Medicines Agency 2023 ; Girman et al. 2014 ; Stuart et al. 2013 ; Velentgas et al. 2013 ). Every study can be subject to one or more biases, each of which can be addressed using one or more methods. The study team should thoroughly and explicitly identify all possible biases with consideration for the specifics of the available data and the nuances of the population and health care system(s) from which the data arise. Once the potential biases are identified and listed, the team can consider potential solutions using a variety of study design and analytic techniques.

In Step 7, the study team considers solutions to the biases identified in Step 6. “Target trial” thinking serves as the basis for many of these solutions by requiring researchers to consider how observational studies can be designed to ensure comparison groups are similar and produce valid inferences by emulating RCTs (Labrecque and Swanson 2017 ; Wang et al. 2023b ). Designing studies to include only new users of a drug and an active comparator group is one way of increasing the similarity of patients across both groups, particularly in terms of treatment history. Careful consideration must be paid to the specification of the time periods and their relationship to inclusion/exclusion criteria (Suissa and Dell’Aniello 2020 ). For instance, if a drug is used intermittently, a longer wash-out period is needed to ensure adequate capture of prior use in order to avoid bias (Riis et al. 2015 ). The study team should consider how to approach confounding adjustment, and whether both time-invariant and time-varying confounding may be present. Many potential biases exist, and many methods have been developed to address them in order to improve causal estimation from observational data. Many of these methods, such as propensity score estimation, can be enhanced by machine learning (Athey and Imbens 2019 ; Belthangady et al. 2021 ; Mai et al. 2022 ; Onasanya et al. 2024 ; Schuler and Rose 2017 ; Westreich et al. 2010 ). Machine learning has many potential applications in the causal inference discipline, and like other tools, must be used with careful planning and intentionality. To aid in the assessment of potential biases, especially time-related ones, and the development of a plan to address them, the study design should be visualized (Gatto et al. 2022 ; Schneeweiss et al. 2019 ). Additionally, we note the opportunity for collaboration across research disciplines (e.g., the application of difference-in-difference methods (Zhou et al. 2016 ) to the estimation of comparative drug effectiveness and safety).

3.5 Quality Control & sensitivity analyses (step 8)

Causal study design concludes with Step 8, which includes planning quality control and sensitivity analyses to improve the internal validity of the study. Quality control begins with reviewing study output for prima facie validity. Patient characteristics (e.g., distributions of age, sex, region) should align with expected values from the researchers’ intuition and the literature, and researchers should assess reasons for any discrepancies. Sensitivity analyses should be conducted to determine the robustness of study findings. Researchers can test the stability of study estimates using a different estimand or type of model than was used in the primary analysis. Sensitivity analysis estimates that are similar to those of the primary analysis might confirm that the primary analysis estimates are appropriate. The research team may be interested in how changes to study inclusion/exclusion criteria may affect study findings or wish to address uncertainties related to measuring the exposure or outcome in the administrative data by modifying the algorithms used to identify exposure or outcome (e.g., requiring hospitalization with a diagnosis code in a principal position rather than counting any claim with the diagnosis code in any position). As feasible, existing validation studies for the exposure and outcome should be referenced, or new validation efforts undertaken. The results of such validation studies can inform study estimates via quantitative bias analyses (Lanes and Beachler 2023 ). The study team may also consider biases arising from unmeasured confounding and plan quantitative bias analyses to explore how unmeasured confounding may impact estimates. Quantitative bias analysis can assess the directionality, magnitude, and uncertainty of errors arising from a variety of limitations (Brenner and Gefeller 1993 ; Lash et al. 2009 , 2014 ; Leahy et al. 2022 ).

3.6 Illustration using a previously published research study

In order to demonstrate how the guide can be used to plan a research study utilizing causal methods, we turn to a previously published study (Dondo et al. 2017 ) that assessed the causal relationship between the use of 𝛽-blockers and mortality after acute myocardial infarction in patients without heart failure or left ventricular systolic dysfunction. The investigators sought to answer a causal research question (Step 1), and so we proceed to Step 2. Use (or no use) of 𝛽-blockers was determined after discharge without taking into consideration discontinuation or future treatment changes (i.e., intention-to-treat). Considering treatment for whom (Step 3), both ATE and ATT were evaluated. Since survival was the primary outcome, an absolute difference in survival time was chosen as the effect measure (Step 4). While there was no explicit directed acyclic graph provided, the investigators specified a list of confounders.

Robust methodologies were established by consideration of possible sources of biases and addressing them using viable solutions (Steps 6 and 7). Table  1 offers a list of the identified potential biases and their corresponding solutions as implemented. For example, to minimize potential biases including prevalent-user bias and selection bias, the sample was restricted to patients with no previous use of 𝛽-blockers, no contraindication for 𝛽-blockers, and no prescription of loop diuretics. To improve balance across the comparator groups in terms of baseline confounders, i.e., those that could influence both exposure (𝛽-blocker use) and outcome (mortality), propensity score-based inverse probability of treatment weighting (IPTW) was employed. However, we noted that the baseline look-back period to assess measured covariates was not explicitly listed in the paper.

Quality control and sensitivity analysis (Step 8) is described extensively. The overlap of propensity score distributions between comparator groups was tested and confounder balance was assessed. Since observations in the tail-end of the propensity score distribution may violate the positivity assumption (Crump et al. 2009 ), a sensitivity analysis was conducted including only cases within 0.1 to 0.9 of the propensity score distribution. While not mentioned by the authors, the PS tails can be influenced by unmeasured confounders (Sturmer et al. 2021 ), and the findings were robust with and without trimming. An assessment of extreme IPTW weights, while not included, would further help increase confidence in the robustness of the analysis. An instrumental variable approach was employed to assess potential selection bias due to unmeasured confounding, using hospital rates of guideline-indicated prescribing as the instrument. Additionally, potential bias caused by missing data was attenuated through the use of multiple imputation, and separate models were built for complete cases only and imputed/complete cases.

4 Discussion

We have described a conceptual schema for designing observational real-world studies to estimate causal effects. The application of this schema to a previously published study illuminates the methodologic structure of the study, revealing how each structural element is related to a potential bias which it is meant to address. Real-world evidence is increasingly accepted by healthcare stakeholders, including the FDA (Concato and Corrigan-Curay 2022 ; Concato and ElZarrad 2022 ), and its use for comparative effectiveness and safety assessments requires appropriate causal study design; our guide is meant to facilitate this design process and complement existing, more specific, guidance.

Existing guidance for causal inference using observational data includes components that can be clearly mapped onto the schema that we have developed. For example, in 2009 Cox et al. described common sources of bias in observational data and recommended specific strategies to mitigate these biases, corresponding to steps 6–8 of our step-by-step guide (Cox et al. 2009 ). In 2013, the AHRQ emphasized development of the research question, corresponding to steps 1–4 of our guide, with additional chapters on study design, comparator selection, sensitivity analyses, and directed acyclic graphs which correspond to steps 7 and 5, respectively (Velentgas et al. 2013 ). Much of Girman et al.’s manuscript (Girman et al. 2014 ) corresponds with steps 1–4 of our guide, and the matter of equipoise and interpretability specifically correspond to steps 3 and 7–8. The current ENCePP guide on methodological standards in pharmacoepidemiology contains a section on formulating a meaningful research question, corresponding to step 1, and describes strategies to mitigate specific sources of bias, corresponding to steps 6–8 (European Medicines Agency 2023 ). Recent works by the FDA Sentinel Innovation Center (Desai et al. 2024 ) and the Joint Initiative for Causal Inference (Dang et al. 2023 ) provide more advanced exposition of many of the steps in our guide. The target trial framework contains guidance on developing seven components of the study protocol, including eligibility criteria, treatment strategies, assignment procedures, follow-up period, outcome, causal contrast of interest, and analysis plan (Hernán and Robins 2016 ). Our work places the target trial framework into a larger context illustrating its relationship with other important study planning considerations, including the creation of a directed acyclic graph and incorporation of prespecified sensitivity and quantitative bias analyses.

Ultimately, the feasibility of estimating causal effects relies on the capabilities of the available data. Real-world data sources are complex, and the investigator must carefully consider whether the data on hand are sufficient to answer the research question. For example, a study that relies solely on claims data for outcome ascertainment may suffer from outcome misclassification bias (Lanes and Beachler 2023 ). This bias can be addressed through medical record validation for a random subset of patients, followed by quantitative bias analysis (Lanes and Beachler 2023 ). If instead, the investigator wishes to apply a previously published, claims-based algorithm validated in a different database, they must carefully consider the transportability of that algorithm to their own study population. In this way, causal inference from real-world data requires the ability to think creatively and resourcefully about how various data sources and elements can be leveraged, with consideration for the strengths and limitations of each source. The heart of causal inference is in the pairing of humility and creativity: the humility to acknowledge what the data cannot do, and the creativity to address those limitations as best as one can at the time.

4.1 Limitations

As with any attempt to synthesize a broad array of information into a single, simplified schema, there are several limitations to our work. Space and useability constraints necessitated simplification of the complex source material and selections among many available methodologies, and information about the relative importance of each step is not currently included. Additionally, it is important to consider the context of our work. This step-by-step guide emphasizes analytic techniques (e.g., propensity scores) that are used most frequently within our own research environment and may not include less familiar study designs and analytic techniques. However, one strength of the guide is that additional designs and techniques or concepts can easily be incorporated into the existing schema. The benefit of a schema is that new information can be added and is more readily accessed due to its association with previously sorted information (Loveless 2022 ). It is also important to note that causal inference was approached as a broad overarching concept defined by the totality of the research, from start to finish, rather than focusing on a particular analytic technique, however we view this as a strength rather than a limitation.

Finally, the focus of this guide was on the methodologic aspects of study planning. As a result, we did not include steps for drafting or registering the study protocol in a public database or for communicating results. We strongly encourage researchers to register their study protocols and communicate their findings with transparency. A protocol template endorsed by ISPOR and ISPE for studies using real-world data to evaluate treatment effects is available (Wang et al. 2023a ). Additionally, the steps described above are intended to illustrate an order of thinking in the study planning process, and these steps are often iterative. The guide is not intended to reflect the order of study execution; specifically, quality control procedures and sensitivity analyses should also be formulated up-front at the protocol stage.

5 Conclusion

We outlined steps and described key conceptual issues of importance in designing real-world studies to answer causal questions, and created a visually appealing, user-friendly resource to help researchers clearly define and navigate these issues. We hope this guide serves to enhance the quality, and thus the impact, of real-world evidence.

Data availability

No datasets were generated or analysed during the current study.

Arlett, P., Kjaer, J., Broich, K., Cooke, E.: Real-world evidence in EU Medicines Regulation: Enabling Use and establishing value. Clin. Pharmacol. Ther. 111 (1), 21–23 (2022)

Article   PubMed   Google Scholar  

Athey, S., Imbens, G.W.: Machine Learning Methods That Economists Should Know About. Annual Review of Economics 11(Volume 11, 2019): 685–725. (2019)

Belthangady, C., Stedden, W., Norgeot, B.: Minimizing bias in massive multi-arm observational studies with BCAUS: Balancing covariates automatically using supervision. BMC Med. Res. Methodol. 21 (1), 190 (2021)

Article   PubMed   PubMed Central   Google Scholar  

Berger, M.L., Sox, H., Willke, R.J., Brixner, D.L., Eichler, H.G., Goettsch, W., Madigan, D., Makady, A., Schneeweiss, S., Tarricone, R., Wang, S.V., Watkins, J.: and C. Daniel Mullins. 2017. Good practices for real-world data studies of treatment and/or comparative effectiveness: Recommendations from the joint ISPOR-ISPE Special Task Force on real-world evidence in health care decision making. Pharmacoepidemiol Drug Saf. 26 (9): 1033–1039

Brenner, H., Gefeller, O.: Use of the positive predictive value to correct for disease misclassification in epidemiologic studies. Am. J. Epidemiol. 138 (11), 1007–1015 (1993)

Article   CAS   PubMed   Google Scholar  

Concato, J., Corrigan-Curay, J.: Real-world evidence - where are we now? N Engl. J. Med. 386 (18), 1680–1682 (2022)

Concato, J., ElZarrad, M.: FDA Issues Draft Guidances on Real-World Evidence, Prepares to Publish More in Future [accessed on 2022]. (2022).

Cox, E., Martin, B.C., Van Staa, T., Garbe, E., Siebert, U., Johnson, M.L.: Good research practices for comparative effectiveness research: Approaches to mitigate bias and confounding in the design of nonrandomized studies of treatment effects using secondary data sources: The International Society for Pharmacoeconomics and Outcomes Research Good Research Practices for Retrospective Database Analysis Task Force Report–Part II. Value Health. 12 (8), 1053–1061 (2009)

Crump, R.K., Hotz, V.J., Imbens, G.W., Mitnik, O.A.: Dealing with limited overlap in estimation of average treatment effects. Biometrika. 96 (1), 187–199 (2009)

Article   Google Scholar  

Danaei, G., Rodriguez, L.A., Cantero, O.F., Logan, R., Hernan, M.A.: Observational data for comparative effectiveness research: An emulation of randomised trials of statins and primary prevention of coronary heart disease. Stat. Methods Med. Res. 22 (1), 70–96 (2013)

Dang, L.E., Gruber, S., Lee, H., Dahabreh, I.J., Stuart, E.A., Williamson, B.D., Wyss, R., Diaz, I., Ghosh, D., Kiciman, E., Alemayehu, D., Hoffman, K.L., Vossen, C.Y., Huml, R.A., Ravn, H., Kvist, K., Pratley, R., Shih, M.C., Pennello, G., Martin, D., Waddy, S.P., Barr, C.E., Akacha, M., Buse, J.B., van der Laan, M., Petersen, M.: A causal roadmap for generating high-quality real-world evidence. J. Clin. Transl Sci. 7 (1), e212 (2023)

Desai, R.J., Wang, S.V., Sreedhara, S.K., Zabotka, L., Khosrow-Khavar, F., Nelson, J.C., Shi, X., Toh, S., Wyss, R., Patorno, E., Dutcher, S., Li, J., Lee, H., Ball, R., Dal Pan, G., Segal, J.B., Suissa, S., Rothman, K.J., Greenland, S., Hernan, M.A., Heagerty, P.J., Schneeweiss, S.: Process guide for inferential studies using healthcare data from routine clinical practice to evaluate causal effects of drugs (PRINCIPLED): Considerations from the FDA Sentinel Innovation Center. BMJ. 384 , e076460 (2024)

Digitale, J.C., Martin, J.N., Glymour, M.M.: Tutorial on directed acyclic graphs. J. Clin. Epidemiol. 142 , 264–267 (2022)

Dondo, T.B., Hall, M., West, R.M., Jernberg, T., Lindahl, B., Bueno, H., Danchin, N., Deanfield, J.E., Hemingway, H., Fox, K.A.A., Timmis, A.D., Gale, C.P.: beta-blockers and Mortality after Acute myocardial infarction in patients without heart failure or ventricular dysfunction. J. Am. Coll. Cardiol. 69 (22), 2710–2720 (2017)

Article   CAS   PubMed   PubMed Central   Google Scholar  

European Medicines Agency: ENCePP Guide on Methodological Standards in Pharmacoepidemiology [accessed on 2023]. (2023).

Ferguson, K.D., McCann, M., Katikireddi, S.V., Thomson, H., Green, M.J., Smith, D.J., Lewsey, J.D.: Evidence synthesis for constructing directed acyclic graphs (ESC-DAGs): A novel and systematic method for building directed acyclic graphs. Int. J. Epidemiol. 49 (1), 322–329 (2020)

Flanagin, A., Lewis, R.J., Muth, C.C., Curfman, G.: What does the proposed causal inference Framework for Observational studies Mean for JAMA and the JAMA Network Journals? JAMA (2024)

U.S. Food and Drug Administration: Framework for FDA’s Real-World Evidence Program [accessed on 2018]. (2018).

Franklin, J.M., Schneeweiss, S.: When and how can Real World Data analyses substitute for randomized controlled trials? Clin. Pharmacol. Ther. 102 (6), 924–933 (2017)

Gatto, N.M., Wang, S.V., Murk, W., Mattox, P., Brookhart, M.A., Bate, A., Schneeweiss, S., Rassen, J.A.: Visualizations throughout pharmacoepidemiology study planning, implementation, and reporting. Pharmacoepidemiol Drug Saf. 31 (11), 1140–1152 (2022)

Girman, C.J., Faries, D., Ryan, P., Rotelli, M., Belger, M., Binkowitz, B., O’Neill, R.: and C. E. R. S. W. G. Drug Information Association. 2014. Pre-study feasibility and identifying sensitivity analyses for protocol pre-specification in comparative effectiveness research. J. Comp. Eff. Res. 3 (3): 259–270

Griffith, G.J., Morris, T.T., Tudball, M.J., Herbert, A., Mancano, G., Pike, L., Sharp, G.C., Sterne, J., Palmer, T.M., Davey Smith, G., Tilling, K., Zuccolo, L., Davies, N.M., Hemani, G.: Collider bias undermines our understanding of COVID-19 disease risk and severity. Nat. Commun. 11 (1), 5749 (2020)

Hernán, M.A.: The C-Word: Scientific euphemisms do not improve causal inference from Observational Data. Am. J. Public Health. 108 (5), 616–619 (2018)

Hernán, M.A., Robins, J.M.: Using Big Data to emulate a target Trial when a Randomized Trial is not available. Am. J. Epidemiol. 183 (8), 758–764 (2016)

Hernán, M., Robins, J.: Causal Inference: What if. Chapman & Hall/CRC, Boca Raton (2020)

Google Scholar  

International Society for Pharmacoeconomics and Outcomes Research (ISPOR): Strategic Initiatives: Real-World Evidence [accessed on 2022]. (2022).

International Society for Pharmacoepidemiology (ISPE): Position on Real-World Evidence [accessed on 2020]. (2020).

Labrecque, J.A., Swanson, S.A.: Target trial emulation: Teaching epidemiology and beyond. Eur. J. Epidemiol. 32 (6), 473–475 (2017)

Lanes, S., Beachler, D.C.: Validation to correct for outcome misclassification bias. Pharmacoepidemiol Drug Saf. (2023)

Lash, T.L., Fox, M.P., Fink, A.K.: Applying Quantitative bias Analysis to Epidemiologic data. Springer (2009)

Lash, T.L., Fox, M.P., MacLehose, R.F., Maldonado, G., McCandless, L.C., Greenland, S.: Good practices for quantitative bias analysis. Int. J. Epidemiol. 43 (6), 1969–1985 (2014)

Leahy, T.P., Kent, S., Sammon, C., Groenwold, R.H., Grieve, R., Ramagopalan, S., Gomes, M.: Unmeasured confounding in nonrandomized studies: Quantitative bias analysis in health technology assessment. J. Comp. Eff. Res. 11 (12), 851–859 (2022)

Loveless, B.: A Complete Guide to Schema Theory and its Role in Education [accessed on 2022]. (2022).

Lund, J.L., Richardson, D.B., Sturmer, T.: The active comparator, new user study design in pharmacoepidemiology: Historical foundations and contemporary application. Curr. Epidemiol. Rep. 2 (4), 221–228 (2015)

Mai, X., Teng, C., Gao, Y., Governor, S., He, X., Kalloo, G., Hoffman, S., Mbiydzenyuy, D., Beachler, D.: A pragmatic comparison of logistic regression versus machine learning methods for propensity score estimation. Supplement: Abstracts of the 38th International Conference on Pharmacoepidemiology: Advancing Pharmacoepidemiology and Real-World Evidence for the Global Community, August 26–28, 2022, Copenhagen, Denmark. Pharmacoepidemiology and Drug Safety 31(S2). (2022)

Mullard, A.: 2021 FDA approvals. Nat. Rev. Drug Discov. 21 (2), 83–88 (2022)

Onasanya, O., Hoffman, S., Harris, K., Dixon, R., Grabner, M.: Current applications of machine learning for causal inference in healthcare research using observational data. International Society for Pharmacoeconomics and Outcomes Research (ISPOR) Atlanta, GA. (2024)

Pearl, J.: Causal diagrams for empirical research. Biometrika. 82 (4), 669–688 (1995)

Prada-Ramallal, G., Takkouche, B., Figueiras, A.: Bias in pharmacoepidemiologic studies using secondary health care databases: A scoping review. BMC Med. Res. Methodol. 19 (1), 53 (2019)

Richardson, T.S., Robins, J.M.: Single World Intervention Graphs: A Primer [accessed on 2013]. (2013).

Richiardi, L., Bellocco, R., Zugna, D.: Mediation analysis in epidemiology: Methods, interpretation and bias. Int. J. Epidemiol. 42 (5), 1511–1519 (2013)

Riis, A.H., Johansen, M.B., Jacobsen, J.B., Brookhart, M.A., Sturmer, T., Stovring, H.: Short look-back periods in pharmacoepidemiologic studies of new users of antibiotics and asthma medications introduce severe misclassification. Pharmacoepidemiol Drug Saf. 24 (5), 478–485 (2015)

Rodrigues, D., Kreif, N., Lawrence-Jones, A., Barahona, M., Mayer, E.: Reflection on modern methods: Constructing directed acyclic graphs (DAGs) with domain experts for health services research. Int. J. Epidemiol. 51 (4), 1339–1348 (2022)

Rothman, K.J., Greenland, S., Lash, T.L.: Modern Epidemiology. Wolters Kluwer Health/Lippincott Williams & Wilkins, Philadelphia (2008)

Rubin, D.B.: Causal inference using potential outcomes. J. Am. Stat. Assoc. 100 (469), 322–331 (2005)

Article   CAS   Google Scholar  

Sauer, B.V.: TJ. Use of Directed Acyclic Graphs. In Developing a Protocol for Observational Comparative Effectiveness Research: A User’s Guide , edited by P. Velentgas, N. Dreyer, and P. Nourjah: Agency for Healthcare Research and Quality (US) (2013)

Schneeweiss, S., Rassen, J.A., Brown, J.S., Rothman, K.J., Happe, L., Arlett, P., Dal Pan, G., Goettsch, W., Murk, W., Wang, S.V.: Graphical depiction of longitudinal study designs in Health Care databases. Ann. Intern. Med. 170 (6), 398–406 (2019)

Schuler, M.S., Rose, S.: Targeted maximum likelihood estimation for causal inference in Observational studies. Am. J. Epidemiol. 185 (1), 65–73 (2017)

Stuart, E.A., DuGoff, E., Abrams, M., Salkever, D., Steinwachs, D.: Estimating causal effects in observational studies using Electronic Health data: Challenges and (some) solutions. EGEMS (Wash DC) 1 (3). (2013)

Sturmer, T., Webster-Clark, M., Lund, J.L., Wyss, R., Ellis, A.R., Lunt, M., Rothman, K.J., Glynn, R.J.: Propensity score weighting and trimming strategies for reducing Variance and Bias of Treatment Effect estimates: A Simulation Study. Am. J. Epidemiol. 190 (8), 1659–1670 (2021)

Suissa, S., Dell’Aniello, S.: Time-related biases in pharmacoepidemiology. Pharmacoepidemiol Drug Saf. 29 (9), 1101–1110 (2020)

Tripepi, G., Jager, K.J., Dekker, F.W., Wanner, C., Zoccali, C.: Measures of effect: Relative risks, odds ratios, risk difference, and ‘number needed to treat’. Kidney Int. 72 (7), 789–791 (2007)

Velentgas, P., Dreyer, N., Nourjah, P., Smith, S., Torchia, M.: Developing a Protocol for Observational Comparative Effectiveness Research: A User’s Guide. Agency for Healthcare Research and Quality (AHRQ) Publication 12(13). (2013)

Wang, A., Nianogo, R.A., Arah, O.A.: G-computation of average treatment effects on the treated and the untreated. BMC Med. Res. Methodol. 17 (1), 3 (2017)

Wang, S.V., Pottegard, A., Crown, W., Arlett, P., Ashcroft, D.M., Benchimol, E.I., Berger, M.L., Crane, G., Goettsch, W., Hua, W., Kabadi, S., Kern, D.M., Kurz, X., Langan, S., Nonaka, T., Orsini, L., Perez-Gutthann, S., Pinheiro, S., Pratt, N., Schneeweiss, S., Toussi, M., Williams, R.J.: HARmonized Protocol Template to enhance reproducibility of hypothesis evaluating real-world evidence studies on treatment effects: A good practices report of a joint ISPE/ISPOR task force. Pharmacoepidemiol Drug Saf. 32 (1), 44–55 (2023a)

Wang, S.V., Schneeweiss, S., Initiative, R.-D., Franklin, J.M., Desai, R.J., Feldman, W., Garry, E.M., Glynn, R.J., Lin, K.J., Paik, J., Patorno, E., Suissa, S., D’Andrea, E., Jawaid, D., Lee, H., Pawar, A., Sreedhara, S.K., Tesfaye, H., Bessette, L.G., Zabotka, L., Lee, S.B., Gautam, N., York, C., Zakoul, H., Concato, J., Martin, D., Paraoan, D.: and K. Quinto. Emulation of Randomized Clinical Trials With Nonrandomized Database Analyses: Results of 32 Clinical Trials. JAMA 329(16): 1376-85. (2023b)

Westreich, D., Lessler, J., Funk, M.J.: Propensity score estimation: Neural networks, support vector machines, decision trees (CART), and meta-classifiers as alternatives to logistic regression. J. Clin. Epidemiol. 63 (8), 826–833 (2010)

Yang, S., Eaton, C.B., Lu, J., Lapane, K.L.: Application of marginal structural models in pharmacoepidemiologic studies: A systematic review. Pharmacoepidemiol Drug Saf. 23 (6), 560–571 (2014)

Zhou, H., Taber, C., Arcona, S., Li, Y.: Difference-in-differences method in comparative Effectiveness Research: Utility with unbalanced groups. Appl. Health Econ. Health Policy. 14 (4), 419–429 (2016)

Download references

The authors received no financial support for this research.

Author information

Authors and affiliations.

Carelon Research, Wilmington, DE, USA

Sarah Ruth Hoffman, Nilesh Gangan, Joseph L. Smith, Arlene Tave, Yiling Yang, Christopher L. Crowe & Michael Grabner

Elevance Health, Indianapolis, IN, USA

Xiaoxue Chen

University of Maryland School of Pharmacy, Baltimore, MD, USA

Susan dosReis

You can also search for this author in PubMed   Google Scholar


SH, NG, JS, AT, CC, MG are employees of Carelon Research, a wholly owned subsidiary of Elevance Health, which conducts health outcomes research with both internal and external funding, including a variety of private and public entities. XC was an employee of Elevance Health at the time of study conduct. YY was an employee of Carelon Research at the time of study conduct. SH, MG, and JLS are shareholders of Elevance Health. SdR receives funding from GlaxoSmithKline for a project unrelated to the content of this manuscript and conducts research that is funded by state and federal agencies.

Corresponding author

Correspondence to Sarah Ruth Hoffman .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Supplementary material 2, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit .

Reprints and permissions

About this article

Hoffman, S.R., Gangan, N., Chen, X. et al. A step-by-step guide to causal study design using real-world data. Health Serv Outcomes Res Method (2024).

Download citation

Received : 07 December 2023

Revised : 31 May 2024

Accepted : 10 June 2024

Published : 19 June 2024


Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Causal inference
  • Real-world data
  • Confounding
  • Non-randomized data
  • Bias in pharmacoepidemiology
  • Find a journal
  • Publish with us
  • Track your research
  • Bihar Board

James Dyson Award

Sanskriti university, srm university.

  • Education News
  • Web Stories
  • Current Affairs
  • School & Boards
  • College Admission
  • Govt Jobs Alert & Prep
  • GK & Aptitude

NEET Re-Exam 2024 (June 23): Check Exam Analysis, Question Paper and Answer Key

NEET Re-Test 2024: Check NEET exam analysis for the re-test conducted for 1563 candidates at seven exam centres today. Know the questions' difficulty level as reviewed by students and experts.

Gurmeet Kaur

NEET Re-Exam 2024 Analysis: Today, June 23, 2024, the National Testing Agency (NTA) conducted a re-examination for 1,563 candidates who were previously awarded grace marks in the NEET UG exam. However, many candidates skipped the exam, significantly affecting the overall number of candidates who retook the test. The exam was held from 2:00 PM to 5:20 PM. This article provides an in-depth analysis of the re-exam, its implications, and what candidates can expect moving forward. We will share the experts’ review and the feedback shared by a handful of candidates who are taking the NEET Re-Exam 2024.

NEET Re-Exam 2024 Key Highlights

NEET Re-Exam 2024: Detailed Analysis

The National Testing Agency (NTA) had announced that candidates who received grace marks in the NEET UG 2024 exam must either sit for a re-examination or accept their original scores without the grace adjustment. Notifications regarding this decision were sent to the affected candidates via their registered email addresses. Consequently, fresh admit cards have been issued to 1,563 candidates for the re-exam.

NEET UG 2024 Exam Pattern

Exam Day Observations

Neet ug fiasco 2024: why nta had to conduct re-exam.

On June 13, the NTA released an official notification stating that a re-examination would be conducted for the impacted candidates. The need for this re-examination arose after 1,563 candidates were awarded grace marks to compensate for time lost during the NEET exam held on May 5. These candidates were from seven examination centres: Balod and Dantewada in Chhattisgarh, and one centre each in Bahadurgarh, Chandigarh, Haryana, Meghalaya and Surat. The re-examination will be conducted at different centres this time.

NEET UG 2024 Implications and Considerations

The decision to conduct a re-examination underscores the NTA's commitment to ensuring fairness and accuracy in the examination process. The affected candidates now face a crucial decision: to either reappear for the exam to improve their scores or accept the original marks, which may impact their prospects for admission into medical colleges.

The re-examination process also places additional logistical demands on both the candidates and the examination authority. Ensuring that the re-exam is conducted smoothly and that the results are declared on time is vital for maintaining the integrity of the NEET UG examination process.

Also Check:

NEET UG 2024 (May 5) Exam Analysis

NEET UG 2024 (May 5) Question Paper

Get here latest School , CBSE and Govt Jobs notification in English and Hindi for Sarkari Naukari and Sarkari Result . Download the Jagran Josh Sarkari Naukri App . Check  Board Result 2024  for Class 10 and Class 12 like  CBSE Board Result ,  UP Board Result ,  Bihar Board Result ,  MP Board Result ,  Rajasthan Board Result  and Other States Boards.

  • UGC NET Exam 2024 Cancelled
  • UPSC Question Paper 2024
  • UPSC Exam Analysis 2024
  • UPSC Prelims Cut Off 2024
  • Bihar BEd Admit Card 2024
  • CSIR NET Admit Card 2024
  • NTA NET Admit Card 2024
  • APSC SO Result 2024
  • APSC SO Admit Card 2024
  • UPSC CSE Admit Card 2024

Latest Education News

Rajasthan BSTC Deled Admit Card 2024 Today: Pre D.El.Ed Call Letter at

Leader of Opposition in Lok Sabha: लोकसभा में विपक्ष के पहले नेता कौन थे? जानें

T20 World Cup 2024 Super 8 Groups: किसका मुकाबला कब और किससे? हर एक डिटेल्स यहां देखें

Rajasthan BSTC Deled Admit Card 2024 Live Update: पर जारी होने वाला है राजस्थान प्री डीएलएड परीक्षा का एडमिट कार्ड

Vande Bharat Sleeper Trains: पहली वंदे भारत स्लीपर ट्रेन कब और किसी रूट पर दौड़ेगी?

Picture Puzzle IQ Test: Find the mistake in the dining table picture in 4 seconds!

रेलवे स्टेशनों पर मिलने वाली इन सेवाओं पर अब नहीं लगेगा GST, देखें पूरी लिस्ट

List of Static GK (General Knowledge) Topics and Questions for SSC CGL/CHSL/MTS/CPO/JE Exams

Synonyms and Antonyms: 100+ Synonyms and Antonyms List with Examples and Meaning

SIPRI Report 2024: भारत-पाक सहित किस देश के पास कितने परमाणु हथियार? जानें

PM Kisan Beneficiary Status: डायरेक्ट Link से देखें PM-Kisan Samman Nidhi बेनिफिशियरी स्टेटस

ICC T20 World Cup 2024: T20 वर्ल्ड कप Super-8 शेड्यूल यहां देखें, कौन हुआ रेस से बाहर देखें यहां

MP PCS Answer key 2024: यहाँ चेक करें एमपी पीसीएस के सेट A, B, C, D की अनऑफिसियल उत्तर कुंजी

Weekly Current Affairs Questions and Answers: 17 June to 23 June 2024

NEET Re-Exam 2024 (June 23): Check Question Paper and Answer Key Here!

Weekly Current Affairs Quiz Hindi: 17 जून से 23 जून 2024

Optical Illusion IQ Challenge: Only 1% with high IQ can find three ducklings in the picture in 7 seconds!

CG TET Question Paper 2024: यहां से डाउनलोड करें छत्तीसगढ़ टीईटी सेट A, B, C, और D प्रश्न पत्र 1, 2 PDF

MPPSC Prelims Question Paper 2024 PDF: यहां से डाउनलोड करें एमपी पीसीएस प्रारंभिक परीक्षा के सेट ए, बी, सी और डी के प्रश्न पत्र

Help | Advanced Search

Computer Science > Computation and Language

Title: a dataset of information-seeking questions and answers anchored in research papers.

Abstract: Readers of academic research papers often read with the goal of answering specific questions. Question Answering systems that can answer those questions can make consumption of the content much more efficient. However, building such tools requires data that reflect the difficulty of the task arising from complex reasoning about claims made in multiple parts of a paper. In contrast, existing information-seeking question answering datasets usually contain questions about generic factoid-type information. We therefore present QASPER, a dataset of 5,049 questions over 1,585 Natural Language Processing papers. Each question is written by an NLP practitioner who read only the title and abstract of the corresponding paper, and the question seeks information present in the full text. The questions are then answered by a separate set of NLP practitioners who also provide supporting evidence to answers. We find that existing models that do well on other QA tasks do not perform well on answering these questions, underperforming humans by at least 27 F1 points when answering them from entire papers, motivating further research in document-grounded, information-seeking QA, which our dataset is designed to facilitate.
Comments: Accepted at NAACL 2021; Project page:
Subjects: Computation and Language (cs.CL)
Cite as: [cs.CL]
  (or [cs.CL] for this version)
  Focus to learn more arXiv-issued DOI via DataCite

Submission history

Access paper:.

  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

DBLP - CS Bibliography

Bibtex formatted citation.

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

  • Find a Lawyer
  • Ask a Lawyer
  • Research the Law
  • Law Schools
  • Laws & Regs
  • Newsletters
  • Justia Connect
  • Pro Membership
  • Basic Membership
  • Justia Lawyer Directory
  • Platinum Placements
  • Gold Placements
  • Justia Elevate
  • Justia Amplify
  • PPC Management
  • Google Business Profile
  • Social Media
  • Justia Onward Blog

Q: How do i respond to being served with custody papers from someone who is not on my daughters birth certificate?

My daughters' possible father was in prison for the first year of her life, when he got out it has been nothing but a fight trying to get him to honor any kind of agreement we had made. Given his past I felt he need to show that he wasn't going to go back to his old ways. After finding out about the most recent incident when my daughter was put in a very unsafe situation I told him our arrangement needed to be modified and he wasn't going to have overnights for a while since he thought doing drugs and alcohol and being violent around her was okay. He refused and I was recently served with custody papers. I'm not sure how to respond as he isn't even on the birth certificate and there is a good possibility that he isn't even the father.

John Michael Frick

  • Divorce Lawyer
  • (214) 618-1400
  • Email Lawyer
  • View Website

A: Ask for a paternity test to determine if he is indeed the father.

Justia Ask a Lawyer is a forum for consumers to get answers to basic legal questions. Any information sent through Justia Ask a Lawyer is not secure and is done so on a non-confidential basis only.

The use of this website to ask questions or receive answers does not create an attorney–client relationship between you and Justia, or between you and any attorney who receives your information or responds to your questions, nor is it intended to create such a relationship. Additionally, no responses on this forum constitute legal advice, which must be tailored to the specific circumstances of each case. You should not act upon information provided in Justia Ask a Lawyer without seeking professional counsel from an attorney admitted or authorized to practice in your jurisdiction. Justia assumes no responsibility to any person who relies on information contained on or received through this site and disclaims all liability in respect to such information.

Justia cannot guarantee that the information on this website (including any legal information provided by an attorney through this service) is accurate, complete, or up-to-date. While we intend to make every attempt to keep the information on this site current, the owners of and contributors to this site make no claims, promises or guarantees about the accuracy, completeness or adequacy of the information contained in or linked to from this site.

  • Bankruptcy Lawyers
  • Business Lawyers
  • Criminal Lawyers
  • Employment Lawyers
  • Estate Planning Lawyers
  • Family Lawyers
  • Personal Injury Lawyers
  • Estate Planning
  • Personal Injury
  • Business Formation
  • Business Operations
  • Intellectual Property
  • International Trade
  • Real Estate
  • Financial Aid
  • Course Outlines
  • Law Journals
  • US Constitution
  • Regulations
  • Supreme Court
  • Circuit Courts
  • District Courts
  • Dockets & Filings
  • State Constitutions
  • State Codes
  • State Case Law
  • Legal Blogs
  • Business Forms
  • Product Recalls
  • Justia Connect Membership
  • Justia Premium Placements
  • Justia Elevate (SEO, Websites)
  • Justia Amplify (PPC, GBP)
  • Testimonials


  1. (PDF) Research Paper on Question Answering System using BERT

    research paper on question answering

  2. Template for question and answer format reports Note: Report writers

    research paper on question answering

  3. my paper is a research paper on answering the

    research paper on question answering

  4. ⭐ How to do a research paper. 4 Ways to Publish a Research Paper. 2022

    research paper on question answering

  5. Question Answering

    research paper on question answering

  6. Research Question: Definition, Types, Examples, Quick Tips

    research paper on question answering


  1. Positive Academy

  2. Ask Questions About Your Paper

  3. AI Research Paper Question Answering

  4. How to read Research Paper Effectively

  5. [AI Summary English] Video-Bench

  6. Question Answering


  1. A literature review on question answering techniques, paradigms and

    With the efforts from academic research, Question Answering is a growing research field worldwide (Voorhees and Tice, 2000, Wang et al., 2000). The demand for this kind of system increases day by day since it delivers short, precise and question-specific answers (Pudaruth et al., 2016). Nevertheless, a systematic approach for understanding the ...

  2. (PDF) A Review of Question Answering Systems

    Question Answering systems generally follow a pipeline structure with three. major modules (Prager, Brown, Coden and Radev 2000; Hovy et al., 2000; Clarke, Cormack and Lynam 2001; Iftene 2009 ...

  3. Question Answering

    Question Answering. 3065 papers with code • 131 benchmarks • 366 datasets. Question Answering is the task of answering questions (typically reading comprehension questions), but abstaining when presented with a question that cannot be answered based on the provided context. Question answering can be segmented into domain-specific tasks like ...

  4. Deep learning-based question answering: a survey

    Question Answering is a crucial natural language processing task. This field of research has attracted a sudden amount of interest lately due mainly to the integration of the deep learning models in the Question Answering Systems which consequently power up many advancements and improvements. This survey aims to explore and shed light upon the recent and most powerful deep learning-based ...

  5. Deep Learning Approaches for Question Answering System

    In this paper, the analysis of, all the major deep learning algorithms for question answering, has been done. 1.1. Background A question answering system implementation usually a chat bot can construct an answer when put up with a query. One of the earliest and most successful implementation of chat bot has been ALICE Bot, developed using AIML.

  6. Automatic question-answer pairs generation and question ...

    This paper blends the essence of Question Generation, Question Comprehension, and Question Answering to overcome the Question Answering System's limitations. ... As the research evolved, the Question Answering System started gaining higher credibility due to data outbursts. The Natural Language Processing (NLP) systems were introduced to ...

  7. (PDF) Question Answering Systems: A Systematic ...

    QAS literature divides the working mechanism of question. answering systems into three broad modules, namely, question. processing, document proce ssing, and answer processing. As. noted earlier ...

  8. Conversational question answering: a survey

    Question answering (QA) systems provide a way of querying the information available in various formats including, but not limited to, unstructured and structured data in natural languages. It constitutes a considerable part of conversational artificial intelligence (AI) which has led to the introduction of a special research topic on conversational question answering (CQA), wherein a system is ...

  9. Modern Question Answering Datasets and Benchmarks: A Survey

    In this paper, we investigate influential QA datasets that have been released in the era of deep learning. Specifically, we begin with introducing two of the most common QA tasks - textual question answer and visual question answering - separately, covering the most representative datasets, and then give some current challenges of QA research.

  10. PDF Question Answering Using Deep Learning

    Question answering (QA) is a well-researched problem in NLP. In spite of being one of the oldest research areas, QA has application in a wide variety of tasks, such as information retrieval and entity extraction. Recently, QA has also been used to develop dialog systems [1] and chatbots [2] designed ...

  11. A Survey of Extractive Question Answering

    Extractive question answering is one of the most important tasks in natural language processing (NLP) which has high research value. In order to sort out its development process, in this paper, it was divided into two categories roughly in this paper: single-span extractive question answering and multispan extractive question answering, which are divided according to the number of spans ...

  12. (PDF) Visual Question Answering: A Survey on Techniques and Common

    Visual Question Answering (VQA) is an emerging area of interest for researches, being a recent problem in natural language processing and image prediction. In this area, an algorithm needs to ...

  13. Research and Reviews in Question Answering System

    The processing of a QA system may broadly have three stages, i.e., question analysis: parsing, question classification and query reformulation; document analysis: extract candidate documents, identify answers; and answer analysis: extract candidate answers and rank the best one. Question answering task combines techniques from artificial ...

  14. [2305.11033] Visual Question Answering: A Survey on Techniques and

    Visual Question Answering (VQA) is an emerging area of interest for researches, being a recent problem in natural language processing and image prediction. In this area, an algorithm needs to answer questions about certain images. As of the writing of this survey, 25 recent studies were analyzed. Besides, 6 datasets were analyzed and provided their link to download. In this work, several ...

  15. Automatic question generation and answer assessment: a survey

    Learning through the internet becomes popular that facilitates learners to learn anything, anytime, anywhere from the web resources. Assessment is most important in any learning system. An assessment system can find the self-learning gaps of learners and improve the progress of learning. The manual question generation takes much time and labor. Therefore, automatic question generation from ...

  16. Question Answering Research Papers

    This paper considers a surface-based question answering system for an open- domain solution. It analyzes the current progress that has been done in this area so far, while as well describes a methodology of answering questions by using information retrieved from very large collection of text.

  17. 10 Research Question Examples to Guide your Research Project

    The first question asks for a ready-made solution, and is not focused or researchable. The second question is a clearer comparative question, but note that it may not be practically feasible. For a smaller research project or thesis, it could be narrowed down further to focus on the effectiveness of drunk driving laws in just one or two countries.

  18. Writing Strong Research Questions

    A good research question is essential to guide your research paper, dissertation, or thesis. All research questions should be: Focused on a single problem or issue. Researchable using primary and/or secondary sources. Feasible to answer within the timeframe and practical constraints. Specific enough to answer thoroughly.

  19. Hurdles to Progress in Long-form Question Answering

    The task of long-form question answering (LFQA) involves retrieving documents relevant to a given question and using them to generate a paragraph-length answer. While many models have recently been proposed for LFQA, we show in this paper that the task formulation raises fundamental challenges regarding evaluation and dataset creation that currently preclude meaningful modeling progress. To ...

  20. This Simple Logic Question Stumps Even the Most Advanced AI

    A fascinating new paper from scientists at the AI research nonprofit LAION finds that even the most sophisticated large language ... answering the AIW question correctly nearly 65 percent of the ...

  21. (PDF) A literature survey on question answering system in Natural

    Research paper. A literature survey on question answering system in natural. language processing. A. Clementeena 1 *, Dr. P. Sripriya. 1 Research Scholar, School of Computing Sciences, VISTAS. 2 ...

  22. UGC NET 2024 Question Papers 1 and 2, Subject Wise PDF 18th ...

    The UGC NET Examination was conducted on June 18, 2024, in two shifts. The exam consists of two papers: Paper 1, which assesses general teaching and research aptitude in areas such as reasoning ability, comprehension, and general awareness, and Paper 2, which is subject-specific, evaluating the candidate's expertise in their chosen field.

  23. NEET Re-Exam 2024 Question Paper & Answer Key: June 23, 2024 Latest Updates

    Today, June 23, 2024, The National Testing Agency (NTA) is conducting a re-exam for 1,563 candidates who were previously given grace marks in the NEET UG exam. This exam is crucial for students ...

  24. A step-by-step guide to causal study design using real-world data

    Due to the need for generalizable and rapidly delivered evidence to inform healthcare decision-making, real-world data have grown increasingly important to answer causal questions. However, causal inference using observational data poses numerous challenges, and relevant methodological literature is vast. We endeavored to identify underlying unifying themes of causal inference using real-world ...

  25. NEET Re-Exam 2024 (June 23): Check Exam Analysis, Question Paper and

    NEET UG 2024 (May 5) Question Paper. NEET UG 2024 (May 5) Answer Key. Get here latest School, CBSE and Govt Jobs notification in English and Hindi for Sarkari Naukari and Sarkari Result.

  26. [2105.03011] A Dataset of Information-Seeking Questions and Answers

    A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers. Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, Matt Gardner. Readers of academic research papers often read with the goal of answering specific questions. Question Answering systems that can answer those questions can make consumption of the ...

  27. NEET Re-Exam June 23 Paper Analysis 2024

    NEET Re-exam Analysis 2024 will be provided here based on the initial response of the test-takers. NEET Question Paper consists total of 200 questions out of which 180 questions are to be attempted. Each question carries 4 marks each with negative marking of 1. NEET Re-Exam June 23 Paper Analysis 2024

  28. Research Paper on Question Answering System using BERT

    Readers of academic research papers often read with the goal of answering specific questions. Question Answering systems that can answer those questions can make consumption of the content much ...

  29. How do i respond to being served with custody papers from someone who

    Read 1 Answer from lawyers to How do i respond to being served with custody papers from someone who is not on my daughters birth certificate? - Colorado Family Law Questions & Answers - Justia Ask a Lawyer

  30. NEET Re-Exam 2024 Question Paper with Answer Key PDF

    NEET Re-Exam 2024 Question paper for June 23 will be available for download here. NTA is going to reconduct NEET exam 2024 on June 23 for 1563 candidates. NEET question paper has 200 MCQs- 180 to be attempted in 3 hours 20 minutes. NEET 2024 question paper 2024 PDF is divided into 4 sections- Zoology, Botany, Chemistry, and Physics.