Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 12 April 2021

COVID-19 information retrieval with deep-learning based semantic search, question answering, and abstractive summarization

  • Andre Esteva   ORCID: orcid.org/0000-0003-1937-9682 1   na1 ,
  • Anuprit Kale 1   na1 ,
  • Romain Paulus 1 ,
  • Kazuma Hashimoto 1 ,
  • Wenpeng Yin 1 ,
  • Dragomir Radev   ORCID: orcid.org/0000-0001-7830-6489 1 , 2 &
  • Richard Socher 1  

npj Digital Medicine volume  4 , Article number:  68 ( 2021 ) Cite this article

10k Accesses

13 Altmetric

Metrics details

  • Health care
  • Medical research

The COVID-19 global pandemic has resulted in international efforts to understand, track, and mitigate the disease, yielding a significant corpus of COVID-19 and SARS-CoV-2-related publications across scientific disciplines. Throughout 2020, over 400,000 coronavirus-related publications have been collected through the COVID-19 Open Research Dataset. Here, we present CO-Search, a semantic, multi-stage, search engine designed to handle complex queries over the COVID-19 literature, potentially aiding overburdened health workers in finding scientific answers and avoiding misinformation during a time of crisis. CO-Search is built from two sequential parts: a hybrid semantic-keyword retriever, which takes an input query and returns a sorted list of the 1000 most relevant documents, and a re-ranker, which further orders them by relevance. The retriever is composed of a deep learning model (Siamese-BERT) that encodes query-level meaning, along with two keyword-based models (BM25, TF-IDF) that emphasize the most important words of a query. The re-ranker assigns a relevance score to each document, computed from the outputs of (1) a question–answering module which gauges how much each document answers the query, and (2) an abstractive summarization module which determines how well a query matches a generated summary of the document. To account for the relatively limited dataset, we develop a text augmentation technique which splits the documents into pairs of paragraphs and the citations contained in them, creating millions of (citation title, paragraph) tuples for training the retriever. We evaluate our system ( http://einstein.ai/covid ) on the data of the TREC-COVID information retrieval challenge, obtaining strong performance across multiple key information retrieval metrics.

Similar content being viewed by others

semantic search research papers

Identifying COVID-19 cases and extracting patient reported symptoms from Reddit using natural language processing

semantic search research papers

BioASQ-QA: A manually curated corpus for Biomedical Question Answering

semantic search research papers

Outbreak.info Research Library: a standardized, searchable platform to discover and explore COVID-19 resources

Introduction.

The evolution of the SARS-CoV-2 virus, with its unique balance of virulence and contagiousness, has resulted in the COVID-19 pandemic. Since December 2019, the disease threatens exponential spread across our society, catalyzed by a modern air and road transportation system, along with dense urban centers where close contact amongst people yielded hubs of viral spread.

Global efforts have arisen in an attempt to quell the spread of the virus. National governments have shut down entire economic sectors, enforcing stay-at-home orders for many people. Hospitals have restructured themselves to cope with an unprecedented influx of intensive care unit patients, sometimes growing organically to increase their number of beds 1 . Institutions have adjusted their practices to support efforts—repurposing assembly lines to build mechanical ventilators 2 , delaying delivery of non-COVID-related shipments 3 , creating contact-tracing mobile apps 4 and "digital swabs” 5 to track symptoms and potential spread. Pharmaceutical enterprises and academic institutions have invested significantly in developing vaccines and therapeutics 6 , while deeply studying both COVID-19 and SARS-CoV-2.

The health impacts of this crisis have been matched only by the economic backlash to society. Hundreds of thousands of small businesses have shut down, entire industrial sectors have been negatively impacted 7 , and tens of millions of workers have been laid off or furloughed 8 . Even after our global society succeeds at controlling the virus’s spread, we will be faced with many challenges, including re-opening our societies, lifting stay-at-home orders, deploying better testing, developing vaccines and therapeutics, aiding the unemployed and out-of-business, etc.

The global response to COVID-19 has yielded a growing corpus of scientific publications—increasing at a rate of thousands per week—about COVID-19, SARS-CoV-2, other coronaviruses, and related topics 9 . The individuals on the front lines of the fight—healthcare practitioners, policy makers, medical researchers, etc.—will require specialized tools to keep up with the literature.

CO-Search is a cascaded retriever-ranker semantic search engine that takes complex search queries (e.g. natural language questions), and retrieves scientific articles strictly over the coronavirus-related literature. CO-Search indexes content from over 400,000 scientific papers made available through the COVID-19 Open Research Dataset Challenge (CORD-19) 9 —an initiative put forth by the US White House and other prominent institutions in early 2020. The goal of this line of work is to offer an alternative, scientific search engine, designed to limit misinformation in a time of crisis.

We evaluate CO-Search on data from the TREC-COVID challenge 10 —a five-round information retrieval (IR) competition for COVID-19 search engines—using several standard IR metrics: normalized discounted cumulative gain (nDCG), precision with N documents (P@N), mean average precision (MAP), and binary preference (Bpref). For full details see the “Methods” section. TREC-COVID considers IR system submissions that are either manual —in which queries and retrieved documents may be manually adjusted by a human operator—or automatic (such as CO-Search)—in which they may not. A third category is accepted in Rounds 2–5, of type feedback , in which systems are trained with supervision from the annotations of prior rounds. Submissions compete on a predefined set of topics, and are judged using a number of metrics, including those listed above. Expert human annotators provide relevance judgments on a small set of topic–document pairs, which are included, together with non-annotated pairs, in the evaluation.

The CORD-19 9 coronavirus-related literature corpus, primarily from PubMed, mostly published in 2020, has quickly generated a number of data science and computing works 11 . These cover topics from IR to natural language processing (NLP), including applications in question answering 12 , text summarization, and document search 10 .

In 2020, more than 20 organizations have launched publicly accessible search engines using the CORD-19 corpus. For instance, Neural Covidex 13 was constructed from various open source information-retrieval building blocks, as well as a deep learning transformer 14 finetuned on a machine-reading comprehension dataset (MS MARCO) 15 to predict query-document relevance, for ranking. SLEDGE 16 extends this by using SciBERT 17 —the scientific text-trained version of the prominent BERT 18 NLP model—also finetuned on MS MARCO, to re-rank articles retrieved with BM25.

One of the first question–answering systems built on top of the CORD-19 corpus is CovidQA ( http://covidqa.ai ), which includes a small number of questions from the CORD-19 tasks 12 . CAiRE is a multi-document summarization system 19 which works by first pre-training on both a general text corpus 20 , 21 and a biomedical review dataset, then finetuning on the CORD-19 dataset.

One of the applications of the corpus has been Named Entity Recognition (NER). Wang et al. 22 introduce the COVID-NER corpus, which includes 75 fine-grained entity types, both conventional (e.g., genes, diseases, and chemicals) and corpus-specific (e.g., viral proteins, coronaviruses, substrates, and immune responses). Ahamed and Samad perform a network analysis of the corpus 23 , in which they use word associations to identify the phrases that co-occur with the most medically relevant keywords. This allows them to identify information about different antiviral drugs, pathogens, and pathogen hosts, as well as proteins and medical therapies, as to how they are connected to the central topic of “coronavirus”.

Broader surveys 11 of the COVID-19-related literature have already arisen, covering a wider range of research perspectives including molecular, clinical, and societal factors. Roberts et al. (2020) 10 offers an in-depth analysis of the TREC-COVID competition structure, including the notable differences in IR systems for pandemics, which deviate substantially from typical IR systems. They address key questions around COVID-19-specific IR systems, including: How are topics different from typical web-based search? What is the appropriate search content? How to deploy quickly? What are the appropriate IR modalities? How to customize IR systems for pandemics? Can existing data be leveraged? How to best respond to the rapidly growing literature corpus? How to evaluate systems? And so forth. COVID search engines differ from more general neural IR engines 24 , 25 because of the relatively limited and focused, and also rapidly changing collection of documents. Another recent system paper from the challenge is ref. 26 , in which the authors describe an ensemble system that combines more than 100 IR methods, including lexical rankers, embeddings, as well as relevance feedback. Our proposed method builds on these insights by selectively choosing three deep-learning methods and showing how they each enhance COVID-specific scientific search.

To quantitatively evaluate the effectiveness of our search engine, we combine the CORD-19 corpus with the TREC-COVID competition’s evaluation dataset. The evaluation dataset consists of topics, along with relevance judgments which assign topic–document pairs into one of the following groups: irrelevant, partially relevant, or relevant. See Table 1 for example topics. The relevance judgments are determined by human experts in related fields (biology, medicine, etc.).

The U.S. White House, along with the U.S. National Institutes of Health, the Allen Institute for AI, the Chan-Zuckerberg Initiative, Microsoft Research, and Georgetown University recently prepared the CORD-19 Challenge in response to the global crisis. As of February 2021, this resource consists of over 400,000 scientific publications (up from 29,000 at the challenge inception in February 2020) about COVID-19, SARS-CoV-2, and earlier coronaviruses 9 .

This challenge represents a call to action to the artificial intelligence (AI) and IR communities to "develop text and data mining tools that can help the medical community develop answers to high priority scientific questions”. It is currently the most extensive coronavirus literature corpus publicly available.

To build on CORD-19, the Text Retrieval Conference (TREC) recently partnered with the National Institute of Standards and Technology (NIST), to define a structured and quantitative evaluation system for coronavirus IR systems. The TREC-COVID challenge 10 is composed of five successive rounds of evaluation on 30–50 topics. The first round includes 30 topics. Each subsequent round takes the prior round’s topics and adds five new ones.

Each topic is represented as a tuple consisting of a query, a question, and a narrative, with an increasing amount of detail in each. IR systems must retrieve up to 1000 ranked documents per topic from the CORD-19 publications, and are evaluated on many metrics. See the “Methods” section for further details.

System architecture

CO-Search consists of a retriever, which returns a sorted subset of documents from the general corpus, a re-ranker, which further sorts them, and an offline pre-processing step known as document indexing, which parses documents via a combination of deep learning and keyword-based techniques to make them both semantically and syntactically searchable at scale. This process converts pieces of raw text into high-dimensional vector representations, such that one vector’s proximity to another indicates similar content. The full system is shown in Fig. 2 .

The index is created by processing documents in three ways: a deep learning model called Siamese-BERT (SBERT 27 ) embeds single paragraphs and image captions, and two keyword-based models (TF-IDF, BM25 28 ) vectorize entire documents (see Fig. 2 a). SBERT is an extension of the widely used BERT 18 language representation model which uses two BERT models with tied network parameters. It has been shown to be superior to BERT in semantic similarity search 27 by being significantly more computationally efficient at learning correspondences between sentences. For instance, finding the most similar pair of sentences, using BERT, in a collection of n  = 10,000 sentences would require each possible pair to be fed into the network, one sentence at a time, yielding n ( n  − 1)/2 = 49,995,000 inference computations, or about 65 h on an NVIDIA V100 GPU. In contrast, SBERT reduces this to 10,000 inference computations and the computation of cosine similarity distances between them, yielding about 5 s of compute time. SBERT is trained to take a short text string and a longer text document and output the correspondence between the two (i.e. their similarity) as a real-valued number between 0 and 1. In this use case, semantic embeddings from the SBERT model face the challenge of working with a relatively small number of long documents. We account for this by pre-training SBERT on a large, synthetic dataset of millions of training examples, constructed as follows. We split documents into paragraphs, extract the titles of the citations of each paragraph, and form a bipartite graph of paragraphs and citations with edges implying that a citation c came from a paragraph p . We use the graph to form tuples (( p ,  c ) s.t. c   ∈   p ) for training SBERT to predict if a title was cited by a paragraph. Additionally, we generate an equivalent number of negative training samples of incorrect tuples (( p ,  c ) s.t. c   ∉   p ). The full pipeline for this step is shown in Fig. 1 a.

figure 1

a Documents are split into paragraphs and the citations included in them to form a bipartite graph that induces training tuples ( p ,  c ). These are fed to a Siamese-BERT (SBERT) model trained to discern if a citation is contained in a given paragraph. This process makes SBERT match user search queries to scientific publication titles. b t-SNE visualization of the SBERT embeddings of entire documents, each denoted by a single point. Their color represents the topic to which they are most closely matched. Notably, queries pertaining to the same topic tend to cluster together.

The structure of the embedded space is such that proximal queries and documents share semantic meaning. Visualizing this reveals a human-understandable clustering of documents and topics. Figure 1 b shows a two-dimensional t-SNE 29 plot—an effective method for visualizing high-dimensional data—of the embedded space, with different colors representing topics of TREC-COVID, and points representing documents. We can observe that semantically similar documents cluster by topic.

Document retrieval (Fig. 2 b, top row)—which returns a list of the top 1000 documents for a query—is accomplished by fusing the returned lists of the SBERT, TF-IDF, and BM25 models. SBERT allows for variable-length queries and documents to be embedded into the same vector space (the multi-dimensional internal representation of the data, by the model), in order to model semantic proximity and enable k-nearest-neighbor (kNN) retrieval. We use approximate kNN retrieval using the Annoy framework ( https://github.com/spotify/annoy ), to account for the large number of paragraphs parsed by SBERT. TF-IDF and BM25 independently return two document lists (TF-IDF uses kNN with cosine distance; BM25 uses a Lucene inverted index 30 , built with Anserini) that either share in the most unique keywords of the query (TF-IDF) or share many of the same keywords as the query (BM25-Anserini) 28 .

figure 2

a Indexing: Raw documents are processed into a searchable format. Documents are split into paragraphs and image captions, embedded with an SBERT deep learning model, and stored into an index. The raw documents are also embedded with two-keyword-based models (TF-IDF and BM25). b Retrieval and re-ranking: The system computes a linear combination of TF-IDF and SBERT retrieval scores, then combines them with the retrieval scores of BM25 using reciprocal rank fusion 31 , to generate a sorted candidate list. k-Nearest-Neighbors are used for TF-IDF and SBERT, and the Lucene Inverted Index is used for BM25. The retrieved documents and the query are parsed using a question answering model and an abstractive summarizer prior to being re-ranked based on answer match, summarization match, and retrieval scores.

These three lists are then combined by first linearly fusing the SBERT list with the TF-IDF list, then using reciprocal rank fusion (RRF) 31 to merge this with the BM25 list. This retrieval process returns the top 1000 documents as a function of their semantic and syntactic distance to the query.

Document re-ranking (Fig. 2 b, bottom row) takes this set of documents, runs them through both a question–answering module (QA) and a summarizer, then ranks the documents by a weighted combination of their original retrieval scores, the QA output, and the summarizer output. Whereas standard question answering systems generate answers, our model extracts multiple answer candidates (text spans) from the paragraphs of the retrieved documents. This is accomplished by taking the query and the retrieved paragraphs, and using a sequential paragraph selector 32 , to filter for a set of paragraphs that, when combined, could answer the query. Specifically, the model uses multi-hop reasoning to model relationships between paragraphs, and selects sequentially ordered sets of them. It is pre-trained using a Wikipedia-derived dataset of 113k question–answer pairs and sentence-level supporting facts 33 , and further finetuned on a QA dataset built from PubMed 34 , for biomedical specificity. Once filtered, these sequential paragraph sets are fed into a reading comprehension model (trained on a standard question–answering dataset with topic structure similar to CORD-19 35 ) to extract answer candidates.

In a parallel fashion, the summarizer generates a single abstractive summary from the retrieved documents. It is built in an encoder–decoder fashion, in which an encoder (BERT 18 ) first embeds an entire document, and a decoder (a modified GPT-2 model 36 ) converts this embedding into raw text, outputting a summary. To increase the probability that a generated summary matches (and thus, helps re-rank) the contents of the retrieved paragraphs, we tuned the model to generate short summaries of fewer than 65 words 37 .

Finally, the system uses the generated answers and summary to compute two scores for each retrieved document. The first measures the relevance of a document, given the query, and the second measures the degree to which any single document summarizes the entire set of retrieved documents. These two scores are combined with the original relevance scores to output a final ranked list of documents.

We evaluate our system quantitatively using the CORD-19 document dataset and the topics and relevance judgments provided by TREC-COVID. The dataset contains five sets of topics, where each topic is represented as a (query, question, narrative) tuple. Relevance judgments—provided on a very small subset of all possible topic–document pairs—scores topic–documents as irrelevant, partially relevant, or relevant. These judgments have been iteratively gathered throughout the course of the five-round TREC-COVID competition, in which search engines submitted up to 1000 ranked documents per query, and the organizers pooled from amongst the most common topic–document pairs for judging (i.e. depth- N pooling, in which the top N documents from each response provided by the set of contributing systems are judged for relevance by human assessors 38 , with N ranging from 7 to 20 for the various rounds). The pool depths results in many relevant documents being missed. Though this labeling procedure is inherently sparse and somewhat biased, this is the best available method for evaluating IR systems, as obtaining relevance judgments on all possible topic–document pairs is infeasible.

In order to better evaluate our approach, we use a variety of IR metrics. Key amongst them are high-precision metrics such as nDCG, top- N precision, and MAP. The critical limitation with these is that their effectiveness relies on complete relevance judgments across all topic–document pairs. To account for this, an additional metric, Bpref, which is robust to missing relevance judgments, is considered. For full details, see the “Methods” section.

Our results on this data are shown in Table 2 . We compare the performance of our system in two contexts. The first context is within the general set of submissions. This includes metric evaluations on all documents—annotated and non-annotated—and this includes ranking against the three possible system types in the competition: manual , automatic , and feedback systems. Manual submissions use human operators that can iteratively adjust the query or the retrieved documents to improve ranking. Feedback systems are trained using the relevance judgments of prior rounds. Automatic search engines may not do either. Strictly speaking, feedback systems are also automated (in that they do not use a human in the loop), though they have an inherent advantage over automatic systems and are thus considered separately. In the second context, we evaluate our system (and all others) strictly on relevance judgments, and we compare our automatic system strictly against other automatic systems. Specifically, we re-score every automatic system’s runs after removing non-judged topic–document pairs. To determine team rankings, we account for both multiple submissions per team, and for multiple submissions with the same score, assigning to each the highest one (i.e., if the top two scoring submissions for a metric have the same score, each would be ranked #1).

Each round builds on the previous rounds, adding five new topics, many documents, as well as new relevance judgments. As a result, Round 5 is the most complete round. In the first context (columns “All submissions, All pairs”), our system ranks in the top 21 (Table 2 ) across all rounds. In considering the rankings from Round 1 through Round 5, there is a pronounced improvement in rankings from Round 1 to Round 2, with a drop then plateau in performance from Rounds 3 to 5. The improvement from Round 1 to 2 can be explained by the judgment fraction—the percentage of relevance judgments goes up, increasing the performance across these metrics. This happens because metrics such as precision penalize search engines for retrieving relevant but non-annotated documents for a topic. Rounds 3–5 have sufficient relevance judgments from prior rounds to improve feedback systems, leading to a drop in the ranking.

In the second context, our system ranks in the top 6 across all metrics and all rounds, in the top 4 across all but four, and as the top 1 system across half of them. The stability in performance is largely due to the consistent judgment fraction (100%, implicitly), and the absence of feedback and manual systems, both of which improve with relevance judgments. This stability—evident also in the metrics—implies a system that is robust to increasing corpus size.

Of note, the availability of relevance judgments is quite sparse throughout all rounds, with Round 1 exhibiting a coverage of 0.57%, and Round 5 a coverage of 0.24%. This is precisely what motivates the use of the Bpref metric, which is robust to missing annotations, as evidenced by its consistency across contexts.

Here we present CO-Search, a scientific search engine over the growing corpus of COVID-19 literature. We train the system using the scientific papers of the COVID-19 Open Research Dataset challenge, and evaluate its performance using the data of the TREC-COVID competition on a number of key metrics, achieving strong performance across metrics and competition rounds. The system uses a combination of semantic and keyword-based models to retrieve and score documents. It then re-ranks these documents by using a Wikipedia-trained & PubMed-trained question–answering system, together with an abstractive summarizer, to modulate retrieval scores.

We perform an ablation study of our system using Round 5 data (first context) in order to examine the performance effects of its components (Table 3 ). This is done in two steps, first for the retriever, then for the re-ranker. For each, we analyze the metric performance of various components individually, and united. The retriever’s components (TF-IDF, BM25, SBERT) each perform poorly, but benefit from substantial synergy when united into the full retrieval pipeline (top half of Table 3 ). This occurs because keyword-based techniques, on their own, do not perform as well on queries in natural language. Similarly, semantic techniques tend to underweight the most salient keywords of a natural language query. Combined, these two techniques work well for this unique dataset. The retrieval subsystem accounts for most of the performance of the overall system. The addition of the re-ranker, with its two other deep learning modules (Q&A, summarizer) serve to further boost this performance on the order of 1–2% across the various metrics employed.

We compare our system against three of the top-performing systems of Round 5, as shown in Table 4 . As can be seen, no single system outperforms the rest across all metrics, indicating the possibility of forming hybrid systems that benefit from the strengths of each. The system covidex 13 uses a transformer fine-tuned on the MedMARCO machine-reading comprehension dataset 16 to predict query-document relevance. The system uogTr linearly combines a SciBert model 17 trained on the medical queries of MSMarco 15 and SciColBERT. The system unique_ptr leverages synthetic query generation 39 for training data augmentation. RRF enables easy merging of ideas. It would be straightforward for CO-Search to be extended to benefit from these ideas: synthetic query generation could augment the SBERT training tuples shown in Fig. 1 ; the outputs of both a medically fine-tuned SciBert model, or a transformer fine-tuned on the MedMARCO data, could be joined with our own output via RRF.

From Round 5, the two topics on which CO-Search performs best, as ranked by Bpref, are “what kinds of complications related to COVID-19 are associated with diabetes” and “are patients taking Angiotensin-converting enzyme inhibitors (ACE) at increased risk for COVID-19?”. Conversely the system performs worst on “what are the guidelines for triaging patients infected with coronavirus?” and “what causes death from Covid-19?”. This is likely due to the hybrid semantic-syntactic nature of the system. The keyword models allow the system to focus in on important words like “diabetes” and “angiotensin”, while the semantic SBERT model would focus on broader meanings inherent in pieces of the text such as “complications..associated with...”. Note that the worst-performing topics lack the obvious keywords of the first.

The semantic search capability of CO-Search allows it to disambiguate between subtle variations in word ordering that, in biological contexts, result in critically different meanings (e.g. “What regulates expression of the ACE2 protein?” vs. “What does the ACE2 protein regulate?”), maximizing its utility to the medical and scientific communities in a time of crisis. Key to the fair evaluation of the system is considering the general use case (all IR systems, all documents), and a specific use case (automatic systems, judged documents).

This work is intended as a tool to support the fight against COVID-19. In this time of crisis, tens of thousands of documents are being published, only some of which are scientific, rigorous, and peer-reviewed. This may lead to the inclusion of misinformation and the potential rapid spread of scientifically disprovable or otherwise false research and data. People on the front lines—medical practitioners, policy makers, etc.—are time-constrained in their ability to parse this corpus, which could impede their ability to approach the returned search results with the appropriate levels of skepticism and inquiry available in less exigent circumstances. Coronavirus-specialized search capabilities are key to making this wealth of knowledge both useful and actionable. The risks are not trivial, as decisions made based on returned, incorrect, or demonstrably false results might jeopardize trust or public health and safety. The authors acknowledge these risks, but believe that the overall benefits to researchers and to the broader COVID-19 research agenda outweigh the risks.

Evaluation metrics

Below we define key metrics in evaluation. Throughout this work we adopt the standard convention that m@N refers to an evaluation using metric m , and the top N retrieved documents.

Precision (P):

nDCG : For position i   ∈  {0, 1, . . . ,  N }, the nDCG of a retrieved set of documents over Q queries is given by

where \({\,\text{rel}\,}_{i}^{(q)}\) denotes the relevance of entry i , ranked according to query q . IDCG denotes the ideal and highest possible DCG. In the limit of perfect annotations, nDCG performs reliably in measuring search engine performance. Since it treats non-annotated documents as incorrect (rel i evaluates to zero), it is less reliable for datasets with incomplete annotations.

MAP : The average precision (AP) of a retrieved document set is defined as the integral over the normalized precision-recall curve of the set’s query. MAP is defined as the mean AP over all queries:

where R is recall, P q is precision as a function of recall, for a particular query. Note that, as in the case of nDCG, MAP penalizes search engines that yield accurate but unique (i.e. non-annotated) results, since non-annotated documents are treated as irrelevant by P .

Bpref : Bpref strictly uses information from judged documents. It is a function of how frequently relevant documents are retrieved before non-relevant documents. In situations with incomplete relevance judgments (most IR datasets) it is more stable than other metrics, and it is designed to be robust to missing relevance judgments. It gives roughly the same results with incomplete judgments as MAP would give with complete judgments 38 . It is defined as

where R is the number of judged relevant documents, r is a relevant retrieved document, n is one of the first R irrelevant retrieved documents, and non-judged documents are ignored.

Document indexing

We train the SBERT model of the indexing step with cross-entropy loss, Adam optimization 40 with a learning rate of 2e–5, a linear learning rate warm-up over 10% of the training data, and a default pooling strategy of MEAN (see Fig. 1 a).

Document retrieval

At runtime, the retrieval step takes an input query, embeds it using SBERT, computes approximate nearest neighbors over the SBERT paragraph embeddings, and returns a set of paragraphs, together with each paragraph’s cosine similarity to the query. TF-IDF and BM25 take as input queries and documents, returning vectors \(t\in {{\mathbb{R}}}^{M}\) and \(b\in {{\mathbb{R}}}^{M}\) such that t i  = TF-IDF(query, document i ), b i  = BM25(query, document i ), and M is the size of the document corpus. We build a Lucene index with BM25 retrieval function with default parameters of k 1 = 1.2, b  = 0.75 in the Anserini IR toolkit. The formula for TF-IDF is given by

where tf( t ,  d ) is the term frequency—the number of times term t appears in document d —and df( t ) is the document frequency—the number of documents in the set that contain term t . We use the scikit-learn 41 version of TF-IDF, with a vocabulary size of 13,000, a max document frequency of 0.5, a minimum document frequency of 3, and L2 normalization 42 of the vectors computed from Eq. ( 5 ), above.

The SBERT and TF-IDF scores are combined linearly. For document d (containing paragraphs p ), and query q , with subscript e s denoting an SBERT embedding, their combination C is given by

This induces a ranking \({R}_{{\mathrm {C}}}^{q}\) on the documents, which is then combined with the BM25-induced ranking \({R}_{{\mathrm {B}}}^{q}\) using reciprocal ranked fusion 31 , to obtain a final retrieved ordering:

In practice, we find that the constants μ  = 0.7 and k  = 60 yield good results. Future work could consider using a learned layer to attend over semantic embeddings and keyword vectors, given the query.

Document re-ranking

Re-ranking combines the RRF scores of the retrieved documents with the outputs of the QA engine and the summarizer. We define Q to measure the degree to which a document answers a query:

where 1( x ) is the indicator function: 1( x ) = {1 if  x  is true, 0 otherwise}. The set A ( q ) contains the text span outputs of the QA model. We define S to measure the degree to which a document summarizes the set of documents retrieved for a query:

where M ( q ) e is the embedded abstractive summary of q , summarized across all retrieved documents. Then the final ranking score R ( d ,  q ) of a document, for a particular query, is given by

With higher scores indicating better matches. In essence, rank score R is determined by letting S and Q modulate the retrieval score of a query–document pair.

Question-Answering : We follow the HotPotQA setup 32 and all model parameters contained therein. We use paragraphs with high TF-IDF scores for the given query as negative examples for the sequential paragraph selector. The original beam search is modified to include paragraph diversity and avoid extracting the same answers from different paths.

Abstractive summarization : We extend the original GPT-2 model by adding a cross-attention function alongside every existing self-attention function. We constrain the cross-attention function to attend strictly to the final layer outputs of the encoder. We use the base models and hyperparameters of Wolf et al. 43 , with 12 layers, 768-dimensional activations in the hidden layers, and 12 attention heads. The model is pre-trained using self-supervision with a gap-sentence generation objective 44 , where we select a random source sentence per document, replace it with a special mask token in the input 80% of the time, and use that sentence as a prediction target in all cases. We then finetune the model with single-document supervised training, using the first 512 tokens of CORD-19 documents after the abstract as input, and the first 300 tokens of the abstract as target output.

Abstracts are split into five groups based on the number of tokens: <65, 65–124, 125–194, 195–294, >295. During training, a special token is provided to specify the summary length in these five categories. At inference time, the model is initialized to output summaries of token lengths <65 in order to generate more concise summaries.

To adapt the model to operate on multiple retrieved paragraphs from different documents, we concatenate the first four sentences of the retrieved paragraphs until they reach an input length of 512 tokens, then feed this into the summarization model.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

All data used in this study was taken from the COVID-19 Open Research Dataset Challenge 9 , and is publicly available. The aggregated data analyzed in this study will be made available upon reasonable request.

Code availability

The code used in this study will be made available upon reasonable request.

Thomala, L. L. Number of new hospital beds to be added in the designated hospitals after the coronavirus covid-19 outbreak in Wuhan, China as of February 2, 2020. Statista. https://www.statista.com/statistics/1095434/china-changes-in-the-number-of-hospital-beds-in-designated-hospitals-after-coronavirus-outbreak-in-wuhan/ .

Bogage, J. Tesla unveils ventilator prototype made with car parts on youtube. Wash. Post . https://www.washingtonpost.com/business/2020/04/06/tesla-coronavirus-ventilators-musk/ (2020).

Day, M. & Soper, S. Amazon is prioritizing essential products as online orders spike. Bloomberg. https://www.bloomberg.com/news/articles/2020-03-17/amazon-prioritizing-essentials-medical-goods-in-virus-response .

Nicas, J. & Wakabayashi, D. Apple and google team up to ‘contact trace’ the coronavirus. N. Y. Times . https://www.nytimes.com/2020/04/10/technology/apple-google-coronavirus-contact-tracing.html (2020).

Armitage, H. Stanford medicine launches national daily health survey to predict covid-19 surges, inform response efforts. Stanford Medicine News Center. http://med.stanford.edu/news/all-news/2020/04/daily-health-survey-for-covid-19-launched0.html .

Liu, C. et al. Research and Development on Therapeutic Agents and Vaccines for Covid-19 and Related Human Coronavirus Diseases. ACS Cent. Sci. 6 , 315–331 (2020).

Warwick, M., and F. Roshen. The global macroeconomic impacts of COVID-19: Seven scenarios. Centre for Applied Macroeconomic Analysis (CAMA) Working Paper 19/2020, Australia National University (2020).

Thomas, P., Chaney, S. & Cutter, C. New covid-19 layoffs make job reductions permanent. Wall Street J. https://www.wsj.com/articles/new-covid-19-layoffs-make-job-reductions-permanent-11598654257 (2020).

Wang, L. L. et al. Cord-19: The covid-19 open research dataset. arXiv preprint arXiv:2004.10706 (2020).

Roberts, K. et al. TREC-Covid: rationale and structure of an information retrieval shared task for covid-19. J. Am. Med. Inform. Assoc . 27 , 1431–1436 (2020).

Bullock, J., Luccioni, A., Pham, K. H., Lam, C. S. N. & Luengo-Oroz, M. Mapping the landscape of artificial intelligence applications against COVID-19. J. Artif. Intell. Res. 69 , 807–845 (2020).

Tang, R. et al. Rapidly bootstrapping a question answering dataset for COVID-19. Preprint at arXiv:2004.11339 (2020).

Zhang, E., Gupta, N., Nogueira, R., Cho, K. & Lin, J. Rapidly deploying a neural search engine for the COVID-19 open research dataset: Preliminary thoughts and lessons learned. Preprint at arXiv:2004.05125 (2020).

Raffel, C. et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res . 21 , 1–67 (2020)

Bajaj, P. et al. MS Marco: a human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268 (2016).

MacAvaney, S., Cohan, A. & Goharian, N. SLEDGE: a simple yet effective baseline for coronavirus scientific knowledge search. Preprint at https://arxiv.org/abs/2005.02365 (2020).

Beltagy, I., Lo, K. & Cohan, A. Scibert: A pretrained language model for scientific text. In (eds. Inui, K., Jiang, J., Ng, V. & Wan, X.), EMNLP-IJCNLP , 3613–3618 (Association for Computational Linguistics, 2019).

Devlin, J., Chang, M. W., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

Su, D. et al. Caire-covid: a question answering and multi-document summarization system for COVID-19 research. Preprint at arXiv:2005.03975 (2020).

Dong, L. et al. Unified language model pre-training for natural language understanding andgeneration. In (eds Wallach, H. M. et al.) NeurIPS , 13042–13054 (Curran Associates, Inc., 2019).

Lewis, M. et al. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , 7871–7880 (Association for Computational Linguistics, Online, 2020).

Wang, X., Song, X., Guan, Y., Li, B. & Han, J. Comprehensive named entity recognition on CORD-19 with distant or weak supervision. Preprint at arXiv:2003.12218 (2020).

Ahamed, S. & Samad, M. Information mining for COVID-19 research from a large volume of scientific literature. Preprint at arXiv:2004.02085 (2020).

Mitra, B. & Craswell, N. Neural models for information retrieval. Preprint at arxiv: 1705.01509 (2017).

Guo, J. et al. A deep look into neural ranking models for information retrieval. Inf. Process. Manag. 57 , 102067 (2020).

Article   Google Scholar  

Bendersky, M. et al. Rrf102: meeting the trec-covid challenge with a 100+ runs ensemble. Preprint at arxiv: 2010.00200 (2020).

Reimers, N. & Gurevych, I. Sentence-BERT: Sentence embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processingand the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , 3973–3983 (2019).

Yang, P., Fang, H. & Lin, J. Anserini: Enabling the use of lucene for information retrieval research. In Proc. 40th International ACM SIGIR Conference on Research and Development in Information Retrieval , 1253–1256. https://doi.org/10.1145/3077136.3080721 (2017).

Maaten, L. v. d. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9 , 2579–2605 (2008).

Google Scholar  

Białecki, A., Muir, R., Ingersoll, G. & Imagination, L. Apache lucene 4. In SIGIR 2012 Workshop on Open Source Information Retrieval , Vol. 17. https://www.semanticscholar.org/paper/Apache-Lucene-4-Bialecki-Muir/2795d9d165607b5ad6d8b9718373b82e55f41606 (2012).

Cormack, G. V., Clarke, C. L. & Buettcher, S. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In SIGIR 2009 , 758–759. https://www.semanticscholar.org/paper/Reciprocal-rank-fusion-outperforms-condorcet-and-Cormack-Clarke/9e698010f9d8fa374e7f49f776af301dd200c548 (2009).

Asai, A., Hashimoto, K., Hajishirzi, H., Socher, R. & Xiong, C. Learning to retrieve reasoning paths over wikipedia graph for question answering. In ICLR 2020. https://openreview.net/forum?id=SJgVHkrYDH (2020).

Yang, Z. et al. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , 2369–2380 (Association for Computational Linguistics, Brussels, Belgium, 2018).

Jin, Q., Dhingra, B., Liu, Z., Cohen, W. W. & Lu, X. PubMedQA: a dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146 (2019).

Rajpurkar, P., Zhang, J., Lopyrev, K. & Liang, P. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016).

Radford, A. et al. Language models are unsupervised multitask learners. OpenAI Blog 1 , 9 (2019).

Fan, A., Grangier, D. & Auli, M. Controllable Abstractive Summarization. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation , 45–54 (Association forComputational Linguistics, Melbourne, Australia, 2018).

Lu, X., Moffat, A. & Culpepper, J. S. The effect of pooling and evaluation depth on ir metrics. Inf. Retr. J. 19 , 416–445 (2016).

Ma, J., Korotkov, I., Yang, Y., Hall, K. & McDonald, R. Zero-shot neural retrieval via domain-targeted synthetic query generation. arXiv preprint arXiv:2004.14503 (2020).

Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. In (eds Bengio, Y. & LeCun, Y.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (2015).

Pedregosa, F. et al. Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12 , 2825–2830 (2011).

Yang, P., Fang, H. & Lin, J. Anserini: reproducible ranking baselines using lucene. J. Data Inf. Qual. 10 , 1–20 (2018).

Wolf, T. et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , 38–45 (Association for Computational Linguistics, Online, 2020).

Zhang, J., Zhao, Y., Saleh, M. & Liu, P. PEGASUS: Pre-training with extracted gap-sentences forabstractive summarization. In (eds III, H. D. & Singh, A.) Proceedings of the 37th International Conference on Machine Learning , vol. 119 of Proceedings of Machine Learning Research , 11328–11339 (PMLR, 2020).

Download references

Acknowledgements

Funding for this study provided by Salesforce.com, Inc.

Author information

These authors contributed equally: Andre Esteva, Anuprit Kale.

Authors and Affiliations

Salesforce Research, Palo Alto, CA, USA

Andre Esteva, Anuprit Kale, Romain Paulus, Kazuma Hashimoto, Wenpeng Yin, Dragomir Radev & Richard Socher

Yale University, New Haven, CT, USA

Dragomir Radev

You can also search for this author in PubMed   Google Scholar

Contributions

A.E. led the work, managed the team, designed the experiments, and created the infrastructure for the system. A.K. built the retriever and the ranker. K.H. trained the QA module. R.P. trained the abstractive summarization module. W.Y. worked on various exploratory components. D.R. advised on general IR systems and contributed substantially to the writing. R.S. supervised the work.

Corresponding author

Correspondence to Andre Esteva .

Ethics declarations

Competing interests.

All authors were employees of Salesforce.com, Inc., at the time of manuscript preparation.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Reporting summary, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Esteva, A., Kale, A., Paulus, R. et al. COVID-19 information retrieval with deep-learning based semantic search, question answering, and abstractive summarization. npj Digit. Med. 4 , 68 (2021). https://doi.org/10.1038/s41746-021-00437-0

Download citation

Received : 03 September 2020

Accepted : 08 March 2021

Published : 12 April 2021

DOI : https://doi.org/10.1038/s41746-021-00437-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

semantic search research papers

semantic search engine Recently Published Documents

Total documents.

  • Latest Documents
  • Most Cited Documents
  • Contributed Authors
  • Related Sources
  • Related Keywords

Question Answering Systems for Covid-19

Abstract In the present scenario COVID-19 pandemic has ruined the entire world. This situation motivates the researchers to resolve the query raised by the people around the world in an efficient manner. However, less number of resources available in order to gain the information and knowledge about COVID-19 arises a need to evaluate the existing Question Answering (QA) systems on COVID-19. In this paper, we compare the various QA systems available in order to answer the questions raised by the people like doctors, medical researchers etc. related to corona virus. QA systems process the queries submitted in natural language to find the best relevant answer among all the candidate answers for the COVID-19 related questions. These systems utilize the text mining and information retrieval on COVID-19 literature. This paper describes the survey of QA systems-CovidQA, CAiRE (Center for Artificial Intelligence Research)-COVID system, CO-search semantic search engine, COVIDASK, RECORD (Research Engine for COVID Open Research Dataset) available for COVID-19. All these QA systems are also compared in terms of their significant parameters-like Precision at rank 1 (P@1), Recall at rank 3(R@3), Mean Reciprocal Rank(MRR), F1-Score, Exact Match(EM), Mean Average Precision, Score metric etc.; on which efficiency of these systems relies.

Dug: A Semantic Search Engine Leveraging Peer-Reviewed Literature to Span Biomedical Data Repositories

As the number of public data resources continues to proliferate, identifying relevant datasets across heterogenous repositories is becoming critical to answering scientific questions. To help researchers navigate this data landscape, we developed Dug: a semantic search tool for biomedical datasets that utilizes evidence-based relationships from curated knowledge graphs to find relevant datasets and explain why those results are returned. Developed through the National Heart, Lung, and Blood Institute's (NHLBI) BioData Catalyst ecosystem, Dug can index more than 15,911 study variables from public datasets in just over 39 minutes. On a manually curated search dataset, Dug's mean recall (total relevant results/total results) of 0.79 outperformed default Elasticsearch's mean recall of 0.76. When using synonyms or related concepts as search queries, Dug's (0.28) far outperforms Elasticsearch (0.1) in terms of mean recall. Dug is freely available at https://github.com/helxplatform/dug, and an example Dug deployment is also available for use at https://helx.renci.org/ui.

A Vertical Semantic Search Engine In Electric Power Metering Domain

Fenix: a semantic search engine based on an ontology and a model trained with machine learning to support research.

This paper presents the final results of the research project that aimed to build a Semantic Search Engine that uses an Ontology and a model trained with Machine Learning to support the semantic search of research projects of the System of Research from the University of Nariño. For the construction of FENIX, as this Engine is called, it was used a methodology that includes the stages: appropriation of knowledge, installation and configuration of tools, libraries and technologies, collection, extraction and preparation of research projects, design and development of the Semantic Search Engine. The main results of the work were three: a) the complete construction of the Ontology with classes, object properties (predicates), data properties (attributes) and individuals (instances) in Protegé, SPARQL queries with Apache Jena Fuseki and the respective coding with Owlready2 using Jupyter Notebook with Python within the virtual environment of anaconda; b) the successful training of the model for which Machine Learning algorithms and specifically Natural Language Processing algorithms were used such as: SpaCy, NLTK, Word2vec and Doc2vec, this was also done in Jupyter Notebook with Python within the virtual environment of anaconda and with Elasticsearch; and c) the creation of FENIX managing and unifying the queries for the Ontology and for the Machine Learning model. The tests showed that FENIX was successful in all the searches that were carried out because its results were satisfactory.

COVID-19 preVIEW: Semantic Search to Explore COVID-19 Research Preprints

During the current COVID-19 pandemic, the rapid availability of profound information is crucial in order to derive information about diagnosis, disease trajectory, treatment or to adapt the rules of conduct in public. The increased importance of preprints for COVID-19 research initiated the design of the preprint search engine preVIEW. Conceptually, it is a lightweight semantic search engine focusing on easy inclusion of specialized COVID-19 textual collections and provides a user friendly web interface for semantic information retrieval. In order to support semantic search functionality, we integrated a text mining workflow for indexing with relevant terminologies. Currently, diseases, human genes and SARS-CoV-2 proteins are annotated, and more will be added in future. The system integrates collections from several different preprint servers that are used in the biomedical domain to publish non-peer-reviewed work, thereby enabling one central access point for the users. In addition, our service offers facet searching, export functionality and an API access. COVID-19 preVIEW is publicly available at https://preview.zbmed.de.

A Proposed Framework for Building Semantic Search Engine with Map-Reduce

Strings and things: a semantic search engine for news quotes using named entity recognition, user queries for semantic search engine, the technique of different semantic search engines.

Semantic Search is a search technique that improves looking precision through perception the reason of the search and the contextual magnitude of phrases as they show up in the searchable statistics space, whether or not on the net to generate greater applicable result. We spotlight right here about Semantic Search, Semantic Web and talk about about exceptional kind of Semantic search engine and variations between key-word base search and Semantic Search and the benefit of Semantic Search. We additionally provide a short overview of the records of semantic search and its function scope in the world.

Optimization of Information Retrieval Algorithm for Digital Library Based on Semantic Search Engine

Export citation format, share document.

semantic search research papers

Insert an arXiv link to find similar papers or use natural language to describe what you are looking for.

Analyze research papers at superhuman speed

Search for research papers, get one sentence abstract summaries, select relevant papers and search for more like them, extract details from papers into an organized table.

semantic search research papers

Find themes and concepts across many papers

Don't just take our word for it.

semantic search research papers

Tons of features to speed up your research

Upload your own pdfs, orient with a quick summary, view sources for every answer, ask questions to papers, research for the machine intelligence age, pick a plan that's right for you, get in touch, enterprise and institutions, custom pricing, common questions. great answers., how do researchers use elicit.

Over 2 million researchers have used Elicit. Researchers commonly use Elicit to:

  • Speed up literature review
  • Find papers they couldn’t find elsewhere
  • Automate systematic reviews and meta-analyses
  • Learn about a new domain

Elicit tends to work best for empirical domains that involve experiments and concrete results. This type of research is common in biomedicine and machine learning.

What is Elicit not a good fit for?

Elicit does not currently answer questions or surface information that is not written about in an academic paper. It tends to work less well for identifying facts (e.g. "How many cars were sold in Malaysia last year?") and in theoretical or non-empirical domains.

What types of data can Elicit search over?

Elicit searches across 125 million academic papers from the Semantic Scholar corpus, which covers all academic disciplines. When you extract data from papers in Elicit, Elicit will use the full text if available or the abstract if not.

How accurate are the answers in Elicit?

A good rule of thumb is to assume that around 90% of the information you see in Elicit is accurate. While we do our best to increase accuracy without skyrocketing costs, it’s very important for you to check the work in Elicit closely. We try to make this easier for you by identifying all of the sources for information generated with language models.

What is Elicit Plus?

Elicit Plus is Elicit's subscription offering, which comes with a set of features, as well as monthly credits. On Elicit Plus, you may use up to 12,000 credits a month. Unused monthly credits do not carry forward into the next month. Plus subscriptions auto-renew every month.

What are credits?

Elicit uses a credit system to pay for the costs of running our app. When you run workflows and add columns to tables it will cost you credits. When you sign up you get 5,000 credits to use. Once those run out, you'll need to subscribe to Elicit Plus to get more. Credits are non-transferable.

How can you get in contact with the team?

You can email us at [email protected] or post in our Slack community ! We log and incorporate all user comments, and will do our best to reply to every inquiry as soon as possible.

What happens to papers uploaded to Elicit?

When you upload papers to analyze in Elicit, those papers will remain private to you and will not be shared with anyone else.

How accurate is Elicit?

Training our models on specific tasks, searching over academic papers, making it easy to double-check answers, save time, think more. try elicit for free..

Literature Reviews

  • Getting Started
  • Choosing a Type of Review
  • Developing a Research Question
  • Searching the Literature
  • Searching Tips

Literature Searching using Artificial Intelligence

  • Research Rabbit
  • Semantic Scholar
  • ChatGPT [beta]
  • Documenting your Search
  • Using Citation Managers
  • Concept Mapping
  • Writing the Review
  • Further Resources

Plug-ins for GenAI

Artificial Intelligence tools are fast-changing. Make sure to check each tool for features you are looking for. 

Click the tool name below to jump directly there.

ChatGPT Elicit AI
Research Rabbit Connected Papers
Semantic Scholar tbd

semantic search research papers

www.researchrabbit.ai

100s of millions of academic articles and covers more than 90%+ of materials that can be found in major databases used by academic institutions (such as Scopus, Web of Science, and others).
  • See its  FAQs  page.  Search algorithms  were borrowed  from NIH  and  Semantic Scholar.

The default “Untitled Collection” will collect your search histories, based on which Research Rabbit will send you recommendations for three types of related results: Similar Works / Earlier Works / Later Works, viewable in graph such as  Network, Timeline, First Authors  etc.

Zotero integration: importing and exporting between these two apps.

  • Example  -  SERVQUAL: A multiple-item scale for measuring consumer perceptions of service quality [Login required] Try it to see its  Similar Works, Earlier Works  and  Later Works  or other documents.
  • Export Results -  Findings can be exported in BibTxt, RIS or CSV format.

semantic search research papers

MORE RESOURCES

Video Introduction to Research Rabbit

semantic search research papers

https://elicit.org

Elicit is a research assistant using language models like GPT-3 to automate parts of researchers’ workflows. Currently, the main workflow in Elicit is Literature Review. If you ask a question, Elicit will show relevant papers and summaries of key information about those papers in an easy-to-use table.  
  • Find answers from 175 million papers.  FAQS
  • Example -  How do mental health interventions vary by age group?    /    Fish oil and depression Results:  [Login required] (1) Summary of top 4 papers > Paper #1 - #4 with  Title, abstract, citations, DOI, and pdf (2) Table view:  Abstract / Interventions / Outcomes measured / Number of participants (3) Relevant studies and citations. (4) Click on  Search for Paper Information  to find - Metadata about Sources ( SJR  etc.) >Population ( age  etc.) >Intervention ( duration  etc.) > Results ( outcome, limitations  etc.) and > Methodology (detailed  study design  etc.) (5) Export as BIB or CSV
  • How to Search Enter a research question or multiple keywords about a research question. Enter the title of a paper. The stared or selected studies will lead to  Semantic Scholar 's site for detailed information for all citations.
  • Export Results -  Various ways to export results.
  • How to Cite  - Includes the elicit.org URL in the citation, for example: Ought; Elicit: The AI Research Assistant; https://elicit.org; accessed xxxx/xx/xx

semantic search research papers

www.semanticscholar.org

A free, AI-powered research tool for scientific literature.
  • Over 200 millions of papers from all fields of science.  

The 4000+ results can be sorted by  Fields of Study, Date Range, Author, Journals & Conferences

Save the papers in your  Library  folder. The  Research Feeds   will recommend similar papers based on the items saved.

Example  -  SERVQUAL: A multiple-item scale for measuring consumer perceptions of service quality Total Citations: 22,438   [Note: these numbers were gathered when this guide was created] Highly Influential Citations 2,001 Background Citations 6,109 Methods Citations 3,273  Results Citations 385

Semantic Reader "Semantic Reader is an augmented reader with the potential to revolutionize scientific reading by making it more accessible and richly contextual" . It "uses artificial intelligence to understand a document’s structure and merge it with the Semantic Scholar’s academic corpus, providing detailed information in context via tooltips and other overlays ." <https://www.semanticscholar.org/product/semantic-reader>. Skim Papers Faster "Find key points of this paper using automatically highlighted overlays. Available in beta on limited papers for desktop devices only."   <https://www.semanticscholar.org/product/semantic-reader>. Press on the pen icon to activate the highlights.

TLDRs (Too Long; Didn't Read) Try this example . Press the pen icon to reveal the highlighted key points . TLDRs  "are super-short summaries of the main objective and results of a scientific paper generated using expert background knowledge and the latest GPT-3 style NLP techniques. This new feature is available in beta for nearly 60 million papers in computer science, biology, and medicine..." < https://www.semanticscholar.org/product/tldr>

  • << Previous: Searching Tips
  • Next: ChatGPT [beta] >>
  • Last Updated: May 9, 2024 11:44 AM
  • URL: https://guides.lib.umich.edu/litreview

Towards a Semantic Search Engine for Scientific Articles

  • Conference paper
  • First Online: 02 September 2017
  • Cite this conference paper

semantic search research papers

  • Bastien Latard 18 , 19 ,
  • Jonathan Weber 18 ,
  • Germain Forestier 18 &
  • Michel Hassenforder 18  

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10450))

Included in the following conference series:

  • International Conference on Theory and Practice of Digital Libraries

2467 Accesses

9 Altmetric

Because of the data deluge in scientific publication, finding relevant information is getting harder and harder for researchers and readers. Building an enhanced scientific search engine by taking semantic relations into account poses a great challenge. As a starting point, semantic relations between keywords from scientific articles could be extracted in order to classify articles. This might help later in the process of browsing and searching for content in a meaningful scientific way. Indeed, by connecting keywords, the context of the article can be extracted. This paper aims to provide ideas to build such a smart search engine and describes the initial contributions towards achieving such an ambitious goal.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

semantic search research papers

A Typology of Semantic Relations Dedicated to Scientific Literature Analysis

semantic search research papers

Semantic Facets for Scientific Information Retrieval

semantic search research papers

Semantic Annotation of Scientific Publications Based on Integration of Concept Knowledge

(Editorial), N.: Gold in the text? Nature 483 (7388), 124 (2012)

Google Scholar  

Effendy, S., Yap, R.H.C.: The problem of categorizing conferences in computer science. In: Fuhr, N., Kovács, L., Risse, T., Nejdl, W. (eds.) TPDL 2016. LNCS, vol. 9819, pp. 447–450. Springer, Cham (2016). doi: 10.1007/978-3-319-43997-6_41

Chapter   Google Scholar  

Navigli, R., Ponzetto, S.P.: BabelNet: the automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artif. Intell. 193 , 217–250 (2012)

Article   MathSciNet   MATH   Google Scholar  

Andor, D., Alberti, C., Weiss, D., Severyn, A., Presta, A., Ganchev, K., Petrov, S., Collins, M.: Globally normalized transition-based neural networks. In: ACL (2016)

Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J.R., Bethard, S., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: ACL, pp. 55–60 (2014)

Franco-Salvador, M., Cruz, F.L., Troyano, J.A., Rosso, P.: Cross-domain polarity classification using a knowledge-enhanced meta-classifier. Knowl. Based Syst. 86 , 46–56 (2015)

Article   Google Scholar  

Download references

Author information

Authors and affiliations.

MIPS, University of Haute-Alsace, Mulhouse, France

Bastien Latard, Jonathan Weber, Germain Forestier & Michel Hassenforder

MDPI AG, Basel, Switzerland

Bastien Latard

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Bastien Latard .

Editor information

Editors and affiliations.

Faculteit der Geesteswetenschappen, Universiteit van Amsterdam , Amsterdam, The Netherlands

Library & Information Center, University of Patras , Patras, Greece

Giannis Tsakonas

Aristotle University of Thessaloniki , Thessaloniki, Greece

Yannis Manolopoulos

Civil Engineering, University of Thrace , Kimmeria, Greece

Lazaros Iliadis

Informatics, Ionian University , Kerkyra, Greece

Ioannis Karydis

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper.

Latard, B., Weber, J., Forestier, G., Hassenforder, M. (2017). Towards a Semantic Search Engine for Scientific Articles. In: Kamps, J., Tsakonas, G., Manolopoulos, Y., Iliadis, L., Karydis, I. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2017. Lecture Notes in Computer Science(), vol 10450. Springer, Cham. https://doi.org/10.1007/978-3-319-67008-9_54

Download citation

DOI : https://doi.org/10.1007/978-3-319-67008-9_54

Published : 02 September 2017

Publisher Name : Springer, Cham

Print ISBN : 978-3-319-67007-2

Online ISBN : 978-3-319-67008-9

eBook Packages : Computer Science Computer Science (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Hero image

The Semantic Reader Open Research Platform

Semantic Reader Project is a collaborative effort of NLP + HCI researchers from non-profit, industry, and academic institutions to create interactive, intelligent reading interfaces for scholarly papers. Our research led to the creation of Semantic Reader, an application used by tens of thousands of scholars each week.

The Semantic Reader Open Research Platform provides resources that enable the broader research community to explore exciting challenges around novel research support tools: PaperMage , a library for processing and analyzing scholarly PDFs, and PaperCraft , a React UI component for building augmented and interactive reading interfaces. Join us in designing the future of scholarly reading interfaces with our open source libraries!

AI2 Logo

Open Source Libraries

We provide PaperMage + PaperCraft for building intelligent and interactive paper readers. Below we showcase how to extract text from a PDF to prompt a LLM for term definitions and then visually augment the paper with highlights and popups.

Process and Analyze Scholarly PDF Documents

Create Visually Augmented Interactive Readers

Research Prototype Showcase

Here we present several interactive demos to showcase systems you can build with PaperMage and PaperCraft.

Photo of TaeSoo Kim

Augmenting Research Papers with Author Talk Videos

Demo Paper Presentation

Photo of Hyeonsu B. Kang

Synergi & Threddy

Clipping Research Threads from Papers for Synthesis and Exploration

Paper Presentation

Photo of Tal August

Paper Plain

Making Medical Research Papers Approachable to Healthcare Consumers

Demo Code Tutorial Paper

Photo of Joseph Chang

LLM Paper Q&A

A GPT-powered PDF QA system with attribution support.

Demo Code Tutorial

Photo of Joseph Chee Chang

Augmenting Citations in Papers with Persistent and Personalized Context

In-Production Paper Presentation

Photo of Napol Rachatasumrit

Localizing Incoming Citations from Follow on Papers in the Margins

Photo of Raymond Fok

Automatic highlights for skimming support of scientific papers

In-Production Paper

Photo of Andrew Head

Augmenting Papers with Just-in-Time Definitions of Terms and Symbols

Founding Project Demo Paper

Publications

Semantic reader project overview.

The Semantic Reader Project: Augmenting Scholarly Documents through AI-Powered Interactive Reading Interfaces Kyle Lo, Joseph Chee Chang, Andrew Head, Jonathan Bragg, Amy X. Zhang, Cassidy Trier, Chloe Anastasiades, Tal August, Russell Authur, Danielle Bragg, Erin Bransom, Isabel Cachola, Stefan Candra, Yoganand Chandrasekhar, Yen-Sung Chen, Evie (Yu-Yen) Cheng, Yvonne Chou, Doug Downey, Rob Evans, Raymond Fok, F.Q. Hu, Regan Huff, Dongyeop Kang, Tae Soo Kim, Rodney Michael Kinney, A. Kittur, Hyeonsu B Kang, Egor Klevak, Bailey Kuehl, Michael Langan, Matt Latzke, Jaron Lochner, Kelsey MacMillan, Eric Stuart Marsh, Tyler Murray, Aakanksha Naik, Ngoc-Uyen Nguyen, Srishti Palani, Soya Park, Caroline Paulic, Napol Rachatasumrit, Smita R Rao, P. Sayre, Zejiang Shen, Pao Siangliulue, Luca Soldaini, Huy Tran, Madeleine van Zuylen, Lucy Lu Wang, Christopher Wilhelm, Caroline M Wu, Jiangjiang Yang, Angele Zamarron, Marti A. Hearst, Daniel S. Weld . ArXiv. 2023 .

Interactive and Intelligent Reading Interfaces

Qlarify: Bridging Scholarly Abstracts and Papers with Recursively Expandable Summaries Raymond Fok, Joseph Chee Chang, Tal August, Amy X. Zhang, Daniel S. Weld . ArXiv. 2023 .

Papeos: Augmenting Research Papers with Talk Videos Tae Soo Kim, Matt Latzke, Jonathan Bragg, Amy X. Zhang, Joseph Chee Chang . The ACM Symposium on User Interface Software and Technology. 2023 .

Synergi: A Mixed-Initiative System for Scholarly Synthesis and Sensemaking Hyeonsu B Kang, Sherry Wu, Joseph Chee Chang, A. Kittur . The ACM Symposium on User Interface Software and Technology. 2023 .

🏆 Best Paper Award CiteSee: Augmenting Citations in Scientific Papers with Persistent and Personalized Historical Context Joseph Chee Chang, Amy X. Zhang, Jonathan Bragg, Andrew Head, Kyle Lo, Doug Downey, Daniel S. Weld . Proceedings of the CHI Conference on Human Factors in Computing Systems. 2023 .

Relatedly: Scaffolding Literature Reviews with Existing Related Work Sections Srishti Palani, Aakanksha Naik, Doug Downey, Amy X. Zhang, Jonathan Bragg, Joseph Chee Chang . Proceedings of the CHI Conference on Human Factors in Computing Systems. 2023 .

CiteRead: Integrating Localized Citation Contexts into Scientific Paper Reading Napol Rachatasumrit, Jonathan Bragg, Amy X. Zhang, Daniel S. Weld . 27th International Conference on Intelligent User Interfaces. 2022 .

🏆 Best Paper Award Math Augmentation: How Authors Enhance the Readability of Formulas using Novel Visual Design Practices Andrew Head, Amber Xie, Marti A. Hearst . Proceedings of the CHI Conference on Human Factors in Computing Systems. 2022 .

Scim: Intelligent Skimming Support for Scientific Papers Raymond Fok, Hita Kambhamettu, Luca Soldaini, Jonathan Bragg, Kyle Lo, Andrew Head, Marti A. Hearst, Daniel S. Weld . Proceedings of the 28th International Conference on Intelligent User Interfaces. 2022 .

Exploring Team-Sourced Hyperlinks to Address Navigation Challenges for Low-Vision Readers of Scientific Papers Soya Park, Jonathan Bragg, Michael Chang, K. Larson, Danielle Bragg . Proceedings of the ACM on Human-Computer Interaction. 2022 .

Paper Plain: Making Medical Research Papers Approachable to Healthcare Consumers with Natural Language Processing Tal August, Lucy Lu Wang, Jonathan Bragg, Marti A. Hearst, Andrew Head, Kyle Lo . ACM Transactions on Computer-Human Interaction. 2022 . Presentation at CHI 2024.

Threddy: An Interactive System for Personalized Thread-based Exploration and Organization of Scientific Literature Hyeonsu B Kang, Joseph Chee Chang, Yongsung Kim, A. Kittur . Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology. 2022 .

🏆 Best Paper Award SciA11y: Converting Scientific Papers to Accessible HTML Lucy Lu Wang, Isabel Cachola, Jonathan Bragg, Evie (Yu-Yen) Cheng, Chelsea Hess Haupt, Matt Latzke, Bailey Kuehl, Madeleine van Zuylen, Linda M. Wagner, Daniel S. Weld . Proceedings of the 23rd International ACM SIGACCESS Conference on Computers and Accessibility. 2021 .

Augmenting Scientific Papers with Just-in-Time, Position-Sensitive Definitions of Terms and Symbols Andrew Head, Kyle Lo, Dongyeop Kang, Raymond Fok, Sam Skjonsberg, Daniel S. Weld, Marti A. Hearst . Proceedings of the CHI Conference on Human Factors in Computing Systems. 2020 .

Open Research Resources: Libraries, Models, Datasets

🏆 Best Paper Award PaperMage: A Unified Toolkit for Processing, Representing, and Manipulating Visually-Rich Scientific Documents Kyle Lo, Zejiang Shen, Benjamin Newman, Joseph Chee Chang, Russell Authur, Erin Bransom, Stefan Candra, Yoganand Chandrasekhar, Regan Huff, Bailey Kuehl, Amanpreet Singh, Chris Wilhelm, Angele Zamarron, Marti A. Hearst, Daniel S. Weld, Doug Downey, Luca Soldaini. Conference on Empirical Methods in Natural Language Processing: Demos. 2023.

A Question Answering Framework for Decontextualizing User-facing Snippets from Scientific Documents Benjamin Newman, Luca Soldaini, Raymond Fok, Arman Cohan, Kyle Lo . undefined. 2023 .

🏆 Best Paper Award LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form Summarization Kalpesh Krishna, Erin Bransom, Bailey Kuehl, Mohit Iyyer, Pradeep Dasigi, Arman Cohan, Kyle Lo . ArXiv. 2023 .

Are Layout-Infused Language Models Robust to Layout Distribution Shifts? A Case Study with Scientific Documents Catherine Chen, Zejiang Shen, D. Klein, G. Stanovsky, Doug Downey, Kyle Lo . ArXiv. 2023 .

The Semantic Scholar Open Data Platform Rodney Michael Kinney, Chloe Anastasiades, Russell Authur, Iz Beltagy, Jonathan Bragg, Alexandra Buraczynski, Isabel Cachola, Stefan Candra, Yoganand Chandrasekhar, Arman Cohan, Miles Crawford, Doug Downey, Jason Dunkelberger, Oren Etzioni, Rob Evans, Sergey Feldman, Joseph Gorney, D. Graham, F.Q. Hu, Regan Huff, Daniel King, Sebastian Kohlmeier, Bailey Kuehl, Michael Langan, Daniel Lin, Haokun Liu, Kyle Lo, Jaron Lochner, Kelsey MacMillan, Tyler Murray, Christopher Newell, Smita R Rao, Shaurya Rohatgi, P. Sayre, Zejiang Shen, Amanpreet Singh, Luca Soldaini, Shivashankar Subramanian, A. Tanaka, Alex D Wade, Linda M. Wagner, Lucy Lu Wang, Christopher Wilhelm, Caroline Wu, Jiangjiang Yang, Angele Zamarron, Madeleine van Zuylen, Daniel S. Weld . ArXiv. 2023 .

VILA: Improving Structured Content Extraction from Scientific PDFs Using Visual Layout Groups Zejiang Shen, Kyle Lo, Lucy Lu Wang, Bailey Kuehl, Daniel S. Weld, Doug Downey . Transactions of the Association for Computational Linguistics. 2021 .

Document-Level Definition Detection in Scholarly Documents: Existing Models, Error Analyses, and Future Directions Dongyeop Kang, Andrew Head, Risham Sidhu, Kyle Lo, Daniel S. Weld, Marti A. Hearst . Proceedings of the First Workshop on Scholarly Document Processing @ ACL. 2020 .

See the  Project Overview Paper  to see a full list of contributors. † For questions and inquiries, please contact Joseph Chee Chang (PaperCraft & Intelligent reading interfaces), or Kyle Lo and Luca Soldaini (PaperMage & Scientific document processing).

Research Advisory Board

Intelligent reading interfaces research, scientific document processing research, research libraries and tooling.

Semantic search

Research areas, meet the teams driving innovation.

Our teams advance the state of the art through research, systems engineering, and collaboration across Google.

Teams

Subscribe to the PwC Newsletter

Join the community, search results, codesearchnet challenge: evaluating the state of semantic code search.

14 code implementations • 20 Sep 2019

To enable evaluation of progress on code search, we are releasing the CodeSearchNet Corpus and are presenting the CodeSearchNet Challenge, which consists of 99 natural language queries with about 4k expert relevance annotations of likely results from CodeSearchNet Corpus.

Learning Deep Semantic Model for Code Search using CodeSearchNet Corpus

1 code implementation • 27 Jan 2022

Different from typical information retrieval tasks, code search requires to bridge the semantic gap between the programming language and natural language, for better describing intrinsic concepts and semantics.

semantic search research papers

Template-Based Automatic Search of Compact Semantic Segmentation Architectures

1 code implementation • 4 Apr 2019

Automatic search of neural architectures for various vision and natural language tasks is becoming a prominent tool as it allows to discover high-performing structures on any dataset of interest.

semantic search research papers

Fast Neural Architecture Search of Compact Semantic Segmentation Models via Auxiliary Cells

4 code implementations • CVPR 2019

While most results in this domain have been achieved on image classification and language modelling problems, here we concentrate on dense per-pixel tasks, in particular, semantic image segmentation using fully convolutional networks.

semantic search research papers

PSCS: A Path-based Neural Model for Semantic Code Search

1 code implementation • 17 Aug 2020

Deep learning models have been proposed to address this challenge.

Software Engineering

AdANNS: A Framework for Adaptive Semantic Search

1 code implementation • NeurIPS 2023

Finally, we demonstrate that AdANNS can enable inference-time adaptivity for compute-aware search on ANNS indices built non-adaptively on matryoshka representations.

FoveaBox: Beyond Anchor-based Object Detector

7 code implementations • 8 Apr 2019

In FoveaBox, an instance is assigned to adjacent feature levels to make the model more accurate. We demonstrate its effectiveness on standard benchmarks and report extensive experimental analysis.

semantic search research papers

CLaMP: Contrastive Language-Music Pre-training for Cross-Modal Symbolic Music Information Retrieval

2 code implementations • 21 Apr 2023

We introduce CLaMP: Contrastive Language-Music Pre-training, which learns cross-modal representations between natural language and symbolic music using a music encoder and a text encoder trained jointly with a contrastive loss.

semantic search research papers

A Generative Approach for Wikipedia-Scale Visual Entity Recognition

2 code implementations • CVPR 2024

In this paper, we address web-scale visual entity recognition, specifically the task of mapping a given query image to one of the 6 million existing entities in Wikipedia.

Code Execution with Pre-trained Language Models

1 code implementation • 8 May 2023

Code execution is a fundamental aspect of programming language semantics that reflects the exact behavior of the code.

semantic search research papers

Academia.edu no longer supports Internet Explorer.

To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to  upgrade your browser .

Enter the email address you signed up with and we'll email you a reset link.

  • We're Hiring!
  • Help Center

paper cover thumbnail

A Study on Semantic Searching, Semantic Search Engines and Technologies Used for Semantic Search Engines

Profile image of Junaid Rashid

2016, International Journal of Information Technology and Computer Science

Related Papers

International Journal of Computer Applications

Neelam Duhan

semantic search research papers

https://www.ijrrjournal.com/IJRR_Vol.6_Issue.10_Oct2019/Abstract_IJRR0012.html

International Journal of Research & Review (IJRR)

Semantic Search is a search technique that improves searching precision by understanding the purpose of the search and the contextual significance of words as they appear in the searchable data space, whether on the web to generate more relevant result. We highlight here about Semantic Search, Semantic Web and discuss about different type of Semantic search engine and differences between keyword base search and Semantic Search and the advantage of Semantic Search. We also give a brief overview of the history of semantic search and its feature scope in the world.

gagan narula

G Anuradha Assoc. Prof., CSE, VRSEC

International Journal of Recent Contributions from Engineering, Science & IT (iJES)

Margret Anouncia

Advances in Science, Technology and Engineering Systems Journal

bzar hussan

Moushmee Kuri

With the rapid development of the World Wide Web, one of the main tools for people to get network information is search engine. However, the search results are widely condemned due to the lack of accuracy and redundancy disadvantages. The semantic web is a technology to save data in a machine-readable format that makes it possible for the machines to intelligently match that data with related data based on its semantics. This paper starts from the traditional search engine, and firstly introduces its classification, popular technology, advantages, disadvantages, and deep Knowledge on semantic-technology, thus leads to the semantic search engine model.

Kanwalvir Singh Dhindsa

AS WE KNOW THAT WWW IS ALLOWING PEOPLES TO SHARE THE HUGE INFORMATION GLOBALLY FROM THE BIG DATABASE REPOSITORIES. THE AMOUNT OF INFORMATION GROWS BILLIONS OF DATABASES. HENCE TO SEARCH PARTICULAR INFORMATION FROM THESE HUGE DATABASES WE NEED THE SPECIALIZED MECHANISM WHICH HELPS TO RETRIEVE THAT INFORMATION EFFICIENTLY. NOW DAYS VARIOUS TYPES OF SEARCH ENGINES ARE AVAILABLE WHICH MAKES INFORMATION RETRIEVING IS DIFFICULT. BUT TO PROVIDE THE BETTER SOLUTION TO THIS PROBLEM, SEMANTIC WEB SEARCH ENGINES ARE PLAYING VITAL ROLE. BASICALLY MAIN AIM OF THIS KIND OF SEARCH ENGINES IS TO PROVIDING THE REQUIRED INFORMATION IS SMALL TIME WITH MAXIMUM ACCURACY. BUT THE PROBLEM WITH SEMANTIC SEARCH ENGINES IS THAT THOSE ARE VULNERABLE WHILE ANSWERING THE INTELLIGENT QUERIES. THESE KINDS OF SEARCH ENGINES DON’T HAVE MUCH EFFICIENCY AS PER EXPECTATIONS BY END USERS, AS MOST OF TIME THEY ARE PROVIDING THE INACCURATE INFORMATION’S. THUS IN THIS PAPER WE ARE PRESENTING THE NEW APPROACH FOR SEMANTIC ...

sudeepthi govathoti

International Journal of Scientific Research in Computer Science, Engineering and Information Technology

International Journal of Scientific Research in Computer Science, Engineering and Information Technology IJSRCSEIT

Search engines play important role in the success of the Web. Search engine helps the users to find the relevant information on the internet. Due to many problems in traditional search engines has led to the development of semantic web. Semantic web technologies are playing a crucial role in enhancing traditional search, as it work to create machines readable data and focus on metadata. However, it will not replace traditional search engines. In the environment of semantic web, search engine should be more useful and efficient for searching the relevant web information. It is a way to increase the accuracy of information retrieval system. This is possible because semantic web uses software agents; these agents collect the information, perform relevant transactions and interact with physical devices. This paper includes the survey on the prevalent Semantic Search Engines based on their advantages, working and disadvantages and presents a comparative study based on techniques, type of results, crawling, and indexing.

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.

RELATED PAPERS

International Journal IJRITCC

2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation

Arooj Fatima

International Journal of Engineering Research and

Mahendra Salunke

IJRES Journal

Pranav Makkar

Anindya Basu , Mita Paul

International journal of Web &amp; Semantic Technology

DR. A. GOVARDHAN

International Journal of Web Engineering

Sumayah Hamad

Digital Ecosystems and …

Erneto CHang

Dr Pawan Singh

International Journal of Engineering Research and Technology (IJERT)

IJERT Journal

International Journal of Scientific Research in Science, Engineering and Technology IJSRSET

Gladys Gnana Kiruba

International Journal of Future Computer and Communication

Majid Qureshi

JACOTECH The Research Group

Usman Siddiqui

Konstantinos I Kotis

RELATED TOPICS

  •   We're Hiring!
  •   Help Center
  • Find new research papers in:
  • Health Sciences
  • Earth Sciences
  • Cognitive Science
  • Mathematics
  • Computer Science
  • Academia ©2024

Help | Advanced Search

Computer Science > Computer Vision and Pattern Recognition

Title: rethinking data augmentation for robust lidar semantic segmentation in adverse weather.

Abstract: Existing LiDAR semantic segmentation methods often struggle with performance declines in adverse weather conditions. Previous research has addressed this issue by simulating adverse weather or employing universal data augmentation during training. However, these methods lack a detailed analysis and understanding of how adverse weather negatively affects LiDAR semantic segmentation performance. Motivated by this issue, we identified key factors of adverse weather and conducted a toy experiment to pinpoint the main causes of performance degradation: (1) Geometric perturbation due to refraction caused by fog or droplets in the air and (2) Point drop due to energy absorption and occlusions. Based on these findings, we propose new strategic data augmentation techniques. First, we introduced a Selective Jittering (SJ) that jitters points in the random range of depth (or angle) to mimic geometric perturbation. Additionally, we developed a Learnable Point Drop (LPD) to learn vulnerable erase patterns with Deep Q-Learning Network to approximate the point drop phenomenon from adverse weather conditions. Without precise weather simulation, these techniques strengthen the LiDAR semantic segmentation model by exposing it to vulnerable conditions identified by our data-centric analysis. Experimental results confirmed the suitability of the proposed data augmentation methods for enhancing robustness against adverse weather conditions. Our method attains a remarkable 39.5 mIoU on the SemanticKITTI-to-SemanticSTF benchmark, surpassing the previous state-of-the-art by over 5.4%p, tripling the improvement over the baseline compared to previous methods achieved.
Comments: 19 pages, 6 figures, accpeted in ECCV 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as: [cs.CV]
  (or [cs.CV] for this version)
  Focus to learn more arXiv-issued DOI via DataCite

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

  • DOI: 10.1088/1742-6596/2788/1/012019
  • Corpus ID: 270760139

Research on optimization and improvement method of new energy access grid stability based on transient stability margin index

  • Kun Wang , Tianzhi Zhao , +4 authors Yuhang Wang
  • Published in Journal of Physics… 1 June 2024
  • Engineering, Environmental Science, Physics
  • Journal of Physics: Conference Series

9 References

Controllable transient power sharing of inverter-based droop controlled microgrid, adaptive power capability control scheme for voltage source converter to improve transient stability, related papers.

Showing 1 through 3 of 0 Related Papers

COMMENTS

  1. Semantic Scholar

    Semantic Reader is an augmented reader with the potential to revolutionize scientific reading by making it more accessible and richly contextual. Try it for select papers. Semantic Scholar uses groundbreaking AI and engineering to understand the semantics of scientific literature to help Scholars discover relevant research.

  2. Semantic Scholar

    Experience a smarter way to search and discover scholarly research. Create Your Account. Semantic Scholar provides free, AI-driven research tools and open resources for all researchers. Search and cite any papers, manage your reading lists in your personal library, and get AI-powered paper recommendations just for you.

  3. [2202.08904] SGPT: GPT Sentence Embeddings for Semantic Search

    This prevents possibly new state-of-the-art results and forces organizations to train and maintain separate models. To this end, we propose SGPT to use decoders for sentence embeddings and semantic search via prompting or fine-tuning. At 5.8 billion parameters SGPT improves on the previously best sentence embeddings by a margin of 7% and ...

  4. Semantic Scholar

    Semantic Scholar's records for research papers published in all fields provided as an easy-to-use JSON archive . f_Dataset. DeepFigures. ... Chief Scientist at Semantic Scholar, to discuss how search engines are helping scientists explore and innovate by making it easier to draw connections from a massive collection of scientific literature.

  5. COVID-19 information retrieval with deep-learning based semantic search

    CO-Search indexes content from over 400,000 scientific papers made available through the COVID-19 Open Research Dataset Challenge (CORD-19) 9 —an initiative put forth by the US White House and ...

  6. 19894 PDFs

    Explore the latest full-text research PDFs, articles, conference papers, preprints and more on SEMANTIC SEARCH. Find methods information, sources, references or conduct a literature review on ...

  7. semantic search engine Latest Research Papers

    The University . Semantic Search Engine. This paper presents the final results of the research project that aimed to build a Semantic Search Engine that uses an Ontology and a model trained with Machine Learning to support the semantic search of research projects of the System of Research from the University of Nariño.

  8. search the arXiv

    A simple semantic search engine for ML papers on arXiv. Insert an arXiv link to find similar papers or use natural language to describe what you are looking for.

  9. Connected Papers

    Connected Papers is a visual tool to help researchers and applied scientists find academic papers relevant to ... With Connected Papers you can just search and visually discover important recent papers. ... we use the Semantic Scholar database which contains hundreds of millions of papers from all fields of science. We grow by word of mouth ...

  10. Elicit: The AI Research Assistant

    Search for research papers. Ask a research question and get back a list of relevant papers from our database of 125 million. Get one sentence abstract summaries. ... Elicit searches across 125 million academic papers from the Semantic Scholar corpus, which covers all academic disciplines. When you extract data from papers in Elicit, Elicit will ...

  11. (PDF) Survey of Semantic Search Research

    This paper surveys the research field of semantic search, i.e. search utilizing semantic techniques or search of formally annotated semantic content. The survey identifies and discusses various ...

  12. Research Guides: Literature Reviews: AI Lit Searching [beta]

    How to Search Enter a research question or multiple keywords about a research question. Enter the title of a paper. The stared or selected studies will lead to Semantic Scholar's site for detailed information for all citations. Export Results - Various ways to export results. How to Cite - Includes the elicit.org URL in the citation, for example:

  13. Search for semantic search indexing

    The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain. 5 code implementations • 2 Sep 2016. This provides numerous benefits: the knowledge graph can be built automatically from a real-world corpus of data, new nodes - along with their combined edges - can be instantly materialized from any arbitrary combination ...

  14. Towards a Semantic Search Engine for Scientific Articles

    Towards a Semantic Search Engine for Scientific Articles. Conference paper. First Online: 02 September 2017. pp 608-611. Cite this conference paper. Download book PDF. Download book EPUB. Research and Advanced Technology for Digital Libraries (TPDL 2017) Bastien Latard,

  15. Semantic Scholar

    March 25, 2023. TLDR. This paper describes the Semantic Reader Project, a collaborative effort across multiple institutions to explore automatic creation of dynamic reading interfaces for research papers, and develops and releases a production reading interface that will incorporate the best features as they mature.

  16. SGPT: GPT Sentence Embeddings for Semantic Search

    This prevents possibly new state-of-the-art results and forces organizations to train and maintain separate models. To this end, we propose SGPT to use decoders for sentence embeddings and semantic search via prompting or fine-tuning. At 5.8 billion parameters SGPT improves on the previously best sentence embeddings by a margin of 7% and ...

  17. The Semantic Reader Project: Augmenting Scholarly Documents through AI

    Semantic Scholar help scholars discover research papers. Auto-mated summarization for research papers [11] helps scholars triage between research papers. But when it comes to actually reading research papers, the process, based on a static PDF format, has remained largely unchanged for many decades. This is a problem

  18. Semantic Reader Open Research Platform

    Semantic Reader Project is a broad collaborative effort across multiple non-profit, industry, and academic institutions to create interactive, intelligent reading interfaces for research papers. The Semantic Reader research papers summarizes our efforts in combining AI and HCI research to design novel, AI-powered interactive reading interfaces that address a variety of user challenges faced by ...

  19. Semantic search

    We strive to create an environment conducive to many different types of research across many different time scales and levels of risk. Learn more about our Philosophy Learn more. ... Semantic search. Ramanathan V. Guha. Rob McCool Eric Miller WWW(2003), pp. 700-709 Download Google Scholar. Abstract. Research Areas. Meet the teams driving ...

  20. A survey and classification of semantic search approaches

    A classification scheme for semantic search engines is introduced and terminology is clarified to clarify terminology and identify not only common concepts and outstanding features, but also open issues. A broad range of approaches to semantic document retrieval has been developed in the context of the Semantic Web. This survey builds bridges among them. We introduce a classification scheme ...

  21. Search for Semantic Code Search

    Stay informed on the latest trending ML papers with code, research developments, libraries, methods, and datasets. ... CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. 14 code implementations • 20 Sep 2019. To enable evaluation of progress on code search, we are releasing the CodeSearchNet Corpus and are presenting the ...

  22. (PDF) Semantic Search Engine

    ISSN (Print) : 0974-6846. ISSN (Online) : 0974-5645. Semantic Search Engine. Shilpa S. Laddha and Pradip M. Jawandhiya. 1 Government College of Engineering, Aurangabad − 431005, Maharashtra ...

  23. (PDF) A Study on Semantic Searching, Semantic Search Engines and

    The Paper also explain the so me issues in I.J. Information Technology and Computer Science, 2016, 10, 82-89 A Study on Semantic Searching, Semantic Search Engines and Technologies Used for Semantic Search Engines the semantic search engines, technologies used for the semantic searching, analysis of current search engines and comparison of ...

  24. [2407.02286v1] Rethinking Data Augmentation for Robust LiDAR Semantic

    Existing LiDAR semantic segmentation methods often struggle with performance declines in adverse weather conditions. Previous research has addressed this issue by simulating adverse weather or employing universal data augmentation during training. However, these methods lack a detailed analysis and understanding of how adverse weather negatively affects LiDAR semantic segmentation performance ...

  25. Research on optimization and improvement method of ...

    This paper focuses on the potential local voltage stability risks and transient stability margin decline in the power grid after the large-scale integration of new energy. Based on the transient stability margin index of the power grid, a method for evaluating and optimizing the transient stability margin after the integration of wind power into the power grid is proposed.