Retrieval augmented generation for building datasets from scientific literature

TitleRetrieval augmented generation for building datasets from scientific literature
Publication TypeJournal Article
Year of Publication2025
AuthorsMaharana, PRanjan, Verma, A, Joshi, K
JournalJournal of Physics-Materials
Volume8
Issue3
Pagination035006
Date PublishedJUL
Type of ArticleArticle
Keywordsdataset building, Hydrogen storage, LLM, materials, RAG
Abstract

In this work, we show that employing retrieval augmented generation (RAG) with a large language model (LLM) enables us to extract accurate data from scientific literature and construct datasets. The rapid growth in publications necessitates the automation of extraction of structured data as it is crucial for training machine learning(ML) models. The pipeline developed is simple and can be adjusted accordingly with natural language as input. Quantization enables us to run LLMs on consumer hardware and remove the reliance on closed-source models. Both Llama3-8B and Gemma2-9B with RAG give structured output consistently and with high accuracy as compared to direct prompting. Using the newly developed protocol, we created a data set of metal hydrides for solid-state hydrogen storage from paper abstracts. The accuracy of the generated dataset was >88% in the cases tested. Further, we demonstrate that the generated dataset is ready-to-use for ML models by testing it with HYST to predict the H(2)wt\textbackslash% at a given temperature. Thus, we demonstrate a pipeline to create datasets from scientific literature at minimal computational cost and high accuracy.

DOI10.1088/2515-7639/ade1fa
Type of Journal (Indian or Foreign)

Foreign

Impact Factor (IF)

4.3

Divison category: 
Physical and Materials Chemistry
Database: 
Web of Science (WoS)

Add new comment