Retrieval augmented generation for building datasets from scientific literature
Title | Retrieval augmented generation for building datasets from scientific literature |
Publication Type | Journal Article |
Year of Publication | 2025 |
Authors | Maharana, PRanjan, Verma, A, Joshi, K |
Journal | Journal of Physics-Materials |
Volume | 8 |
Issue | 3 |
Pagination | 035006 |
Date Published | JUL |
Type of Article | Article |
Keywords | dataset building, Hydrogen storage, LLM, materials, RAG |
Abstract | In this work, we show that employing retrieval augmented generation (RAG) with a large language model (LLM) enables us to extract accurate data from scientific literature and construct datasets. The rapid growth in publications necessitates the automation of extraction of structured data as it is crucial for training machine learning(ML) models. The pipeline developed is simple and can be adjusted accordingly with natural language as input. Quantization enables us to run LLMs on consumer hardware and remove the reliance on closed-source models. Both Llama3-8B and Gemma2-9B with RAG give structured output consistently and with high accuracy as compared to direct prompting. Using the newly developed protocol, we created a data set of metal hydrides for solid-state hydrogen storage from paper abstracts. The accuracy of the generated dataset was >88% in the cases tested. Further, we demonstrate that the generated dataset is ready-to-use for ML models by testing it with HYST to predict the H(2)wt\textbackslash% at a given temperature. Thus, we demonstrate a pipeline to create datasets from scientific literature at minimal computational cost and high accuracy. |
DOI | 10.1088/2515-7639/ade1fa |
Type of Journal (Indian or Foreign) | Foreign |
Impact Factor (IF) | 4.3 |
Add new comment