거대언어모델 기반 검색증강생성 시스템의 표 데이터 인식률을 높이기 위한 최적의 초매개변수 조합

거대언어모델(Large Language Models, LLM)은 비정형 데이터 처리에 강점을 지니지만, 표와 같은 정형 데이터처리에서는 인식률이 낮다. 본 연구에서는 이 문제를 해결하기 위해, 검색증강생성(Retrieval-AugmentedGeneration, RAG) 기반 질의응답 시스템의 표 데이터 인식 성능을 높이는 최적의 초매개변수(Hyperparameter) 조합을 제안한다. 표 데이터를 효과적으로 처리할 수 있도록 전처리 기법을 활용하며, 실험에는 전처리 된 표 데이터 기반의 말뭉치를 사용했다. 다양한 청크 및 오버랩 크...

Full description

Saved in:

Bibliographic Details
Published in	한국정보통신학회논문지 Vol. 28; no. 11; pp. 1282 - 1290
Main Authors	정민수(Min-Su Jung), 이정훈(Jung-Hun Lee)
Format	Journal Article
Language	Korean
Published	한국정보통신학회 01.11.2024
Subjects	전자/정보통신공학 검색 증강 생성 Data Preprocessing LangChain Table QA RAG 랭체인 거대언어모델 표 질의응답 LLM 데이터 전처리
Online Access	Get full text
ISSN	2234-4772 2288-4165
DOI	10.6109/jkiice.2024.28.11.1282

Cover

More Information
Summary:	거대언어모델(Large Language Models, LLM)은 비정형 데이터 처리에 강점을 지니지만, 표와 같은 정형 데이터처리에서는 인식률이 낮다. 본 연구에서는 이 문제를 해결하기 위해, 검색증강생성(Retrieval-AugmentedGeneration, RAG) 기반 질의응답 시스템의 표 데이터 인식 성능을 높이는 최적의 초매개변수(Hyperparameter) 조합을 제안한다. 표 데이터를 효과적으로 처리할 수 있도록 전처리 기법을 활용하며, 실험에는 전처리 된 표 데이터 기반의 말뭉치를 사용했다. 다양한 청크 및 오버랩 크기를 조절해 가능 높은 성능을 보이는 초매개변수 조합을 도출하는데 중점을 두었다. 실험 결과, 거대언어모델마다 최적의 성능을 보이는 초매개변수 조합이 달랐으며, 청크 크기는 응답 품질에 큰 영향을 미치지 않았으나 오버랩을 적용했을 때 일관되게 성능이 개선되는 결과를 보였다. 향후 연구에서는 다양한 도메인의 정형화된 데이터를 활용한 추가 실험을 진행할 예정이다. Large Language Models are highly proficient at handling unstructured data, like natural language, but their performancesignificantly declines when processing structured data, such as tables or other similar formats. To address this limitation,this study proposes an optimal combination of hyperparameters aimed at improving the recognition of table data in aretrieval-augmented question-answering system. Preprocessing techniques are applied to ensure the effective handling oftable data, and the experiments conducted use corpora based on preprocessed tables. The main focus was on discoveringthe best-performing hyperparameter combination by adjusting chunk sizes and varying overlap settings. The experimentalresults revealed that the optimal hyperparameters differed based on the specific language model being used. Althoughchunk size had little effect on overall response quality, introducing overlap consistently led to notable performanceimprovements. Future research will extend these findings by conducting further experiments with structured data acrossvarious domains. KCI Citation Count: 0
Bibliography:	http://jkiice.org
ISSN:	2234-4772 2288-4165
DOI:	10.6109/jkiice.2024.28.11.1282