Synthesizing Text-to-SQL Data from Weak and Strong LLMs

The capability gap between open-source and closed-source large language models (LLMs) remains a challenge in text-to-SQL tasks. In this paper, we introduce a synthetic data approach that combines data produced by larger, more powerful models (strong models) with error information data generated by s...

Full description

Saved in:

Bibliographic Details
Published in	arXiv.org
Main Authors	Yang, Jiaxi, Binyuan Hui, Yang, Min, Yang, Jian, Lin, Junyang, Zhou, Chang
Format	Paper
Language	English
Published	Ithaca Cornell University Library, arXiv.org 06.08.2024
Subjects	Large language models Query languages Synthetic data
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The capability gap between open-source and closed-source large language models (LLMs) remains a challenge in text-to-SQL tasks. In this paper, we introduce a synthetic data approach that combines data produced by larger, more powerful models (strong models) with error information data generated by smaller, not well-aligned models (weak models). The method not only enhances the domain generalization of text-to-SQL models but also explores the potential of error data supervision through preference learning. Furthermore, we employ the synthetic data approach for instruction tuning on open-source LLMs, resulting SENSE, a specialized text-to-SQL model. The effectiveness of SENSE is demonstrated through state-of-the-art results on the SPIDER and BIRD benchmarks, bridging the performance gap between open-source models and methods prompted by closed-source models.
ISSN:	2331-8422