The paper Science Benchmark: A Complex Real-World Benchmark for Evaluating Natural Language to SQL Systems by Yi Zhang, Jan Deriu, George Katsogiannis-Meimarakis, Catherine Kosten, Georgia Koutrika, Kurt Stockinger has been accepted at VLDB 2024 (Very Large Databases), which is considered as one of the most prestigious international database research conferences. The paper introduces Science Benchmark, a complex benchmark for evaluating systems that automatically translate natural language questions to the database query language SQL. Currently, most approaches are based on Large Language Models such as GPT4 by OpenAI.
The paper is a result of a research collaboration between ZHAW and Athena Research as part of the INODE project funded by the European Commission. The paper combines highly challenging science domains such as astrophysics and cancer research and contains hundreds of natural language questions against scientific databases curated both by computer scientists and domain exports. The benchmark also contains synthetically generated natural language /SQL pairs based on Generative AI technology.
The work produced by the researchers from ZHAW and Athena enables systematically evaluating Generative AI systems for querying complex scientific databases in natural language – a race where major AI companies as well as universities and research labs compete. The newly developed benchmark shows that the problem of translating natural language to a database query language is far from being solved and should encourage new research efforts to tackle this hard problem.