Harmonic LLMs are Trustworthy

We introduce an intuitive method to test the robustness (stability and explainability) of any black-box LLM in real-time via its local deviation from harmoniticity, denoted as \(\gamma\). To the best of our knowledge this is the first completely model-agnostic and unsupervised method of measuring th...

Full description

Saved in:
Bibliographic Details
Published inarXiv.org
Main Authors Kersting, Nicholas S, Rahman, Mohammad, Vedala, Suchismitha, Wang, Yang
Format Paper
LanguageEnglish
Published Ithaca Cornell University Library, arXiv.org 25.07.2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:We introduce an intuitive method to test the robustness (stability and explainability) of any black-box LLM in real-time via its local deviation from harmoniticity, denoted as \(\gamma\). To the best of our knowledge this is the first completely model-agnostic and unsupervised method of measuring the robustness of any given response from an LLM, based upon the model itself conforming to a purely mathematical standard. To show general application and immediacy of results, we measure \(\gamma\) in 10 popular LLMs (ChatGPT, Claude-2.1, Claude3.0, GPT-4, GPT-4o, Smaug-72B, Mixtral-8x7B, Llama2-7B, Mistral-7B and MPT-7B) across thousands of queries in three objective domains: WebQA, ProgrammingQA, and TruthfulQA. Across all models and domains tested, human annotation confirms that \(\gamma \to 0\) indicates trustworthiness, and conversely searching higher values of \(\gamma\) easily exposes examples of hallucination, a fact that enables efficient adversarial prompt generation through stochastic gradient ascent in \(\gamma\). The low-\(\gamma\) leaders among the models in the respective domains are GPT-4o, GPT-4, and Smaug-72B, providing evidence that mid-size open-source models can win out against large commercial models.
ISSN:2331-8422