June 6, 2024

A Gen AI Deep Dive: Marginal Differences in Large Language Models

By Ehab Naim

Large language models (LLMs) are advanced artificial intelligence (AI) models that have the capacity to understand human language and generate responses in a manner that an average person would. Examples of LLMs include ChatGPT, Mistral, Llama, and others: A Gen AI Deep Dive: Marginal Differences in Large Language Models.

As these models advance, their performance improves, making their evaluation process more challenging. When assessed using standard benchmarks, major models and many rising ones achieve high accuracy rates (high 90th percentile or above). To achieve a significant level of precision, developers and engineers work on solving complex language challenges encountered during various phases of the model development, training, testing, and deployment.

This raises a problem: Such a high level of accuracy makes it difficult to distinguish the capabilities of state-of-the-art models based on marginal differences in validation scores alone. In this article, our team explores the idea that minute variations in reported accuracy between top-performing LLMs could, in some cases, be attributable to random chance rather than inherent differences in language proficiency. This article is somewhat technical, but we will simplify the concepts as we continue along. Let’s dive into the article to learn more about the marginal differences in LLMs. 

Understanding the Concepts

To support what we presented earlier, here’s a concrete example where a basic statistical test failed to reject the null hypothesis. In simple terms, a null hypothesis (denoted as H0 in many instances) in this case usually refers to a statement where there is no significant difference or effect between evaluated parameters or that they are equivalent to a specific value.

In other words, it is as if you are assuming there is no difference between the parameters before you start examining the data, so you perform statistical tests to not reject or reject the assumption (statement). In our case, the null hypothesis statement was defined as a 10-point discrepancy occurring due to randomness. This way, our team can highlight the inherent difficulties in proving the superiority of one model over another on the basis of validation metrics alone. Moreover, LLMs possess qualities reminiscent of chaotic dynamical systems, introducing an additional source of stochasticity into evaluations.

Exploring the Tests

We start by discussing the case of a recent evaluation involving two agents: A and B, which were selected to complete a 164-question test. Their responses were evaluated and Agent A got 90 correct answers, while Agent B got 80 correct responses. If the results are viewed without further evaluation, one would assume that agent A’s 10-point difference indicates a relatively better grasp and understanding of the language being assessed. However, other possibilities cannot be ruled out without knowing more context about the test construction and models’ training.

To get better insights, our team ran a Fisher’s exact test. It is a statistical assessment used to determine whether the differences between two binomial groups could be attributed to chance or not. So, our previously-mentioned null hypothesis would assess whether the scores of both agents A and B were equally likely and whether any discrepancy was due to randomness in sampling a finite number of questions. The results of our analysis were as follows:

  • The test consisted of 164 total queries that could be passed or failed
  • Agent A responded accurately to 90 of these questions
  • Agent B responded accurately to 80 questions
  • The p-value returned was 0.31995
  • Odds ratio: 1.277

Since the obtained p-value exceeds the typical 0.05 significance threshold, the null hypothesis we suggested could not be rejected. This means that results suggest that the 10-point difference in the 164-question test could be attributed to randomness or chance alone. In other words, there is insufficient evidence to confirm that Agent A has outperformed Agent B.

Context and Insights

Getting additional information about the agent could undoubtedly influence the results. For example, if Agent A was trained on more detailed linguistic data that are related to the 164 questions while Agent B was not, this could explain the discrepancy between both agents. Another would be if the prompt used optimally performed with one agent over the other. However, without such data and information, we cannot determine whether Agent A is genuinely more capable than Agent B. This shows how superficial divergent validation scores between leading LLMs do not necessarily reflect the superiority of one model over another. It is because a degree of uncertainty stemming from random variation is inherently introduced when software approximates human language using sampling over finite datasets.

Chaos in LLMs is sensitive to tiny changes in a manner similar to how chaotic systems are sensitive to their starting point. This means that even a small change, like a different prompt, could make LLMs behave differently. In other words, changes to the internal representation space (the knowledge and understanding of the LLMs) within these models by something as simple as a query could trigger their focus on a different area of the aforementioned space, leading to subtle changes in their response.

Final Remarks

In light of our results, we strongly caution developers against claiming that one LLM surpasses another based solely on scoring splits of fractions of a percent. This means that we need a multi-faceted benchmarking approach to determine whether a new LLM represents a leap forward in this technology or whether the numbers are merely attributed to chance. Some degree of ambiguity is inevitable for any technique mimicking a factor as complex as human language. With current evaluation tools, differences approaching chance levels cannot be stated conclusively.

Leverage Narrativa’s Knowledge and Experience for Your Content

Narrativa performs this technical work so that you don’t have to. It’s what makes the automated content produced using our platform extremely robust and it’s a true market differentiator. We lead in the field of generative AI, with a diverse team spanning over five continents and professionals with backgrounds in software engineering, finance, marketing, linguistics, business, and healthcare. Narrativa partners with leading organizations across the globe, like Microsoft, The Wall Street Journal, Dow Jones, TCS, and the Leukemia & Lymphoma Society. We help our clients automate and scale their content so that their teams’ productivity and efficiency significantly improves. In addition, our business partners see significant cost reductions and increases in their revenue as well. If you’ve been curious about LLMs, automated content, or generative AI in general, let’s talk today!

About Narrativa

Narrativa® is an internationally recognized generative AI content company that believes people and artificial intelligence are better together. Through its proprietary content automation platform, teams of all types and sizes are empowered to build and deploy smart composition, smart business intelligence reporting, and smart process optimization content solutions for internal and external audiences alike. Narrativa® helps teams produce content quickly and at scale while supporting growth in a variety of industries by saving businesses time and money. Accelerate the potential with Narrativa®.

For additional information, visit www.narrativa.com and follow on LinkedIn, Facebook, Instagram and X.

Share

Book a demo to learn more about how our Generative AI content automation platform can transform your business.

Book a demo to learn more about how our Generative AI content automation platform can transform your business.