Arthur, the New York City-based developer of monitoring tools for large language models (LLMs), this week launched Arthur Bench, an open-source evaluation tool it said compares LLMs, prompts, and hyperparameters for generative text models.
According to the company, the offering will enable businesses to evaluate how different LLMs will perform in real-world scenarios so they can make informed, data-driven decisions when integrating the latest AI technologies into their operations.
“The AI landscape is rapidly evolving,” it said. “Keeping abreast of advancements and ensuring that a company’s LLM choice remains the best fit in terms of performance viability is crucial. Arthur Bench helps companies compare the different LLM options available using a consistent metric so they can determine the best fit for their application.”
Company co-founder and chief executive officer (CEO) Adam Wenchel said, “understanding the differences in performance between LLMs can have an incredible amount of nuance. With Bench, we have created an open-source tool to help teams deeply understand the differences between LLM providers, different prompting and augmentation strategies, and custom training regimes.”