With the rise of large language models (LLMs), our exposure to benchmarks — not to mention the sheer number and variety of them — has surged. Given the opaque nature of LLMs and other AI systems, benchmarks have become the standard way to compare their performance. These are standardized tests or data sets that evaluate […]
