Only smart LLMs can understand good essays

Teaser: For all LLMs, bad essays are hard to understand. But to smart LLMs, good essays make sense. This could be a way to measure an LLM's reasoning ability.


This plot shows that "smarter" LLMs have, on average, a stronger negative correlation between essay grade and essay entropy.
Even though gpt2 (second from left) has more neurons than opt-125M (first from left), it seems "dumber" according to the plot above. However, gpt2 was released in 2019 while opt-125M was released in 2023. One can thus argue that gpt2, being older, uses its neurons less effectively than opt-125M.

How to compute the correlation between essay grade and entropy

Take a large dataset (e.g. BAWE) with essays along with the grades of these essays assigned by human reviewers. The BAWE dataset contains 2092 essays written by university students on various subjects. Each essay is graded with one of two grades, "Merit" or "Distinction". Having the dataset, take an LLM like Mistral. For each essay, let it predict the next token: Give it the 1st token of an essay, let it predict the 2nd one. Compute the entropy of the output probability distribution of the 2nd token. Give it the 2nd token, let it predict the 3rd, compute the entropy of the output distribution of the 3rd token etc. Average the entropies for an entire essay.

Entropy by token position

Essays' entropy is highest in the beginning and decreases gradually with each token being added. Intuitively, this means that with each additional token, the LLM understands more clearly what an essay is about and can thus predict the next token more easily. Only the first 500 tokens are plotted.

Code