Take a large dataset (e.g. BAWE) with essays along with the grades of these essays assigned by human reviewers. The BAWE dataset contains 2092 essays written by university students on various subjects. Each essay is graded with one of two grades, "Merit" or "Distinction". Having the dataset, take an LLM like Mistral. For each essay, let it predict the next token: Give it the 1st token of an essay, let it predict the 2nd one. Compute the entropy of the output probability distribution of the 2nd token. Give it the 2nd token, let it predict the 3rd, compute the entropy of the output distribution of the 3rd token etc. Average the entropies for an entire essay.
Essays' entropy is highest in the beginning and decreases gradually with each token being added. Intuitively, this means that with each additional token, the LLM understands more clearly what an essay is about and can thus predict the next token more easily. Only the first 500 tokens are plotted.