Evaluating LLMs after training

Like any other machine learning model, the LLMs need to be evaluated after they are trained to see if the training was successful.

For LLMs there are 2 main types of evaluation methods - intrinsic and extrinsic:

In intrinsic methods, the language model is evaluated only on specific types of tasks which are directly related to its training objectives. Examples of commonly used intrinsic evaluation metrics are among other things:

Language fluency - evaluates naturalness of language produced by the LLM, checks for grammatical correctness and syntactic diversity to make sure that sentences generated by the model sound as if they were written by a human.
Perplexity - measures how well the model predicts a sample of text.
BLEU (Bilingual Evaluation Understudy) - is a common measure in machine translation. It looks at the translated text and compares it to one or more reference translations to see how similar they are.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) - checks how good LLMs summarize the texts.

In extrinsic methods, the language model is evaluated on how well it performs in real-life scenarios not explicitly covered during training. Extrinsic methods also incorporate human-in-the-loop testing. Examples of commonly used extrinsic evaluation metrics are among other things:

Questionnaires - the LLMs are given questions that people would normally answer and their responses are compared to those produced by humans.
Common-sense tests - test if the LLMs can make common-sense guesses, just like people would do.
Multitasking - checks how LLMs can handle different tasks requiring knowledge from different fields, like math, law, history, geography etc., all at once.

Which one to choose? Both intrinsic and extrinsic evaluation methods have their pros and cons.

Intrinsic methods cope well with assessing the effectiveness of individual NLP components but they may not show the model’s performance in real-life applications.

On the other hand, extrinsic methods take into account how well the model works in a wider range of situations but it usually takes more effort and time to conduct them and they may also be quite subjective.

Which one is optimal to utilize depends on what needs to be tested and how the model will be used. Usually using a combination of both methods gives a clear picture of what the language model can and can’t do.