An overview of NLP evaluation metrics and distance/similarity measures
Date:
Evaluation in NLP is an open problem especially when it comes to the evaluation of generated text. It’s a really hard problem and therefore there is no single go to evaluation metric available. What makes this problem even more hard is that metrics need to evaluate competing goals i.e Correctness (quality)/Specicity (diversity).
How can you evaluate a model.
- Indirect-Evaluation - Incorporate your NLP model into some other downstream task that can be easily evaluaed.
- Direct Evaluation
- Human Evaluation
- Automatic Evaluation - Generate some objective scores using ground truths and an evalation metric.
Then the question is which metric to use in which case. And the best way to answer this questions is to first answer the question - what are we trying to measure?