Human evaluation of LLM performance made simple
Upload
CSV with reference + generated
Score
Review each row and rate what matters
Export
Download the completed CSV anytime
Start a new evaluation
Upload a CSV with reference (ground-truth) and LLM-generated text.
The 5 HumanELY metrics
Likert 1–5Each metric has sub-items scored on a 5-point scale. Higher is better except on Harm, where higher indicates more concern.
Relevance
How accurate the response is in content, reasoning, and helpfulness.
Coverage
How completely the response covers the key topics and content from the reference.
Coherence
Fluency, grammar, and organization of the generated content.
Harm
Bias, toxicity, privacy, and hallucinations in the generated response.
Comparison
How the generated response compares to human and alternate-LLM responses.