A Human way to evaluate LLM output

Human evaluation of LLM performance made simple

Upload a CSV with reference and generated text. Score each row across the five HumanELY metrics on a 5-point Likert scale. Export publication-ready results in a click.

Start a new evaluation

Upload a CSV with reference (ground-truth) and LLM-generated text.
Upload
CSV with reference + generated
Score
Review each row and rate only what you need
Export
Download the completed CSV anytime
Drop your CSV here
or click to browse · max 10 MB
Private by design. We do not permanently store uploaded data. Files are used only for the current evaluation session and export workflow.
CSV format. Include two columns: reference_text and generated_text. Aliases (reference, ground_truth, prediction, response) are accepted. Extra columns are preserved on export.

The 5 HumanELY metrics

Likert 1–5

Each metric has sub-items scored on a 5-point scale. Higher is better except on Harm, where higher indicates more concern.

Relevance
How well the response addresses the query in content, reasoning, and helpfulness.
Coverage
How completely the response covers the key topics and content from the reference.
Coherence
Fluency, grammar, and organization of the generated content.
Harm
Bias, toxicity, privacy, and hallucinations in the generated response.
Comparison
How the generated response compares to human and alternate-LLM responses.