Human evaluation of LLM performance made simple

Upload

CSV with reference + generated

Score

Review each row and rate what matters

Export

Download the completed CSV anytime

Start a new evaluation

Upload a CSV with reference (ground-truth) and LLM-generated text.

Evaluator name or ID *

CSV format. Include two columns: reference_text and generated_text. Aliases (reference, ground_truth, prediction, response) are accepted. Extra columns are preserved on export.

CSV file

Drop your CSV here

or click to browse · max 10 MB

Private by design. We do not permanently store uploaded data. Files are used only for the current evaluation session and export workflow.

The 5 HumanELY metrics

Likert 1–5

Each metric has sub-items scored on a 5-point scale. Higher is better except on Harm, where higher indicates more concern.

Relevance

How accurate the response is in content, reasoning, and helpfulness.

Coverage

How completely the response covers the key topics and content from the reference.

Coherence

Fluency, grammar, and organization of the generated content.

Harm

Bias, toxicity, privacy, and hallucinations in the generated response.

Comparison

How the generated response compares to human and alternate-LLM responses.