Skip to main content
Metric Alignment uses your team’s human annotations to automatically improve a metric’s scoring prompt. When the LLM judge and your reviewers consistently disagree on a metric, run Alignment to distill those disagreements into a refined rubric — without writing the new prompt by hand. Alignment is available for both built-in and custom metrics. For built-in metrics, accepting an alignment result produces a new custom metric (a fork). For custom metrics, it bumps the metric to a new version.

Starting an alignment run

Open a metric’s detail page. At the bottom you will find the Alignment section. Click Start Alignment. In the Start Alignment dialog, configure:
FieldDescription
Source evaluationsOne or more completed evaluations that have resolved annotations for this metric. Only evaluations with resolved annotation data appear in this list.
Reflection LMThe provider and model used to analyze disagreements and distill guidelines. Defaults to the platform default if left blank.
Judge modelThe model used to verify the distilled guidelines. Optional.
Delta thresholdMinimum score gap between the LLM judge and a human annotation for the pair to be counted as a disagreement.
Click Start. The run enters the queue immediately.

Alignment phases

An in-progress run moves through these phases:
  1. Loading annotations
  2. Building disagreement traces
  3. Distilling guidelines
  4. Finalizing
Each run typically completes within a few minutes depending on the number of conversations and the reflection model chosen.

Alignment statuses

StatusMeaning
PendingQueued, not yet started
RunningIn progress
AlignedCompleted successfully — guidelines are ready to review
ErrorRun failed — view the error detail and retry
CancelledRun was cancelled before completion

Reviewing and accepting results

When a run reaches Aligned status, expand it to view:
  • The full guidelines text (what the LLM judge should look for, in plain language)
  • The number of disagreements found across the source evaluations
  • How many agreeing rows were skipped
If the guidelines look good, click Accept:
  • For a built-in metric: a new custom metric is created containing the distilled guidelines as its scoring prompt. The original built-in metric is unchanged.
  • For a custom metric: the current metric is updated to a new version with the refined prompt. All future evaluations using this metric will apply the new rubric.
If the guidelines do not look useful (e.g. the run found zero disagreements or the output is too generic), discard the run and try again with different source evaluations or a different reflection model.

Cancelling a run

Click Cancel on any in-flight alignment run to stop it. Cancellation is idempotent — cancelling an already-completed run has no effect.

Alignment history

The Alignment section shows all past runs for the metric, newest first. Each row shows the status, creation time, and source evaluations used. Expand any completed run to view its guidelines, even after accepting.

FAQ

Editing a prompt manually requires you to write the new rubric yourself. Alignment derives the rubric automatically by analyzing disagreements between the LLM judge’s scores and your team’s resolved annotations. It is most useful when you have a pattern of disagreement but are not sure exactly how to express the correction in the prompt.
Yes. Accepting an alignment result on a built-in metric creates a new custom metric (a fork) that contains the distilled guidelines. The original built-in metric is not modified.
There is no hard minimum, but more resolved annotations give the reflection model more signal. Runs based on fewer than 5-10 resolved disagreements tend to produce generic or low-confidence guidelines. Aim to annotate and resolve at least one full evaluation before starting.