Starting an alignment run
Open a metric’s detail page. At the bottom you will find the Alignment section. Click Start Alignment. In the Start Alignment dialog, configure:| Field | Description |
|---|---|
| Source evaluations | One or more completed evaluations that have resolved annotations for this metric. Only evaluations with resolved annotation data appear in this list. |
| Reflection LM | The provider and model used to analyze disagreements and distill guidelines. Defaults to the platform default if left blank. |
| Judge model | The model used to verify the distilled guidelines. Optional. |
| Delta threshold | Minimum score gap between the LLM judge and a human annotation for the pair to be counted as a disagreement. |
Alignment phases
An in-progress run moves through these phases:- Loading annotations
- Building disagreement traces
- Distilling guidelines
- Finalizing
Alignment statuses
| Status | Meaning |
|---|---|
| Pending | Queued, not yet started |
| Running | In progress |
| Aligned | Completed successfully — guidelines are ready to review |
| Error | Run failed — view the error detail and retry |
| Cancelled | Run was cancelled before completion |
Reviewing and accepting results
When a run reaches Aligned status, expand it to view:- The full guidelines text (what the LLM judge should look for, in plain language)
- The number of disagreements found across the source evaluations
- How many agreeing rows were skipped
- For a built-in metric: a new custom metric is created containing the distilled guidelines as its scoring prompt. The original built-in metric is unchanged.
- For a custom metric: the current metric is updated to a new version with the refined prompt. All future evaluations using this metric will apply the new rubric.
Cancelling a run
Click Cancel on any in-flight alignment run to stop it. Cancellation is idempotent — cancelling an already-completed run has no effect.Alignment history
The Alignment section shows all past runs for the metric, newest first. Each row shows the status, creation time, and source evaluations used. Expand any completed run to view its guidelines, even after accepting.FAQ
What is the difference between Alignment and editing the scoring prompt manually?
What is the difference between Alignment and editing the scoring prompt manually?
Editing a prompt manually requires you to write the new rubric yourself. Alignment derives the rubric automatically by analyzing disagreements between the LLM judge’s scores and your team’s resolved annotations. It is most useful when you have a pattern of disagreement but are not sure exactly how to express the correction in the prompt.
Can I run Alignment on a built-in metric?
Can I run Alignment on a built-in metric?
Yes. Accepting an alignment result on a built-in metric creates a new custom metric (a fork) that contains the distilled guidelines. The original built-in metric is not modified.
How many resolved annotations do I need before running Alignment?
How many resolved annotations do I need before running Alignment?
There is no hard minimum, but more resolved annotations give the reflection model more signal. Runs based on fewer than 5-10 resolved disagreements tend to produce generic or low-confidence guidelines. Aim to annotate and resolve at least one full evaluation before starting.