Built-in metrics
| Metric | What it measures |
|---|---|
| Helpfulness | Whether the agent’s responses actually help the user accomplish their goal. |
| Coherence | Whether the agent’s responses are logically consistent and well-structured across the conversation. |
| Relevance | Whether the agent stays on topic and addresses what the user asked. |
| Verbosity | Whether the agent’s response length is appropriate — not too terse or unnecessarily long. |
| Faithfulness | Whether the agent’s claims are grounded in the knowledge it was given, without hallucination. |
| Goal Completion | Whether the simulated user’s stated goal was fully achieved by the end of the conversation. Scored 0-1 rather than 1-5. Always required — this metric is automatically included in every evaluation and cannot be deselected. |
| Agent Behavior Failure | Whether the agent exhibited any defined failure behaviors (e.g. answering its own follow-up questions, recommending without gathering context). |
Custom metrics
Custom metrics let you define domain-specific behaviors to score. Navigate to Metrics in the sidebar to create and manage them.Creating a custom metric
Click New Metric or the + icon on the Metrics page. Name — A short label for the metric (e.g. “Upsell Appropriateness”, “Policy Compliance”). Description — A one-sentence description that appears as a tooltip in evaluation results. Type — Whether the metric produces a numeric score (Quantitative) or a category label (Qualitative).
Scope — Whether to score each assistant turn independently (Turn) or the full conversation once (Conversation).
System Prompt — The fixed instruction the LLM judge receives describing its role and how to apply this metric.
User Prompt Template — The per-turn or per-conversation prompt the judge uses to produce a score. Reference the conversation content here. For quantitative metrics, define the numeric scale and what each score level means.
Type-specific config — For quantitative metrics, set the score range (e.g. 1–5 or 0–1). For qualitative metrics, define the label options the judge can assign.
Example scoring rubric (quantitative, turn-level):
Editing a custom metric
Click a metric row to open its detail view. Edit the name, description, or prompt and click Save. The updated prompt is used for all future evaluations. Previous evaluation results are not retroactively rescored.Metric versions
Each time you save changes to a custom metric, a new version is created. You can view version history from the metric detail page to see what changed between runs and understand score differences over time.Deleting a custom metric
Open the metric detail page and use the delete action. Deleting a metric removes it from future evaluations. Historical evaluation results that used this metric remain unchanged.Metric Alignment
When your team’s annotations consistently disagree with LLM judge scores, Metric Alignment can automatically distill those disagreements into a refined rubric. See Metric Alignment for the full workflow.Selecting metrics for an evaluation
When creating an evaluation, the Metrics selector lists all available metrics grouped by type:- Custom — your organization’s custom metrics, shown first.
- Built-in — the seven standard Arkdock metrics.
FAQ
How do I choose which metrics to include in an evaluation?
How do I choose which metrics to include in an evaluation?
Start with the built-in metrics that are most relevant to your agent’s purpose. For a customer support agent, Helpfulness, Goal Completion, and Faithfulness are usually the most informative. Add custom metrics for behaviors specific to your use case.
Can I use custom metrics alongside built-in ones?
Can I use custom metrics alongside built-in ones?
Yes. Mix and match freely. Each metric is scored independently.
Does the choice of LLM judge affect scores?
Does the choice of LLM judge affect scores?
Yes. Different models interpret scoring rubrics differently. For consistency across evaluation runs, use the same judge and model. If you change judges, treat the results as a separate baseline rather than a direct comparison to prior runs.
How specific should a custom metric prompt be?
How specific should a custom metric prompt be?
Be as specific as possible about the score boundaries. A vague prompt like “Rate how professional the agent sounds” produces inconsistent scores. A prompt that defines what a 1, 3, and 5 look like will produce much more reliable results.
How many custom metrics can I create?
How many custom metrics can I create?
There is no hard limit on custom metrics. Practically, evaluations with more than 10-12 metrics can become harder to interpret. Focus on the 3-5 metrics that best capture the behaviors you care about.