Skip to main content
Metrics define what the LLM judge evaluates during an evaluation run. Arkdock includes seven built-in metrics and supports custom metrics that you define with a natural-language rubric.

Built-in metrics

MetricWhat it measures
HelpfulnessWhether the agent’s responses actually help the user accomplish their goal.
CoherenceWhether the agent’s responses are logically consistent and well-structured across the conversation.
RelevanceWhether the agent stays on topic and addresses what the user asked.
VerbosityWhether the agent’s response length is appropriate — not too terse or unnecessarily long.
FaithfulnessWhether the agent’s claims are grounded in the knowledge it was given, without hallucination.
Goal CompletionWhether the simulated user’s stated goal was fully achieved by the end of the conversation. Scored 0-1 rather than 1-5. Always required — this metric is automatically included in every evaluation and cannot be deselected.
Agent Behavior FailureWhether the agent exhibited any defined failure behaviors (e.g. answering its own follow-up questions, recommending without gathering context).
Built-in metrics use Arkdock’s default scoring rubrics and are available in all evaluations without any configuration.

Custom metrics

Custom metrics let you define domain-specific behaviors to score. Navigate to Metrics in the sidebar to create and manage them.

Creating a custom metric

Click New Metric or the + icon on the Metrics page. Name — A short label for the metric (e.g. “Upsell Appropriateness”, “Policy Compliance”). Description — A one-sentence description that appears as a tooltip in evaluation results. Type — Whether the metric produces a numeric score (Quantitative) or a category label (Qualitative). Scope — Whether to score each assistant turn independently (Turn) or the full conversation once (Conversation). System Prompt — The fixed instruction the LLM judge receives describing its role and how to apply this metric. User Prompt Template — The per-turn or per-conversation prompt the judge uses to produce a score. Reference the conversation content here. For quantitative metrics, define the numeric scale and what each score level means. Type-specific config — For quantitative metrics, set the score range (e.g. 1–5 or 0–1). For qualitative metrics, define the label options the judge can assign. Example scoring rubric (quantitative, turn-level):
Evaluate whether the agent gathered sufficient information from the user
before making a product recommendation.

Score 5 if the agent asked at least two clarifying questions about budget,
use case, or preferences before recommending.
Score 3 if the agent asked one clarifying question.
Score 2 if the agent made a recommendation without asking any questions.
Score 1 if the agent's recommendation directly contradicted the user's
stated preferences.
Click Save to create the metric. Custom metrics are immediately available in the metric selector when creating an evaluation.

Editing a custom metric

Click a metric row to open its detail view. Edit the name, description, or prompt and click Save. The updated prompt is used for all future evaluations. Previous evaluation results are not retroactively rescored.

Metric versions

Each time you save changes to a custom metric, a new version is created. You can view version history from the metric detail page to see what changed between runs and understand score differences over time.

Deleting a custom metric

Open the metric detail page and use the delete action. Deleting a metric removes it from future evaluations. Historical evaluation results that used this metric remain unchanged.

Metric Alignment

When your team’s annotations consistently disagree with LLM judge scores, Metric Alignment can automatically distill those disagreements into a refined rubric. See Metric Alignment for the full workflow.

Selecting metrics for an evaluation

When creating an evaluation, the Metrics selector lists all available metrics grouped by type:
  • Custom — your organization’s custom metrics, shown first.
  • Built-in — the seven standard Arkdock metrics.
Use the search input to find a specific metric by name. Check individual metrics or use Select All / Deselect All to quickly configure the set. At least one metric must be selected before an evaluation can run.

FAQ

Start with the built-in metrics that are most relevant to your agent’s purpose. For a customer support agent, Helpfulness, Goal Completion, and Faithfulness are usually the most informative. Add custom metrics for behaviors specific to your use case.
Yes. Mix and match freely. Each metric is scored independently.
Yes. Different models interpret scoring rubrics differently. For consistency across evaluation runs, use the same judge and model. If you change judges, treat the results as a separate baseline rather than a direct comparison to prior runs.
Be as specific as possible about the score boundaries. A vague prompt like “Rate how professional the agent sounds” produces inconsistent scores. A prompt that defines what a 1, 3, and 5 look like will produce much more reliable results.
There is no hard limit on custom metrics. Practically, evaluations with more than 10-12 metrics can become harder to interpret. Focus on the 3-5 metrics that best capture the behaviors you care about.