Skip to main content

Getting started

Most teams connect an agent and run their first simulation within minutes. Point Arklex at your agent’s endpoint, define a scenario, and run it — no setup or scripting required.
No. Arklex calls your agent over HTTP. As long as your agent exposes a compatible endpoint, no changes are needed in your codebase.
Arklex supports two integration types: Chat Completions (any endpoint following the OpenAI /chat/completions schema) and A2A (Agent-to-Agent protocol). Most agents built on popular frameworks — LangChain, CrewAI, the OpenAI Agents SDK, or custom code — connect via the Chat Completions endpoint with no code changes.

Simulations and evaluations

A simulation is the execution layer: it runs scenarios against your agent and produces conversation transcripts. An evaluation is the scoring layer: it takes one or more completed simulations and scores the transcripts using an LLM judge.
Multi-turn. Each simulated user follows a persona and goal across a full conversation, producing transcripts that reflect realistic interaction patterns rather than one-shot prompts.
Manual testing is slow, subjective, and hard to repeat. Arklex runs the same scenarios consistently across every agent version, scores them with a defined rubric, and keeps a record you can compare over time — so you catch regressions instead of rediscovering them in production.

Metrics

Seven built-in metrics cover general agent quality, spanning both quantitative and qualitative dimensions. You can also define custom metrics for behaviors specific to your domain.
Write a scoring prompt in plain language describing what good and bad look like on a 1–5 scale. The LLM judge applies it from the next evaluation onward. Custom metrics are versioned, reusable across evaluations, and mix freely with built-ins.
The judge is a starting point, not the final word. When your team disagrees with a score, reviewers add their own in the Annotations tab. The Calibration tab then shows agreement rates per metric and surfaces common disagreements, giving you the evidence to refine a metric’s prompt and bring automated scoring in line with human judgment over time.

Team workflow and security

Yes. Multiple reviewers can annotate turns independently. Admins can toggle to a view showing all reviewers’ annotations alongside the auto-evaluation scores and the resolved values used for calibration.
Header values — typically API keys and auth tokens — are encrypted at rest. The platform displays masked values and never returns the raw secret after it’s saved.