Getting started
How long does it take to run my first simulation?
How long does it take to run my first simulation?
Most teams connect an agent and run their first simulation within minutes. Point Arklex at your agent’s endpoint, define a scenario, and run it — no setup or scripting required.
Does Arklex require changes to my agent code?
Does Arklex require changes to my agent code?
No. Arklex calls your agent over HTTP. As long as your agent exposes a compatible endpoint, no changes are needed in your codebase.
What agent types does Arklex support?
What agent types does Arklex support?
Arklex supports two integration types: Chat Completions (any endpoint following the OpenAI
/chat/completions schema) and A2A (Agent-to-Agent protocol). Most agents built on popular frameworks — LangChain, CrewAI, the OpenAI Agents SDK, or custom code — connect via the Chat Completions endpoint with no code changes.Simulations and evaluations
What's the difference between a simulation and an evaluation?
What's the difference between a simulation and an evaluation?
A simulation is the execution layer: it runs scenarios against your agent and produces conversation transcripts. An evaluation is the scoring layer: it takes one or more completed simulations and scores the transcripts using an LLM judge.
Are simulations single-turn or multi-turn?
Are simulations single-turn or multi-turn?
Multi-turn. Each simulated user follows a persona and goal across a full conversation, producing transcripts that reflect realistic interaction patterns rather than one-shot prompts.
How is this different from testing my agent manually?
How is this different from testing my agent manually?
Manual testing is slow, subjective, and hard to repeat. Arklex runs the same scenarios consistently across every agent version, scores them with a defined rubric, and keeps a record you can compare over time — so you catch regressions instead of rediscovering them in production.
Metrics
What metrics does Arklex measure?
What metrics does Arklex measure?
Seven built-in metrics cover general agent quality, spanning both quantitative and qualitative dimensions. You can also define custom metrics for behaviors specific to your domain.
How do custom metrics work?
How do custom metrics work?
Write a scoring prompt in plain language describing what good and bad look like on a 1–5 scale. The LLM judge applies it from the next evaluation onward. Custom metrics are versioned, reusable across evaluations, and mix freely with built-ins.
Can I trust the LLM judge's scores?
Can I trust the LLM judge's scores?
The judge is a starting point, not the final word. When your team disagrees with a score, reviewers add their own in the Annotations tab. The Calibration tab then shows agreement rates per metric and surfaces common disagreements, giving you the evidence to refine a metric’s prompt and bring automated scoring in line with human judgment over time.
Team workflow and security
Can multiple team members annotate the same evaluation?
Can multiple team members annotate the same evaluation?
Yes. Multiple reviewers can annotate turns independently. Admins can toggle to a view showing all reviewers’ annotations alongside the auto-evaluation scores and the resolved values used for calibration.
How are API keys and headers stored?
How are API keys and headers stored?
Header values — typically API keys and auth tokens — are encrypted at rest. The platform displays masked values and never returns the raw secret after it’s saved.