4. Comprehensive Guide to Testing¶
While tracing is powerful on its own, the testing suite is what turns that data into actionable insights.
The Three-Phase Testing Approach¶
The framework runs tests in three distinct, optional phases. This allows you to, for example, run a massive batch of questions overnight (Phase 1) and then run the expensive LLM evaluations the next day (Phase 2).
Phase 1: Send Questions (
--phase1)What it does: Reads the
dataset_pathCSV, iterates through each question, and calls your chatbot’s API via the configuredclient.Result: Your
Recorder(e.g., DynamoDB) is populated with trace data for all test runs. Arun_map.csvis created to link questions tosession_ids.
Phase 2: Evaluate Performance (
--phase2)What it does: Reads the
run_map.csv, retrieves the trace data for each run from theRecorder, and uses the configuredllm_providerto evaluate the quality of each step and the final answer.Result: Generates
step_performance.json,final_answer_performance.json, and the AI-generatedperformance_summary.txt.
Phase 3: Analyze Latency (
--phase3)What it does: Reads the
run_map.csv, retrieves the trace data, and calculates the duration of each step and the total run time.Result: Generates
latency_per_run.jsonandaverage_latencies.json.
The Command-Line Interface (CLI) In-Depth¶
The chatbot-tester command is your primary tool for running tests.
chatbot-tester init <path>: Initializes a new project structure at the given path.chatbot-tester run [options]: Executes the test runs.
Common run Options:
Flag |
Description |
Example |
|---|---|---|
|
Path to your main configuration file. |
|
|
Executes all three phases sequentially. The easiest way to run a test. |
|
|
Runs only Phase 1 (sending questions). |
|
|
Runs only Phase 2 (performance evaluation). |
|
|
Runs only Phase 3 (latency analysis). |
|
|
Specifies a custom ID for the run folder. Defaults to a timestamp. |
|
Configuration (test_config.yaml) In-Depth¶
This file is the heart of your test setup.
General Settings¶
dataset_path: "data/test_questions.csv"
results_dir: "results"
dataset_path: Path to your CSV file of test questions. It must contain amodel_questioncolumn and an optionalmodel_answercolumn for quality comparison.results_dir: The root directory where all run reports will be saved.
Client Configuration¶
This section defines how the framework communicates with your chatbot.
client:
type: "api"
settings:
url: "http://127.0.0.1:5000/invoke"
method: "POST"
headers:
"Content-Type": "application/json"
"x-api-key": "YOUR_API_KEY"
body_template: '{ "question": "{question}", "session_id": "{session_id}", "trace_config": {trace_config} }'
type: Currently, onlyapi(for HTTP requests) is supported.settings:url,method,headers: Standard HTTP request parameters.body_template: A crucial string that defines the JSON payload. The framework will replace the placeholders:{question}: The question from the CSV.{session_id}: A unique UUID generated for the run.{trace_config}: A JSON object containing thetracing.recorderconfiguration. This is how the framework tells your app how to record the trace for this specific run.
Tracing Configuration¶
This section is passed to your application inside the {trace_config} placeholder.
tracing:
recorder:
type: "local_json" # or "dynamodb"
settings:
filepath: "results/traces.json"
# For dynamodb:
# table_name: "my-traces"
# region: "us-east-1"
Evaluation Configuration¶
This section controls the performance evaluation phase (Phase 2).
evaluation:
prompts_path: "configs/prompts.py"
workflow_description: >
A multi-agent chatbot for an insurance company. It first authorizes the user,
then routes their question to either a Commercial or Property insurance agent.
llm_provider:
type: "bedrock" # Options: 'claude', 'openai', 'gemini', 'bedrock'
settings:
# Settings vary by provider
region: "us-east-1"
model: "anthropic.claude-3-sonnet-20240229-v1:0"
prompts_path: Path to your Python file containing custom evaluation logic.workflow_description: A high-level description of your chatbot’s purpose. This is given to the evaluator LLM to provide crucial context for its judgments.llm_provider: Defines which LLM to use for evaluation.type:claude,openai,gemini, orbedrock.settings:For
claude/openai/gemini: requiresmodeland an API key (set in config or as an environment variable likeANTHROPIC_API_KEY).For
bedrock: requiresmodelandregion. IAM credentials are used automatically.
Customizing Evaluations (prompts.py)¶
This file gives you direct control over the LLM’s evaluation criteria.
CUSTOM_POLICIES: A list of strings defining your chatbot’s rules. The LLM will check if the final answer violates any of these.CUSTOM_POLICIES = [ "The response must be polite and professional at all times.", "The response must not suggest any medical, legal, or financial advice.", "If the chatbot cannot find an answer, it should explicitly state that." ]
FINAL_ANSWER_EVALUATION_PROMPT: The master prompt for judging the final user-facing answer. It instructs the LLM to score the answer on Coherence, Safety, Policy Adherence, and Quality vs. a model answer.STEP_EVALUATION_PROMPT: The prompt used to evaluate each individual traced step.DEEP_DIVE_SUMMARY_PROMPT: The prompt used to generate the final qualitative summary report.
You can edit these prompts to tailor the evaluation to your specific needs.
Understanding the Reports¶
After a --full-run, your results/<run_id> folder will contain:
run_map.csv: Maps eachquestionto its uniquesession_id.traces.json(if usinglocal_jsonrecorder): The raw trace data.Performance Reports:
step_performance.json: The detailed LLM evaluation for every single traced step across all runs.final_answer_performance.json: The detailed LLM evaluation of the final answer for each run.performance_summary.txt: This is often the most valuable report. An AI-generated qualitative summary that includes:An executive summary.
Key findings (e.g., “The ‘route_request’ step consistently fails on ambiguous inputs.”).
A step-by-step analysis of common failure patterns.
Actionable recommendations for your development team.
Latency Reports:
latency_per_run.json: A step-by-step latency breakdown for each individual test run.average_latencies.json: The average latency for each step across all runs, helping you identify systemic bottlenecks.