## 1. Introduction ### What is the Chatbot Test Framework? The Chatbot Test Framework is a powerful, open-source tool designed for end-to-end testing of conversational AI applications. It provides a structured way to measure and improve your chatbot's **quality**, **safety**, **performance**, and **latency**. At its core, the framework helps you answer critical questions: * Does my chatbot give correct and relevant answers? * Does it follow my company's safety and style policies? * Which parts of my chatbot's internal logic are slow? * How does performance change after I deploy new code? It achieves this by separating the **test runner** from your **chatbot application**, allowing you to test any Python-based chatbot with an API endpoint. ### Why Use It? * **Ensure Reliability:** Automate testing to catch regressions and ensure consistent quality. * **Improve Safety:** Use LLM-powered evaluation to check for harmful content and policy violations. * **Optimize Performance:** Pinpoint bottlenecks in your chatbot's workflow with detailed latency reports. * **Deep Insights:** Go beyond simple input/output tests. The tracing mechanism gives you a step-by-step view of your chatbot's internal decision-making process. * **Flexible & Pluggable:** Works with any chatbot architecture and allows you to use different LLMs (Claude, GPT, Gemini, Bedrock) for evaluation and store data where you want (local files, DynamoDB). ### Core Concepts * **Tracer:** An object you integrate into your chatbot's code. Its `@trace` decorator wraps key functions to capture their inputs, outputs, status, and timings. * **Recorder:** The storage backend for trace data. The `Tracer` sends its data to a `Recorder` (e.g., `DynamoDBRecorder` or `LocalJsonRecorder`). * **Test Runner:** The command-line tool (`chatbot-tester`) that orchestrates the entire testing process. It sends questions, triggers evaluations, and generates reports. ### High-Level Workflow The framework operates in a clear, decoupled cycle: 1. **You Instrument Your App:** You add the `@trace` decorator to key functions in your chatbot's code. 2. **Phase 1: Send Questions:** The `Test Runner` reads a CSV of questions and sends them one by one to your chatbot's API endpoint. 3. **Tracing in Action:** As your chatbot processes a request, the `@trace` decorators capture data and send it to the configured `Recorder` (e.g., DynamoDB). 4. **Phase 2: Evaluate Performance:** The `Test Runner` retrieves the trace data from the `Recorder` and uses a powerful LLM to evaluate each step for quality, relevance, and policy adherence. 5. **Phase 3: Analyze Latency:** The `Test Runner` uses the same trace data to calculate the duration of each step and the total end-to-end latency. 6. **Reporting:** The framework generates a folder of detailed reports summarizing all findings. --- ## 2. Getting Started: A Quick Tour Let's get a test running in under 5 minutes. ### Prerequisites * Python 3.9+ * A running chatbot application with an HTTP API endpoint. We will create a mock one for this guide. ### Installation Install the framework from PyPI in your terminal: ```bash pip install chatbot-test-framework ``` ### Step 1: Initialize Your Project Create a directory for your tests and run the `init` command. ```bash mkdir my-first-tests cd my-first-tests chatbot-tester init . ``` This creates the essential project structure: ``` . ├── configs/ │ ├── prompts.py # Your custom evaluation policies │ └── test_config.yaml # Main test configuration ├── data/ │ └── test_questions.csv # Your test questions └── results/ └── (Reports will be generated here) ``` ### Step 2: Instrument a Simple Chatbot Create a file named `mock_app.py` and paste the following Flask application code. This simulates a simple, multi-step chatbot. ```python # mock_app.py import time from flask import Flask, request, jsonify from chatbot_test_framework import Tracer from chatbot_test_framework.recorders import LocalJsonRecorder app = Flask(__name__) class MockBot: def __init__(self, tracer): self.tracer = tracer @property def route_request(self): @self.tracer.trace(step_name="route_request") def _route(question: str): time.sleep(0.2) if "bill" in question.lower(): return "billing_agent" return "general_agent" return _route @property def execute_agent(self): @self.tracer.trace(step_name="execute_agent") def _execute(agent: str): time.sleep(0.5) if agent == "billing_agent": return {"response": "Your last bill was $50."} return {"response": "I can help with general questions."} return _execute @app.route("/invoke", methods=['POST']) def invoke(): data = request.get_json() question, session_id = data['question'], data['session_id'] trace_config = data.get('trace_config', {}) # The framework tells the app how to trace this run recorder = LocalJsonRecorder(trace_config.get('settings', {})) tracer = Tracer(recorder, run_id=session_id) bot = MockBot(tracer) agent = bot.route_request(question=question) result = bot.execute_agent(agent=agent) return jsonify({"final_answer": result['response']}) if __name__ == '__main__': app.run(port=5000) ``` ### Step 3: Configure Your Test Edit `configs/test_config.yaml` to point to our mock app and use a local recorder. ```yaml # configs/test_config.yaml dataset_path: "data/test_questions.csv" results_dir: "results" client: type: "api" settings: url: "http://127.0.0.1:5000/invoke" method: "POST" headers: "Content-Type": "application/json" body_template: '{ "question": "{question}", "session_id": "{session_id}", "trace_config": {trace_config} }' tracing: recorder: type: "local_json" settings: filepath: "results/traces.json" evaluation: prompts_path: "configs/prompts.py" workflow_description: "A simple mock chatbot that routes to a billing or general agent." llm_provider: type: "claude" # Or "openai", "gemini" settings: model: "claude-3-sonnet-20240229" # API key should be set as an environment variable (e.g., ANTHROPIC_API_KEY) ``` ### Step 4: Run Your First Test 1. **Start your chatbot app:** ```bash python mock_app.py ``` 2. **In a new terminal**, run the framework's test command: ```bash # Make sure your LLM provider API key is set as an environment variable! # export ANTHROPIC_API_KEY="sk-..." chatbot-tester run --full-run ``` ### Step 5: Check the Results Look inside the `results/` directory. You'll find a new folder named with a timestamp (e.g., `run_20231027_103000`). Inside, you'll find: * `traces.json`: The raw data captured by the `LocalJsonRecorder`. * `performance_summary.txt`: An AI-generated analysis of your bot's performance. * `average_latencies.json`: A breakdown of how long each step took on average. * ...and other detailed JSON reports. Congratulations! You've completed your first end-to-end test.