1. Introduction

What is the Chatbot Test Framework?

The Chatbot Test Framework is a powerful, open-source tool designed for end-to-end testing of conversational AI applications. It provides a structured way to measure and improve your chatbot’s quality, safety, performance, and latency.

At its core, the framework helps you answer critical questions:

  • Does my chatbot give correct and relevant answers?

  • Does it follow my company’s safety and style policies?

  • Which parts of my chatbot’s internal logic are slow?

  • How does performance change after I deploy new code?

It achieves this by separating the test runner from your chatbot application, allowing you to test any Python-based chatbot with an API endpoint.

Why Use It?

  • Ensure Reliability: Automate testing to catch regressions and ensure consistent quality.

  • Improve Safety: Use LLM-powered evaluation to check for harmful content and policy violations.

  • Optimize Performance: Pinpoint bottlenecks in your chatbot’s workflow with detailed latency reports.

  • Deep Insights: Go beyond simple input/output tests. The tracing mechanism gives you a step-by-step view of your chatbot’s internal decision-making process.

  • Flexible & Pluggable: Works with any chatbot architecture and allows you to use different LLMs (Claude, GPT, Gemini, Bedrock) for evaluation and store data where you want (local files, DynamoDB).

Core Concepts

  • Tracer: An object you integrate into your chatbot’s code. Its @trace decorator wraps key functions to capture their inputs, outputs, status, and timings.

  • Recorder: The storage backend for trace data. The Tracer sends its data to a Recorder (e.g., DynamoDBRecorder or LocalJsonRecorder).

  • Test Runner: The command-line tool (chatbot-tester) that orchestrates the entire testing process. It sends questions, triggers evaluations, and generates reports.

High-Level Workflow

The framework operates in a clear, decoupled cycle:

  1. You Instrument Your App: You add the @trace decorator to key functions in your chatbot’s code.

  2. Phase 1: Send Questions: The Test Runner reads a CSV of questions and sends them one by one to your chatbot’s API endpoint.

  3. Tracing in Action: As your chatbot processes a request, the @trace decorators capture data and send it to the configured Recorder (e.g., DynamoDB).

  4. Phase 2: Evaluate Performance: The Test Runner retrieves the trace data from the Recorder and uses a powerful LLM to evaluate each step for quality, relevance, and policy adherence.

  5. Phase 3: Analyze Latency: The Test Runner uses the same trace data to calculate the duration of each step and the total end-to-end latency.

  6. Reporting: The framework generates a folder of detailed reports summarizing all findings.


2. Getting Started: A Quick Tour

Let’s get a test running in under 5 minutes.

Prerequisites

  • Python 3.9+

  • A running chatbot application with an HTTP API endpoint. We will create a mock one for this guide.

Installation

Install the framework from PyPI in your terminal:

pip install chatbot-test-framework

Step 1: Initialize Your Project

Create a directory for your tests and run the init command.

mkdir my-first-tests
cd my-first-tests
chatbot-tester init .

This creates the essential project structure:

.
├── configs/
│   ├── prompts.py             # Your custom evaluation policies
│   └── test_config.yaml       # Main test configuration
├── data/
│   └── test_questions.csv     # Your test questions
└── results/
    └── (Reports will be generated here)

Step 2: Instrument a Simple Chatbot

Create a file named mock_app.py and paste the following Flask application code. This simulates a simple, multi-step chatbot.

# mock_app.py
import time
from flask import Flask, request, jsonify
from chatbot_test_framework import Tracer
from chatbot_test_framework.recorders import LocalJsonRecorder

app = Flask(__name__)

class MockBot:
    def __init__(self, tracer):
        self.tracer = tracer

    @property
    def route_request(self):
        @self.tracer.trace(step_name="route_request")
        def _route(question: str):
            time.sleep(0.2)
            if "bill" in question.lower():
                return "billing_agent"
            return "general_agent"
        return _route

    @property
    def execute_agent(self):
        @self.tracer.trace(step_name="execute_agent")
        def _execute(agent: str):
            time.sleep(0.5)
            if agent == "billing_agent":
                return {"response": "Your last bill was $50."}
            return {"response": "I can help with general questions."}
        return _execute

@app.route("/invoke", methods=['POST'])
def invoke():
    data = request.get_json()
    question, session_id = data['question'], data['session_id']
    trace_config = data.get('trace_config', {})
    
    # The framework tells the app how to trace this run
    recorder = LocalJsonRecorder(trace_config.get('settings', {}))
    tracer = Tracer(recorder, run_id=session_id)
    
    bot = MockBot(tracer)
    agent = bot.route_request(question=question)
    result = bot.execute_agent(agent=agent)
    
    return jsonify({"final_answer": result['response']})

if __name__ == '__main__':
    app.run(port=5000)

Step 3: Configure Your Test

Edit configs/test_config.yaml to point to our mock app and use a local recorder.

# configs/test_config.yaml
dataset_path: "data/test_questions.csv"
results_dir: "results"

client:
  type: "api"
  settings:
    url: "http://127.0.0.1:5000/invoke"
    method: "POST"
    headers:
      "Content-Type": "application/json"
    body_template: '{ "question": "{question}", "session_id": "{session_id}", "trace_config": {trace_config} }'

tracing:
  recorder:
    type: "local_json"
    settings:
      filepath: "results/traces.json"

evaluation:
  prompts_path: "configs/prompts.py"
  workflow_description: "A simple mock chatbot that routes to a billing or general agent."
  llm_provider:
    type: "claude" # Or "openai", "gemini"
    settings:
      model: "claude-3-sonnet-20240229"
      # API key should be set as an environment variable (e.g., ANTHROPIC_API_KEY)

Step 4: Run Your First Test

  1. Start your chatbot app:

    python mock_app.py
    
  2. In a new terminal, run the framework’s test command:

    # Make sure your LLM provider API key is set as an environment variable!
    # export ANTHROPIC_API_KEY="sk-..."
    
    chatbot-tester run --full-run
    

Step 5: Check the Results

Look inside the results/ directory. You’ll find a new folder named with a timestamp (e.g., run_20231027_103000). Inside, you’ll find:

  • traces.json: The raw data captured by the LocalJsonRecorder.

  • performance_summary.txt: An AI-generated analysis of your bot’s performance.

  • average_latencies.json: A breakdown of how long each step took on average.

  • …and other detailed JSON reports.

Congratulations! You’ve completed your first end-to-end test.