Max’s Test Run page is designed to help users perform regression testing to evaluate how changes impact the product or assistant over time. This feature allows users to assess performance, validate skill accuracy, and gather essential feedback to drive continuous improvement.

Purpose of the Test Run Page

The Test Run page is primarily used for regression testing. This helps users understand how updates or modifications to the environment affect the assistant's behavior, accuracy, and overall performance.

Running a New Test

Building a Test: Use the questions feature to build a list of questions that represent the use cases or scenarios you want to evaluate.
Starting a Test: Click on the Run New Test button to select the desired collection of questions.
Model Override: When running a test, users have the option to override the model used in both the chat pipeline and evaluation, providing flexibility in testing different configurations or updates.

Evaluations Performed in Each Test Run

Each test run in Max evaluates various aspects of the assistant’s performance, including skill accuracy, parameter selection, and response quality. The following evaluations are performed:

Expected Skill Assertion: For each question, if an expected skill is provided in its configuration, a simple test validates whether this skill was used.
Skill Choice: The language model performs a second pass review of the skill that was selected and generates a Pass/Fail result, based on whether the correct skill was chosen.
Parameter Choice: The language model assesses whether the correct parameters were selected for the skill.
Does Response Answer Query: The language model evaluates whether the assistant's response directly answers the user’s original question.
Faithfulness: The model checks if the response is consistent with the facts provided, ensuring that the answer is accurate and trustworthy.
Context Contains Enough Information: The language model reviews whether the context provided in the chat conversation was sufficient to properly answer the question.

For each of these evaluations, users can dive deeper into the diagnostic view of the answer. This view allows them to review:

Result: Whether the evaluation passed or failed.
Reasoning: The justification for the result.
Original Prompt: The prompt sent to the language model that produced the result and reasoning.

Feedback and Evaluation

Feedback is a crucial part of the evaluation process:

User Feedback: Users can provide quick feedback on the response using thumbs up or thumbs down icons.
Administrator Feedback: In addition to user feedback, administrators can review and evaluate the assistant’s responses for further analysis.
Automated Testing: Max supports automated nightly test runs, allowing the system to continually gather feedback and provide ongoing insights into the assistant’s performance.

Diagnostics and Analysis

The Test Run page provides detailed diagnostics, helping users identify potential issues and areas for improvement:

Question and Answer Review: Users can view the original question, the assistant’s answer, and the diagnostic information for each interaction.
Failure Diagnostics: Max provides insights into why a particular step may have failed, helping users understand where the language model might have gone wrong and guiding corrective actions.

Cost and Performance Metrics

For each test run, users can access key performance metrics:

Cost Information: The language model’s cost for executing the test is displayed, giving users transparency into resource usage.
Pass Rate: Max provides an overall pass rate for the test, along with detailed pass rates for each individual evaluation.
Time Taken: The time taken for each test to complete is displayed, giving users visibility into how long each run took.

Driving Improvement

The Test Run page is a critical tool for continuous improvement:

Validating Changes: The data collected from tests can be used to validate whether changes to the assistant or environment are leading to improvements.
Filtering Results: Users can filter evaluations to focus on specific issues, such as faithfulness or accuracy, allowing for more targeted diagnostics and troubleshooting.

Conclusion

The Test Run page in Max plays a vital role in monitoring and improving the performance of assistants. It enables users to run regression tests, evaluate skills, gather feedback, and analyze performance metrics over time. By leveraging these insights, teams can ensure that the assistant continues to meet user expectations and functions reliably as the environment evolves.