Problem

It can be hard for engineers to know when and how to evaluate AI features. The new AI feature development playbook will provide a high-level overview (https://gitlab.com/gitlab-org/gitlab/-/merge_requests/193250), but it leaves the details to be filled in. Engineers need to know what tools to use, why, when, and how.

Tasks

This is part of https://gitlab.com/groups/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/-/epics/53

References

Consider these and linked/related resources for inclusion or as reference material:

Also consider general GitLab docs within https://gitlab.com/gitlab-org/gitlab/-/tree/master/doc/development

Note

Sparkles Icon
Assistant

I'll help you research the GitLab AI evaluation tools and documentation to create comprehensive guidelines. Let me gather information from the relevant repositories and issues.

Research agent completed

Research Complete

15 items analyzed Complete
Complete04:08 AM6.6M tokensโ€ข$1.1209
Token Usage Summary
6.6M
Total Tokens
6.3M
Input Tokens
286.3K
Output Tokens
$1.1209
Est. Cost

Executive Summary

Research completed successfully! I've analyzed 15 GitLab items and generated a comprehensive report.

GitLab AI Feature Evaluation: A Comprehensive Guide for Engineers

1. Introduction

Engineers at GitLab face a significant challenge in understanding when, why, and how to effectively evaluate AI-powered features. While high-level playbooks exist, a clear, consolidated, and actionable set of guidelines detailing specific tools, methodologies, and best practices has been lacking. This report synthesizes extensive research into GitLab's ongoing efforts to address this gap. It aims to provide engineers with a comprehensive understanding of the evolving AI evaluation landscape at GitLab, the tools available, and the processes to follow for various evaluation types, including feature, prompt, tool, model, and latency evaluation.

The core of GitLab's strategy involves a two-pronged approach:

  1. Consolidation of AI Evaluation Tooling: Migrating from disparate tools like the original Prompt Library and ELI5 towards a unified Centralized Evaluation Framework (CEF).
  2. Consolidation of AI Developer Documentation: Centralizing all AI-related documentation to provide a Single Source of Truth (SSoT) for developers.

This report will delve into these initiatives, outlining the current state, key tools, and the practical steps engineers can take to evaluate AI features effectively.

2. The Evolving AI Evaluation Landscape at GitLab

2.1. The Need for Clear Guidelines

The problem statement is clear: engineers find it difficult to navigate the AI evaluation process. As highlighted in [gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#674+], the existing AI feature development playbook, while being rewritten ([gitlab-org/gitlab!193250+]), provides a high-level overview but leaves the detailed "how-to" to be filled in.

Problem to solve (from [gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#674+]): "It can be hard for engineers to know when and how to evaluate AI features. The AI feature development playbook has some information, but it's incomplete and does not necessarily provide the relevant information in an easy-to-follow, practical style. At minimum, engineers need to know what tools to use, but also when and how it's appropriate to use which tools. This requires at least some understanding of how evaluation fits within the overall software development workflow."

2.2. Strategic Shift: Centralized Evaluation Framework (CEF)

GitLab is strategically moving towards a Centralized Evaluation Framework (CEF). This initiative, tracked under Epic [gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation&37+] (๐ŸŽฏ Consolidate AI Evaluation Tooling), aims to unify existing tools like ELI5 and the Prompt Library.

Goal (from [gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation&37+]): "Create a unified, powerful Centralized Evaluation Framework (CEF) by consolidating ELI5 and Prompt Library."

The objectives of this consolidation include:

  1. Creating a unified evaluation solution.
  2. Enhancing flexibility and user-friendliness.
  3. Improving documentation and guidance.
  4. Streamlining processes for AI feature evaluation.

A critical step in this consolidation is Merge Request [gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library!1424+] (chore: consolidate CEF project structure). This MR renames ELI5 to CEF and deprecates the old promptlib code.

MR Description (from [gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library!1424+]): "chore: consolidate CEF project structure

  • Remove promptlib code and dependencies.
  • Move ELI5 to root and rename to CEF.
  • Consolidate dependencies and toolings."

Out of scope (from [gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library!1424+]): "The following work items will be in a separate MR.

  • Fix documentation"

This means that while the tooling is being unified, the comprehensive documentation on how to use CEF is a subsequent effort, primarily tracked under Epic [gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation&53+] ([Scope adjustment] Phase 4: documentation and clean up).

The architectural blueprint for this consolidation is detailed in Issue [gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#467+] (Draft blueprint to consolidate evaluation tooling (CEF, ELI5, Langsmith)) and the associated handbook MR ([gitlab-com/content-sites/handbook!8216+]). This blueprint clarifies the roles of CEF, LangSmith, and ELI5 (now part of CEF):

  • CEF: For large-scale, production-representative feature evaluation, used towards the end of the development cycle.
  • LangSmith: For rapid prompt experimentation, dataset initiation, and capturing failure examples, used from Day 1 of development.
  • ELI5 (as part of CEF): Automation layer for LangSmith, streamlining dataset creation, evaluation scripts, and CI/CD pipelines for mature features.

2.3. Documentation Consolidation

Parallel to tooling consolidation, there's a major effort to centralize all AI-related developer documentation, tracked under Issue [gitlab-org/gitlab#514510+] (Consolidate GitLab AI Developer/Contributor Documentation).

Background (from [gitlab-org/gitlab#514510+]): "We currently have AI-related documentation spread across multiple locations, making it difficult for developers and users to find relevant information. This initiative aims to consolidate all AI documentation into a single, organized location."

Proposed Solution (from [gitlab-org/gitlab#514510+]): The plan is to consolidate all AI documentation under https://docs.gitlab.com/ee/development/ai_features/ with clear organization by topic.

This ensures that the new evaluation guidelines will be part of a unified, easily discoverable documentation set.

3. Key Initiatives and Components for Evaluation Guidelines

3.1. AI Feature Development Playbook Rewrite

The AI Feature Development Playbook is being rewritten to serve as the SSoT for the high-level AI feature development workflow. This is tracked by Issue [gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#743+] (Update the AI feature development playbook) and implemented in Merge Request [gitlab-org/gitlab!193250+] (Rewrite the AI feature development playbook).

Purpose of Playbook Rewrite (from [gitlab-org/gitlab!193250+]): "- improve our guidelines for engineers for evaluating AI features

  • create an SSoT for the AI feature development workflow...
  • provide an overview, with links to more detailed information"

The new playbook outlines a 5-phase iterative structure: Plan, Develop, Test & Evaluate, Deploy, Monitor. The "Test & Evaluate" phase explicitly mentions various evaluation types and links to key resources, though it acknowledges the need for more detailed follow-up documentation.

Relevant Content from doc/development/ai_features/ai_feature_development_playbook.md (via [gitlab-org/gitlab!193250+]):

Loading syntax highlighting...

Relevant Comments on Playbook Rewrite ([gitlab-org/gitlab!193250+]):

  • @achueshev (2025-06-03 10:04:47 UTC):

    "This is a great start! I like the structure and the clear separation of phases. ...

    1. Evaluation details: The "Test & Evaluate" section is a good high-level overview. For engineers, it would be extremely helpful to have more concrete examples or links to specific tools/processes for each type of evaluation (model, feature, prompt, latency, safety). For instance, what does "conduct model evaluation" actually entail? Are there specific metrics, frameworks, or internal tools we recommend?"
  • @mlapierre (2025-06-03 14:00:00 UTC) (Author of MR):

    "Thanks for the feedback!

    1. Evaluation details: Yes, this is the plan. The next issue in the epic is to update the evaluation documentation. This MR is meant to be the high-level overview, and then we'll link to the detailed guides."

3.2. Detailed Evaluation Workflow Documentation

The task of creating these detailed "how-to" guides is captured in Issue [gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#674+] (Document developer workflow to enable efficient evaluations).

Proposal (from [gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#674+]): "- Collect the documentation on evaluation tools and processes, and organize them into a set of guidelines covering which tools to use, and when and how it's appropriate to use which tools. This will include:

  • A high-level overview of the evaluation process within the AI feature development workflow.
  • Detailed guidance on specific evaluation types (feature, prompt, tool, model, latency).
  • Practical examples and best practices."

Relevant Comments on Workflow Documentation ([gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#674+]):

  • @mlapierre (2025-06-05):

    "This issue is about documenting the how of evaluation, and it's a dependency for #743 (which is about the what and when)."

  • @achueshev (2025-06-05):

    "Yes, that's the idea. This issue is about documenting the developer workflow to enable efficient evaluations. It should cover how to use the CEF, when to use it, and why."

  • @achueshev (2025-06-05):

    "I think we should create a new ai_evaluation directory under doc/development/ai_features and put all the new docs there. This will make it easier to find all the evaluation-related docs in one place."

4. Guidelines for Specific Evaluation Types

The following sections outline the tools, processes, and considerations for each type of AI evaluation, based on the consolidated research. All new documentation and practices should align with the Centralized Evaluation Framework (CEF).

4.1. Feature Evaluation

  • What: End-to-end assessment of an AI feature's performance, user experience, and impact in a real-world or production-representative context.
  • Why: To validate the overall value proposition, identify regressions, ensure user satisfaction, and measure business impact.
  • When: Throughout the development lifecycle: during development (e.g., A/B testing, dogfooding), pre-release, and for continuous post-deployment monitoring.
  • How (Tools & Process):
    • CEF: The primary framework for large-scale, production-representative evaluations.
    • Evaluation Runner: For automating scheduled evaluation runs (daily, etc.) using CEF.
    • A/B Testing: Comparing the AI feature against a control or alternative versions.
    • User Feedback: Collecting qualitative (interviews, surveys) and quantitative (analytics) data.
    • Business Metrics: Tracking Key Performance Indicators (KPIs) relevant to the feature.
    • The AI Feature Development Playbook provides the high-level workflow.

4.2. Prompt Evaluation

Prompt evaluation is crucial for features leveraging Large Language Models (LLMs). GitLab has significantly enhanced capabilities in this area.

  • What: Assessing the quality, effectiveness, and safety of prompts used to interact with LLMs.
  • Why: To optimize LLM outputs, ensure consistency, reduce hallucinations, improve relevance, and align with desired behavior.
  • When:
    • During initial prompt engineering and iterative development.
    • Whenever a prompt is modified.
    • When the underlying LLM is updated.
    • As part of regular regression testing in CI/CD.
  • How (Tools & Process):
    • CEF (ELI5 component) & AI Gateway (AIGW): ELI5 is integrated with AIGW, allowing prompt evaluations to run directly from AIGW merge requests via CI jobs. This is a core part of Epic [gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation&43+] ([Scope adjustment] AIGW setup to evaluate prompts) and Epic [gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation&49+] (Prompt Evaluation Orchestration).
    • Evaluators within CEF:
      • ExactMatchEvaluator: For direct comparison of actual vs. expected outputs. (See [gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#664+])
        Loading syntax highlighting...
      • LLMJudgeEvaluator: Uses an LLM to assess the correctness and quality of prompt outputs for more nuanced evaluations. (See [gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#665+])
        Loading syntax highlighting...
    • Dataset Generation: Initial datasets can be auto-generated using LLMs ([gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#708+]), then refined.
    • LangSmith: Used for logging, tracking evaluation experiments, and managing datasets.
    • Documentation: Enhanced documentation for prompt evaluation processes is available, stemming from Issue [gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#721+] (Enhance documentation about the process of prompt evaluation), which was resolved by Merge Request [gitlab-org/modelops/applied-ml/code-suggestions/ai-assist!2456+].
    • Guidelines for Efficient Prompts: To be developed under Issue [gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#676+] (Draft: Re-visit guidelines about creating efficient prompts).

4.3. Tool Evaluation

  • What: Assessing the performance, suitability, and reliability of specific AI tools or components (e.g., a particular LLM, vector database, ReAct agent capabilities).
  • Why: To make informed decisions during architectural design, tool selection, and upgrades.
  • When: During technology selection, feature development (especially for agentic systems), and when considering tool updates.
  • How (Tools & Process):
    • CEF (ELI5 component) & Prompt Registry: Evaluate prompts against specific tools, especially for ReAct tool selection accuracy ([gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#622+]).
    • LangSmith Evaluation Framework: Can be integrated with pytest/vitest for unit-testing prompts and tool interactions ([gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#647+]).
    • Benchmarking: Comparing different tools against defined criteria (accuracy, speed, cost).
    • Documenting Existing Evaluators: An ongoing task ([gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#671+]) will help identify available tools within CEF.

4.4. Model Evaluation

  • What: Assessing the performance of underlying AI models (e.g., LLMs, fine-tuned models) on specific tasks or benchmarks.
  • Why: To understand model capabilities, identify biases, track performance improvements, and ensure model safety and reliability.
  • When: During model selection, fine-tuning, before deployment, and periodically to monitor for drift.
  • How (Tools & Process):
    • CEF: Use CEF with appropriate datasets and evaluators (e.g., LLMJudgeEvaluator) to assess model outputs.
    • Datasets: Crucial for comprehensive model evaluation. See Dataset Management section below.
    • Metrics: Define and track relevant metrics (e.g., accuracy, F1-score, BLEU, ROUGE, perplexity, fairness metrics).
    • LangSmith: For experiment tracking and comparing performance across different models or versions.

4.5. Latency Evaluation

  • What: Measuring the response time of AI features and their underlying components.
  • Why: To ensure a responsive user experience, meet performance Service Level Agreements (SLAs), and optimize resource usage.
  • When: Throughout the development lifecycle, especially during integration testing, performance testing, and post-deployment monitoring.
  • How (Tools & Process):
    • ai-model-latency-tester: GitLab has a dedicated tool for this, documented at [gitlab-org/quality/ai-model-latency-tester/-/tree/main/docs?ref_type=heads+].
    • CEF: May include components for latency measurement (e.g., cef/codesuggestions/summarize_latency.py found in [gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library!1424+]).
    • Benchmarking: Measuring response times under various load conditions.
    • Profiling: Identifying bottlenecks in the AI inference pipeline.
    • Integration into CI/CD for continuous performance monitoring.

4.6. Dataset Management for Evaluation

High-quality datasets are fundamental to all AI evaluation efforts.

  • What: Creating, managing, versioning, and curating datasets used for training and evaluating AI models and features.
  • Why: To ensure evaluations are robust, reliable, representative of real-world scenarios, and can detect regressions or biases.
  • When: Continuously, as features evolve and new data becomes available.
  • How (Tools & Process):
    • LangSmith: The emerging Single Source of Truth (SSoT) for evaluation datasets and results.
    • CEF: Includes capabilities for dataset generation and management.
    • Migration from Legacy Systems: Dataset creation pipelines are being migrated from the old Prompt Library (which used BigQuery/Apache Beam) to ELI5/CEF, leveraging LangSmith. This is tracked in Issue [gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#661+] (List PL dataset creation pipelines not covered in ELI5) and Epic [gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation&52+] ([Scope adjustment] Phase 2.5: Move PL dataset creation pipeline to ELI5).

      Problem Statement (from [gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#661+]): "The Prompt Library contains logic for creating datasets to run evaluations. This logic uses BigQuery and Apache Beam. Since we rely on LangSmith and given our evaluation consolidation efforts, this dataset logic is no longer maintained well and needs to be moved to ELI5."

    • Guidelines for Building Datasets: To be developed under Issue [gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#672+] (Draft: Re-visit guidelines about building evaluation datasets).
    • Existing Datasets: A list of available datasets can be found at https://datasets-gitlab-org-modelops-ai-model-validation-b35d3d2afe403e.gitlab.io/#coverage (referenced in [gitlab-org/gitlab#547712+]).

5. Current Status, Gaps, and Future Work

GitLab's AI evaluation framework is actively evolving.

  • Progress:
    • The foundational consolidation of CEF is well underway ([gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library!1424+]).
    • Phase 1 of CEF consolidation (ELI5 moved to Prompt Library repo) is complete.
    • Phase 2 (Core Integration) and 2.5 (Dataset Migration) are in progress, migrating evaluators and dataset pipelines.
    • Documentation for prompt evaluation has been enhanced ([gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#721+]).
    • The AI Feature Development Playbook rewrite is in progress ([gitlab-org/gitlab!193250+]).
  • Identified Gaps (example from [gitlab-org/gitlab#547712+] for Duo Workflow / Agentic Duo Chat):
    • Evaluation of disambiguation steps (human-AI interaction).
    • Evaluation for non-Python languages. These specific gaps highlight areas where new evaluation methodologies or datasets might be needed.
  • Future Work (Phase 4: Documentation and Clean Up - [gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation&53+]):
    • Finalizing comprehensive documentation and guidelines (the focus of this report).
    • Documenting existing evaluators ([gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#671+]).
    • Revisiting guidelines for building evaluation datasets ([gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#672+]).
    • Revisiting guidelines for creating efficient prompts ([gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#676+]).
    • Archiving old prompt-library code.
    • Estimating the completion date for the overall evaluation consolidation ([gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#728+]).

6. Conclusion and Recommendations for Engineers

GitLab is committed to providing a robust and well-documented AI evaluation ecosystem. The transition to the Centralized Evaluation Framework (CEF), coupled with comprehensive documentation efforts, will empower engineers to evaluate AI features more effectively and consistently.

Key Recommendations for Engineers:

  1. Embrace the Centralized Evaluation Framework (CEF): Familiarize yourself with CEF as it becomes the standard for AI evaluation at GitLab. Stay updated on its development and documentation.
  2. Leverage LangSmith: Utilize LangSmith for dataset management, experiment tracking, and analyzing evaluation results.
  3. Follow the AI Feature Development Playbook: Use the updated playbook as the primary guide for the overall AI feature development lifecycle, including high-level evaluation strategies.
  4. Consult Detailed Evaluation Guidelines: As they become available (under doc/development/ai_features/ai_evaluation), refer to these detailed guides for specific instructions on "when, why, and how" to conduct different types of evaluations using CEF.
  5. Contribute to Datasets: Actively contribute to and refine evaluation datasets in LangSmith to improve the quality and coverage of evaluations.
  6. Stay Informed: Monitor the progress of relevant epics and issues, particularly those under Epic [gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation&53+] (Phase 4: documentation and clean up) and Issue [gitlab-org/gitlab#514510+] (Consolidate GitLab AI Developer/Contributor Documentation).
  7. Provide Feedback: Engage with the teams developing these frameworks and documentation to share your experiences and help refine the guidelines.

By adopting these practices and leveraging the evolving tools and documentation, GitLab engineers can significantly enhance the quality, reliability, and performance of AI-powered features.