GitLab Research Agent

Final Report: Integrating Knowledge Graph Replicas with GitLab's Zoekt Architecture

This report synthesizes research on the integration of GitLab's Knowledge Graph functionality with its existing Zoekt search architecture, as detailed in Merge Request gitlab-org/gitlab!191162+. The analysis focuses on how the AddKnowledgeGraphReplicas database migration and associated models leverage and extend the Zoekt infrastructure.

The core strategy is to embed the Knowledge Graph service directly into the Zoekt ecosystem, reusing its robust infrastructure for node management, task scheduling, and deployment. This avoids the overhead of creating a new, parallel infrastructure, thereby accelerating development and ensuring operational consistency.

1. Core Integration Strategy: Infrastructure Reuse

The integration of the Knowledge Graph Server is fundamentally based on the principle of reusing and extending the existing Zoekt infrastructure. This approach is a strategic decision to leverage a mature, scalable system for a new purpose.

As stated in the Graph node management epic (gitlab-org&17767+):

"We will use existing Zoekt nodes to index, store and serve also graph DBs. The major benefit is that we can re-use existing Zoekt logic (nodes management on Rails side) and infrastructure (deployment of Zoekt nodes) and node logic itself (Zoekt Webservice and Indexer)."

This strategy is realized through several key architectural decisions:

1.1. Shared Node Infrastructure

The most significant aspect of this integration is the use of Search::Zoekt::Node instances as the physical hosts for both Zoekt search indices and Knowledge Graph databases (which use KuzuDB). Instead of provisioning a new class of servers, Zoekt nodes become multi-purpose hosts.

This is formally established in the data model by extending the Search::Zoekt::Node model to have a direct relationship with Knowledge Graph replicas.

Relevant Code Snippet (ee/app/models/search/zoekt/node.rb from MR !191162):

Loading syntax highlighting...

1.2. Flexible Node Service Assignment

To manage the dual-purpose nature of these nodes, the architecture includes a mechanism to designate which services a node can provide. This allows for operational flexibility, enabling phased rollouts and dedicated resource allocation if needed.

As proposed in issue gitlab-org/gitlab#540786+:

"We should also add a setting to our “Zoekt node” models to mark them as “zoekt only”, “kuzu only”, or “zoekt and kuzu”. A “kuzu only” node will not be allocated new zoekt indexes and vice versa. This will give our operators the most flexibility to roll out changes while keeping as much infrastructure shared as possible."

This is planned to be implemented by adding a services column to the zoekt_nodes table, allowing administrators to control task allocation for both Zoekt and Knowledge Graph workloads.

1.3. Unified Task Processing

A crucial refactoring, detailed in the proof-of-concept MR gitlab-org/gitlab!189941+, introduced a shared base class for tasks. The Search::Zoekt::BaseTask abstract model encapsulates common logic for task state management, scheduling, and processing. Both Search::Zoekt::Task and the new KnowledgeGraph::Task inherit from this base class.

This allows both systems to use the same background workers and processing pipeline, ensuring consistent behavior for retries, failures, and state transitions.

Relevant Code Snippet (ee/app/models/search/zoekt/base_task.rb from MR !189941):

Loading syntax highlighting...

2. The `AddKnowledgeGraphReplicas` Migration and Data Models

The AddKnowledgeGraphReplicas migration is the cornerstone of the database-level integration. It creates the p_knowledge_graph_replicas table, which acts as the bridge between the Knowledge Graph entities and the Zoekt node infrastructure.

2.1. The `p_knowledge_graph_replicas` Table

This table is designed for scalability using PostgreSQL partitioning, a pattern common across GitLab's database.

Relevant Code Snippet (db/migrate/20250512113325_add_knowledge_graph_replicas.rb from MR !191162):

Loading syntax highlighting...

Key Columns:

zoekt_node_id: The foreign key that directly links a Knowledge Graph replica to a specific zoekt_nodes record. This is the most critical link for infrastructure reuse.
knowledge_graph_enabled_namespace_id: Links the replica to the namespace for which the graph is being generated.
PARTITION BY RANGE (namespace_id): Ensures the table can scale efficiently as more projects adopt the feature.

2.2. The `Ai::KnowledgeGraph::Replica` Model

This model represents an instance of a Knowledge Graph for a specific namespace, deployed on a specific Zoekt node. It effectively serves as a join table, creating a many-to-many relationship between namespaces and Zoekt nodes. This design is key to achieving high availability, as a single namespace's graph can be replicated across multiple nodes.

Relevant Code Snippet (ee/app/models/ai/knowledge_graph/replica.rb from MR !191162):

Loading syntax highlighting...

2.3. Mirrored Architectural Pattern

The new Knowledge Graph models are intentionally designed to mirror the existing Zoekt data models, creating a parallel but integrated structure. The diagram from the proof-of-concept MR illustrates this relationship, with Node being the central, shared component.

Model Relationship Diagram (from MR !189941):

Mermaid Diagram (click to expand)
classDiagram
    namespace ZoektModels {
        class Node
        class Index
        class Repository
        class Task
        class EnabledNamespace
        class Replica
    }
    namespace KnowledgeGraphModels {
        class KnowledgeGraphEnabledRepository
        class KnowledgeGraphReplica
        class KnowledgeGraphTask
    }

    Node "1" --> "*" Task : has_many tasks
    Node "1" --> "*" Index : has_many indices
    Node "1" --> "*" KnowledgeGraphTask : has_many graph tasks
    Node "1" --> "*" KnowledgeGraphReplica : has_many graph replicas

    KnowledgeGraphEnabledRepository "1" --> "*" KnowledgeGraphReplica : has_many replicas
    KnowledgeGraphReplica "1" --> "*" KnowledgeGraphTask : has_many tasks

3. The End-to-End Workflow

The integration of these components results in a seamless workflow for creating and managing Knowledge Graphs:

Enablement: A project is marked for Knowledge Graph creation, resulting in a record in the p_knowledge_graph_enabled_namespaces table.
Replica Creation: A service (e.g., KnowledgeGraph::IndexingTaskService) is triggered. It queries for available Search::Zoekt::Nodes that are configured to handle Knowledge Graph tasks.
Assignment: The service creates one or more Ai::KnowledgeGraph::Replica records, linking the EnabledNamespace to the selected Zoekt::Nodes.
Task Scheduling: For each new replica, a KnowledgeGraph::Task record is created with task_type: :graph_index_repo. This task is associated with the replica and its assigned zoekt_node_id.
Processing: The gitlab-zoekt-indexer service, running on the assigned Zoekt node, polls the shared task queue. It picks up the graph_index_repo task and executes the logic to generate and store the KuzuDB file on its local disk.
State Update: Upon completion, the indexer reports the task status back to GitLab Rails, which updates the KnowledgeGraph::Task and Ai::KnowledgeGraph::Replica states to done or ready.

4. Conclusion

The AddKnowledgeGraphReplicas migration is a foundational element in a well-defined strategy to deliver Knowledge Graph functionality by deeply integrating with the existing Zoekt architecture. This approach demonstrates a sophisticated model of infrastructure reuse.

By leveraging Zoekt's proven capabilities for node management, task distribution, and scalability, GitLab can:

Maximize Resource Efficiency: Use existing compute nodes for a new service.
Ensure Scalability and High Availability: Distribute replicas across multiple nodes using a familiar pattern.
Streamline Operations: Extend existing monitoring, logging, and deployment processes rather than creating new ones.
Accelerate Development: Build upon a mature and robust task-processing framework.

In summary, the Knowledge Graph is not a standalone service but a "first-class citizen" within the Zoekt ecosystem, made possible by the database relationships and shared components established in this body of work.

Research Complete

Executive Summary

Final Report: Integrating Knowledge Graph Replicas with GitLab's Zoekt Architecture

1. Core Integration Strategy: Infrastructure Reuse

1.1. Shared Node Infrastructure

1.2. Flexible Node Service Assignment

1.3. Unified Task Processing

2. The `AddKnowledgeGraphReplicas` Migration and Data Models

2.1. The `p_knowledge_graph_replicas` Table

2.2. The `Ai::KnowledgeGraph::Replica` Model

2.3. Mirrored Architectural Pattern

3. The End-to-End Workflow

4. Conclusion

Research Complete

Executive Summary

Final Report: Integrating Knowledge Graph Replicas with GitLab's Zoekt Architecture

1. Core Integration Strategy: Infrastructure Reuse

1.1. Shared Node Infrastructure

1.2. Flexible Node Service Assignment

1.3. Unified Task Processing

2. The AddKnowledgeGraphReplicas Migration and Data Models

2.1. The p_knowledge_graph_replicas Table

2.2. The Ai::KnowledgeGraph::Replica Model

2.3. Mirrored Architectural Pattern

3. The End-to-End Workflow

4. Conclusion

2. The `AddKnowledgeGraphReplicas` Migration and Data Models

2.1. The `p_knowledge_graph_replicas` Table

2.2. The `Ai::KnowledgeGraph::Replica` Model