Skip to content

Conversation

@eric-tramel
Copy link
Contributor

@eric-tramel eric-tramel commented Jan 27, 2026

Summary

This PR introduces Model Context Protocol (MCP) integration into Data Designer, enabling LLM generation columns to call external tools via MCP servers. This allows models to dynamically retrieve information, perform calculations, or interact with external systems during data generation.

TODO

  • End-too-end test demonstrating usage.
  • Example workflows.
  • Job scale-up tests & throughput impacts.
    • Fix possibility of multi-thread tool call request deadlocking.
  • Fleshed out module/comment docstrings.
  • Documentation content

Key Features

  • Global MCP Server Configuration: Define MCP servers at the DataDesigner level, supporting both stdio (command-based) and SSE (URL-based) transports
  • Per-Column Tool Configuration: Enable specific tools on individual LLM columns with fine-grained control
  • Automatic Tool Loop: The ModelFacade handles iterative tool calling until the model produces a final response
  • Safety Limits: Configurable max_tool_calls prevents runaway tool loops

New Configuration Types

MCPServerConfig

Defines how to connect to an MCP server:

from data_designer.config import MCPServerConfig

# Stdio transport (subprocess)
server = MCPServerConfig(
    name="my-mcp-server",
    command="python",
    args=["-m", "my_mcp_server"],
    env={"API_KEY": "..."},
)

# SSE transport (HTTP)
server = MCPServerConfig(
    name="remote-mcp",
    url="http://localhost:8080/sse",
)

MCPToolConfig

Controls which tools an LLM column may use:

from data_designer.config import MCPToolConfig

tool_config = MCPToolConfig(
    server_name="my-mcp-server",  # Must match an MCPServerConfig.name
    tool_names=["get_weather", "lookup_data"],  # None = allow all tools
    max_tool_calls=5,  # Default: 5
)

Usage Example

from data_designer.config import (
    CategorySamplerParams,
    DataDesignerConfigBuilder,
    LLMTextColumnConfig,
    MCPServerConfig,
    MCPToolConfig,
    SamplerColumnConfig,
    SamplerType,
)
from data_designer.interface import DataDesigner

# 1. Define MCP server(s)
mcp_server = MCPServerConfig(
    name="fact-server",
    command="python",
    args=["-m", "my_fact_server"],
)

# 2. Create DataDesigner with MCP servers
data_designer = DataDesigner(mcp_servers=[mcp_server])

# 3. Build config with tool-enabled column
config = DataDesignerConfigBuilder()
config.add_column(
    SamplerColumnConfig(
        name="topic",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(values=["Python", "Rust", "Go"]),
    )
)
config.add_column(
    LLMTextColumnConfig(
        name="fact",
        prompt="Use the get_fact tool to retrieve a fact about {{ topic }}.",
        model_alias="nvidia-text",
        tool_config=MCPToolConfig(
            server_name="fact-server",
            tool_names=["get_fact"],
        ),
    )
)

# 4. Generate data - model will automatically call tools
result = data_designer.preview(config, num_records=10)

Full Message Traces (*__trace)

This PR replaces the previous {column_name}__reasoning_trace side-effect column with a more powerful {column_name}__trace column.

  • Old: {column_name}__reasoning_trace (reasoning only, when the provider exposed it)
  • New: {column_name}__trace (full ordered message history)

What is stored

When enabled, {column_name}__trace is a list[dict] containing the complete message history for the final generation attempt, including (in order):

  • system + user prompts
  • assistant tool-call messages (tool_calls)
  • tool responses (role="tool", tool_call_id, content)
  • final assistant response

If the model/provider exposes it, assistant messages may also include a reasoning_content field.

How to enable

Traces are controlled via a new runtime option:

from data_designer.config import RunConfig

# Enable full trace capture for LLM columns
# (produces `{column_name}__trace` alongside the normal output column)
data_designer.set_run_config(RunConfig(include_full_traces=True))

Architecture

DataDesigner
    └── mcp_servers: list[MCPServerConfig]
            │
            ▼
    ResourceProvider
        └── mcp_manager: MCPClientManager
                │
                ▼
        ModelRegistry
            └── ModelFacade (receives mcp_manager)
                    │
                    ▼
            LLMCompletionGenerator
                └── generate() passes tool_config to ModelFacade

New Files

File Purpose
packages/data-designer-config/src/data_designer/config/mcp.py MCPServerConfig and MCPToolConfig Pydantic models
packages/data-designer-engine/src/data_designer/engine/mcp/manager.py MCPClientManager - connects to servers, caches tools, executes calls
packages/data-designer-engine/src/data_designer/engine/mcp/errors.py MCP-specific exception hierarchy

Modified Files

File Changes
packages/data-designer-config/src/data_designer/config/column_configs.py Added tool_config field to LLMTextColumnConfig
packages/data-designer-config/src/data_designer/config/run_config.py Added include_full_traces to enable {column_name}__trace
packages/data-designer-engine/src/data_designer/engine/models/facade.py Tool calling loop + trace capture
packages/data-designer-engine/src/data_designer/engine/models/factory.py Passes mcp_manager to ModelFacade
packages/data-designer-engine/src/data_designer/engine/resources/resource_provider.py Initializes MCPClientManager
packages/data-designer/src/data_designer/interface/data_designer.py Accepts mcp_servers parameter

Test Plan

  • Unit tests for MCPServerConfig validation (packages/data-designer-config/tests/config/test_mcp.py)
  • Unit tests for MCPClientManager (packages/data-designer-engine/tests/engine/mcp/test_manager.py)
  • Unit tests for ModelFacade tool calling (packages/data-designer-engine/tests/engine/models/test_facade.py)
  • E2E demonstration with real MCP server (tests_e2e/tests/test_mcp_demo.py)

Running the E2E Demo

# Requires NVIDIA_API_KEY environment variable
NVIDIA_API_KEY=your-key uv run pytest tests_e2e/tests/test_mcp_demo.py -v

Dependencies

  • Added mcp package to packages/data-designer-engine/pyproject.toml (provides MCP client functionality)

@johnnygreco johnnygreco requested a review from a team January 27, 2026 03:14
@eric-tramel eric-tramel self-assigned this Jan 27, 2026
@eric-tramel eric-tramel added the enhancement New feature or request label Jan 27, 2026
@nabinchha nabinchha changed the base branch from main to johnny/refactor/207-repackaging-with-subpackages January 27, 2026 16:04
@dhruvnathawani dhruvnathawani self-requested a review January 27, 2026 16:35
@johnnygreco johnnygreco force-pushed the johnny/refactor/207-repackaging-with-subpackages branch from 824d8ed to 177d38d Compare January 27, 2026 18:32
@eric-tramel eric-tramel changed the base branch from johnny/refactor/207-repackaging-with-subpackages to main January 27, 2026 19:54
@eric-tramel eric-tramel marked this pull request as ready for review January 27, 2026 23:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants