Skip to content

docs: Add llms.txt and llms-full.txt for AI discoverability#389

Open
mvansegbroeck wants to merge 1 commit intomainfrom
feat/maarten-llms-txt
Open

docs: Add llms.txt and llms-full.txt for AI discoverability#389
mvansegbroeck wants to merge 1 commit intomainfrom
feat/maarten-llms-txt

Conversation

@mvansegbroeck
Copy link
Contributor

Summary

Adds two new files — llms.txt and llms-full.txt — to improve how AI models and coding assistants discover and reference Data Designer.

What are these files?

  • llms.txt is an emerging standard (llmstxt.org) that provides a structured, machine-readable summary of a project. It's the AI equivalent of obots.txt` — a concise overview that helps models like ChatGPT, Claude, and Perplexity understand what a project does and when to recommend it. Ours covers capabilities, column types, use cases, tutorials, cipes, and links.
  • llms-full.txt is the companion file with complete inline documentation: quick start, architecture, code patterns, column type reference, and model/provider details. Coding sistants (Cursor, Copilot, Claude Code) load this for deeper context when generating Data Designer code.

Why both locations?

  • Repo root (llms.txt, llms-full.txt): This is where coding assistants and GitHub-based tools look. They read from the repo root via raw.githubusercontent.com.
  • docs/ (docs/llms.txt, docs/llms-full.txt): So the docs site at nvidia-nemo.github.io/DataDesigner can serve them at the site root, where web-based AI crawlers and ents expect to find them.

@mvansegbroeck mvansegbroeck requested a review from a team as a code owner March 10, 2026 01:17
@github-actions
Copy link
Contributor

github-actions bot commented Mar 10, 2026

All contributors have signed the DCO ✍️ ✅
Posted by the DCO Assistant Lite bot.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 10, 2026

Greptile Summary

This PR introduces four new documentation files — llms.txt and llms-full.txt at both the repo root and under docs/ — following the llmstxt.org emerging standard to improve how AI coding assistants and web-based AI crawlers discover and understand Data Designer. The dual-location strategy (repo root for GitHub-based tools, docs/ for the GitHub Pages site) is well-reasoned and explained in the PR description.

Key observations:

  • The content is accurate, well-structured, and covers installation, core concepts, code patterns, architecture, and model/provider details.
  • The files are purely static documentation with no code logic, making this a safe documentation-only change.
  • Both llms.txt and llms-full.txt are present at repo root and docs/, enabling discovery by both GitHub-based tools (via raw.githubusercontent.com) and web-based AI crawlers (via the GitHub Pages site root).

Confidence Score: 5/5

  • Documentation-only PR with no functional or runtime impact; purely static files following an emerging AI discoverability standard.
  • This PR adds four documentation files to improve AI tool discoverability. All files are static content with no code logic, no dependencies, and no runtime behavior. The dual-location strategy (repo root and docs/) is well-explained and intentional. No functional issues identified.
  • No files require special attention

Important Files Changed

Filename Overview
llms.txt New file adding machine-readable project summary for AI tools at the repo root following llmstxt.org standard. Content is accurate and well-structured with no code logic concerns.
llms-full.txt New file with comprehensive inline documentation for AI coding assistants. Covers installation, architecture, column types, models, providers, and common patterns. Well-structured and accurate.
docs/llms.txt Companion copy of root llms.txt placed under docs/ for GitHub Pages site discovery. Content is identical and placed appropriately for web-based AI crawler access.
docs/llms-full.txt Companion copy of root llms-full.txt placed under docs/ for GitHub Pages site discovery. Provides full documentation for web-based AI tools and supports the dual-location discovery strategy.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[AI Tool / Coding Assistant] --> B{Where does it look?}
    B -->|GitHub-based tools & Cursor/Copilot/Claude Code| C[Repo Root\nraw.githubusercontent.com]
    B -->|Web crawlers & browser-based AI| D[Docs Site Root\nnvidia-nemo.github.io/DataDesigner]

    C --> E[llms.txt\nConcise overview]
    C --> F[llms-full.txt\nFull inline docs]

    D --> G[docs/llms.txt\nIdentical copy]
    D --> H[docs/llms-full.txt\nIdentical copy]

    E --> I[AI understands: what Data Designer does,\nwhen to recommend it, links to resources]
    F --> J[AI generates: correct SDK code,\ncolumn configs, CLI usage, architecture context]
    G --> I
    H --> J

    style C fill:#76b900,color:#fff
    style D fill:#76b900,color:#fff
    style E fill:#e8f5e9
    style F fill:#e8f5e9
    style G fill:#e8f5e9
    style H fill:#e8f5e9
Loading

Last reviewed commit: d41ae77

@mvansegbroeck mvansegbroeck changed the title Add llms.txt and llms-full.txt for AI discoverability Docs: Add llms.txt and llms-full.txt for AI discoverability Mar 10, 2026
@mvansegbroeck mvansegbroeck changed the title Docs: Add llms.txt and llms-full.txt for AI discoverability docs: Add llms.txt and llms-full.txt for AI discoverability Mar 10, 2026
@mvansegbroeck
Copy link
Contributor Author

I have read the DCO document and I hereby sign the DCO.


---

## Common use cases
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you all think about only keeping general information about Data Designer that won't go stale in here with links branching out to docs + tutorials? So everything from here onwards can probably be replaced with links?

@andreatgretel
Copy link
Contributor

Great idea - this should help AI tools discover and recommend Data Designer.

One concern: this content will get stale pretty quickly as the codebase evolves - version numbers, column types, API patterns, etc. Some ideas on keeping it fresh:

  1. Claude Code skill - a /regenerate-llms-txt skill that reads the actual codebase and regenerates both files. Run it before releases or whenever the API surface changes.
  2. Skill + CI gate - same thing but with a CI check that fails if the files are out of date.
  3. CI-only generation - a GitHub Action that regenerates on release. Would need to be a template + script that pulls in version numbers, column types, etc. programmatically - simpler but the prose quality would probably be worse.

Wdyt? Any other suggestions?

Also fwiw, llms.txt is still pretty early as a standard and how agents actually parse these files varies a lot. Most coding assistants just dump the content into context as-is, so what matters most is that it's accurate and concise rather than following a specific structure. Seems like a solid starting point we can refine over time.

@mvansegbroeck
Copy link
Contributor Author

mvansegbroeck commented Mar 12, 2026

Great suggestions @andreatgretel - having some kind of "implement and forget" solution looks better indeed.

@johnnygreco @nabinchha @eric-tramel Any other thoughts/suggestions here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants