|
| 1 | +# Generating LLM Files |
| 2 | + |
| 3 | +This directory contains scripts to automatically generate `llms.txt` and `llms-full.txt` files for LLM consumption. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +The LLM files provide structured documentation references that help AI assistants: |
| 8 | +- Find the correct documentation pages |
| 9 | +- Understand the documentation structure |
| 10 | +- Reduce hallucinations by providing accurate URLs |
| 11 | +- Discover all available integration options |
| 12 | + |
| 13 | +## Files |
| 14 | + |
| 15 | +- `generate-llm-files.js` - Node.js script that generates the LLM files |
| 16 | +- `generate-llm-files.sh` - Shell wrapper script for easier execution |
| 17 | + |
| 18 | +## Usage |
| 19 | + |
| 20 | +### Option 1: After Local Build |
| 21 | + |
| 22 | +1. Build the documentation site: |
| 23 | + ```bash |
| 24 | + yarn antora ./antora-playbook.yml |
| 25 | + ``` |
| 26 | + |
| 27 | +2. Generate LLM files from local sitemap: |
| 28 | + ```bash |
| 29 | + yarn generate-llm-files |
| 30 | + # or |
| 31 | + ./-scripts/generate-llm-files.sh |
| 32 | + ``` |
| 33 | + |
| 34 | +### Option 2: From Remote Sitemap |
| 35 | + |
| 36 | +Generate directly from the published sitemap (useful for syncing with production): |
| 37 | + |
| 38 | +```bash |
| 39 | +yarn generate-llm-files-from-url |
| 40 | +# or |
| 41 | +node ./-scripts/generate-llm-files.js https://www.tiny.cloud/docs/antora-sitemap.xml |
| 42 | +``` |
| 43 | + |
| 44 | +### Option 3: Custom Sitemap Source |
| 45 | + |
| 46 | +```bash |
| 47 | +node ./-scripts/generate-llm-files.js /path/to/sitemap.xml |
| 48 | +# or |
| 49 | +node ./-scripts/generate-llm-files.js https://example.com/sitemap.xml |
| 50 | +``` |
| 51 | + |
| 52 | +## Workflow |
| 53 | + |
| 54 | +### Manual Regeneration (Current Approach) |
| 55 | + |
| 56 | +**After major/minor/patch releases:** |
| 57 | +1. Run the script to regenerate files from production sitemap: |
| 58 | + ```bash |
| 59 | + yarn generate-llm-files-from-url |
| 60 | + ``` |
| 61 | + This ensures the LLM files match what's actually published on the live site. |
| 62 | + |
| 63 | + Alternatively, if you need to generate from a local build: |
| 64 | + ```bash |
| 65 | + yarn generate-llm-files |
| 66 | + ``` |
| 67 | +2. Review the generated files in a PR |
| 68 | +3. Commit and merge |
| 69 | + |
| 70 | +**Why not automated in CI/CD?** |
| 71 | +- The script makes 400+ HTTP requests to fetch H1 titles (~4-5 minutes) |
| 72 | +- Resource-intensive and slow for every build |
| 73 | +- Manual review ensures quality before committing |
| 74 | +- Validates no 404s are listed and titles match actual page content |
| 75 | + |
| 76 | +### File Locations |
| 77 | + |
| 78 | +The files are generated in `modules/ROOT/attachments/`: |
| 79 | +- `llms.txt` - Simplified, curated documentation index (~105 lines) |
| 80 | +- `llms-full.txt` - Complete documentation index with all pages (~700 lines) |
| 81 | + |
| 82 | +**Post-build:** Files are moved to the root directory (handled in separate PR) and accessible at: |
| 83 | +- `https://www.tiny.cloud/docs/tinymce/latest/llms.txt` |
| 84 | +- `https://www.tiny.cloud/docs/tinymce/latest/llms-full.txt` |
| 85 | + |
| 86 | +## How It Works |
| 87 | + |
| 88 | +1. **Reads sitemap.xml** - Extracts all documentation URLs from the sitemap (only `/latest/` URLs) |
| 89 | +2. **Fetches H1 titles** - Makes HTTP requests to each page to get the actual H1 title (validates no 404s) |
| 90 | +3. **Generates titles** - Uses fetched H1 titles, falls back to URL-based titles if fetch fails |
| 91 | +4. **Categorizes pages** - Groups by topic (integrations, plugins, API, etc.) based on URL patterns |
| 92 | +5. **Deduplicates** - Removes duplicate URLs and makes titles unique within categories |
| 93 | +6. **Generates structured markdown** - Creates both simplified (`llms.txt`) and complete (`llms-full.txt`) indexes |
| 94 | + |
| 95 | +## Customization |
| 96 | + |
| 97 | +The script uses hardcoded categorization logic. To customize: |
| 98 | + |
| 99 | +1. Edit `generate-llm-files.js` |
| 100 | +2. Modify the `categorizeUrl()` function to adjust categorization |
| 101 | +3. Update `generateLLMsTxt()` and `generateLLMsFullTxt()` to change output format |
| 102 | + |
| 103 | +## Notes |
| 104 | + |
| 105 | +- The script requires Node.js and `sanitize-html` package (installed via `yarn install`) |
| 106 | +- Generated files are written to `modules/ROOT/attachments/` |
| 107 | +- Uses only the sitemap (no dependency on `nav.adoc`) |
| 108 | +- Fetches actual H1 titles from pages (validates no 404s) |
| 109 | +- Rate-limited fetching: 10 concurrent requests with 100ms delay between batches |
| 110 | +- Request timeout: 10 seconds per page |
| 111 | +- Security: Validates URLs to prevent SSRF attacks (only allows tiny.cloud domains) |
| 112 | +- Handles HTML entity decoding (`’` → `'`) |
| 113 | +- Filters out error pages and duplicate URLs |
| 114 | +- Makes titles unique within categories (e.g., "ES6 and npm (Webpack)", "ES6 and npm (Rollup)") |
| 115 | +- Falls back to URL-based title generation if H1 fetch fails |
| 116 | + |
| 117 | +## Troubleshooting |
| 118 | + |
| 119 | +**Error: "Source not found"** |
| 120 | +- Make sure the sitemap path is correct |
| 121 | +- For remote URLs, check your internet connection |
| 122 | +- For local files, ensure Antora has generated the site first |
| 123 | + |
| 124 | +**Missing page titles** |
| 125 | +- If H1 fetch fails, the script uses URL-based title generation as fallback |
| 126 | +- Check that pages return valid HTML with H1 tags |
| 127 | +- 404 pages are automatically filtered out |
| 128 | + |
| 129 | +**Incorrect categorization** |
| 130 | +- Review the `categorizeUrl()` function (note: function name is singular, not plural) |
| 131 | +- Add custom patterns for new page types |
0 commit comments