llms-generator
Crawl any website and generate llms.txt with one command.
Install with pip install llms-generator, run llms-gen https://example.com, and upload the generated llms.txt to your domain root. No account required.
Quickstart
pip install llms-generator
llms-gen https://example.com
What Is llms.txt?
The AI-ready site map standard
llms.txt is a standard proposal by Jeremy Howard (AnswerDotAI, September 2024). Place a Markdown file at /llms.txt on your domain. Inside, list your important pages with short descriptions. ChatGPT and Claude read this file when answering user questions. They skip your HTML, navigation menus, and ads.
The format is precise Markdown. An H1 heading with your project name. A blockquote summary. Sections split by H2 headings. Each entry is a link with a description after the colon.
llms.txt coexists with robots.txt and sitemap.xml. Robots.txt controls crawler access. Sitemap.xml lists every indexable page. llms.txt provides a curated overview. It is for answering user questions, not for training models.
Why Does llms.txt Matter Now?
Context windows are limited
ChatGPT, Claude, and Gemini have small context windows. They cannot read your entire website with navigation, JavaScript, and ads. llms.txt gives them your essential pages in one request.
The Optional section has a special meaning. Entries marked Optional can be skipped when the LLM needs shorter context. Use it for secondary resources, tutorials, or examples. Primary docs go in named sections.
Key Takeaways
- LLMs have limited context windows and cannot read your entire website in one request.
- llms.txt gives AI assistants your essential pages in a single Markdown file.
- The Optional section lets you mark secondary content that LLMs can skip when context is tight.
What Is the llms.txt Specification?
Standard file structure
The llms.txt specification defines a Markdown file structure with an H1 heading, an optional blockquote summary, H2 section headings, and link lists with colon-separated descriptions. The file goes at the root of your domain at /llms.txt.
Standard file structure, in order:
- H1 heading. Project or site name (required).
- Blockquote. Short summary of the site.
- Optional paragraphs. Additional details about the project.
- H2 sections. Categories like "Docs", "Guides", "API".
- Link lists.
[Title](URL): Descriptioninside each section.
# Project Name
> Short description of the project or site.
Additional context for the LLM goes here.
## Docs
- [Getting Started](https://example.com/docs/getting-started): Step-by-step setup for new users.
- [API Reference](https://example.com/docs/api): Complete REST endpoint documentation.
## Optional
- [Tutorial Videos](https://example.com/tutorials): Walkthroughs for common tasks.
Use ## Optional as a special section. LLMs skip these links when context is tight. Everything else is considered essential.
| File | Role | Typical Size |
|---|---|---|
| llms.txt | Curated directory of essential pages | 2 to 50 KB |
| llms-full.txt | Full page content concatenated into one document | 50 KB to 5+ MB |
Key Takeaways
- llms.txt uses a standard Markdown structure: H1 title, blockquote summary, H2 sections, link lists with descriptions.
- Two file types exist: llms.txt (2-50 KB curated directory) and llms-full.txt (50 KB-5+ MB full content).
- The Optional section lets LLMs skip secondary content when working with limited context.
What Features Does llms-generator Include?
What the tool does automatically
llms-generator includes six built-in features that handle crawling, filtering, grouping, and output generation automatically. You only need to point it at a URL.
Robot Checks
Respects robots.txt, X-Robots-Tag headers, and meta robots tags on every page. Pages marked noindex are excluded.
Auto Grouping
Pages are grouped by directory path. /docs/* becomes "Docs", /blog/* becomes "Blog". Sections are ordered by priority.
JS Fallback
Playwright headless browser is used as fallback for JavaScript-rendered sites. Browser is launched once and reused.
Dual Output
Generates both llms.txt (curated directory) and llms-full.txt (full page content) with a single --full flag.
URL Normalization
Deduplicates http/https variants and trailing slash differences. Follows HTTP redirects and records final URLs.
Spec Compliant
Output follows the llmstxt.org specification. H1 title, blockquote summary, H2 sections, and link lists with descriptions.
Key Takeaways
- llms-generator respects robots.txt, X-Robots-Tag, and meta robots on every page automatically.
- Pages are grouped by directory path into named sections like Docs, Blog, and API.
- Dual output mode generates both llms.txt and llms-full.txt with a single --full flag.
What CLI Options Are Available?
All available flags and defaults
| Flag | Default | Description |
|---|---|---|
| URL | required | Target website URL |
| --depth | 2 | Maximum crawl depth |
| --output | llms.txt | Output file path |
| --full | false | Also generate llms-full.txt |
| --delay | 1.0 | Seconds between requests |
| --no-js | false | Skip Playwright JS fallback |
How Does llms-generator Work?
Step-by-step crawl process
Parse robots.txt
Respects Disallow and Crawl-Delay rules. Gracefully handles missing or restricted files.
BFS Crawl
Starts from your URL and follows internal links breadth-first up to the configured depth.
Per-Page Analysis
Extracts title, h1, meta description, first paragraph, and directory path. Falls back to Playwright for JS-rendered content.
Section Grouping
Groups pages by top-level directory path. /docs/* becomes "Docs", /blog/* becomes "Blog".
Spec Output
Writes valid llms.txt per the llmstxt.org specification with proper H1, blockquote, H2 sections, and link lists.
Key Takeaways
- llms-generator starts by parsing robots.txt, then runs a breadth-first crawl respecting all robots directives.
- Each page is analyzed for title, h1, meta description, and falls back to Playwright for JS-rendered content.
- Output is a spec-compliant llms.txt with proper H1, blockquote, H2 sections, and link lists.
How Do I Install and Run llms-generator?
Install and run in minutes
You need Python installed on your computer. Python is a programming language. Most Mac and Linux computers have it already. On Windows, download it from python.org. Get version 3.10 or newer.
Open your terminal (Command Prompt on Windows, Terminal on Mac/Linux). Type this and press Enter:
pip install llms-generator
Wait for the installation to finish. Then type this, replacing the URL with your own website:
llms-gen https://example.com
That creates llms.txt in the current folder. Open it with any text editor. You will see your pages grouped into sections with descriptions.
To also generate the full content file, add --full:
llms-gen https://example.com --full
What Do You Need to Run llms-generator?
Requirements
Python 3.10 or higher. Upload access to your server root folder (usually called public_html or www).
How Do I Crawl and Audit My Site?
Start the crawl
llms-gen https://example.com --depth 3 --delay 1.0
The tool starts at your URL. It follows internal links up to the depth you set. Every page goes through three checks before it makes the output:
- robots.txt. Skips disallowed paths automatically.
- X-Robots-Tag. Respects
noindexandnofollowfrom HTTP headers. - <meta name="robots">. Respects page-level directives in the HTML.
Set --delay 1.0 to wait one second between requests. For sites under 50 pages, 0.5 seconds is safe.
What Gets Filtered Automatically When I Crawl?
Automatic filtering
You do not need to build a URL list by hand. The tool filters while it crawls:
| Kept in llms.txt | Filtered Out |
|---|---|
| HTML pages with real content | Login, signup, admin, and account pages |
| Docs, guides, tutorials, blog posts | 404s, 500s, and empty responses |
| API references and changelogs | PDFs, images, CSS, JS files |
| About, contact, FAQ, privacy pages | Pages with noindex meta tag |
To skip a directory like /tag/ or /author/, add a Disallow rule in your robots.txt. The tool reads it automatically.
User-agent: llms-generator/0.1
Disallow: /tag/
Disallow: /author/
How Do I Generate llms.txt?
Auto-grouped output
Pages are grouped by directory path. /docs/getting-started goes under ## Docs. /blog/hello-world goes under ## Blog.
llms-gen https://example.com --depth 3 --output llms.txt
Sections appear in priority order: Home, About, Docs, API, Blog, then the rest alphabetically. Each entry shows the page title, URL, and a one-line description from the page's meta description or h1 tag.
# Example Site
> Example documentation and blog content.
## Docs
- [Getting Started](https://example.com/docs/getting-started): How to get started on the platform.
- [API Reference](https://example.com/docs/api): Full endpoint documentation.
## Blog
- [Hello World](https://example.com/blog/hello): Announcing the public launch.
How Do I Generate llms-full.txt?
Full content version
Pass --full once. The tool writes both files in one run. llms.txt stays as a structured directory. llms-full.txt stacks every page's full text under section headings.
llms-gen https://example.com --full
ChatGPT or Claude can read llms-full.txt in one request without fetching each page separately. Keep the file under 100,000 tokens, about 75,000 words. Smaller models drop content past their limit.
How Do I Write Descriptions That Get Cited by AI?
Optimize for AI
The tool pulls descriptions from your page's meta description or h1 tag. You can edit them after generation. Open the llms.txt file and rewrite any line. The format is plain Markdown.
Use your actual brand name in every description. Do not write "our platform" or "we offer". Write the real name so the AI can connect your content to your business.
Put the main point first in each entry. Write each description so it works as a standalone sentence. If the AI reads only that one line, it should still be useful. For example, instead of "How to install the package", write "Install llms-generator with pip in under one minute."
Add Article, Person, and Organization structured data to your pages. Schema markup tells AI systems what a page is, who wrote it, and who published it. The crawler finds the page through llms.txt, then reads the schema to decide if it should cite you. Use a schema generator to create this code without writing it by hand.
Use semantic HTML tags like <article> and <section> instead of nested div containers. Clean HTML helps the crawler extract your text accurately for the full content file.
How Do I Deploy llms.txt to My Server?
Upload to your server
Open your FTP client or hosting file manager. Upload llms.txt to the same folder that holds index.html and robots.txt. That folder is usually called public_html, www, or htdocs.
Open this URL in your browser to confirm it works:
https://yourdomain.com/llms.txt
You should see a plain text file starting with # Your Site Name. If you get a 404, move the file up one folder.
How Do I Automate llms.txt Updates?
Keep it fresh
Most users can skip this. Re-run llms-gen when your site changes and upload the new file. If you use GitHub, you can set it to regenerate on every push. That is optional.
Review your llms.txt every three to six months. Remove old pages. Add new ones. Refresh descriptions if your content has changed. ChatGPT and Claude prefer recently updated directories over stale ones.
Frequently Asked Questions
Common questions about llms-generator
No. It is not a ranking signal. It only affects how AI crawlers discover your content.
No. Only one llms.txt goes at the root of your domain.
Yes, if you install Playwright. Run pip install llms-generator[js]. The tool falls back to JS rendering when HTTP fetch returns empty content.
Add Disallow rules to your robots.txt. The tool respects them on every crawl. There is no CLI flag for path exclusion.
About --delay times the number of pages. A 50-page site at 1-second delay takes roughly one minute.
Yes. The tool auto-generates descriptions from your page's meta description or h1 tag. Open llms.txt after generation and edit any line. The format is plain Markdown.
What's the Final Checklist Before Shipping?
Before you ship
- ☐ Installed
llms-generatorwith pip - ☐ Ran
llms-gen https://yoursite.com --depth 3 - ☐ Generated
llms.txt(add--fullfor the full version) - ☐ Checked that all links work
- ☐ Uploaded the file to your domain root
- ☐ Opened
https://yoursite.com/llms.txtin a browser to verify
Install the package, run one command, upload the file. That is all it takes to create an llms.txt for any website.
Once it is live, open ChatGPT or Claude and ask what your site does. If the answer is accurate, your descriptions are working. If not, tighten them and regenerate.