veda.ng
Open-Source Project

AI Discovery Standards

Every file, protocol, and technique used to make websites discoverable by AI systems, search engines, and autonomous agents. One command to set up everything.

View on GitHub
npx ai-discovery-standards

Why AI discovery matters now

The way people find information is changing. Google Search is no longer the only gateway to content. Perplexity, ChatGPT Search, Google AI Overviews, and Claude are answering questions directly, pulling from websites and citing sources inline. If your content is not structured for these systems, you are invisible to a growing share of how people discover information.

Traditional SEO optimizes for one system: Google's ranking algorithm. AI discovery optimizes for three simultaneously: traditional search engines (SEO), AI answer engines that cite sources (AEO), and generative models that recommend content (GEO). Each requires different signals, different file formats, and different content structures.

Most websites today have zero AI discovery infrastructure. They have a robots.txt that was last updated in 2019, no llms.txt, no structured AI permissions, and no agent-readable metadata. This is the equivalent of not having a sitemap in 2010. The gap between sites with AI discovery files and sites without will widen as AI search traffic grows.

This project provides every file you need to close that gap. It is not a framework, a library, or a SaaS product. It is a set of static files that any website can deploy in under five minutes.

What it does

Run one command and generate 13 AI discovery files for any web project. The CLI tool auto-detects your public/ or static/ directory, asks for your site details, and creates every file you need. Existing files are never overwritten.

One-command setup

npx ai-discovery-standards generates all 13 files

25+ AI crawlers

Complete robots.txt with every known AI bot

AEO & GEO guides

Answer Engine and Generative Engine optimization

Claude Code skill

Slash command for AI-assisted setup

Discovery Files

Static files you place on your web server to communicate with AI crawlers and agents. Each file serves a specific purpose in the discovery stack.

FilePurpose
robots.txt
Crawler access policies for 25+ AI bots
llms.txt
Curated content summary for LLMs
llms-full.txt
Full-text content for AI ingestion
ai.txt
AI usage permissions (training, citation, indexing)
ai.json
Structured content map for AI agents
brand.txt
Brand governance rules for AI systems
ai-plugin.json
ChatGPT plugin manifest
agents.json
A2A agent capability advertisement
security.txt
Vulnerability reporting (RFC 9116)
humans.txt
Team credits and technologies
sitemap.xml
URL index with metadata
manifest.json
PWA metadata and icons
browserconfig.xml
Windows tile configuration

AEO vs GEO: how to optimize for both

Answer Engine Optimization (AEO) is about getting your content selected as the direct answer when someone asks a question to Perplexity, ChatGPT Search, or Google AI Overviews. The key is structure: use H2 headings that are literal questions, follow each with a concise 2-3 sentence answer, then provide supporting detail below. AI answer engines preferentially extract from this question-answer pattern because it maps cleanly to user queries.

Generative Engine Optimization (GEO) targets a different outcome: being cited as a source across AI platforms. When Claude or ChatGPT recommends a tool, a framework, or a company, what determines which ones get mentioned? The answer is authority signals: structured data (JSON-LD), consistent terminology across pages, clear authorship attribution, and machine-readable content summaries like llms.txt.

Practical implementation:

  • Restructure your top 10 pages with question-format H2 headings and concise answer paragraphs
  • Add FAQ schema (JSON-LD) to every page that answers common questions
  • Publish an llms.txt with a clear, factual description of your site and its content
  • Add Organization and Person schema to establish entity authority
  • Use consistent, specific terminology rather than vague descriptions across all pages
  • Ensure every page has a clear, quotable summary in the first paragraph

The companies that implement both AEO and GEO now will compound their visibility as AI search traffic grows. Sites without these signals will not lose Google traffic immediately, but they will miss the fastest-growing discovery channel of 2026.

AI Crawler Registry

All known AI crawler user-agent strings as of April 2026, organized by company. Your robots.txt should address each of these explicitly.

OpenAI

GPTBotOAI-SearchBotChatGPT-User

Anthropic

ClaudeBotClaude-SearchBotClaude-User

Google

GooglebotGoogle-ExtendedGoogleOther

Perplexity

PerplexityBotPerplexity-User

Meta

meta-externalagentmeta-externalfetcher

Apple

ApplebotApplebot-Extended

Amazon

Amazonbot

ByteDance

BytespiderTikTokSpider

Others

CCBotcohere-aiCopilotBotYouBotDiffbot

robots.txt strategy for AI crawlers

The critical distinction in AI crawler management is between search bots and training bots. These serve fundamentally different purposes, and your robots.txt policy should treat them differently.

Search bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot) crawl your site to include your content in AI-generated answers. When someone asks "what is the best tool for X?" and your site has the answer, these bots are what make your content citable. Blocking them removes you from AI search results entirely.

Training bots (GPTBot, ClaudeBot, Google-Extended) crawl your site to ingest content into model training data. Your content becomes part of the model's knowledge but is not attributed to you. Some publishers block these to retain control over their content. Others allow them for broader influence.

Recommended strategy for most businesses: allow all search bots (you want citations), selectively allow or block training bots based on your content strategy, and always allow Googlebot (traditional search remains the largest traffic source for most sites).

FAQ

What is llms.txt?
A Markdown file at /llms.txt that gives LLMs a curated summary of your site. It includes a title, a one-paragraph description, and organized links to your key pages. Created by Jeremy Howard (Answer.AI) in 2024. Adopted by Anthropic, Stripe, Vercel, and Cloudflare.
What is the difference between AEO and GEO?
AEO (Answer Engine Optimization) targets question-answer extraction by AI systems like ChatGPT and Perplexity. GEO (Generative Engine Optimization) targets citation rate and "Share of AI Voice" across all AI platforms. AEO is about being the answer. GEO is about being the cited source.
Which AI crawlers should I allow?
Separate training bots (GPTBot, ClaudeBot, Google-Extended) from search bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot). Blocking training bots prevents your content from being absorbed into model weights. Blocking search bots removes you from AI-generated answers entirely.
What is brand.txt?
A plain-text file that tells AI systems how to represent your brand: correct name capitalization, preferred terminology, prohibited terms, tone guidance, and competitor disambiguation. Reduces hallucinations about your brand identity.
What is ai.txt?
A plain-text file declaring what AI systems may do with your content: training, indexing, citation, or summarization. Works alongside robots.txt but with AI-specific granularity. Not yet standardized but gaining adoption.
Do these files replace structured data (JSON-LD)?
No. Discovery files and structured data serve different purposes. JSON-LD tells search engines and AI systems what type of content a page contains (Article, FAQ, Product). Discovery files tell AI systems what your site is about overall and how they may use it. You need both.
How often should I update llms.txt?
Update it whenever you add or remove major content sections, launch new products, or change your site structure. For most sites, a monthly review is sufficient. The file should reflect the current state of your site, not a historical archive.
Does this work with any web framework?
Yes. The CLI auto-detects public/ (Next.js, React, Vue), static/ (Hugo, Gatsby), and root directories. Files are plain text and JSON, framework-agnostic. They work with any web server that serves static files.

Get started

Run a single command to generate all 13 discovery files. The CLI auto-detects your project structure and walks you through the setup interactively. Existing files are never overwritten.

$ npx ai-discovery-standards

Works with Next.js, React, Vue, Hugo, Gatsby, and any static site. No dependencies to install.

Full documentation on GitHub