Key Insight: Training vs Retrieval
Sites treat training crawlers (ClaudeBot, GPTBot) differently from retrieval agents (Claude-User, ChatGPT-User). Over 5.8 million sites block ClaudeBot, but far fewer block retrieval agents. Cloudflare now blocks AI crawlers by default for all clients. The emerging business model is Pay Per Crawl ($0.001–$0.01/request) — Cloudflare, Stack Overflow, Quora, and BuzzFeed are early adopters.
Access Matrix
| Platform | Access | WebFetch | Search | Read | Comments | Images | Post | AI Train | Best AI Provider |
|---|---|---|---|---|---|---|---|---|---|
| ▶YouTube | API Only | - | - | - | - | - | - | Blocked | Google Gemini |
| Paid API | - | - | - | - | - | - | Blocked | OpenAI / ChatGPT | |
| YHacker News | Full Access | Y | Y | Y | Y | - | - | Allowed | Any — all equal |
| ⬡Stack Overflow | Paid API | - | - | - | - | - | - | Allowed | OpenAI / ChatGPT |
| 𝕏Twitter / X | Paid API | - | - | - | - | - | - | Blocked | xAI / Grok |
| Blocked | - | - | - | - | - | - | Blocked | Meta AI | |
| fFacebook | Blocked | - | - | - | - | - | - | Blocked | Meta AI |
| ♪TikTok | Blocked | - | - | - | - | - | - | Blocked | None have good access |
| 💬Discord | Blocked | - | - | - | - | - | - | Blocked | None — invite model |
| WWikipedia | API Only | - | Y | Y | - | Y | - | Allowed | Any — via REST API |
| 🐘Mastodon | Full Access | Y | Y | Y | Y | Y | - | Allowed | Any — all equal |
| 🦋Bluesky | Full Access | Y | Y | Y | Y | Y | - | Allowed | Any — all equal |
| ⌥GitHub | API Only | Y | Y | Y | Y | - | - | Allowed | GitHub Copilot / Any |
| MMedium | Partial | Y | - | Y | - | - | - | Allowed | Any — all equal |
| 📰News Sites (NYT, WaPo, etc.) | Partial | Y | - | Y | - | - | - | Blocked | OpenAI (most deals) |
| QQuora | Blocked | - | - | - | - | - | - | Blocked | None have access |
| 💭Discourse Forums | Partial | Y | Y | Y | Y | - | - | Allowed | Any — all equal |
Open Platforms
Full agent access, no auth needed
- Y Hacker News
Firebase API + Algolia Search — Free, no auth required
- 🐘 Mastodon
Mastodon API (per-instance) — Free, per-instance
- 🦋 Bluesky
AT Protocol API + Firehose — Free, no auth for reads
API / Partial Access
Accessible via official API or partially via WebFetch
- ▶ YouTube
YouTube Data API v3 — Free (10K units/day)
- W Wikipedia
Wikimedia REST API — Free (rate limited)
- ⌥ GitHub
REST API + GraphQL API — Free (60/hr unauth, 5K/hr auth)
- M Medium
None (RSS only) — N/A
- 📰 News Sites (NYT, WaPo, etc.)
Varies (licensing deals) — Varies by publisher
- 💭 Discourse Forums
Discourse API (per-instance) — Free (per-instance)
Restricted / Blocked
Paid API required or fully blocked
- 🔶 Reddit
Reddit API (OAuth) — $0.24/1K calls (free under 100/min)
- ⬡ Stack Overflow
OverflowAPI (commercial) — Paid (licensing deal)
- 𝕏 Twitter / X
X API v2 — $100/mo Basic, $5K/mo Pro
- 📷 Instagram
Meta Graph API (restricted) — Free but requires app approval
- f Facebook
Meta Graph API (restricted) — Free but requires app approval
- ♪ TikTok
Research API (apply required) — Free for approved researchers
- 💬 Discord
Discord Bot API — Free (bot must be invited)
- Q Quora
None — N/A
Content Licensing Deals
OpenAI
- Reddit — real-time content firehose (2024)
- Stack Overflow — OverflowAPI, 59M+ Q&A posts (May 2024)
- Associated Press — news content licensing
- Axel Springer — Politico, Business Insider, Bild
- Future Publishing — TechRadar, Tom's Guide, etc.
Google / DeepMind
- Reddit — $60M/year data licensing (early 2024)
- YouTube — native access (Google owns YouTube)
- Web index — Google Search infrastructure
xAI / Grok
- Twitter/X — de facto exclusive access via Elon's ownership
- Real-time tweet search and analysis built in
- No known third-party content deals
Anthropic / Claude
- No publicly known content licensing deals
- Relies on WebSearch/WebFetch for retrieval
- GitHub Agent HQ integration (Universe 2025)
Meta AI
- Instagram/Facebook — native access (Meta owns both)
- Trains LLaMA on own platform data
- No known external content deals
Perplexity AI
- Washington Post — revenue-sharing deal
- Controversy over aggressive scraping practices
- Multiple publisher complaints and lawsuits
Platform Details
▶YouTubeAPI Only▶
AI Policy
Automated access prohibited by ToS. robots.txt blocks /results, /watch, /api. Uses browser fingerprinting, WebGL fingerprinting, behavioral analysis, TLS fingerprinting, and CAPTCHAs.
Official API
YouTube Data API v3 — Free (10K units/day)
Best AI Provider
Google Gemini
Gemini has native YouTube understanding — can process video content, transcripts, and metadata directly as a Google product.
Content Deals
No known third-party AI licensing deals. YouTube Data API v3 is the only legitimate access path (100 searches/day free).
Notes
Hardest major site to scrape. Datacenter IPs fail immediately. Even residential proxies get caught. MCP servers wrapping the Data API v3 are the best workaround for agents.
🔶RedditPaid API▶
AI Policy
Killed free API access July 2023 ($0.24/1K calls). Blocks all major AI crawlers at network level, not just robots.txt. Pursues legal action via CFAA against scrapers.
Official API
Reddit API (OAuth) — $0.24/1K calls (free under 100/min)
Best AI Provider
OpenAI / ChatGPT
OpenAI signed a content licensing deal with Reddit (2024) giving access to Reddit's real-time content firehose for training and retrieval.
Content Deals
OpenAI: content licensing deal (2024). Google: $60M/year data licensing deal (early 2024). Both get structured API access to Reddit's full corpus.
Notes
Connection-level block — WebFetch can't even reach Reddit. The API pricing change explicitly cited AI companies as the reason. Apollo shutdown was the catalyst for the 2023 blackout protests.
YHacker NewsFull Access▶
AI Policy
Open by design. robots.txt allows all user agents. Only blocks interactive paths (/vote, /reply, /login). Requests 30-second crawl delay.
Official API
Firebase API + Algolia Search — Free, no auth required
Best AI Provider
Any — all equal
No restrictions. Any AI agent can access HN via WebFetch or the free APIs. Algolia provides full-text search with no auth.
Content Deals
No deals needed — content is publicly accessible. Firebase API and Algolia Search API are both free, unauthenticated, and real-time.
Notes
Gold standard for AI agent accessibility. Simple HTML, no JavaScript rendering required, two free public APIs. The Algolia API is particularly powerful for searching historical posts.
⬡Stack OverflowPaid API▶
AI Policy
Initially blocked GPTBot (2023), reversed after OpenAI deal. Uses Cloudflare bot protection. Bans AI-generated answers. Content is CC BY-SA licensed but scraping is blocked in practice.
Official API
OverflowAPI (commercial) — Paid (licensing deal)
Best AI Provider
OpenAI / ChatGPT
OpenAI signed the OverflowAPI deal (May 2024) — licensed access to 59M+ Q&A posts. Copilot also has deep SO integration.
Content Deals
OpenAI: OverflowAPI partnership (May 2024), licensed 59M+ Q&A posts. Community backlash — many users deleted answers in protest.
Notes
Cloudflare blocks WebFetch. Google search results often surface SO pages that WebFetch can read via cached URLs as a workaround. Content is CC BY-SA but platform enforces access control.
𝕏Twitter / XPaid API▶
AI Policy
Login wall since late 2023. Free API tier is write-only (500 posts/mo). ToS prohibits using API data to train foundation models. Pay-as-you-go launched Feb 2026.
Official API
X API v2 — $100/mo Basic, $5K/mo Pro
Best AI Provider
xAI / Grok
Grok has privileged access to X/Twitter data through Elon Musk's ownership of both companies. Can search and analyze tweets in real-time.
Content Deals
xAI/Grok: de facto exclusive access via shared ownership (Elon Musk). No known third-party licensing deals. X actively blocks other AI training crawlers.
Notes
Free tier is essentially useless for reading. Basic ($100/mo) is minimum for 10K tweet reads. Pro ($5K/mo) for serious usage. Enterprise starts at $42K/mo.
📷InstagramBlocked▶
AI Policy
Requires authenticated sessions. Uses IP quality scoring, TLS fingerprinting, behavioral analysis. ToS explicitly prohibits automated data collection. Meta Graph API only exposes data from authorized pages/accounts.
Official API
Meta Graph API (restricted) — Free but requires app approval
Best AI Provider
Meta AI
Meta AI has native access to Instagram/Facebook content through Meta's infrastructure. No third-party AI has comparable access.
Content Deals
No known third-party AI licensing deals. Meta trains its own AI (LLaMA) on its platform data. Meta Graph API requires per-app approval.
Notes
Datacenter IPs blocked immediately. Even mobile proxies face aggressive detection. Jan 2026 breach exposed 17.5M records via API scraping vulnerabilities.
fFacebookBlocked▶
AI Policy
Same infrastructure as Instagram (Meta). Logged-out access to most content removed years ago. Meta Graph API heavily permissioned — only authorized apps can read.
Official API
Meta Graph API (restricted) — Free but requires app approval
Best AI Provider
Meta AI
Meta AI has native platform access. No third-party AI provider has meaningful Facebook content access.
Content Deals
No third-party deals. Meta uses its own platform data for LLaMA training. Cambridge Analytica scandal (2018) led to severe lockdown of API access.
Notes
Public page content is technically crawlable but Meta blocks it in practice. The platform is effectively a walled garden.
♪TikTokBlocked▶
AI Policy
Research API requires application + institutional affiliation. 1,000 requests/day, up to 100K records/day for approved researchers. Uses encrypted headers, behavioral detection, real-time fraud scoring.
Official API
Research API (apply required) — Free for approved researchers
Best AI Provider
None have good access
No known AI provider has licensed TikTok content. The Research API is the only path and requires manual approval.
Content Deals
No known AI content licensing deals. TikTok's Research API is the only legitimate access path. 1 in 8 videos returns no metadata even through the Research API.
Notes
Even the Research API has significant gaps. Major accounts (Taylor Swift, news outlets) often return no data. ByteDance has not signed deals with Western AI companies.
💬DiscordBlocked▶
AI Policy
No publicly readable web pages — everything requires login. Bot API requires server admin invitation. ToS explicitly prohibits 'any robot, spider, crawler, scraper.' Developer Policy bans data mining.
Official API
Discord Bot API — Free (bot must be invited)
Best AI Provider
None — invite model
Access requires a bot invited to specific servers. No AI provider has broad Discord access. Must be set up per-server.
Content Deals
No content licensing deals. Invite-only model means no broad access is possible. Researchers scraped 2B+ messages in May 2025 via bots (against ToS).
Notes
Fundamentally different from other platforms — invite-based access model. Even with a valid bot, you can only read servers where the bot has been explicitly added by an admin.
WWikipediaAPI Only▶
AI Policy
Content is CC BY-SA (open license). Philosophically open but practically strained — AI crawlers account for 65% of resource-intensive traffic. Wikimedia urges AI companies to use Wikimedia Enterprise (paid) or the free REST API.
Official API
Wikimedia REST API — Free (rate limited)
Best AI Provider
Any — via REST API
The free Wikimedia REST API works for all agents. Example: api.wikimedia.org/api/rest_v1/page/summary/{title}. No auth required for basic access.
Content Deals
Wikimedia Enterprise is the paid commercial tier for heavy usage. Free REST API available for lighter use. Content is open-licensed (CC BY-SA) so training is legally permissible.
Notes
WebFetch returns 403 on Wikimedia domains (confirmed Claude Code issue #22846). The REST API is the correct path — clean JSON, no auth needed. Bandwidth for multimedia grew 50% since Jan 2024 due to AI scraping.
🐘MastodonFull Access▶
AI Policy
Decentralized — each instance sets its own policy. Most instances expose public timelines and posts without authentication. API (/api/v1/) allows reading public content without auth on most instances.
Official API
Mastodon API (per-instance) — Free, per-instance
Best AI Provider
Any — all equal
Open federation protocol (ActivityPub) means any agent can access public content. No special deals needed.
Content Deals
No licensing deals — open by design. ActivityPub federation protocol. Individual instance admins can restrict access but most don't.
Notes
Accessibility varies by instance. The decentralized nature means there's no single entity to sign deals with. Generally AI-friendly for public content.
🦋BlueskyFull Access▶
AI Policy
Most AI-friendly major social platform. Public posts readable without auth. Firehose (real-time stream of ALL public posts) accessible via WebSocket, no auth required. Jetstream provides simplified firehose. Explicit bot-building documentation.
Official API
AT Protocol API + Firehose — Free, no auth for reads
Best AI Provider
Any — all equal
The AT Protocol is fully open. Firehose provides real-time access to all public posts. Any AI agent can access everything with zero authentication.
Content Deals
No deals needed — the AT Protocol is open by design. The Firehose is the most remarkable open data stream from any major social platform.
Notes
Gold standard for social media AI accessibility alongside Hacker News. No login wall, no API keys for reads, real-time firehose, explicit support for bots and agents.
⌥GitHubAPI Only▶
AI Policy
robots.txt blocks most web paths (tree views, raw files, blame). But GPTBot and ClaudeBot are NOT specifically blocked. The REST/GraphQL API is the intended automated access path. 'Agentic Workflows' launched Feb 2026.
Official API
REST API + GraphQL API — Free (60/hr unauth, 5K/hr auth)
Best AI Provider
GitHub Copilot / Any
GitHub Copilot has native repo access. For Claude Code, the gh CLI tool provides excellent authenticated API access. 5,000 requests/hour.
Content Deals
GitHub Copilot (Microsoft/OpenAI) has deep native integration. GitHub launched 'Agent HQ' (Universe 2025) with agents from Anthropic, OpenAI, Google. Public API is free and generous.
Notes
For Claude Code specifically, the gh CLI is the best path — already integrated. Agentic Workflows (Feb 2026) enable AI agents to run within GitHub Actions with repo access.
MMediumPartial▶
AI Policy
Metered paywall — public posts accessible, member-only posts blocked. No specific AI crawler blocking beyond the paywall. No official API for reading content.
Official API
None (RSS only) — N/A
Best AI Provider
Any — all equal
Public posts are readable by any agent via WebFetch. Member-only content is behind the paywall for everyone.
Content Deals
No known AI content licensing deals. Medium's business model is reader subscriptions, not AI licensing.
Notes
One of the more accessible publishing platforms for agents. Public posts work with WebFetch. The metered paywall is the only barrier.
📰News Sites (NYT, WaPo, etc.)Partial▶
AI Policy
48% of major news sites block AI training crawlers in robots.txt. NYT sued OpenAI. Financial Times blocks GPTBot, ClaudeBot, Google-Extended explicitly. Cloudflare 'Pay Per Crawl' ($0.001–$0.01/request) is the emerging model.
Official API
Varies (licensing deals) — Varies by publisher
Best AI Provider
OpenAI (most deals)
OpenAI has licensing deals with AP, Axel Springer, Future Publishing, and others. Perplexity has a deal with Washington Post (revenue-sharing).
Content Deals
OpenAI: AP, Axel Springer, Future Publishing. Perplexity: Washington Post (revenue-sharing). NYT, PCMag, Mashable: suing OpenAI. Internet Archive being blocked as 'AI back door.'
Notes
Public (non-paywalled) articles generally readable via WebFetch. Training crawlers blocked on ~48% of sites. TollBit blocked 26M scrapes in March 2025. A 2025 CJR study found AI browsers can bypass paywalls.
QQuoraBlocked▶
AI Policy
Uses Cloudflare protection + login wall. Supports Cloudflare's 'Pay Per Crawl' initiative. Operates Poe (AI assistant platform) — wants payment for content used in AI training/retrieval.
Official API
None — N/A
Best AI Provider
None have access
No known AI provider has licensed Quora content. Quora wants to monetize via Pay Per Crawl rather than direct deals.
Content Deals
No content licensing deals. Quora is a named supporter of Cloudflare's Pay Per Crawl initiative alongside Stack Overflow, BuzzFeed, Fortune.
Notes
Ironic position — Quora runs Poe (an AI platform) but blocks AI agents from accessing Quora content. They want to be paid for content, not give it away.
💭Discourse ForumsPartial▶
AI Policy
Self-hosted open-source platform — each instance sets its own policy. Most public Discourse forums are accessible. Some use Cloudflare or other bot protection. The Discourse API provides structured access.
Official API
Discourse API (per-instance) — Free (per-instance)
Best AI Provider
Any — all equal
Public Discourse instances are generally accessible to all agents. API access varies by instance configuration.
Content Deals
No centralized deals possible — Discourse is open-source and self-hosted. Access depends entirely on individual forum administrators.
Notes
Many tech communities run on Discourse (e.g., Meta community forums, Rust users, etc.). Generally more accessible than Reddit for AI agents. Public topics usually readable via WebFetch.
Claude Code Researcher Agents
Four specialized research agents are available as subagents. Each uses different models, strategies, and trade-offs between speed and depth.
| Agent | Model | Speed | Depth | Best For |
|---|---|---|---|---|
| researcher | Claude (inherited) | Medium | Medium-Deep | Targeted research, fact-checking, combining web + local info |
| gemini-researcher | Google Gemini | Slow (5-60+ min) | Very Deep | Comprehensive research, YouTube content, Google ecosystem, multi-angle analysis |
| perplexity-researcher | Perplexity AI | Fast | Medium | Quick factual research, current events, finding specific data points |
| claude-researcher | Claude (with WebSearch) | Medium | Deep | Research requiring strong reasoning, connecting information across domains |
researcher
General-purpose web research with WebSearch + WebFetch. Single-threaded, sequential approach.
Strengths
- +Most flexible — handles any research task
- +Can combine web search with file reads and code execution
- +Good at synthesizing information from multiple sources
- +Can write files and create deliverables from research
Weaknesses
- -Single-threaded — searches one query at a time
- -No built-in query decomposition
- -Can go down rabbit holes without structure
- -Slower on broad topics
Tool Access
gemini-researcher
Breaks queries into 3-10 variations, launches parallel Gemini sub-agents. Multi-perspective deep investigation.
Strengths
- +Parallel execution — 3-10 sub-agents simultaneously
- +Best YouTube/Google ecosystem access (Google product)
- +Multi-perspective decomposition catches angles you'd miss
- +Deep investigation with high token budgets per sub-agent
Weaknesses
- -SLOW — can take 5-60+ minutes for complex queries
- -High token usage (100K+ tokens typical)
- -Overkill for simple questions
- -Sub-agents can overlap and duplicate work
Tool Access
perplexity-researcher
Leverages Perplexity's search-optimized AI for web research. Built for finding answers.
Strengths
- +Search-native — built specifically for web research
- +Good at finding current/recent information
- +Provides source citations naturally
- +Faster than Gemini for most queries
Weaknesses
- -Less control over search strategy
- -Can't decompose into parallel sub-queries like Gemini
- -May hit rate limits on heavy usage
- -Less capable at multi-step reasoning
Tool Access
claude-researcher
Multi-query decomposition with parallel search execution using Claude's built-in WebSearch.
Strengths
- +Intelligent query decomposition
- +Parallel search execution
- +Strong reasoning about search results
- +Good at connecting dots across sources
Weaknesses
- -Limited by Claude's WebSearch tool capabilities
- -Same WebFetch limitations as other Claude agents
- -No special access to walled-garden platforms
- -May be slower than Perplexity for simple queries
Tool Access
Which Researcher Should I Use?
Need a quick answer? Use perplexity-researcher — fastest for simple factual lookups.
Need YouTube data or Google ecosystem info? Use gemini-researcher — only agent with native Google/YouTube understanding.
Need comprehensive multi-angle research? Use gemini-researcher — parallel sub-agents cover more ground, but budget 10-60 minutes.
Need research + reasoning + code? Use claude-researcher — best at connecting dots and synthesizing across domains.
Need research combined with file edits? Use researcher — general-purpose, can write deliverables from findings.
The YouTube Lesson (What Triggered This Dashboard)
YouTube is the single hardest major website for AI agents to access. It uses browser fingerprinting, WebGL fingerprinting, behavioral analysis, TLS fingerprinting, CAPTCHAs, and frequent HTML structure changes. Its robots.txt blocks all useful paths (/results, /watch, /api). Even datacenter IPs fail immediately.
When Claude Code tried to find YouTube video IDs for the agentic-marketing course, it failed across 5 different approaches over 2+ hours because no web scraping approach works on YouTube.
The fix: Use the YouTube Data API v3 (free, 10K units/day, ~100 searches/day), set up an MCP server wrapping the API, or use gemini-researcher which has native Google/YouTube understanding. For transcripts, yt-dlp is the most reliable CLI tool.