AI Agent Web Access Dashboard

Key Insight: Training vs Retrieval

Sites treat training crawlers (ClaudeBot, GPTBot) differently from retrieval agents (Claude-User, ChatGPT-User). Over 5.8 million sites block ClaudeBot, but far fewer block retrieval agents. Cloudflare now blocks AI crawlers by default for all clients. The emerging business model is Pay Per Crawl ($0.001–$0.01/request) — Cloudflare, Stack Overflow, Quora, and BuzzFeed are early adopters.

Access Matrix

Platform	Access	WebFetch	Search	Read	Comments	Images	Post	AI Train	Best AI Provider
▶YouTube	API Only	-	-	-	-	-	-	Blocked	Google Gemini
🔶Reddit	Paid API	-	-	-	-	-	-	Blocked	OpenAI / ChatGPT
YHacker News	Full Access	Y	Y	Y	Y	-	-	Allowed	Any — all equal
⬡Stack Overflow	Paid API	-	-	-	-	-	-	Allowed	OpenAI / ChatGPT
𝕏Twitter / X	Paid API	-	-	-	-	-	-	Blocked	xAI / Grok
📷Instagram	Blocked	-	-	-	-	-	-	Blocked	Meta AI
fFacebook	Blocked	-	-	-	-	-	-	Blocked	Meta AI
♪TikTok	Blocked	-	-	-	-	-	-	Blocked	None have good access
💬Discord	Blocked	-	-	-	-	-	-	Blocked	None — invite model
WWikipedia	API Only	-	Y	Y	-	Y	-	Allowed	Any — via REST API
🐘Mastodon	Full Access	Y	Y	Y	Y	Y	-	Allowed	Any — all equal
🦋Bluesky	Full Access	Y	Y	Y	Y	Y	-	Allowed	Any — all equal
⌥GitHub	API Only	Y	Y	Y	Y	-	-	Allowed	GitHub Copilot / Any
MMedium	Partial	Y	-	Y	-	-	-	Allowed	Any — all equal
📰News Sites (NYT, WaPo, etc.)	Partial	Y	-	Y	-	-	-	Blocked	OpenAI (most deals)
QQuora	Blocked	-	-	-	-	-	-	Blocked	None have access
💭Discourse Forums	Partial	Y	Y	Y	Y	-	-	Allowed	Any — all equal

Open Platforms

Full agent access, no auth needed

Y Hacker News
Firebase API + Algolia Search — Free, no auth required
🐘 Mastodon
Mastodon API (per-instance) — Free, per-instance
🦋 Bluesky
AT Protocol API + Firehose — Free, no auth for reads

API / Partial Access

Accessible via official API or partially via WebFetch

▶ YouTube
YouTube Data API v3 — Free (10K units/day)
W Wikipedia
Wikimedia REST API — Free (rate limited)
⌥ GitHub
REST API + GraphQL API — Free (60/hr unauth, 5K/hr auth)
M Medium
None (RSS only) — N/A
📰 News Sites (NYT, WaPo, etc.)
Varies (licensing deals) — Varies by publisher
💭 Discourse Forums
Discourse API (per-instance) — Free (per-instance)

Restricted / Blocked

Paid API required or fully blocked

🔶 Reddit
Reddit API (OAuth) — $0.24/1K calls (free under 100/min)
⬡ Stack Overflow
OverflowAPI (commercial) — Paid (licensing deal)
𝕏 Twitter / X
X API v2 — $100/mo Basic, $5K/mo Pro
📷 Instagram
Meta Graph API (restricted) — Free but requires app approval
f Facebook
Meta Graph API (restricted) — Free but requires app approval
♪ TikTok
Research API (apply required) — Free for approved researchers
💬 Discord
Discord Bot API — Free (bot must be invited)
Q Quora
None — N/A

Content Licensing Deals

OpenAI

Reddit — real-time content firehose (2024)
Stack Overflow — OverflowAPI, 59M+ Q&A posts (May 2024)
Associated Press — news content licensing
Axel Springer — Politico, Business Insider, Bild
Future Publishing — TechRadar, Tom's Guide, etc.

Google / DeepMind

Reddit — $60M/year data licensing (early 2024)
YouTube — native access (Google owns YouTube)
Web index — Google Search infrastructure

xAI / Grok

Twitter/X — de facto exclusive access via Elon's ownership
Real-time tweet search and analysis built in
No known third-party content deals

Anthropic / Claude

No publicly known content licensing deals
Relies on WebSearch/WebFetch for retrieval
GitHub Agent HQ integration (Universe 2025)

Meta AI

Instagram/Facebook — native access (Meta owns both)
Trains LLaMA on own platform data
No known external content deals

Perplexity AI

Washington Post — revenue-sharing deal
Controversy over aggressive scraping practices
Multiple publisher complaints and lawsuits

Platform Details

▶YouTubeAPI Only▶

AI Policy

Automated access prohibited by ToS. robots.txt blocks /results, /watch, /api. Uses browser fingerprinting, WebGL fingerprinting, behavioral analysis, TLS fingerprinting, and CAPTCHAs.

Official API

YouTube Data API v3 — Free (10K units/day)

Best AI Provider

Google Gemini

Gemini has native YouTube understanding — can process video content, transcripts, and metadata directly as a Google product.

Content Deals

No known third-party AI licensing deals. YouTube Data API v3 is the only legitimate access path (100 searches/day free).

Notes

Hardest major site to scrape. Datacenter IPs fail immediately. Even residential proxies get caught. MCP servers wrapping the Data API v3 are the best workaround for agents.

🔶RedditPaid API▶

AI Policy

Killed free API access July 2023 ($0.24/1K calls). Blocks all major AI crawlers at network level, not just robots.txt. Pursues legal action via CFAA against scrapers.

Official API

Reddit API (OAuth) — $0.24/1K calls (free under 100/min)

Best AI Provider

OpenAI / ChatGPT

OpenAI signed a content licensing deal with Reddit (2024) giving access to Reddit's real-time content firehose for training and retrieval.

Content Deals

OpenAI: content licensing deal (2024). Google: $60M/year data licensing deal (early 2024). Both get structured API access to Reddit's full corpus.

Notes

Connection-level block — WebFetch can't even reach Reddit. The API pricing change explicitly cited AI companies as the reason. Apollo shutdown was the catalyst for the 2023 blackout protests.

YHacker NewsFull Access▶

AI Policy

Open by design. robots.txt allows all user agents. Only blocks interactive paths (/vote, /reply, /login). Requests 30-second crawl delay.

Official API

Firebase API + Algolia Search — Free, no auth required

Best AI Provider

Any — all equal

No restrictions. Any AI agent can access HN via WebFetch or the free APIs. Algolia provides full-text search with no auth.

Content Deals

No deals needed — content is publicly accessible. Firebase API and Algolia Search API are both free, unauthenticated, and real-time.

Notes

Gold standard for AI agent accessibility. Simple HTML, no JavaScript rendering required, two free public APIs. The Algolia API is particularly powerful for searching historical posts.

⬡Stack OverflowPaid API▶

AI Policy

Initially blocked GPTBot (2023), reversed after OpenAI deal. Uses Cloudflare bot protection. Bans AI-generated answers. Content is CC BY-SA licensed but scraping is blocked in practice.

Official API

OverflowAPI (commercial) — Paid (licensing deal)

Best AI Provider

OpenAI / ChatGPT

OpenAI signed the OverflowAPI deal (May 2024) — licensed access to 59M+ Q&A posts. Copilot also has deep SO integration.

Content Deals

OpenAI: OverflowAPI partnership (May 2024), licensed 59M+ Q&A posts. Community backlash — many users deleted answers in protest.

Notes

Cloudflare blocks WebFetch. Google search results often surface SO pages that WebFetch can read via cached URLs as a workaround. Content is CC BY-SA but platform enforces access control.

𝕏Twitter / XPaid API▶

AI Policy

Login wall since late 2023. Free API tier is write-only (500 posts/mo). ToS prohibits using API data to train foundation models. Pay-as-you-go launched Feb 2026.

Official API

X API v2 — $100/mo Basic, $5K/mo Pro

Best AI Provider

xAI / Grok

Grok has privileged access to X/Twitter data through Elon Musk's ownership of both companies. Can search and analyze tweets in real-time.

Content Deals

xAI/Grok: de facto exclusive access via shared ownership (Elon Musk). No known third-party licensing deals. X actively blocks other AI training crawlers.

Notes

Free tier is essentially useless for reading. Basic ($100/mo) is minimum for 10K tweet reads. Pro ($5K/mo) for serious usage. Enterprise starts at $42K/mo.

📷InstagramBlocked▶

AI Policy

Requires authenticated sessions. Uses IP quality scoring, TLS fingerprinting, behavioral analysis. ToS explicitly prohibits automated data collection. Meta Graph API only exposes data from authorized pages/accounts.

Official API

Meta Graph API (restricted) — Free but requires app approval

Best AI Provider

Meta AI

Meta AI has native access to Instagram/Facebook content through Meta's infrastructure. No third-party AI has comparable access.

Content Deals

No known third-party AI licensing deals. Meta trains its own AI (LLaMA) on its platform data. Meta Graph API requires per-app approval.

Notes

Datacenter IPs blocked immediately. Even mobile proxies face aggressive detection. Jan 2026 breach exposed 17.5M records via API scraping vulnerabilities.

fFacebookBlocked▶

AI Policy

Same infrastructure as Instagram (Meta). Logged-out access to most content removed years ago. Meta Graph API heavily permissioned — only authorized apps can read.

Official API

Meta Graph API (restricted) — Free but requires app approval

Best AI Provider

Meta AI

Meta AI has native platform access. No third-party AI provider has meaningful Facebook content access.

Content Deals

No third-party deals. Meta uses its own platform data for LLaMA training. Cambridge Analytica scandal (2018) led to severe lockdown of API access.

Notes

Public page content is technically crawlable but Meta blocks it in practice. The platform is effectively a walled garden.

♪TikTokBlocked▶

AI Policy

Research API requires application + institutional affiliation. 1,000 requests/day, up to 100K records/day for approved researchers. Uses encrypted headers, behavioral detection, real-time fraud scoring.

Official API

Research API (apply required) — Free for approved researchers

Best AI Provider

None have good access

No known AI provider has licensed TikTok content. The Research API is the only path and requires manual approval.

Content Deals

No known AI content licensing deals. TikTok's Research API is the only legitimate access path. 1 in 8 videos returns no metadata even through the Research API.

Notes

Even the Research API has significant gaps. Major accounts (Taylor Swift, news outlets) often return no data. ByteDance has not signed deals with Western AI companies.

💬DiscordBlocked▶

AI Policy

No publicly readable web pages — everything requires login. Bot API requires server admin invitation. ToS explicitly prohibits 'any robot, spider, crawler, scraper.' Developer Policy bans data mining.

Official API

Discord Bot API — Free (bot must be invited)

Best AI Provider

None — invite model

Access requires a bot invited to specific servers. No AI provider has broad Discord access. Must be set up per-server.

Content Deals

No content licensing deals. Invite-only model means no broad access is possible. Researchers scraped 2B+ messages in May 2025 via bots (against ToS).

Notes

Fundamentally different from other platforms — invite-based access model. Even with a valid bot, you can only read servers where the bot has been explicitly added by an admin.

WWikipediaAPI Only▶

AI Policy

Content is CC BY-SA (open license). Philosophically open but practically strained — AI crawlers account for 65% of resource-intensive traffic. Wikimedia urges AI companies to use Wikimedia Enterprise (paid) or the free REST API.

Official API

Wikimedia REST API — Free (rate limited)

Best AI Provider

Any — via REST API

The free Wikimedia REST API works for all agents. Example: api.wikimedia.org/api/rest_v1/page/summary/{title}. No auth required for basic access.

Content Deals

Wikimedia Enterprise is the paid commercial tier for heavy usage. Free REST API available for lighter use. Content is open-licensed (CC BY-SA) so training is legally permissible.

Notes

WebFetch returns 403 on Wikimedia domains (confirmed Claude Code issue #22846). The REST API is the correct path — clean JSON, no auth needed. Bandwidth for multimedia grew 50% since Jan 2024 due to AI scraping.

🐘MastodonFull Access▶

AI Policy

Decentralized — each instance sets its own policy. Most instances expose public timelines and posts without authentication. API (/api/v1/) allows reading public content without auth on most instances.

Official API

Mastodon API (per-instance) — Free, per-instance

Best AI Provider

Any — all equal

Open federation protocol (ActivityPub) means any agent can access public content. No special deals needed.

Content Deals

No licensing deals — open by design. ActivityPub federation protocol. Individual instance admins can restrict access but most don't.

Notes

Accessibility varies by instance. The decentralized nature means there's no single entity to sign deals with. Generally AI-friendly for public content.

🦋BlueskyFull Access▶

AI Policy

Most AI-friendly major social platform. Public posts readable without auth. Firehose (real-time stream of ALL public posts) accessible via WebSocket, no auth required. Jetstream provides simplified firehose. Explicit bot-building documentation.

Official API

AT Protocol API + Firehose — Free, no auth for reads

Best AI Provider

Any — all equal

The AT Protocol is fully open. Firehose provides real-time access to all public posts. Any AI agent can access everything with zero authentication.

Content Deals

No deals needed — the AT Protocol is open by design. The Firehose is the most remarkable open data stream from any major social platform.

Notes

Gold standard for social media AI accessibility alongside Hacker News. No login wall, no API keys for reads, real-time firehose, explicit support for bots and agents.

⌥GitHubAPI Only▶

AI Policy

robots.txt blocks most web paths (tree views, raw files, blame). But GPTBot and ClaudeBot are NOT specifically blocked. The REST/GraphQL API is the intended automated access path. 'Agentic Workflows' launched Feb 2026.

Official API

REST API + GraphQL API — Free (60/hr unauth, 5K/hr auth)

Best AI Provider

GitHub Copilot / Any

GitHub Copilot has native repo access. For Claude Code, the gh CLI tool provides excellent authenticated API access. 5,000 requests/hour.

Content Deals

GitHub Copilot (Microsoft/OpenAI) has deep native integration. GitHub launched 'Agent HQ' (Universe 2025) with agents from Anthropic, OpenAI, Google. Public API is free and generous.

Notes

For Claude Code specifically, the gh CLI is the best path — already integrated. Agentic Workflows (Feb 2026) enable AI agents to run within GitHub Actions with repo access.

MMediumPartial▶

AI Policy

Metered paywall — public posts accessible, member-only posts blocked. No specific AI crawler blocking beyond the paywall. No official API for reading content.

Official API

None (RSS only) — N/A

Best AI Provider

Any — all equal

Public posts are readable by any agent via WebFetch. Member-only content is behind the paywall for everyone.

Content Deals

No known AI content licensing deals. Medium's business model is reader subscriptions, not AI licensing.

Notes

One of the more accessible publishing platforms for agents. Public posts work with WebFetch. The metered paywall is the only barrier.

📰News Sites (NYT, WaPo, etc.)Partial▶

AI Policy

48% of major news sites block AI training crawlers in robots.txt. NYT sued OpenAI. Financial Times blocks GPTBot, ClaudeBot, Google-Extended explicitly. Cloudflare 'Pay Per Crawl' ($0.001–$0.01/request) is the emerging model.

Official API

Varies (licensing deals) — Varies by publisher

Best AI Provider

OpenAI (most deals)

OpenAI has licensing deals with AP, Axel Springer, Future Publishing, and others. Perplexity has a deal with Washington Post (revenue-sharing).

Content Deals

OpenAI: AP, Axel Springer, Future Publishing. Perplexity: Washington Post (revenue-sharing). NYT, PCMag, Mashable: suing OpenAI. Internet Archive being blocked as 'AI back door.'

Notes

Public (non-paywalled) articles generally readable via WebFetch. Training crawlers blocked on ~48% of sites. TollBit blocked 26M scrapes in March 2025. A 2025 CJR study found AI browsers can bypass paywalls.

QQuoraBlocked▶

AI Policy

Uses Cloudflare protection + login wall. Supports Cloudflare's 'Pay Per Crawl' initiative. Operates Poe (AI assistant platform) — wants payment for content used in AI training/retrieval.

Official API

None — N/A

Best AI Provider

None have access

No known AI provider has licensed Quora content. Quora wants to monetize via Pay Per Crawl rather than direct deals.

Content Deals

No content licensing deals. Quora is a named supporter of Cloudflare's Pay Per Crawl initiative alongside Stack Overflow, BuzzFeed, Fortune.

Notes

Ironic position — Quora runs Poe (an AI platform) but blocks AI agents from accessing Quora content. They want to be paid for content, not give it away.

💭Discourse ForumsPartial▶

AI Policy

Self-hosted open-source platform — each instance sets its own policy. Most public Discourse forums are accessible. Some use Cloudflare or other bot protection. The Discourse API provides structured access.

Official API

Discourse API (per-instance) — Free (per-instance)

Best AI Provider

Any — all equal

Public Discourse instances are generally accessible to all agents. API access varies by instance configuration.

Content Deals

No centralized deals possible — Discourse is open-source and self-hosted. Access depends entirely on individual forum administrators.

Notes

Many tech communities run on Discourse (e.g., Meta community forums, Rust users, etc.). Generally more accessible than Reddit for AI agents. Public topics usually readable via WebFetch.

Claude Code Researcher Agents

Four specialized research agents are available as subagents. Each uses different models, strategies, and trade-offs between speed and depth.

Agent	Model	Speed	Depth	Best For
researcher	Claude (inherited)	Medium	Medium-Deep	Targeted research, fact-checking, combining web + local info
gemini-researcher	Google Gemini	Slow (5-60+ min)	Very Deep	Comprehensive research, YouTube content, Google ecosystem, multi-angle analysis
perplexity-researcher	Perplexity AI	Fast	Medium	Quick factual research, current events, finding specific data points
claude-researcher	Claude (with WebSearch)	Medium	Deep	Research requiring strong reasoning, connecting information across domains

researcher

MediumMedium-Deep

General-purpose web research with WebSearch + WebFetch. Single-threaded, sequential approach.

Strengths

+Most flexible — handles any research task
+Can combine web search with file reads and code execution
+Good at synthesizing information from multiple sources
+Can write files and create deliverables from research

Weaknesses

-Single-threaded — searches one query at a time
-No built-in query decomposition
-Can go down rabbit holes without structure
-Slower on broad topics

Tool Access

WebSearchWebFetchReadWriteEditBashGrepGlob

gemini-researcher

Slow (5-60+ min)Very Deep

Breaks queries into 3-10 variations, launches parallel Gemini sub-agents. Multi-perspective deep investigation.

Strengths

+Parallel execution — 3-10 sub-agents simultaneously
+Best YouTube/Google ecosystem access (Google product)
+Multi-perspective decomposition catches angles you'd miss
+Deep investigation with high token budgets per sub-agent

Weaknesses

-SLOW — can take 5-60+ minutes for complex queries
-High token usage (100K+ tokens typical)
-Overkill for simple questions
-Sub-agents can overlap and duplicate work

Tool Access

WebSearchWebFetchReadWriteBashGrepGlob

perplexity-researcher

FastMedium

Leverages Perplexity's search-optimized AI for web research. Built for finding answers.

Strengths

+Search-native — built specifically for web research
+Good at finding current/recent information
+Provides source citations naturally
+Faster than Gemini for most queries

Weaknesses

-Less control over search strategy
-Can't decompose into parallel sub-queries like Gemini
-May hit rate limits on heavy usage
-Less capable at multi-step reasoning

Tool Access

WebSearchWebFetchReadWriteBashGrepGlob

claude-researcher

MediumDeep

Multi-query decomposition with parallel search execution using Claude's built-in WebSearch.

Strengths

+Intelligent query decomposition
+Parallel search execution
+Strong reasoning about search results
+Good at connecting dots across sources

Weaknesses

-Limited by Claude's WebSearch tool capabilities
-Same WebFetch limitations as other Claude agents
-No special access to walled-garden platforms
-May be slower than Perplexity for simple queries

Tool Access

WebSearchWebFetchReadWriteBashGrepGlob

Which Researcher Should I Use?

Need a quick answer? Use perplexity-researcher — fastest for simple factual lookups.

Need YouTube data or Google ecosystem info? Use gemini-researcher — only agent with native Google/YouTube understanding.

Need comprehensive multi-angle research? Use gemini-researcher — parallel sub-agents cover more ground, but budget 10-60 minutes.

Need research + reasoning + code? Use claude-researcher — best at connecting dots and synthesizing across domains.

Need research combined with file edits? Use researcher — general-purpose, can write deliverables from findings.

The YouTube Lesson (What Triggered This Dashboard)

YouTube is the single hardest major website for AI agents to access. It uses browser fingerprinting, WebGL fingerprinting, behavioral analysis, TLS fingerprinting, CAPTCHAs, and frequent HTML structure changes. Its robots.txt blocks all useful paths (/results, /watch, /api). Even datacenter IPs fail immediately.

When Claude Code tried to find YouTube video IDs for the agentic-marketing course, it failed across 5 different approaches over 2+ hours because no web scraping approach works on YouTube.

The fix: Use the YouTube Data API v3 (free, 10K units/day, ~100 searches/day), set up an MCP server wrapping the API, or use gemini-researcher which has native Google/YouTube understanding. For transcripts, yt-dlp is the most reliable CLI tool.