AI report card
Common Crawl
commoncrawl.orgLive infrastructure simulation · commoncrawl.org
Baseline
Common CrawlAI-Readiness Index42/100
Core Web Vitals40
Schema Coverage12%
Agent Crawlability32%
// current <head> on commoncrawl.org
<title>Common Crawl - Home</title> <meta name="description" content="Welcome to Common Crawl."> <!-- no JSON-LD --> <!-- no /llms.txt --> <!-- no agent manifest --> <link rel="icon" href="/favicon.ico"> <script src="/legacy-analytics.js"></script>
AI crawlers parse this and bounce — no structured signal.
Toggle the switch to inject Solarly's infrastructure layer.
42
AI-readiness score / 100
Common Crawl is a non-profit organization that provides a free, open repository of web crawl data, established in 2007. It contains over 300 billion web pages and adds 3-5 billion new pages monthly. The data is widely used in research, cited in over 10,000 papers. Key resources include access to various indexes, crawl statistics, and community collaboration tools. Recent updates include the release of the June 2026 crawl archive containing 2.10 billion pages. The organization's mission focuses on making web data accessible for research and analysis.
Breakdown
Schema.org Coverage0
Crawler Accessibility100
Content Structure100
Metadata Quality10
AI Directives (llms.txt)0
What's holding Common Crawl back
- Missing /llms.txt manifestcommoncrawl.org does not expose an /llms.txt file. AI crawlers cannot discover canonical content fast.
- No Schema.org JSON-LD detectedNo structured data found on the homepage of commoncrawl.org. LLMs rely on JSON-LD for entity grounding.
- Open Graph metadata incompleteog:description and og:image missing.
- Sitemap lacks <lastmod> timestampsAI crawlers cannot prioritize freshness without lastmod hints.
Get alerts when this changes
Solá will watch Common Crawl and email you the moment its AI visibility shifts. $19/mo · cancel any time.