Squarespace, Wix, and other hosted sites
If your site is built on Squarespace, Wix, Webflow, GoDaddy, Weebly or a similar hosted builder, you can't run code on the server or read your own access logs - so there's nothing on the site itself that can see an AI crawler arrive. The fix is to route your domain through a free Cloudflare account, which sits in front of your site, spots AI crawlers, and reports them to OpenAttribution. No terminal, no developer needed.
Why you need a layer in front
AI crawlers - GPTBot, ClaudeBot, PerplexityBot, ChatGPT-User and the rest - fetch the raw HTML of your pages. They don't run JavaScript, so a tracking snippet pasted into your site's footer never sees them; it only ever fires for real people in browsers. And hosted builders don't give you access to your server logs (their built-in analytics are summaries, not the raw requests).
Cloudflare solves both problems at once: every request to your site passes through Cloudflare
first, so it can see the crawler, classify it, and send a content_retrieved event to your OpenAttribution dashboard. Your site keeps working exactly as before - Cloudflare
just passes traffic through to it.
oat_pub_, created when you register your domain (or under Domains if you've already
verified it) - and your login for whoever you bought the domain name from (your registrar -
GoDaddy, Namecheap, Squarespace Domains, Google Domains, etc.), because one step changes a setting
there. Your site stays online throughout.Set it up
Step 1 - Create a free Cloudflare account
Go to dash.cloudflare.com/sign-up and sign up. It's free; you don't need a paid plan for any of this.
Step 2 - Add your site
In the Cloudflare dashboard, click Add a site and type your domain (for example yoursite.com,
without www and without https://).
Cloudflare scans your existing DNS records and copies them over - this is what keeps your site
pointing at Squarespace/Wix/etc. while Cloudflare sits in front. When it shows you the imported
records, just continue; you don't need to change anything here.
Step 3 - Choose the Free plan
When Cloudflare asks which plan you want, pick Free.
Step 4 - Point your domain at Cloudflare
Cloudflare gives you two nameservers (they look like xena.ns.cloudflare.com). You set these at your registrar -
the company you bought the domain from. Cloudflare shows step-by-step instructions for the common
registrars on that screen; the gist is: log in to your registrar, find the domain's
"nameservers" or "DNS" setting, replace what's there with Cloudflare's two, and save.
This change can take anywhere from a few minutes to a few hours to take effect. Your site stays up the whole time - traffic just gradually switches to going via Cloudflare. Cloudflare emails you when it's active.
Step 5 - Set SSL to "Full"
Once the domain is active on Cloudflare, open SSL/TLS in the left menu and set the encryption mode to Full (not "Full (strict)", not "Flexible"). This keeps HTTPS working between Cloudflare and your hosted builder. It's the step people most often skip, and skipping it causes a "redirect loop" or a certificate warning on the site.
Step 6 - Create the Worker
In the left menu go to Workers & Pages → Create application → Create Worker. Give it a name like oa-telemetry and click Deploy (that just creates a placeholder). Then click Edit code,
delete everything in the editor, paste the code below, and click Save and deploy.
// OpenAttribution Cloudflare Worker - single file, no build step.
// Source: github.com/openattribution-org/cloudflare-worker
//
// Detects AI crawlers / assistants and reports content_retrieved telemetry
// events to the OA API. Requires two settings on the Worker:
// OA_TELEMETRY_ENDPOINT (text) e.g. https://telemetry.openattribution.org/events
// OA_API_KEY (secret) a content-owner key, oat_pub_..., telemetry:write scope
const AI_BOT_PATTERNS = [
// Training crawlers
[/GPTBot/i, 'GPTBot', 'training'],
[/ClaudeBot/i, 'ClaudeBot', 'training'],
[/CCBot/i, 'CCBot', 'training'],
[/GoogleOther/i, 'GoogleOther', 'training'],
[/Bytespider/i, 'Bytespider', 'training'],
[/Diffbot/i, 'Diffbot', 'training'],
[/Applebot-Extended/i, 'Applebot-Extended', 'training'],
[/cohere-ai/i, 'cohere-ai', 'training'],
[/FacebookBot/i, 'FacebookBot', 'training'],
[/meta-externalagent/i, 'meta-externalagent', 'training'],
[/Amazonbot/i, 'Amazonbot', 'training'],
[/DeepSeekBot/i, 'DeepSeekBot', 'training'],
[/AI2Bot/i, 'AI2Bot', 'training'],
[/PanguBot/i, 'PanguBot', 'training'],
[/ChatGLM-Spider/i, 'ChatGLM-Spider', 'training'],
[/Timpibot/i, 'Timpibot', 'training'],
[/omgili/i, 'omgili', 'training'],
[/ImagesiftBot/i, 'ImagesiftBot', 'training'],
[/FirecrawlAgent/i, 'FirecrawlAgent', 'training'],
[/xAI-Bot/i, 'xAI-Bot', 'training'],
[/Google-CloudVertexBot/i, 'Google-CloudVertexBot', 'training'],
[/HuggingFace-Bot/i, 'HuggingFace-Bot', 'training'],
[/Brightbot/i, 'Brightbot', 'training'],
[/Webzio-Extended/i, 'Webzio-Extended', 'training'],
[/TerraCotta/i, 'TerraCotta', 'training'],
// Inference fetchers (user-triggered, real time)
[/ChatGPT-User/i, 'ChatGPT-User', 'inference'],
[/ChatGPT-Browser/i, 'ChatGPT-Browser', 'inference'],
[/Claude-User/i, 'Claude-User', 'inference'],
[/Perplexity-User/i, 'Perplexity-User', 'inference'],
[/MistralAI-User/i, 'MistralAI-User', 'inference'],
[/Amzn-User/i, 'Amzn-User', 'inference'],
[/meta-externalfetcher/i, 'meta-externalfetcher', 'inference'],
[/Google-Agent/i, 'Google-Agent', 'inference'],
[/GoogleAgent-Mariner/i, 'GoogleAgent-Mariner', 'inference'],
[/Gemini-Deep-Research/i, 'Gemini-Deep-Research', 'inference'],
[/Google-NotebookLM/i, 'Google-NotebookLM', 'inference'],
[/DuckAssistBot/i, 'DuckAssistBot', 'inference'],
[/PhindBot/i, 'PhindBot', 'inference'],
[/Cohere-Command/i, 'Cohere-Command', 'inference'],
[/Devin\/[\d.]+/i, 'Devin', 'inference'],
// AI search indexers
[/OAI-SearchBot/i, 'OAI-SearchBot', 'search'],
[/Claude-SearchBot/i, 'Claude-SearchBot', 'search'],
[/PerplexityBot/i, 'PerplexityBot', 'search'],
[/YouBot/i, 'YouBot', 'search'],
[/PetalBot/i, 'PetalBot', 'search'],
[/Bravebot/i, 'Bravebot', 'search'],
[/AzureAI-SearchBot/i, 'AzureAI-SearchBot', 'search'],
[/meta-webindexer/i, 'meta-webindexer', 'search'],
[/ExaBot/i, 'ExaBot', 'search'],
[/Andibot/i, 'Andibot', 'search'],
];
// Cloudflare's verifiedBotCategory -> OA bot_category. Available on every plan.
const CF_CATEGORY = {
'AI Crawler': 'training',
'AI Assistant': 'inference',
'AI Search': 'search',
};
const STATIC_EXT = /\.(css|js|jpg|jpeg|png|gif|svg|ico|woff2?|ttf|eot|map|webp|avif|mp4|webm)$/i;
function matchUserAgent(ua) {
for (const [pattern, name, category] of AI_BOT_PATTERNS) {
if (pattern.test(ua)) return { name, category };
}
return null;
}
function classify(request) {
const cf = request.cf || {};
const bm = cf.botManagement;
const uaMatch = matchUserAgent(request.headers.get('user-agent') || '');
// verifiedBotCategory is available on all plans. If Cloudflare has
// categorised this as an AI bot, trust it for the category, but still pull
// the bot name from the UA when we recognise it.
const aiCategory = CF_CATEGORY[cf.verifiedBotCategory];
if (aiCategory) {
return {
name: uaMatch ? uaMatch.name : null,
category: aiCategory,
verified: bm && typeof bm.verifiedBot === 'boolean' ? bm.verifiedBot : true,
detection: 'bot_management',
ja4: bm && bm.ja4,
};
}
// Enterprise Bot Management: skip verified non-AI bots (Googlebot, Bingbot,
// Pingdom, etc.) and high-score requests (likely human).
if (bm && typeof bm.score === 'number') {
if (bm.verifiedBot || bm.score >= 30) return null;
}
// UA pattern matching - Free/Pro fallback, or low-score unverified on Enterprise.
if (uaMatch) {
return {
name: uaMatch.name,
category: uaMatch.category,
verified: false,
detection: bm ? 'bot_management' : 'user_agent',
ja4: bm && bm.ja4,
};
}
return null;
}
export default {
async fetch(request, env, ctx) {
// Static assets: pass straight through, don't classify.
if (STATIC_EXT.test(new URL(request.url).pathname)) {
return fetch(request);
}
const response = await fetch(request);
const hit = classify(request);
if (hit) {
const cf = request.cf || {};
const contentLength = response.headers.get('content-length');
const cacheStatus = response.headers.get('cf-cache-status');
const event = {
id: crypto.randomUUID(),
type: 'content_retrieved',
timestamp: new Date().toISOString(),
content_url: request.url,
source_role: 'edge',
content_telemetry_id: request.headers.get('Content-Telemetry-ID') || undefined,
data: {
user_agent: request.headers.get('user-agent'),
...(hit.name ? { bot_name: hit.name } : {}),
bot_category: hit.category,
verified: hit.verified,
detection: hit.detection,
response_status: response.status,
...(contentLength ? { response_bytes: parseInt(contentLength, 10) } : {}),
...(cacheStatus ? { cache_status: cacheStatus.toLowerCase() } : {}),
asn: cf.asn,
asn_org: cf.asOrganization,
country: cf.country,
...(hit.ja4 ? { ja4: hit.ja4 } : {}),
},
};
// Fire and forget - runs after the response is sent, never blocks or
// breaks the page; telemetry failures are swallowed on purpose.
ctx.waitUntil(
fetch(env.OA_TELEMETRY_ENDPOINT, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'X-API-Key': env.OA_API_KEY,
},
body: JSON.stringify({ events: [event] }),
}).catch(() => {}),
);
}
return response;
},
};For each request, if it's from a known AI bot the Worker sends one small event to OpenAttribution after the page has already been served - it never slows down or breaks your site. It's open source - the full version lives in openattribution-org/cloudflare-worker.
Step 7 - Add your two settings
Still in the Worker, go to its Settings tab → Variables and Secrets (older dashboards call this "Environment Variables"). Add these two:
| Name | Value | Type |
|---|---|---|
OA_TELEMETRY_ENDPOINT | https://telemetry.openattribution.org/events | Text / plaintext |
OA_API_KEY | Your key starting oat_pub_ | Secret / encrypted |
Mark OA_API_KEY as a secret (the "Encrypt" option) so it
isn't shown in plain text afterwards. Save, and redeploy the Worker if Cloudflare prompts you to.
Step 8 - Run the Worker on your site
The Worker exists but isn't attached to your site yet. In the Worker's Settings → Domains & Routes (or Triggers → Routes), click Add route and enter yoursite.com/*, then add a
second route *.yoursite.com/* so it covers the www version too. Pick your domain from the zone dropdown and
save.
Step 9 - Check it's working
That's it. AI crawlers visit on their own schedule, so data fills in over hours and days rather than
instantly - check your OpenAttribution
dashboard over the next day or two and you should see content_retrieved events appear. If you've been verified for a while and nothing shows up after a couple of days, the troubleshooting page has
the usual suspects.
Optional: serve the manifest from Cloudflare too
If you verified your domain with the HTML meta tag (because hosted builders rarely let you put a
file at /.well-known/openattribution.json), you can now have
Cloudflare serve that file instead - which also tells AI agents and other tools where to send
telemetry. Add this near the top of the same Worker, before the part that fetches your site:
export default {
async fetch(request, env, ctx) {
const url = new URL(request.url);
if (url.pathname === '/.well-known/openattribution.json') {
return new Response(JSON.stringify({
schema_version: '0.1',
id: 'https://' + url.hostname + '/.well-known/openattribution.json',
roles: ['content_owner'],
operator: { name: url.hostname },
telemetry: { endpoint: 'https://telemetry.openattribution.org/events', conformance_level: 'retrieval' },
domains: [url.hostname],
}), { headers: { 'content-type': 'application/json' } });
}
// ...rest of the worker from Step 6 goes here...
},
};Once that's deployed and reachable at https://yoursite.com/.well-known/openattribution.json, you can
remove the meta tag from your site if you like - the manifest is enough on its own.
Already on Fastly or another CDN?
If your site already sits behind Fastly, Akamai, CloudFront or another CDN, the same idea applies - the detection runs at that layer instead of Cloudflare. Those integrations are in progress; if you're on one and want it sooner, tell us. We'd suggest the Cloudflare route above only if you're not already on a CDN - it's the least technical way to add one.
What this captures (and what it doesn't)
This gives you content_retrieved - the fact that an AI crawler or assistant fetched a page - which is the part you can see today
without anyone else's cooperation. The rest of the picture (when your content was actually cited in an answer, whether the reader engaged, the full session) depends on AI platforms
adopting the standard on their side. The reporting
paths overview walks through which events come from where.