|

Squarespace, Wix, and other hosted sites

If your site is built on Squarespace, Wix, Webflow, GoDaddy, Weebly or a similar hosted builder, you can't run code on the server or read your own access logs - so there's nothing on the site itself that can see an AI crawler arrive. The fix is to route your domain through a free Cloudflare account, which sits in front of your site, spots AI crawlers, and reports them to OpenAttribution. No terminal, no developer needed.

No code on your site Free Cloudflare plan ~20 minutes

Why you need a layer in front

AI crawlers - GPTBot, ClaudeBot, PerplexityBot, ChatGPT-User and the rest - fetch the raw HTML of your pages. They don't run JavaScript, so a tracking snippet pasted into your site's footer never sees them; it only ever fires for real people in browsers. And hosted builders don't give you access to your server logs (their built-in analytics are summaries, not the raw requests).

Cloudflare solves both problems at once: every request to your site passes through Cloudflare first, so it can see the crawler, classify it, and send a content_retrieved event to your OpenAttribution dashboard. Your site keeps working exactly as before - Cloudflare just passes traffic through to it.

Before you start
You'll need two things: your OpenAttribution API key - the one starting oat_pub_, created when you register your domain (or under Domains if you've already verified it) - and your login for whoever you bought the domain name from (your registrar - GoDaddy, Namecheap, Squarespace Domains, Google Domains, etc.), because one step changes a setting there. Your site stays online throughout.

Set it up

Step 1 - Create a free Cloudflare account

Go to dash.cloudflare.com/sign-up and sign up. It's free; you don't need a paid plan for any of this.

Step 2 - Add your site

In the Cloudflare dashboard, click Add a site and type your domain (for example yoursite.com, without www and without https://). Cloudflare scans your existing DNS records and copies them over - this is what keeps your site pointing at Squarespace/Wix/etc. while Cloudflare sits in front. When it shows you the imported records, just continue; you don't need to change anything here.

Step 3 - Choose the Free plan

When Cloudflare asks which plan you want, pick Free.

Step 4 - Point your domain at Cloudflare

Cloudflare gives you two nameservers (they look like xena.ns.cloudflare.com). You set these at your registrar - the company you bought the domain from. Cloudflare shows step-by-step instructions for the common registrars on that screen; the gist is: log in to your registrar, find the domain's "nameservers" or "DNS" setting, replace what's there with Cloudflare's two, and save.

This change can take anywhere from a few minutes to a few hours to take effect. Your site stays up the whole time - traffic just gradually switches to going via Cloudflare. Cloudflare emails you when it's active.

If your domain is registered with Squarespace itself
Squarespace-managed domains don't always let you change nameservers from inside Squarespace. If the option is greyed out, the simplest route is to use the CNAME setup instead, or move the domain's registration elsewhere. Get in touch via the contact form if you're stuck on this step and we'll talk you through it.

Step 5 - Set SSL to "Full"

Once the domain is active on Cloudflare, open SSL/TLS in the left menu and set the encryption mode to Full (not "Full (strict)", not "Flexible"). This keeps HTTPS working between Cloudflare and your hosted builder. It's the step people most often skip, and skipping it causes a "redirect loop" or a certificate warning on the site.

Step 6 - Create the Worker

In the left menu go to Workers & PagesCreate applicationCreate Worker. Give it a name like oa-telemetry and click Deploy (that just creates a placeholder). Then click Edit code, delete everything in the editor, paste the code below, and click Save and deploy.

worker.js - paste this into the Cloudflare editor
// OpenAttribution Cloudflare Worker - single file, no build step.
// Source: github.com/openattribution-org/cloudflare-worker
//
// Detects AI crawlers / assistants and reports content_retrieved telemetry
// events to the OA API. Requires two settings on the Worker:
//   OA_TELEMETRY_ENDPOINT  (text)    e.g. https://telemetry.openattribution.org/events
//   OA_API_KEY             (secret)  a content-owner key, oat_pub_..., telemetry:write scope

const AI_BOT_PATTERNS = [
  // Training crawlers
  [/GPTBot/i, 'GPTBot', 'training'],
  [/ClaudeBot/i, 'ClaudeBot', 'training'],
  [/CCBot/i, 'CCBot', 'training'],
  [/GoogleOther/i, 'GoogleOther', 'training'],
  [/Bytespider/i, 'Bytespider', 'training'],
  [/Diffbot/i, 'Diffbot', 'training'],
  [/Applebot-Extended/i, 'Applebot-Extended', 'training'],
  [/cohere-ai/i, 'cohere-ai', 'training'],
  [/FacebookBot/i, 'FacebookBot', 'training'],
  [/meta-externalagent/i, 'meta-externalagent', 'training'],
  [/Amazonbot/i, 'Amazonbot', 'training'],
  [/DeepSeekBot/i, 'DeepSeekBot', 'training'],
  [/AI2Bot/i, 'AI2Bot', 'training'],
  [/PanguBot/i, 'PanguBot', 'training'],
  [/ChatGLM-Spider/i, 'ChatGLM-Spider', 'training'],
  [/Timpibot/i, 'Timpibot', 'training'],
  [/omgili/i, 'omgili', 'training'],
  [/ImagesiftBot/i, 'ImagesiftBot', 'training'],
  [/FirecrawlAgent/i, 'FirecrawlAgent', 'training'],
  [/xAI-Bot/i, 'xAI-Bot', 'training'],
  [/Google-CloudVertexBot/i, 'Google-CloudVertexBot', 'training'],
  [/HuggingFace-Bot/i, 'HuggingFace-Bot', 'training'],
  [/Brightbot/i, 'Brightbot', 'training'],
  [/Webzio-Extended/i, 'Webzio-Extended', 'training'],
  [/TerraCotta/i, 'TerraCotta', 'training'],
  // Inference fetchers (user-triggered, real time)
  [/ChatGPT-User/i, 'ChatGPT-User', 'inference'],
  [/ChatGPT-Browser/i, 'ChatGPT-Browser', 'inference'],
  [/Claude-User/i, 'Claude-User', 'inference'],
  [/Perplexity-User/i, 'Perplexity-User', 'inference'],
  [/MistralAI-User/i, 'MistralAI-User', 'inference'],
  [/Amzn-User/i, 'Amzn-User', 'inference'],
  [/meta-externalfetcher/i, 'meta-externalfetcher', 'inference'],
  [/Google-Agent/i, 'Google-Agent', 'inference'],
  [/GoogleAgent-Mariner/i, 'GoogleAgent-Mariner', 'inference'],
  [/Gemini-Deep-Research/i, 'Gemini-Deep-Research', 'inference'],
  [/Google-NotebookLM/i, 'Google-NotebookLM', 'inference'],
  [/DuckAssistBot/i, 'DuckAssistBot', 'inference'],
  [/PhindBot/i, 'PhindBot', 'inference'],
  [/Cohere-Command/i, 'Cohere-Command', 'inference'],
  [/Devin\/[\d.]+/i, 'Devin', 'inference'],
  // AI search indexers
  [/OAI-SearchBot/i, 'OAI-SearchBot', 'search'],
  [/Claude-SearchBot/i, 'Claude-SearchBot', 'search'],
  [/PerplexityBot/i, 'PerplexityBot', 'search'],
  [/YouBot/i, 'YouBot', 'search'],
  [/PetalBot/i, 'PetalBot', 'search'],
  [/Bravebot/i, 'Bravebot', 'search'],
  [/AzureAI-SearchBot/i, 'AzureAI-SearchBot', 'search'],
  [/meta-webindexer/i, 'meta-webindexer', 'search'],
  [/ExaBot/i, 'ExaBot', 'search'],
  [/Andibot/i, 'Andibot', 'search'],
];

// Cloudflare's verifiedBotCategory -> OA bot_category. Available on every plan.
const CF_CATEGORY = {
  'AI Crawler': 'training',
  'AI Assistant': 'inference',
  'AI Search': 'search',
};

const STATIC_EXT = /\.(css|js|jpg|jpeg|png|gif|svg|ico|woff2?|ttf|eot|map|webp|avif|mp4|webm)$/i;

function matchUserAgent(ua) {
  for (const [pattern, name, category] of AI_BOT_PATTERNS) {
    if (pattern.test(ua)) return { name, category };
  }
  return null;
}

function classify(request) {
  const cf = request.cf || {};
  const bm = cf.botManagement;
  const uaMatch = matchUserAgent(request.headers.get('user-agent') || '');

  // verifiedBotCategory is available on all plans. If Cloudflare has
  // categorised this as an AI bot, trust it for the category, but still pull
  // the bot name from the UA when we recognise it.
  const aiCategory = CF_CATEGORY[cf.verifiedBotCategory];
  if (aiCategory) {
    return {
      name: uaMatch ? uaMatch.name : null,
      category: aiCategory,
      verified: bm && typeof bm.verifiedBot === 'boolean' ? bm.verifiedBot : true,
      detection: 'bot_management',
      ja4: bm && bm.ja4,
    };
  }

  // Enterprise Bot Management: skip verified non-AI bots (Googlebot, Bingbot,
  // Pingdom, etc.) and high-score requests (likely human).
  if (bm && typeof bm.score === 'number') {
    if (bm.verifiedBot || bm.score >= 30) return null;
  }

  // UA pattern matching - Free/Pro fallback, or low-score unverified on Enterprise.
  if (uaMatch) {
    return {
      name: uaMatch.name,
      category: uaMatch.category,
      verified: false,
      detection: bm ? 'bot_management' : 'user_agent',
      ja4: bm && bm.ja4,
    };
  }

  return null;
}

export default {
  async fetch(request, env, ctx) {
    // Static assets: pass straight through, don't classify.
    if (STATIC_EXT.test(new URL(request.url).pathname)) {
      return fetch(request);
    }

    const response = await fetch(request);

    const hit = classify(request);
    if (hit) {
      const cf = request.cf || {};
      const contentLength = response.headers.get('content-length');
      const cacheStatus = response.headers.get('cf-cache-status');

      const event = {
        id: crypto.randomUUID(),
        type: 'content_retrieved',
        timestamp: new Date().toISOString(),
        content_url: request.url,
        source_role: 'edge',
        content_telemetry_id: request.headers.get('Content-Telemetry-ID') || undefined,
        data: {
          user_agent: request.headers.get('user-agent'),
          ...(hit.name ? { bot_name: hit.name } : {}),
          bot_category: hit.category,
          verified: hit.verified,
          detection: hit.detection,
          response_status: response.status,
          ...(contentLength ? { response_bytes: parseInt(contentLength, 10) } : {}),
          ...(cacheStatus ? { cache_status: cacheStatus.toLowerCase() } : {}),
          asn: cf.asn,
          asn_org: cf.asOrganization,
          country: cf.country,
          ...(hit.ja4 ? { ja4: hit.ja4 } : {}),
        },
      };

      // Fire and forget - runs after the response is sent, never blocks or
      // breaks the page; telemetry failures are swallowed on purpose.
      ctx.waitUntil(
        fetch(env.OA_TELEMETRY_ENDPOINT, {
          method: 'POST',
          headers: {
            'Content-Type': 'application/json',
            'X-API-Key': env.OA_API_KEY,
          },
          body: JSON.stringify({ events: [event] }),
        }).catch(() => {}),
      );
    }

    return response;
  },
};

For each request, if it's from a known AI bot the Worker sends one small event to OpenAttribution after the page has already been served - it never slows down or breaks your site. It's open source - the full version lives in openattribution-org/cloudflare-worker.

Step 7 - Add your two settings

Still in the Worker, go to its Settings tab → Variables and Secrets (older dashboards call this "Environment Variables"). Add these two:

NameValueType
OA_TELEMETRY_ENDPOINThttps://telemetry.openattribution.org/eventsText / plaintext
OA_API_KEYYour key starting oat_pub_Secret / encrypted

Mark OA_API_KEY as a secret (the "Encrypt" option) so it isn't shown in plain text afterwards. Save, and redeploy the Worker if Cloudflare prompts you to.

Step 8 - Run the Worker on your site

The Worker exists but isn't attached to your site yet. In the Worker's SettingsDomains & Routes (or TriggersRoutes), click Add route and enter yoursite.com/*, then add a second route *.yoursite.com/* so it covers the www version too. Pick your domain from the zone dropdown and save.

Step 9 - Check it's working

That's it. AI crawlers visit on their own schedule, so data fills in over hours and days rather than instantly - check your OpenAttribution dashboard over the next day or two and you should see content_retrieved events appear. If you've been verified for a while and nothing shows up after a couple of days, the troubleshooting page has the usual suspects.


Optional: serve the manifest from Cloudflare too

If you verified your domain with the HTML meta tag (because hosted builders rarely let you put a file at /.well-known/openattribution.json), you can now have Cloudflare serve that file instead - which also tells AI agents and other tools where to send telemetry. Add this near the top of the same Worker, before the part that fetches your site:

add to worker.js
export default {
  async fetch(request, env, ctx) {
    const url = new URL(request.url);
    if (url.pathname === '/.well-known/openattribution.json') {
      return new Response(JSON.stringify({
        schema_version: '0.1',
        id: 'https://' + url.hostname + '/.well-known/openattribution.json',
        roles: ['content_owner'],
        operator: { name: url.hostname },
        telemetry: { endpoint: 'https://telemetry.openattribution.org/events', conformance_level: 'retrieval' },
        domains: [url.hostname],
      }), { headers: { 'content-type': 'application/json' } });
    }
    // ...rest of the worker from Step 6 goes here...
  },
};

Once that's deployed and reachable at https://yoursite.com/.well-known/openattribution.json, you can remove the meta tag from your site if you like - the manifest is enough on its own.


Already on Fastly or another CDN?

If your site already sits behind Fastly, Akamai, CloudFront or another CDN, the same idea applies - the detection runs at that layer instead of Cloudflare. Those integrations are in progress; if you're on one and want it sooner, tell us. We'd suggest the Cloudflare route above only if you're not already on a CDN - it's the least technical way to add one.


What this captures (and what it doesn't)

This gives you content_retrieved - the fact that an AI crawler or assistant fetched a page - which is the part you can see today without anyone else's cooperation. The rest of the picture (when your content was actually cited in an answer, whether the reader engaged, the full session) depends on AI platforms adopting the standard on their side. The reporting paths overview walks through which events come from where.