|

AWS CloudFront

CloudFront integration via Real-Time Logs piped through Kinesis to your OA telemetry endpoint. Uses AWS WAF Bot Control's CategoryAI rule for classification. More infrastructure than other CDN options, but covers the largest cloud market share.

Requires AWS infrastructure Zero latency impact source_role: edge

The challenge

CloudFront is the weakest CDN for edge telemetry because of a fundamental limitation:

Edge runtimeOutbound HTTPAsync (non-blocking)
CloudFront FunctionsNot supportedN/A
Lambda@EdgeSupportedNo - blocks response until complete

Unlike Cloudflare Workers (ctx.waitUntil) or Netlify Edge Functions (context.waitUntil), Lambda@Edge has no fire-and-forget mechanism. Making an outbound HTTP call adds its latency directly to the user's page load.


Recommended: Real-Time Logs pipeline

The best path is CloudFront Real-Time Logs streamed through Kinesis, with AWS WAF Bot Control handling AI bot classification:

flow
AI Bot ──GET──> CloudFront + WAF Bot Control
                    │
                    ├── Bot Control applies labels
                    │   awswaf:managed:aws:bot-control:bot:category:ai
                    │   awswaf:managed:aws:bot-control:bot:name:<name>
                    │   awswaf:managed:aws:bot-control:bot:verified
                    │
                    └── Real-Time Log entry (includes WAF labels)
                        └── Kinesis Data Stream
                            └── Kinesis Data Firehose
                                └── HTTPS delivery to OA endpoint

This is fully async and adds zero latency to responses. The trade-off is infrastructure complexity - you need Kinesis Data Streams and Firehose configured in your AWS account.

Infrastructure required
This approach requires provisioning Kinesis Data Streams and Kinesis Data Firehose in your AWS account. We provide a CloudFormation template that sets up the full pipeline with one deploy.

CloudFormation template

Deploy the full pipeline with a single CloudFormation stack:

bash
aws cloudformation deploy \
  --template-file oa-cloudfront-telemetry.yaml \
  --stack-name oa-telemetry \
  --parameter-overrides \
      OAOrgId=your-org-id \
      CloudFrontDistributionId=E1234567890 \
  --capabilities CAPABILITY_IAM

The template creates:

  • Kinesis Data Stream for CloudFront Real-Time Logs
  • Kinesis Data Firehose delivery stream to the OA telemetry endpoint
  • IAM roles with minimal permissions
  • CloudFront Real-Time Log configuration with WAF label filtering

Bot detection

AWS WAF Bot Control (AWSManagedRulesBotControlRuleSet) has a dedicated CategoryAI rule that labels all AI bot traffic:

LabelMeaning
bot:category:aiRequest is from an AI bot
bot:name:<name>Specific bot identity (e.g. bot:name:gptbot)
bot:verifiedBot identity cryptographically verified
bot:web_bot_auth:verifiedVerified via Web Bot Authentication (RFC 9421)
signal:cloud_service_provider:<csp>Origin infrastructure (aws, gcp, azure, oracle)

Set the CategoryAI rule action to Count (instead of the default Block) so that AI traffic is labelled but not blocked. The Kinesis pipeline then filters log entries by the bot:category:ai label.

CategoryAI blocks by default
Unlike other Bot Control categories, CategoryAI blocks all AI bots by default - including verified ones. Override the action to Count for telemetry-only use. Bots verified via Web Bot Authentication are the one exception and pass through even on Block.

Alternative: Lambda@Edge (with latency trade-off)

If the Kinesis pipeline is too heavy, a Lambda@Edge function works but adds latency to AI bot requests. WAF Bot Control labels are not directly accessible in Lambda@Edge - you need a WAF custom rule to forward them as a header:

WAF custom rule (forward AI label as header)
{
  "Name": "ForwardAIBotLabel",
  "Priority": 10,
  "Statement": {
    "LabelMatchStatement": {
      "Scope": "LABEL",
      "Key": "awswaf:managed:aws:bot-control:bot:category:ai"
    }
  },
  "Action": {
    "Count": {
      "CustomRequestHandling": {
        "InsertHeaders": [
          { "Name": "x-oa-bot-category", "Value": "ai" }
        ]
      }
    }
  },
  "VisibilityConfig": {
    "SampledRequestsEnabled": true,
    "CloudWatchMetricsEnabled": true,
    "MetricName": "ForwardAIBotLabel"
  }
}
lambda@edge
exports.handler = async (event) => {
  const request = event.Records[0].cf.request;

  // Check for WAF-forwarded AI bot header
  const botCategory = request.headers['x-oa-bot-category']?.[0]?.value;

  if (botCategory === 'ai') {
    const ua = request.headers['user-agent']?.[0]?.value || '';

    // This blocks until complete - adds ~50-200ms to bot requests
    await postTelemetryEvent({
      type: 'content_retrieved',
      timestamp: new Date().toISOString(),
      content_url: `https://${request.headers.host[0].value}${request.uri}`,
      source_role: 'edge',
      oa_telemetry_id: request.headers['oa-telemetry-id']?.[0]?.value || undefined,
      data: { user_agent: ua },
    });
  }

  return request;
};
Latency impact
Lambda@Edge blocks the response during the outbound HTTP call. This adds 50-200ms to AI bot requests only. For bots, this latency is usually acceptable - they're not human users waiting for a page to render. But the Real-Time Logs pipeline is the cleaner solution.