未知数据源 2024年10月02日
How to optimize content for search engines with AWS WAF Bot Control and Amazon CloudFront
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了如何使用 AWS WAF Bot Control 和 Lambda@Edge 优化网站对搜索引擎爬虫的处理,以提高网站在搜索结果中的排名。文章指出,搜索引擎爬虫通常难以处理 JavaScript 生成的动态内容,这会导致延迟和降低搜索排名。为了解决这个问题,文章建议使用动态渲染,并提供了一种通过 AWS WAF Bot Control 和 Lambda@Edge 实现动态渲染的方案,可以有效地识别搜索引擎爬虫,并为其提供优化后的静态 HTML 版本。

🤔 **问题概述**: 搜索引擎爬虫通常难以处理 JavaScript 生成的动态内容,这会导致延迟和降低搜索排名。

💡 **解决方案**: 使用动态渲染,为搜索引擎爬虫提供优化后的静态 HTML 版本。

🚀 **实现方案**: 使用 AWS WAF Bot Control 和 Lambda@Edge 实现动态渲染,具体步骤如下: 1. 使用 AWS WAF Bot Control 识别搜索引擎爬虫并添加标签; 2. 使用自定义 WAF 规则,根据 Bot Control 添加的标签添加一个自定义 HTTP 请求头; 3. 使用 Lambda@Edge 检查自定义请求头,并将搜索引擎爬虫的请求重定向到一个专门为其优化的 origin,该 origin 提供静态 HTML 版本。

📊 **优化效果**: 通过使用动态渲染,可以提高网站在搜索结果中的排名,并减少搜索引擎爬虫带来的延迟和资源占用。

🎯 **额外收益**: 可以通过配置缓存控制策略来提高静态 HTML 内容的缓存命中率,进一步优化网站性能。

<section class="blog-post-content"><p>Search engine crawlers – a special bot type used to index your site – are very important visitors. They make sure that your content is searchable by end users. If a crawler can’t easily read your content, then any updates you make might not be immediately reflected in the search results. Depending on the algorithms that the search engine uses, it could also affect where you appear in search results – your search ranking.</p><p>Therefore, it’s important to make sure that search engine crawlers can read your content without additional processing, and that they can access your content as quickly as possible.</p><p>In this post, I will outline the content types that search engines can have difficulty with, and the methods that you can use to work around this. I will discuss how you can identify search engines, as well as the impact on your ability to cache content in a Content Delivery Network (CDN). Then, I will walk you through how to use <a href="https://aws.amazon.com/waf/&quot;&gt;AWS Web Application Firewall (WAF)</a> Bot Control to reliably identify search engines, and how to use <a href="https://aws.amazon.com/cloudfront/&quot;&gt;Amazon CloudFront</a> with <a href="https://aws.amazon.com/lambda/edge/&quot;&gt;Lambda@Edge&lt;/a&gt; to direct them to an optimized version of your content, all while maximizing use of the CDN cache.</p><p>According to W3Techs, approximately 97% of all websites today use JavaScript. This enables otherwise static websites to transform into responsive web applications by running code inside of the user’s browser. JavaScript is commonly used to generate HTML content and display it in the browser, also known as client-side rendering. Single Page Applications also make extensive use of JavaScript to render content as the user interacts with the application.</p><p>JavaScript is great for human visitors – but what about search engine crawlers? Most search engines can deal with JavaScript, but usually they must perform additional processing to render the content before they can parse it. However, not all of them can do it successfully all of the time. Crawlers usually have limited time and compute resources, so they will often queue the page for rendering first, and then parse it when rendering is complete. This can result in a delay between publishing your content and having it appear in search results. Therefore, search engines actually recommend not serving them JavaScript and using <a href="https://developers.google.com/search/docs/advanced/JavaScript/dynamic-rendering&quot;&gt;dynamic rendering</a> instead.</p><p>To use dynamic rendering for search engines, you must first identify them. The most common way of doing this is by inspecting the <a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent&quot;&gt;user-agent header</a>. If the header value indicates that the visitor is a search engine crawler, then you can route it to a version of the page which can serve a suitable version of the content – a static HTML version, for example.</p><p>If you’re using a CDN, then using the user-agent header to identify search engines will impact caching. You would need the CDN to serve a different version of each object for each different value of the user-agent header. In other words, the user-agent header must form part of the <a href="https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/understanding-the-cache-key.html&quot;&gt;cache key</a>. The problem here is that there are numerous potential values for the user-agent header – millions, in fact. The CDN would need to make a request to the origin for every value of user-agent for every object. This would increase the cost of running your origin servers, as they must respond to a larger number of requests. Moreover, it would increase the response times for users, as it would be less likely that their requests could be served from the cache, and they would need to wait while the object is fetched from the origin. In other words, your cache hit ratio would be lower.</p><p>By using <a href="https://aws.amazon.com/waf/features/bot-control/&quot;&gt;AWS WAF Bot Control</a>, you can accurately identify search engine crawlers without relying on inspecting the user-agent header, or needing to include it in the cache key for the CDN.</p><p>AWS WAF Bot Control analyzes HTTP requests to identify the source and purpose of a bot. It can identify bots and categorize them based on their type – for example: scraper, SEO, crawler, or site monitor. When it identifies a bot, Bot Control adds a <a href="https://docs.aws.amazon.com/waf/latest/developerguide/waf-rule-labels.html&quot;&gt;label&lt;/a&gt; to the request that you can utilize later on in a custom WAF rule.</p><p>Now I will walk you through how to enable AWS WAF Bot Control to label your traffic. You will learn how to add a custom WAF rule that will evaluate the labels and add a new, custom request header if Bot Control identified a search engine bot. And you will also learn how to add your custom header to the cache key for a CloudFront distribution.</p><p>Furthermore, you’ll learn how to send bot traffic to an alternate origin (one that is configured to serve static HTML, for example) by using Lambda@Edge to inspect the custom header.</p><p>The resulting configuration will look like this:</p><div id="attachment_13428" class="wp-caption aligncenter c4"><img aria-describedby="caption-attachment-13428" class="size-full wp-image-13428" src="https://d2908q01vomqb2.cloudfront.net/5b384ce32d8cdef02bc3a139d4cac0a22bb029e8/2022/09/22/Untitled-Diagram.png&quot; alt="Diagram showing requests being made to Amazon CloudFront, and processed with AWS WAF and Lambda@Edge before being sent to the default origin, or a bot-optimized origin" width="628" height="297" /><p id="caption-attachment-13428" class="wp-caption-text">Figure 1 – Diagram showing request flow from human and search engine visitors</p></div><ol><li>Both bots and human users make requests to a CloudFront distribution.</li><li>AWS WAF inspects each request, using Bot Control to identify bots and adds labels to the requests.</li><li>AWS WAF inspects the labels and adds a custom HTTP header into the request if the labels indicate that a search engine made the request.</li><li>When CloudFront requests objects from the origin, Lambda@Edge inspects the request, looking for the custom header. If it’s present, Lambda@Edge modifies the origin, instructing CloudFront to send the request to the bot-optimized origin. CloudFront sends all other requests to the default origin.</li></ol><p>Note that while you are changing the origin for the purposes of this example, it’s also possible to use a single origin and differentiate between the website versions (human optimized, bot optimized) in other ways. For example, you can do this by modifying the URI path or by using an HTTP request header. This is discussed in more detail in the <a href="https://www.youtube.com/watch?v=9npcOZ1PP_c&amp;amp;t=2442s&quot;&gt;Building Low Latency Websites breakout session from Re:Invent 2021</a>.</p><p>Additionally, because search engines don’t typically receive dynamic content, you can also <a href="https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/Expiration.html&quot;&gt;maximize the Time-to-Live (TTL) for your static HTML content via the cache-control header</a>. This increases the opportunity for search engines to receive your content from the CDN cache, which in turn provides faster response times. This may improve your SEO rankings, as search engines tend to favor fast responses.</p><h2>Walkthrough</h2><h3>Creating a WAF WebACL</h3><p>In this section, you’ll create a new WAF WebACL and configure Amazon Bot Control to identify bots. You’ll also create a custom rule to insert a new request header when Bot Control identifies a search engine crawler.</p><h4>To create a new WebACL and configure Amazon Bot Control:</h4><ol><li>Sign in to the <a href="https://aws.amazon.com/console/&quot;&gt;AWS Management Console</a> and open the AWS WAF console <a href="https://console.aws.amazon.com/wafv2/homev2&quot;&gt;here&lt;/a&gt;.&lt;/li&gt;&lt;li&gt;In the navigation pane, choose <strong>WebACLs</strong>, then choose <strong>Create web ACL.</strong></li><li>In the <strong>Web ACL details</strong> dialog box, do the following:<ol type="a"><li>For <strong>Resource type</strong>, choose <strong>CloudFront distributions.</strong></li><li>Choose a <strong>name</strong> (eg. <strong>SearchBot-ACL</strong>) and a <strong>Cloudwatch metric name.</strong></li></ol></li><li>Choose <strong>Next.</strong></li><li>In the <strong>Rules</strong> dialog box, choose <strong>Add rules</strong>, <strong>Add managed rule groups.</strong></li><li>Expand the <strong>AWS managed rule groups</strong> dialog box and turn on <strong>Add to web ACL</strong> next to <strong>Bot Control.</strong></li></ol><p>You’ll notice that Bot Control is a paid rule group – for pricing information, refer <a href="https://aws.amazon.com/waf/pricing&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;&lt;p&gt;Note that, by default, this rule group will block all bot traffic. You’ll change this behavior so that you can evaluate Bot Control without blocking any requests.</p><ol start="7"><li>Choose the <strong>Edit</strong> button underneath <strong>Add to web ACL</strong>.</li><li>Turn on <strong>Set all rule actions to count</strong></li></ol><div id="attachment_13429" class="wp-caption alignnone c6"><img aria-describedby="caption-attachment-13429" class="wp-image-13429 c5" src="https://d2908q01vomqb2.cloudfront.net/5b384ce32d8cdef02bc3a139d4cac0a22bb029e8/2022/09/22/Picture-2-3.png&quot; alt="Screenshot showing all rule actions set to count" width="790" height="199" /><p id="caption-attachment-13429" class="wp-caption-text">Figure 2: Screenshot showing all rule actions set to count</p></div><ol start="9"><li>At the bottom of the page, choose <strong>Save Rule.</strong></li><li>At the bottom of the page, choose <strong>Add Rules.</strong></li></ol><p>Bot Control identifies various bots and bot categories, and assigns <a href="https://docs.aws.amazon.com/waf/latest/developerguide/waf-rule-labels.html&quot;&gt;labels&lt;/a&gt; to requests based on what was identified. You can use the labels in other WAF rules for fine-grained control over bot traffic to your applications. Now you’ll configure AWS WAF to evaluate the labels in a custom rule, and add a new request header if Bot Control identifies a search engine crawler.</p><h4>To configure a custom rule to evaluate labels and add a header (visual rule builder):</h4><ol><li>On the <strong>Add rules and rule groups</strong> page, choose <strong>Add rules, Add my own rules and rule groups.</strong></li><li>Use the rule builder visual editor to construct your rule:<ol type="a"><li>For <strong>Name</strong>, choose a name (eg. <strong>Add-Bot-Header</strong>).</li><li>For <strong>Type</strong>, choose <strong>Regular Rule.</strong></li><li>For <strong>If a request</strong>, choose <strong>Matches at least one of the statements (OR).</strong></li><li>In the <strong>Statement 1</strong> dialog box, do the following:<ol type="i"><li>For <strong>Inspect</strong>, choose <strong>Has a label.</strong></li><li>For <strong>Match scope</strong>, choose <strong>Label.</strong></li><li>For <strong>Match</strong> key, choose <strong>awswaf:managed:aws:bot-control:bot:category:search_engine.</strong></li></ol></li><li>In the <strong>Statement 2</strong> dialog box, do the following:<ol type="i"><li>For <strong>Inspect</strong>, choose <strong>Has a label.</strong></li><li>For <strong>Match scope</strong>, choose <strong>Label.</strong></li><li>For <strong>Match</strong> key, choose <strong>awswaf:managed:aws:bot-control:bot:category:seo.</strong></li></ol></li><li>In the <strong>Action</strong> dialog box, do the following:<ol type="i"><li>For <strong>Action</strong>, choose <strong>Allow.</strong></li><li>Expand <strong>Custom request</strong> and choose <strong>Add new custom header.</strong></li><li>For <strong>Key</strong>, enter a header name (eg. <strong>SearchBot</strong>).</li></ol></li></ol></li></ol><p class="c7"><em>Note that AWS WAF automatically prefixes your custom header name with <strong>x-amzn-waf- .</strong></em></p><ol start="2" type="1"><li class="c8"><ol start="4" type="i"><li>For <strong>Value,</strong> enter <strong>True.</strong></li></ol></li><li>At the bottom of the page, choose <strong>Add rule.</strong></li></ol><h4>To configure a custom rule to evaluate labels and add a header (rule json editor):</h4><ol><li>On the <strong>Add rules and rule groups</strong>page, choose <strong>Add rules, Add my own rules and rule groups.</strong></li><li>For <strong>Rule builder</strong>, choose <strong>Rule JSON Editor.</strong></li><li>Copy and paste the following JSON snippet into the editor in the console:</li></ol><pre class="lang-json">{ "Name": "Add-Bot-Header", "Priority": 0, "Action": { "Allow": { "CustomRequestHandling": { "InsertHeaders": [ { "Name": "SearchBot", "Value": "True" } ] } } }, "VisibilityConfig": { "SampledRequestsEnabled": true, "CloudWatchMetricsEnabled": true, "MetricName": "Add-Bot-Header" }, "Statement": { "OrStatement": { "Statements": [ { "LabelMatchStatement": { "Scope": "LABEL", "Key": "awswaf:managed:aws:bot-control:bot:category:search_engine" } }, { "LabelMatchStatement": { "Scope": "LABEL", "Key": "awswaf:managed:aws:bot-control:bot:category:seo" } } ] } }}</pre><ol start="4"><li>At the bottom of the page, choose <strong>Add rule.</strong></li></ol><h4>To finalize your new WebACL:</h4><ol><li>In the <strong>Default web ACL action for requests that don’t match any rules</strong> dialog box, for <strong>Default action</strong>, choose <strong>Allow.</strong></li></ol><div id="attachment_13431" class="wp-caption alignnone c10"><img aria-describedby="caption-attachment-13431" class="wp-image-13431 size-full c5" src="https://d2908q01vomqb2.cloudfront.net/5b384ce32d8cdef02bc3a139d4cac0a22bb029e8/2022/09/22/Picture-3-2.png&quot; alt="Screenshot showing two WAF rules (one managed, one custom), with the default action set to Allow" width="941" height="654" /><p id="caption-attachment-13431" class="wp-caption-text">Figure 3: Managed &amp; Custom WAF rules with default action set to Allow</p></div><ol start="2"><li>Choose <strong>Next</strong> three times to accept the default values on the remaining pages.</li><li>Choose <strong>Create web ACL.</strong></li></ol><p>Now you have AWS WAF configured to identify and report on various bots, and to insert a new request header <strong>x-amzn-waf-searchbot: true</strong>on requests from search engine bots.</p><p>Next, you’ll create a new Lambda@Edge function to look for the presence of the x-amzn-waf-searchbot header, and send the request to a new custom origin if it’s present.</p><h3>Creating a Lambda@Edge function</h3><h4>To create the lambda@edge function:</h4><ol><li>Open the Lambda console <a href="https://console.aws.amazon.com/lambda/home?region=us-east-1&quot;&gt;here&lt;/a&gt;.&lt;/li&gt;&lt;/ol&gt;&lt;p class="c11"><em>Note that you must use the <strong>N. Virginia (us-east-1)</strong> region to author Lambda functions for use with Lambda@Edge. See <a href="https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/lambda-at-the-edge.html&quot;&gt;here&lt;/a&gt;&lt;/em&gt; <em>for further details.</em></p><ol start="2"><li>Choose <strong>Create Function.</strong></li><li>Choose <strong>Author from scratch.</strong></li><li>In the <strong>Basic Information</strong> dialog box, do the following:<ol type="a"><li>For <strong>Function name</strong>, enter a name for your function (eg. <strong>Edge-SearchBots</strong>).</li><li>For <strong>Runtime</strong>, choose <strong>Node.js 16.x.</strong></li><li>For <strong>Architecture</strong>, choose <strong>x86_64.</strong></li><li>Under Permissions, expand <strong>Change default execution role.</strong></li><li>For <strong>Execution role</strong>, choose <strong>Create a new role from AWS policy templates.</strong></li><li>For <strong>Role name</strong>, enter a name (eg. <strong>EdgeLambda-role).</strong></li><li>For <strong>Policy templates</strong>, choose <strong>Basic Lambda@Edge permissions (for CloudFront trigger).</strong></li><li>Choose <strong>Create function.</strong></li></ol></li></ol><p class="c12"><em>Refer <a href="https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/lambda-edge-permissions.html&quot;&gt;here&lt;/a&gt;&lt;/em&gt;&lt;em&gt;for more information on the permissions required for Lambda@Edge.</em></p><div id="attachment_13432" class="wp-caption alignnone c14"><img aria-describedby="caption-attachment-13432" class="wp-image-13432 size-full c13" src="https://d2908q01vomqb2.cloudfront.net/5b384ce32d8cdef02bc3a139d4cac0a22bb029e8/2022/09/22/Picture-4-1.png&quot; alt="Screenshot showing the create function page with example values inserted" width="843" height="822" /><p id="caption-attachment-13432" class="wp-caption-text">Figure 4: Lambda function configuration</p></div><ol start="5"><li>In the <strong>Code source</strong> dialog box, replace the sample code in the editor with the code that follows:</li></ol><pre class="lang-js">exports.handler = (event, context, callback) =&gt; { const SEO_ORIGIN 'seo-origin.example.com'; const CRAWLER_HEADER = 'x-amzn-waf-searchbot'; const {request} = event.Records[0].cf; if (request.headers[CRAWLER_HEADER]){ request.origin = { custom: { domainName: SEO_ORIGIN, port: 80, protocol: 'http', path: '', sslProtocols: ['TLSv1.2'], readTimeout: 5, keepaliveTimeout: 60, customHeaders: {} } }; request.headers.host = [{ key: 'host', value: SEO_ORIGIN}]; } callback(null, request);};</pre><ol start="6"><li>Inside the code block, replace <strong>seo-origin.example.com</strong> with the DNS name of an origin which should receive requests from search engine crawlers. Make sure that the other settings such as <strong>port</strong>, <strong>protocol,</strong> and <strong>timeouts</strong> are appropriate for your origin.</li><li>Choose <strong>Deploy</strong> to deploy your code.</li><li>Choose <strong>Actions, Publish New Version.</strong></li><li>For <strong>Version description</strong>, enter a suitable description (eg. <strong>v1</strong>).</li><li>Choose <strong>Publish.</strong></li></ol><p class="c11"><em>Note that this step is necessary to allow Lambda@Edge to replicate your function code across multiple AWS regions so that it can be used by CloudFront.</em></p><ol start="11"><li>Once the new version of your function has been published, copy the <strong>Function ARN</strong> and keep it handy, as this will be needed later on when configuring your CloudFront behavior.</li></ol><div id="attachment_13433" class="wp-caption alignnone c10"><img aria-describedby="caption-attachment-13433" class="size-full wp-image-13433 c5" src="https://d2908q01vomqb2.cloudfront.net/5b384ce32d8cdef02bc3a139d4cac0a22bb029e8/2022/09/22/Picture-5-1.png&quot; alt="Screenshot showing the function overview with version selected. The copy button for the function ARN is circled in red." width="941" height="283" /><p id="caption-attachment-13433" class="wp-caption-text">Figure 5: Published function and function ARN</p></div><p>Now you have a Lambda@Edge function and a WebACL created. However, neither have been associated with a CloudFront distribution just yet, so neither is currently in use.</p><p>Before you do that, you must add your new header, x-amzn-waf-searchbot, to the cache policy for the appropriate CloudFront behavior(s). This will instruct CloudFront to store a different copy of each object, depending on the value of the header. Because the header can only have two values, true or false, CloudFront will store two copies of each object – one where the header value is true, and another where the header value is false. This is much more efficient than using the user-agent header, where there are numerous potential values.</p><p>For this example, you’ll walk through creating a new, custom cache policy that includes the x-amzn-waf-searchbot header. Depending on your requirements, you may need to adjust the other settings, or you might add the custom header to an existing custom policy.</p><h3>Creating a custom cache policy</h3><h4>To create a custom cache policy:</h4><ol><li>Open the CloudFront console <a href="https://console.aws.amazon.com/cloudfront/&quot;&gt;here&lt;/a&gt;.&lt;/li&gt;&lt;li&gt;In the navigation pane, choose</li><li>In the <strong>Custom policies</strong> dialog box, choose <strong>Create cache policy.</strong></li><li>For <strong>Name</strong>, enter a name (eg. <strong>Cache-Searchbot-Header</strong>).</li><li>In the <strong>Cache key settings</strong> dialog box, do the following:<ol type="a"><li>For <strong>Headers</strong>, choose <strong>Include the following headers</strong>.</li><li>For <strong>Add Header</strong>, choose <strong>Add custom.</strong></li><li>For <strong>Custom header,</strong> enter <strong>x-amzn-waf-searchbot.</strong></li><li>Choose <strong>Add</strong>.</li></ol></li></ol><div id="attachment_13434" class="wp-caption alignnone c10"><img aria-describedby="caption-attachment-13434" class="wp-image-13434 size-full c13" src="https://d2908q01vomqb2.cloudfront.net/5b384ce32d8cdef02bc3a139d4cac0a22bb029e8/2022/09/22/Picture-6-1.png&quot; alt="Screenshot showing cache key settings with sample values inserted" width="941" height="589" /><p id="caption-attachment-13434" class="wp-caption-text">Figure 6: Cache key configuration</p></div><ol start="6"><li>Choose <strong>Create.</strong></li></ol><p>Now you have all of the components ready to add the new functionality to your CloudFront distribution.</p><h3>Adding the new functionality to your CloudFront distribution</h3><h4>To associate your WebACL and Lambda@Edge function and cache policy to an existing CloudFront distribution:</h4><ol><li>Open the CloudFront console <a href="https://console.aws.amazon.com/cloudfront/&quot;&gt;here&lt;/a&gt;.&lt;/li&gt;&lt;li&gt;Choose the <strong>ID</strong> of the CloudFront distribution to which you wish to apply this functionality.</li><li>In the <strong>Settings</strong> dialog box, choose <strong>Edit.</strong></li><li>For <strong>AWS WAF web ACL</strong>, choose the Web ACL you created earlier (e.g., <strong>SearchBot-ACL</strong>).</li><li>At the bottom of the page, choose <strong>Save changes.</strong></li><li>Choose the <strong>Behaviors</strong> tab.</li><li>Choose the behavior that you want to apply this functionality to, then choose <strong>Edit.</strong></li><li>For <strong>Cache key and origin request</strong>, do the following:<ol type="a"><li>Choose <strong>Cache policy and origin request policy (recommended).</strong></li><li>For <strong>Cache policy</strong>, choose the policy that you created earlier (eg. <strong>Cache-Searchbot-Header</strong>).</li></ol></li></ol><div id="attachment_13435" class="wp-caption alignnone c15"><img aria-describedby="caption-attachment-13435" class="wp-image-13435 c13" src="https://d2908q01vomqb2.cloudfront.net/5b384ce32d8cdef02bc3a139d4cac0a22bb029e8/2022/09/22/Picture-7-1.png&quot; alt="Screenshot showing the custom cache policy selected in CloudFront" width="709" height="373" /><p id="caption-attachment-13435" class="wp-caption-text">Figure 7: Custom cache policy selected</p></div><ol start="9"><li>For <strong>Function Associations</strong>, do the following:<ol type="a"><li>For <strong>Origin request</strong>, choose <strong>Lambda@Edge.</strong></li><li>For <strong>Function ARN / Name</strong>, paste the ARN of your function’s version that you copied earlier.</li></ol></li></ol><div id="attachment_13437" class="wp-caption alignnone c10"><img aria-describedby="caption-attachment-13437" class="size-full wp-image-13437 c13" src="https://d2908q01vomqb2.cloudfront.net/5b384ce32d8cdef02bc3a139d4cac0a22bb029e8/2022/09/22/Picture-8-1.png&quot; alt="Screenshot showing the sample Lambda@Edge function associated with Origin Request" width="941" height="456" /><p id="caption-attachment-13437" class="wp-caption-text">Figure 8: Lambda@Edge function associated with Origin Request</p></div><ol start="10"><li>Choose <strong>Save changes.</strong></li></ol><p>After a few minutes, your changes will have propagated to all of the CloudFront edge locations, and search engine bots will be sent to your custom origin.</p><h3>Testing</h3><p>To test this accurately, you must instruct a search engine to make a legitimate request to your application. This is because AWS WAF Bot Control performs validation on search engine bots. This means that you can’t simply spoof the user-agent header to test. You also can’t attempt to include the x-amzn-waf-searchbot header in your request, as CloudFront will remove it.</p><p>Both Google and Microsoft offer consoles that can perform a live test against a URL on your website.</p><h4>To test using Google:</h4><ol><li>Navigate to the <a href="https://search.google.com/search-console&quot;&gt;search console</a>.</li><li>Choose <strong>URL Inspection.</strong></li><li>Enter a URL which will match the CloudFront Behavior that you modified above.</li><li>Choose <strong>Test Live URL</strong> and <strong>View tested page.</strong></li></ol><h4>To test using Microsoft:</h4><ol><li>Navigate to the URL Inspection function in <a href="https://www.bing.com/webmasters/urlinspection&quot;&gt;Webmaster Tools</a>.</li><li>Enter a URL which will match the CloudFront Behavior that you modified above.</li><li>Choose <strong>Inspect,</strong>then <strong>Live URL</strong>, then <strong>View Tested Page.</strong></li></ol><p>Make sure that the content of the page received by the search engine matches your expectations. The HTML view should display the static HTML version of the page that you expect, rather than the javascript-enabled version that you would serve to human visitors.</p><h2>Conclusion</h2><p>By using <a href="https://aws.amazon.com/waf/&quot;&gt;AWS WAF</a> <a href="https://aws.amazon.com/waf/features/bot-control/&quot;&gt;Bot Control</a> to identify search engine crawlers, you now have a robust method for identifying bots, rather than relying on the user-agent header alone.</p><p>By injecting a new, custom request header in AWS WAF based on the labels applied to traffic by AWS WAF Bot Control, you have been able to optimize your cache key in <a href="https://aws.amazon.com/cloudfront/&quot;&gt;CloudFront&lt;/a&gt; (as compared to including user-agent in the cache key). This increases the likelihood of your content being served from cache, and reduces the number of requests that must be forwarded to your origin.</p><p>Rather than inspecting the header in your application code, you have moved this processing into <a href="https://aws.amazon.com/lambda/edge/&quot;&gt;Lambda@Edge&lt;/a&gt;, and you can now forward search engine traffic to a different instance of your application, configured specifically to render HTML content. Furthermore, you can increase the TTL of your HTML content in the CloudFront cache through the use of the cache-control header. If the origin can’t add or modify the cache-control header, then you can still optimize this through Lambda@Edge if required.</p><p>All of this serves to make sure that search engine crawlers receive your content in the format that works best for them, with the lowest possible response time, while continuing to use JavaScript-based page frameworks for human visitors.</p><div class="blog-author-box c17"><p class="Paul-Le-Page.jpg"><img class="alignleft wp-image-1288 size-thumbnail" src="https://d2908q01vomqb2.cloudfront.net/5b384ce32d8cdef02bc3a139d4cac0a22bb029e8/2022/09/22/Paul-Le-Page.jpg&quot; alt="Paul Le Page" width="94" height="125" /></p><h3 class="lb-h4">Paul Le Page</h3><p class="c16">Paul Le Page is a Senior Solutions Architect at AWS. He works with enterprise retail customers, helping them with migrations to the cloud and adoption of cloud native services. In his free time, Paul enjoys travel, food and the great outdoors.</p></div></section>

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AWS WAF Bot Control Lambda@Edge 搜索引擎优化 动态渲染
相关文章