Cloudflare says Perplexity’s AI bots are ‘stealth crawling’ blocked sites

The Verge - Artificial Intelligences 前天 02:16

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

根据Cloudflare的报告，AI搜索初创公司Perplexity涉嫌规避旨在阻止其AI网络爬虫访问特定网站的限制。报告指出，当Perplexity遇到屏蔽时，会试图隐藏其爬虫身份以绕过网站偏好设置。这加剧了关于Perplexity未经许可收集内容的担忧，此前该公司已被发现绕过付费墙并无视网站的robots.txt文件。Cloudflare发现，Perplexity会伪装成Google Chrome，并更改其IP地址和网络系统，以规避网站的阻止措施。Perplexity方面则回应称该报告是一场“公关噱头”。

🌐 **Perplexity涉嫌规避网站访问限制：** Cloudflare的报告指出，AI搜索公司Perplexity在遇到网站限制时，会尝试隐藏其AI爬虫的身份，以绕过网站的访问偏好设置，这种行为被指存在“隐秘爬行”。

🚫 **绕过robots.txt和付费墙：** 此前已有报道称Perplexity存在绕过付费墙和无视网站robots.txt文件的行为，此次报告进一步揭示了其规避访问限制的手段，加剧了对数据收集方式的担忧。

🕵️ **伪装身份与IP地址变更：** Cloudflare的测试发现，Perplexity会首先尝试以“PerplexityBot”或“Perplexity-User”的身份访问，若受限则会伪装成“Google Chrome on macOS”，并使用轮换的IP地址和更改Autonomous System Networks (ASN) 来规避阻止。

📈 **大规模的规避行为：** Cloudflare声称，这种规避行为发生在数万个域名和每日数百万次的请求中，表明其数据抓取行为的规模和普遍性。

📢 **Perplexity回应与Cloudflare行动：** Perplexity发言人称Cloudflare的报告是“公关噱头”，存在许多误解。Cloudflare已将Perplexity从已验证的机器人列表中移除，并推出了阻止其“隐秘爬行”的方法。

The AI search startup Perplexity is allegedly skirting restrictions meant to stop its AI web crawlers from accessing certain websites, according to a report from Cloudflare. In the report, Cloudflare claims that when Perplexity encounters a block, the startup will conceal its crawling identity “in an attempt to circumvent the website’s preferences.”

The report only adds to concerns about Perplexity vacuuming up content without permission, as the company got caught barging past paywalls and ignoring sites’ robots.txt files last year. At the time, Perplexity CEO Aravind Srinivas blamed the activity on third-party crawlers used by the site.

Now, Cloudflare, one of the world’s biggest internet architecture providers, says it received complaints from customers who claimed that Perplexity’s bots still had access to their websites even after putting their preference in their websites’ robots.txt file and by creating Web Application Firewall (WAF) rules to restrict access to the startup’s AI bots.

To test this, Cloudflare says it created new domains with similar restrictions against Perplexity’s AI scrapers. It found that the startup will first attempt to access the sites by identifying itself as the names of its crawlers: “PerplexityBot” or “Perplexity-User.”

But if the website has restrictions against AI scraping, Cloudflare claims Perplexity will change its user agent — the bit of information that tells a website what kind of browser and device you’re using, or if the visitor is a bot — to “impersonate Google Chrome on macOS.” Cloudflare says this “undeclared crawler” uses “rotating” IP addresses that the company doesn’t include on the list of IP addresses used by its bots.

Additionally, Cloudflare claims that Perplexity changes its autonomous system networks (ASN), a number used to identify groups of IP networks controlled by a single operator, to get around blocks as well. “This activity was observed across tens of thousands of domains and millions of requests per day,” Cloudflare writes.

In a statement to The Verge, Perplexity spokesperson Jesse Dwyer called Cloudflare’s report a “publicity stunt,” adding that “there are a lot of misunderstandings in the blog post.” Cloudflare has since de-listed Perplexity as a verified bot and has rolled out methods to block Perplexity’s “stealth crawling.”

Cloudflare CEO Matthew Prince has been outspoken about AI’s “existential threat” to publishers. Last month, the company started letting websites ask AI companies to pay to crawl their content, and began blocking AI crawlers by default.

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签