TechCrunch News 01月11日
How OpenAI’s bot crushed this seven-person company’s web site ‘like a DDoS attack’
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Triplegangers公司电商网站遭遇OpenAI爬虫攻击,导致网站瘫痪。OpenAI的爬虫机器人使用数百个IP地址,试图下载该网站所有65000多个产品页面上的数十万张图片及其详细描述,造成了类似DDoS攻击的效果。该公司拥有大量3D人体模型数据,并明确禁止未经许可的抓取行为。尽管如此,由于网站未正确配置robot.txt文件,OpenAI的爬虫还是进行了大规模抓取,导致服务器负载过高,账单飙升。事件暴露了AI公司在数据抓取方面的漏洞,以及小型企业在保护自身数据方面面临的挑战。

⚠️ OpenAI爬虫:OpenAI的爬虫机器人(GPTBot)使用数百个IP地址,对Triplegangers网站进行了大规模数据抓取,试图下载网站上所有产品页面上的图片和描述。

⚙️ robot.txt配置:网站未正确配置robot.txt文件,导致OpenAI爬虫可以自由抓取数据。robot.txt文件本应告知爬虫哪些内容不应抓取,但网站的疏忽使得爬虫得以肆虐。

💸 经济损失:由于爬虫的大量请求,Triplegangers网站瘫痪,且因CPU和下载活动导致AWS账单大幅增加,给公司造成了经济损失。

🛡️ 数据保护:尽管Triplegangers明确禁止未经许可的抓取行为,但由于robot.txt配置不当,以及AI公司对规则的遵守程度不一,网站数据面临风险。这凸显了小型企业在数据保护方面面临的挑战。

🔍 行业警示:事件揭示了AI公司数据抓取的潜在问题,以及企业如何通过监控日志活动来发现和阻止AI爬虫。提醒其他网站所有者主动保护自己的数据,而非被动等待AI公司遵守规则。

On Saturday, Triplegangers CEO Oleksandr Tomchuk was alerted that his company’s ecommerce site was down. It looked to be some kind of distributed denial-of-service attack. 

He soon discovered the culprit was a bot from OpenAI that was relentlessly attempting to scrape his entire, enormous site. 

“We have over 65,000 products, each product has a page,” Tomchuk told TechCrunch. “Each page has at least three photos.” 

OpenAI was sending “tens of thousands” of server requests trying to download all of it, hundreds of thousands of photos, along with their detailed descriptions. 

“OpenAI used 600 IPs to scrape data, and we are still analyzing logs from last week, perhaps it’s way more,” he said of the IP addresses the bot used to attempt to consume his site. 

“Their crawlers were crushing our site,” he said “It was basically a DDoS attack.”

Triplegangers’ website is its business. The seven-employee company has spent over a decade assembling what it calls the largest database of “human digital doubles” on the web, meaning 3D image files scanned from actual human models. 

It sells the 3D object files, as well as photos – everything from hands to hair, skin, and full bodies – to 3D artists, video game makers, anyone who needs to digitally recreate authentic human characteristics.

Tomchuk’s team, based in Ukraine but also licensed in the U.S. out of Tampa, Florida, has a terms of service page on its site that forbids bots from taking its images without permission. But that alone did nothing. Websites must use a properly configured robot.txt file with tags specifically telling OpenAI’s bot, GPTBot, to leave the site alone. (OpenAI also has a couple of other bots, ChatGPT-User and OAI-SearchBot, that have their own tags, according to its information page on its crawlers.)

Robot.txt, otherwise known as the Robots Exclusion Protocol, was created to tell search engine sites what not to crawl as they index the web. OpenAI says on its informational page that it honors such files when configured with its own set of do-not-crawl tags, though it also warns that it can take its bots up to 24 hours to recognize an updated robot.txt file.

As Tomchuk experienced, if a site isn’t properly using robot.txt, OpenAI and others take that to mean they can scrape to their hearts’ content. It’s not an opt-in system.

To add insult to injury, not only was Triplegangers knocked offline by OpenAI’s bot during US business hours, but Tomchuk expects a jacked-up AWS bill thanks to all of the CPU and downloading activity from the bot.

Robot.txt also isn’t a failsafe. AI companies voluntarily comply with it. Another AI startup, Perplexity, pretty famously got called out last summer by a Wired investigation when some evidence implied Perplexity wasn’t honoring it.

Each of these is a product, with a product page that includes multiple more photos. Used by permission.Image Credits:Triplegangers (opens in a new window)

By Wednesday, after days of OpenAI’s bot returning, Triplegangers had a properly configured robot.txt file in place, and also a Cloudflare account set up to block its GPTBot and several other bots he discovered, like Barkrowler (an SEO crawler) and Bytespider (TokTok’s crawler). Tomchuk is also hopeful he’s blocked crawlers from other AI model companies. On Thursday morning, the site didn’t crash, he said.

But Tomchuk still has no reasonable way to find out exactly what OpenAI successfully took or to get that material removed. He’s found no way to contact OpenAI and ask. OpenAI did not respond to TechCrunch’s request for comment. And OpenAI has so far failed to deliver its long-promised opt-out tool, as TechCrunch recently reported.

This is an especially tricky issue for Triplegangers. “We’re in a business where the rights are kind of a serious issue, because we scan actual people,” he said. With laws like Europe’s GDPR, “they cannot just take a photo of anyone on the web and use it.”

Triplegangers’ website was also an especially delicious find for AI crawlers. Multibillion-dollar-valued startups, like Scale AI, have been created where humans painstakingly tag images to train AI. Triplegangers’ site contains photos tagged in detail: ethnicity, age, tattoos vs scars, all body types, and so on.

The irony is that the OpenAI bot’s greediness is what alerted Triplegangers to how exposed it was. Had it scraped more gently, Tomchuk never would have known, he said.

“It’s scary because there seems to be a loophole that these companies are using to crawl data by saying “you can opt out if you update your robot.txt with our tags,” says Tomchuk, but that puts the onus on the business owner to understand how to block them.

Triplegangers’ server logs showed how ruthelessly an OpenAI bot was accessing the site, from hundreds of IP addresses. Used by permission.

He wants other small online businesses to know that the only way to discover if an AI bot is taking a website’s copyrighted belongings is to actively look. He’s certainly not alone in being terrorized by them. Owners of other websites recently told Business Insider how OpenAI bots crashed their sites and ran up their AWS bills.

The problem grew magnitudes in 2024. New research from digital advertising company DoubleVerify found that AI crawlers and scrapers caused an 86% increase in “general invalid traffic” in 2024 — that is, traffic that doesn’t come from a real user.

Still, “most sites remain clueless that they were scraped by these bots,” warns Tomchuk. “Now we have to daily monitor log activity to spot these bots.”

When you think about it, the whole model operates a bit like a mafia shakedown: the AI bots will take what they want unless you have protection.

“They should be asking permission, not just scraping data,” Tomchuk says.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

OpenAI 网络爬虫 数据抓取 robot.txt DDoS攻击
相关文章