Content Heist: Perplexity Ghosts Your Robots.txt

When “Do Not Crawl” Falls on Deaf Bots: Cloudflare Exposes Perplexity’s Stealth Moves

In a revealing investigation, internet infrastructure leader Cloudflare has accused the AI-powered search engine Perplexity of bypassing websites that explicitly forbade crawling via their robots.txt files or firewall rules. Cloudflare’s findings shed light on how content ownership and digital trust are being tested in the AI era.

The Deceptive Crawl

Cloudflare’s researchers set up hidden domains with strict no-crawl policies, even deploying Web Application Firewall (WAF) rules to block Perplexity’s known bots: PerplexityBot and Perplexity-User. Yet, astonishingly, Perplexity still scraped detailed information from these sites.

Further analysis revealed stealth tactics in full swing:

User-agent Spoofs: When blocked, Perplexity masked its identity by impersonating Chrome on macOS.
IP and ASN Rotation: The bots switched between undeclared IP addresses and Autonomous System Numbers to evade detection.

These tactics allegedly allowed Perplexity to crawl across tens of thousands of domains—making millions of daily scraping requests.

Cloudflare’s Retaliation

In light of these findings, Cloudflare took decisive action:

Delisted Perplexity as a “verified bot” within its ecosystem.
Deployed new heuristics in its managed rules to specifically block stealth crawling attempts.

Cloudflare is also championing AI transparency and standards, urging developers to separate user-driven agents from crawling bots and to use clear bot identities that respect website preferences.

Perplexity Fires Back

Unsurprisingly, Perplexity pushed back, branding the report a “publicity stunt” and accusing Cloudflare of technical misunderstandings. They argue the blocked requests stemmed from third-party services like BrowserBase, rather than Perplexity’s own crawlers.

On discussion forums, many users questioned the notion of agents and bots as different entities – arguing that fetching information in real-time for a user isn’t necessarily scraping at scale. Nonetheless, the practice of evading clearly stated directives remains deeply controversial.

The High Stakes of Stealth Scraping

Cloudflare sees this as more than a crawling violation – it’s a threat to the web’s underlying trust framework. When bots ignore robots.txt, they erode webmasters’ control over how their content is accessed and used.

As AI platforms grow more sophisticated, the clash between content protection and data hunger intensifies. Legal scholars point to uncharted territory-robots.txt may lack binding legal force, but ignoring it can expose companies to reputational damage and possibly even legal disputes.

Final Thoughts: A Call for Ethical AI Crawling

At its core, this confrontation between Cloudflare and Perplexity underscores a pressing need: AI tools must respect the web’s rules. Ethical data gathering, clear bot identification, and meaningful opt-in frameworks – not stealthy backdoors – should define future norms.

If the internet is to remain balanced, content creators must retain autonomy over their web spaces. And AI companies, in turn, must uphold transparency – because trust cannot be crawled under the radar and still survive.

Sources

https://m.economictimes.com/tech/artificial-intelligence/cloudflare-accuses-perplexity-of-using-stealth-crawling-techniques-to-evade-network-blocks/amp_articleshow/123110615.cms

Cloudflare Blocks Perplexity for Skirting Web Scraping Rules: What’s at Stake?

Post Views: 587

Quantrail Data