Hacker News new | past | comments | ask | show | jobs | submit login

The author has misunderstood when the perplexity user agent applies.

Web site owners shouldn’t dictate what browser users can access their site with - whether that’s chrome, firefox, or something totally different like perplexity.

When retrieving a web page _for the user_ it’s appropriate to use a UA string that looks like a browser client.

If perplexity is collecting training data in bulk without using their UA that’s a different thing, and they should stop. But this article doesn’t show that.




Just to go a little bit more into detail on this, because the article and most of the conversation here is based on a big misunderstanding:

robots.txt governs crawlers. Fetching a single user-specified URL is not crawling. Crawling is when you automatically follow links to continue fetching subsequent pages.

Perplexity’s documentation that the article links to describes how their crawler works. That is not the piece of software that fetches individual web pages when a user asks for them. That’s just a regular user-agent, because it’s acting as an agent for the user.

The distinction between crawling and not crawling has been very firmly established for decades. You can see it in action with wget. If you fetch a specific URL with `wget https://www.example.com` then wget will just fetch that URL. It will not fetch robots.txt at all.

If you tell wget to act recursively with `wget --recursive https://www.example.com` to crawl that website, then wget will fetch `https://www.example.com`, look for links on the page, then if it finds any links to other pages, it will fetch `https://www.example.com/robots.txt` to check if it is permitted to fetch any subsequent links.

This is the difference between fetching a web page and crawling a website. Perplexity is following the very well established norms here.


Its fairly logical to assume that robots.txt governs robots (empahsis in "bots") not just crawlers, if they are only intended to block crawlers why aren't they called crawlers.txt instead and remove all ambiguity?


That's a historical question. At this time, most if not all the bots were either search engines or archival. The name was even "RobotsNotWanted.txt" at the beginning but made "robots.txt" for simplicity. To give another example, Internet Archive stopped respecting it a couple of years ago, and they discuss this point (crawlers vs other bots) here [1].

[1] https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...


You meant search bots and other bots? Internet Archive's bot is a crawler.

They showed no difference between search bots and archive bots. robots.txt was never for SEO alone. Sites exclude print versions so people see more ads and links to other pages. Sites exclude search pages to conserve resources. They said sites exclude large files for costs. And they can't think sites want sensitive areas like administrative pages archived.

Really Internet Archive stopped respecting robots.txt because they wanted to archive what sites didn't want them to archive. Many sites disallowed Internet Archive specifically. Many sites allowed specific bots. Many sites disallowed all bots and meant all bots. And hiding old snapshots when a new domain owner changed robots.txt was a self inflicted problem. robots.txt says what to crawl or not now. They knew all of this.


If it was uniquely an historical question then another text file to handle AI requests would exist by now, e.g. ai-bots.txt, but it hasn't and likely never will, they don't want to even have to pretend to comply with creator requests about forbidding (or not) the usage of their sites.


There's more than one way to define what a bot is.

You can make a request by typing the url in chrome, or by asking an AI tool to do so. Both start from user intent, both heavily rely on complicated software to work.

It's fairly logical to assume that bots don't have an intent and users do. It's not the only available interpretation though.


It’s not logical to assume anything about a standard merely from a 30-year-old filename when you can just read the documentation.

> WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages.

http://www.robotstxt.org/orig.html


> Its fairly logical to assume that robots.txt governs robots (empahsis in "bots") not just crawlers

It's plenty logical. That doesn't make it correct.

> if they are only intended to block crawlers why aren't they called crawlers.txt instead and remove all ambiguity?

Ha. Ask HTTP Referer.

A million standards have quirks in them that we're stuck with.


It’s not retrieving a web page though is it? It’s retrieving the content then manipulating it. Perplexity isn’t a web browser.


> It’s retrieving the content then manipulating it. Perplexity isn’t a web browser.

So a browser with an ad-blocker that's removing / manipulating elements on the page isn't a browser? What about reader mode?


How a user views a page isn't the same as a startup scraping the internet wholesale for financial gain.


But it's not scraping, it's retrieving the page on request from the user.


> it's not scraping, it's retrieving the page on request from the user

Search engines already tried it. It’s not retrieving on request because the user didn’t request the page, they requested a bot find specific content on any page.


But it's not what happened here. It WAS retrieving on request.

> I went into Perplexity and asked "What's on this page rknight.me/PerplexityBot?". Immediately I could see the log and just like Lewis, the user agent didn't include their custom user agent


That was to test the user-agent hiding. The broader problem—Perplexity laundering attribution—is where the scraping vs retrieval question comes into play.


Well the example in the post doesn't show any laundering. Do you have an example of it?

Unless you mean the entire concept of training launders attribution, but that's basically unrelated to this post and the complaints inside it.


In this case you are 100% correct, but I think it’s reasonable to assume that the “read me this web page” use case constitutes a small minority of perplexity’s fetches. I find it useful because of the attribution - more so its references - which I almost always navigate to because its summaries are frequently crap.


That's not how it works in this case. The author asked the AI for information about a specific page.


The only way available to immediately test whether Perplexity pretends not to be Perplexity is by actively requesting a page. The fact that they mask their UA in that scenario makes it fairly obvious that they are not above bending rules and “working around” inconvenient for them public conventions. It seems safe to assume, until proven otherwise, that they would fake their bots’ user agents in every other case, such as when acquiring training data.


This is why this conversation is making me insane. How are people saying straight-faced that the user is requesting a specific page? They aren't, they're doing a search of the web.

That's not at all the same as a browser visiting a page.


Because that's literally what the author does in TFA and then complains about when Perplexity complies.

> What is this post about https://rknight.me/blog/blocking-bots-with-nginx/


Am I the only one that sees a difference between “show me page X” and “what is page X about”?

The first is how browsers work. The second is what perplexity is doing.

Those two are clearly different imo.


You are not.

Perplexity should always respect robots.txt, even for summarization requests. If I say that I don't want Perplexity crawling my site, I mean at all, and I explicitly would not want them "summarizing" my page.

The response from Perplexity to such a request should be "The owner of this page/site does not permit Perplexity to process any data from this site." Period.

LLMs can't summarize in any case: https://ea.rna.nl/2024/05/27/when-chatgpt-summarises-it-actu...


> Perplexity should always respect robots.txt, even for summarization requests. If I say that I don't want Perplexity crawling my site, I mean at all

Issuing a single HTTP request is definitionally not crawling, and the robots.txt spec is specifically for crawlers, which this is not.

If you want a specific tool to exclude you from their web request feature you have to talk to them about it. The web was designed to maximize interop between tools, it correctly doesn't have a mechanism for blacklisting specific tools from your site.


You are definitionally incorrect. From Wikipedia:

> robots.txt is the filename used for implementing the Robots Exclusion Protocol, a standard used by websites to indicate to visiting web crawlers and other web robots which portions of the website they are allowed to visit.

From robotstxt.org/orig.html (the original proposed specification), there is a bit about "recursive" behaviour, but the last paragraph indicates "which parts of their server should not be accessed".

> WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages. For more information see the robots page.

> In 1993 and 1994 there have been occasions where robots have visited WWW servers where they weren't welcome for various reasons. Sometimes these reasons were robot specific, e.g. certain robots swamped servers with rapid-fire requests, or retrieved the same files repeatedly. In other situations robots traversed parts of WWW servers that weren't suitable, e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting).

> These incidents indicated the need for established mechanisms for WWW servers to indicate to robots which parts of their server should not be accessed. This standard addresses this need with an operational solution.

The draft RFC at robotstxt.org/norobots-rfc.txt, the definition is a little more strict about "recursive", but indicates that heuristics used and/or time spacing do not make it less a robot.

On robotstxt.org/faq/what.html, there is a paragraph:

> Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images).

One might argue that the misbehaviour of Perplexity on this matter is "at the instruction" of a human, but as Perplexity does not present itself as a web browser, but a data processing entity, it’s clearly not a web browser.

Here's what would be permitted unequivocally, even on a site that blocks bad actors like Perplexity: a browser extension that used Perplexity's LLM to pretend to summarize but actually shorten the content (https://ea.rna.nl/2024/05/27/when-chatgpt-summarises-it-actu...) when you visit the page as long as that summary were not saved in Perplexity's data.


Every paragraph that you've included up there just reinforces my point.

The recursive behavior isn't incidental, it's literally part of the definition of a crawler. You can't just skip past that and pretend that the people who specifically included the word recursive (or the phrase "many pages") didn't really mean it.

The first paragraph of the two about access controls is the context for what "should not be accessed" means. It refers to "very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting)", which are pages that should not be indexed by search engines but for the most part shouldn't be a problem for something like perplexity. As I said in my comment, it's about search engine crawlers and indexers.

I'm glad that you at least cherry-picked a paragraph from that second page, because I was starting to worry that you weren't even reading your sources to check if they support your argument. That said, that paragraph means very little in support of your argument (it just gives one example of what isn't a robot, which doesn't imply that everything else is) and you're deliberately ignoring that that page is also very specific about the recursive nature of the robots that are being protected against.

Again, this is the definition that you just cited, which can't possibly include a single request from Perplexity's server (emphasis added):

> WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages.

The only way you can possibly apply that definition to the behavior in TFA is if you delete most of it and just end up with "programs ... that traverse ... the WWW", at which point you've also included normal web browsers in your new definition.

It honestly just feels like you really have a lot of beef with LLM tech, which is fair, but there are much better arguments to be made against LLMs than "Perplexity's ad hoc requests are made by a crawler and should respect robots.txt". Your sources do not back up what you claim—on the contrary, they support my claim in every respect—so you should either find better sources or try a different argument.


Perplexity's ad hoc requests are still made by a crawler — whether you believe it or not. A web browser presents the content directly to the user. There may be extensions or features (reader mode) which modify the retrieved content in browser, but Perplexity's summarization feature does not present the content directly to the user in any way.

It honestly just feels like you have no critical thinking when it comes to LLM tech and want to pretend that an autonomous crawler that only retrieves a single page to process it isn't a crawler.

I have used, with permission of the site owner, a crawler to retrieve data from a single URL on a scheduled basis. It is fully automated data retrieval not intended for direct user consumption. THAT is what makes it a crawler. If the page from which I was retrieving the data was included in `/robots.txt`, the site owner would expect that an automated program would not pull the data. Recursiveness is not required to make a web robot. Unattended and/or disconnected requests do.


You are inventing your own definition for a term that is widely understood and clearly and unambiguously defined in sources that you yourself cited. Since you can't engage honestly with your own sources I see no value in continuing this conversation.


With no benefit provided to the creator — they're not directing users out, they're pulling data in.


They are directing users __in__ in some cases though, no? I’m a perplexity user, and their summaries are often way off which drives me to the references (attribution). The ratio of fetches to clickthroughs is what’s important now though; this new model (which we’ve not negotiated or really asked for) is driving that upward from 1, and not only are you paying more as a provider but your consumer is paying more ($ to perplexity and/or via ad backend) and you aren’t seeing any of it. And you pay those extra costs to indirectly finance the competitor who put you in this situation, who intends to drive that ratio as high as it can in order to get more money from more of your customers tomorrow. Yay.


That's not a relevant factor in most legal regimes. At best it's a moral argument.


Yes, that's literally why "user agent" is called "user agent". It's a program that acts in place and in the interest of its user, and this in particular always included allowing the user to choose what will or won't be rendered, and how. It's not up to the server what the client does with the response they get.


Retrieving the content of a web page then manipulating it is basically the definition of a web browser.


So if you have a browser that has Greasemonkey like scripts running on it, then it's not a browser? What about AI summary feature available on Edge now?


I’d consider it a web browser but that’s a vague enough term that I can understand seeing it differently.

I’d be disappointed if it became common to block clients like this though. To me this feels like blocking google chrome because you don’t want to show up in google search (which is totally fine to want, for the record). Unnecessarily user hostile because you don’t approve of the company behind the client.


UA is just a signature a client sends. It's up to the client to use the signature they want to use.


And its up to the client to send as many requests as they see fit, it still called a DDOS attack when overdone regardless of the available freedom that the client has to do it.


Setting a correct user agent isn't required anyway, you just do it to not be an asshole. Robots.txt is an optional standard.

The article is just calling Perplexity out for some asshole behavior, it's not that complicated

It's clear they know they're engaging in poor behavior too, they could've documented some alternative UA for user-initiated requests instead of spoofing Chrome. Folks who trust them could've then blocked the training UA but allowed the alternative




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact