Codeberg: army of AI crawlers are extremely slowing us; AI crawlers learned how to solve the Anubis challenges.

Pro@programming.dev · edit-2 3 days ago

Codeberg: army of AI crawlers are extremely slowing us; AI crawlers learned how to solve the Anubis challenges.

S7rauss@discuss.tchncs.de · 3 days ago

Does your tool respect the site’s robots.txt?

who@feddit.org · edit-2 3 days ago

Unfortunately, robots.txt cannot express rate limits, so it would be an overly blunt instrument for things like GP describes. HTTP 429 would be a better fit.

redjard@lemmy.dbzer0.com · 3 days ago

Crawl-delay is just that, a simple directive to add to robots.txt to set the maximum crawl frequency. It used to be widely followed by all but the worst crawlers …

who@feddit.org · edit-2 2 days ago

Crawl-delay

It’s a nonstandard extension without consistent semantics or wide support, but I suppose it’s good to know about anyway. Thanks for mentioning it.

S7rauss@discuss.tchncs.de · 3 days ago

I was responding to their question if scraping the site is considered harmful. I would say as long as they are not ignoring robots they shouldn’t be contributing significant amounts of traffic if they’re really only pulling data once a day.

Programmer Belch@lemmy.dbzer0.com · 3 days ago

Yes, it just downloads the HTML of one page and formats the data into the RSS format with only the information I need.