lads

ℍ𝕂-𝟞𝟝@sopuli.xyz · 8 days ago

Websites were under a constant noise of malicious requests even before AI, but now AI scraping of Lemmy instances usually triples traffic. While some sites can cope with this, this means a three-fold increase in hosting costs in order to essentially fuel investment portfolios.

AI scrapers will already use as much energy as available, so making them use more per site measn less sites being scraped, not more total energy used.

And this is not DDoS, the objective of scrapers is to get the data, not bring the site down, so while the server must reply to all requests, the clients can’t get the data out without doing more work than the server.

daniskarma@lemmy.dbzer0.com · 8 days ago

AI does not triple traffic. It’s a completely irrational statement to make.

There’s a very limited number of companies training big LLM models, and these companies do train a model a few times per year. I would bet that the number of requests per year of s resource by an AI scrapper is on the dozens at most.

Using as much energy as a available per scrapping doesn’t even make physical sense. What does that sentence even mean?

grysbok@lemmy.sdf.org · 8 days ago

You’re right. AI didn’t just triple the traffic to my tiny archive’s site. It way more than tripled it. After implementing Anubis, we went from 3000 ‘unique’ visitors down to 20 in a half-day. Twenty is a much more expected number for a small college archive in the summer. That’s before I did any fine-tuning to Anubis, just the default settings.

I was getting constant outage reports. Now I’m not.

For us, it’s not about protecting our IP. We want folks to get to find out information. That’s why we write finding aids, scan it, accession it. But, allowing bots to siphon it all up inefficiently was denying everyone access to it.

And if you think bots aren’t inefficient, explain why Facebook requests my robots.txt 10 times a second.

daniskarma@lemmy.dbzer0.com · edit-2 8 days ago

How do you know those reduced request were AI companies and not any other purpose?

grysbok@lemmy.sdf.org · edit-2 8 days ago

Timing and request patterns. The increase in traffic coincided with the increase in AI in the marketplace. Before, we’d get hit by bots in waves and we’d just suck it up for a day. Now it’s constant. The request patterns are deep deep solr requests, with far more filters than any human would ever use. These are expensive requests and the results aren’t any more informative that just scooping up the nicely formatted EAD/XML finding aids we provide.

And, TBH, I don’t care if it’s AI. I care that it’s rude. If the bots respected robots.txt then I’d be fine with them. They don’t and they break stuff for actual researchers.

daniskarma@lemmy.dbzer0.com · edit-2 8 days ago

I mean number of pirates correlates with global temperature. That doesn’t mean causation.

The rest of the indices would aso match for any archiving bot, or with any bit in search of big data. We must remember that big data is used for much more than AI. At the end of the day scraping is cheap, but very few companies in the world have access to the processing power to train that amount of data. That’s why it seems so illogical to me.

We are seeing how many LLM models which are results of a full train, per year? Ten? twenty? Even if they update and retrain often it’s not compatible with the amount of request people are implying as AI scraping that would put services into dos risk. Specially when I would think that any AI company would not try to scrap the same data twice.

I have also experience an increase in bot requests in my host. But I just think is a result of internet getting bigger, more people using internet with more diverse intentions, some ill some not. I’ve also experience a big increase on probing and attack attempts on general, and I don’t think it’s OpenAI trying some outdated Apache vulnerability on my server. Internet is just a bigger sea with more fish in it.

grysbok@lemmy.sdf.org · 8 days ago

I just looked at my log for this morning. 23% of my total requests were from the useragent GoogleOther. Other visitors include GPTBot, SemanticScholarBot, and Turnitin. That’s the crawlers that are still trying after I’ve had Anubis on the site for over a month. It was much, much worse before, when they could crawl the site, instead of being blocked.

That doesn’t include the bots that lie about being bots. Looking back at an older screenshot of a monitors—I don’t have the logs themselves anymore—I seriously doubt I had 43,000 unique visitors using Windows per day in March.

daniskarma@lemmy.dbzer0.com · edit-2 8 days ago

Why would they request so many times a day the same data if the objective was AI model training. It makes zero sense.

Also google bots obeys robots.txt so they are easy to manage.

There may be tons of reasons google is crawling your website. From ad research to any kind of research. The only AI related use I can think of is RAG. But that would take some user requests aways because if the user got the info through the AI google response then they would not enter the website. I suppose that would suck for the website owner, but it won’t drastically increase the number of requests.

But for training I don’t see it, there’s no need at all to keep constantly scraping the same web for model training.

grysbok@lemmy.sdf.org · edit-2 8 days ago

Like I said, [edit: at one point] Facebook requested my robots.txt multiple times a second. You’ve not convinced me that bot writers care about efficiency.

[edit: they’ve since stopped, possibly because now I give a 404 to anything claiming to be from facebook]

ℍ𝕂-𝟞𝟝@sopuli.xyz · 8 days ago

AI does not triple traffic. It’s a completely irrational statement to make.

Multiple testimonials from people who host sites say they do. Multiple Lemmy instances also supported this claim.

I would bet that the number of requests per year of s resource by an AI scrapper is on the dozens at most.

You obviously don’t know much about hosting a public server. Try dozens per second.

There is a booming startup industry all over the world training AI, and scraping data to sell to companies training AI. It’s not just Microsoft, Facebook and Twitter doing it, but also Chinese companies trying to compete. Also companies not developing public models, but models for internal use. They all use public cloud IPs, so the traffic is coming from all over incessantly.

Using as much energy as a available per scrapping doesn’t even make physical sense. What does that sentence even mean?

It means that Microsoft buys a server for scraping, they are going to be running it 24/7, with the CPU/network maxed out, maximum power use, to get as much data as they can. If the server can scrape 100 sites per minute, it will scrape 100 sites. If it can scrape 1000, it will scrape 1000, and if it can do 10, it will do 10.

It will not stop scraping ever, as it is the equivalent of shutting down a production line. Everyone always uses their scrapers as much as they can. Ironically, increasing the cost of scraping would result in less energy consumed in total, since it would force companies to work more “smart” and less “hard” at scraping and training AI.

Oh, and it’s S-C-R-A-P-I-N-G, not scrapping. It comes from the word “scrape”, meaning to remove the surface from an object using a sharp instrument, not “scrap”, which means to take something apart for its components.

daniskarma@lemmy.dbzer0.com · 8 days ago

I’m not native English speaker. So I would apologize if there’s bad English in my response. And would thank any corrections.

That being said I do host public services, before and after AI was a thing. And I have asked many of these people who claim “we are under AI bot attacks” how are they able to differentiate when a request is from a AI scrapper or just any other scrapper and there was no satisfying answer.