• who@feddit.org
      link
      fedilink
      English
      arrow-up
      18
      ·
      edit-2
      3 days ago

      Unfortunately, robots.txt cannot express rate limits, so it would be an overly blunt instrument for things like GP describes. HTTP 429 would be a better fit.

      • redjard@lemmy.dbzer0.com
        link
        fedilink
        English
        arrow-up
        9
        ·
        3 days ago

        Crawl-delay is just that, a simple directive to add to robots.txt to set the maximum crawl frequency. It used to be widely followed by all but the worst crawlers …

        • who@feddit.org
          link
          fedilink
          English
          arrow-up
          2
          ·
          edit-2
          2 days ago

          Crawl-delay

          It’s a nonstandard extension without consistent semantics or wide support, but I suppose it’s good to know about anyway. Thanks for mentioning it.

      • S7rauss@discuss.tchncs.de
        link
        fedilink
        English
        arrow-up
        4
        ·
        3 days ago

        I was responding to their question if scraping the site is considered harmful. I would say as long as they are not ignoring robots they shouldn’t be contributing significant amounts of traffic if they’re really only pulling data once a day.