• Aatube@kbin.social
    link
    fedilink
    arrow-up
    10
    arrow-down
    46
    ·
    9 months ago

    robots.txt is purely textual; you can’t run JavaScript or log anything. Plus, one who doesn’t intend to follow robots.txt wouldn’t query it.

    • BrianTheeBiscuiteer@lemmy.world
      link
      fedilink
      English
      arrow-up
      55
      ·
      9 months ago

      If it doesn’t get queried that’s the fault of the webscraper. You don’t need JS built into the robots.txt file either. Just add some line like:

      here-there-be-dragons.html
      

      Any client that hits that page (and maybe doesn’t pass a captcha check) gets banned. Or even better, they get a long stream of nonsense.

          • Aniki 🌱🌿
            link
            fedilink
            English
            arrow-up
            1
            arrow-down
            1
            ·
            9 months ago

            That was really interesting. I always used urandom by practice and wondered what the difference was.

        • Aniki 🌱🌿
          link
          fedilink
          English
          arrow-up
          3
          arrow-down
          1
          ·
          edit-2
          9 months ago

          I wonder if Nginx would just load random into memory until the kernel OOM kills it.

      • gravitas_deficiency@sh.itjust.works
        link
        fedilink
        English
        arrow-up
        11
        arrow-down
        3
        ·
        9 months ago

        I actually love the data-poisoning approach. I think that sort of strategy is going to be an unfortunately necessary part of the future of the web.

    • ShitpostCentral@lemmy.world
      link
      fedilink
      English
      arrow-up
      16
      ·
      9 months ago

      You’re second point is a good one, but you absolutely can log the IP which requested robots.txt. That’s just a standard part of any http server ever, no JavaScript needed.

      • GenderNeutralBro@lemmy.sdf.org
        link
        fedilink
        English
        arrow-up
        11
        ·
        9 months ago

        You’d probably have to go out of your way to avoid logging this. I’ve always seen such logs enabled by default when setting up web servers.

    • ricecake@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      12
      ·
      9 months ago

      People not intending to follow it is the real reason not to bother, but it’s trivial to track who downloaded the file and then hit something they were asked not to.

      Like, 10 minutes work to do right. You don’t need js to do it at all.