Remember when tumblr announced the sale of your posts to openAi?
In the staff post that avoided talking about it by mentioned the new ‘no-ai’ setting they mentioned that if you had your tumblr blog already hidden from search engines, they automatically had turned that setting on for you by default. I remember thinking “well, in this sea of shit, turning it on by default for those who already cared about privacy is at least a decent gesture”.
I had just been sniffing around a bit (of course I didn’t want to copy tumblr’s robots.txt to use it in goblin, what are you saying, that’s outrageous!) and I just realized about something. This is tumblr’s robots.txt (the file that tell crawlers what they can access and what not) when you have the ‘discourage external search" on.
Basically, any bot is discouraged from searching any page on your blog. Any bot. Including the AI ones. So if you have the “don’t show up in search engines” box, you already were as protected as tumblr can protect your posts from being read by AI trainers. So yeah, that good gesture was basically smoke and mirrors to distract people from the fact they were already selling their data.
But what happens if you want to allow google & bing & the rest to index your blog, but turn on the ‘no-ai’ setting? Then the robot.txt file of your blog changes:
Yeah, that’s a list of known AI services crawlers you gently request to not read your blog. Can you see something missing?
Oh yeah, it seems they have forgotten to disallow GPTBot, the crawler from OpenAi. You know, the company Automattic signed a deal with to sell Tumblr’s data. So well, even if you turn the infamous “no-ai” setting on tumblr, tumblr won’t block openAi bots from reading your site. They will block all its competitors, but not openAi.