2
0

Google does not respect robots.txt


 invite response                  
2023 Jun 30, 6:10pm   308 views  3 comments

by Patrick   ➕follow (60)   ignore (3)  

There is a standard for informing search engines that you do not want them to index your site. It's to include a file called robots.txt at the top level of you site, eg:

https://patrick.net/robots.txt

Note that the first thing I do is to tell Google to fuck off:


User-Agent:GoogleBot
Disallow: /


But Google disrespects the wishes of site owners and indexes anyway! Proof from my web server log:


34.32.251.230 - - [30/Jun/2023:23:57:41 +0000] "GET /housing/rss.xhtml HTTP/1.1" 403 27 "http://patrick.net/housing/rss.xhtml" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-" 0.002 0.000 127.0.0.1:8083 "-" 403 - .


And that is not a spoof of Google's bot, because 34.32.251.230 is really a Google IP:

% whois 34.32.251.230

NetRange: 34.4.5.0 - 34.63.255.255
CIDR: 34.4.64.0/18, 34.4.32.0/19, 34.16.0.0/12, 34.8.0.0/13, 34.32.0.0/11, 34.4.16.0/20, 34.4.128.0/17, 34.5.0.0/16, 34.4.8.0/21, 34.4.6.0/23, 34.6.0.0/15, 34.4.5.0/24
NetName: GOOGL-2
NetHandle: NET-34-4-5-0-1
Parent: NET34 (NET-34-0-0-0-0)
NetType: Direct Allocation
OriginAS:
Organization: Google LLC (GOOGL-2)
RegDate: 2022-05-09
Updated: 2022-05-09
Ref: https://rdap.arin.net/registry/ip/34.4.5.0
...
Comment: Complaints can also be sent to the GC Abuse desk
Comment: (google-cloud-compliance@google.com )
Comment: but may have longer turnaround times.


So I wrote google-cloud-compliance@google.com to ask them to stop that, but they have not replied, and not stopped.

Summary: Google is evil, and will try to index your site whether you ask them to stop or not.

PS I have taken measures on the server side now to block Google by IP address, returning a 403 Forbidden to them.

Comments 1 - 3 of 3        Search these comments

1   RWSGFY   2023 Jun 30, 6:43pm  

They don't respect robots.txt but they don't search worth shit on this site either. Why do they even bother to crawl it?
2   MolotovCocktail   2023 Jun 30, 8:00pm  

RWSGFY says

They don't respect robots.txt but they don't search worth shit on this site either. Why do they even bother to crawl it?


They don't give search results to you worth shit.

Doesn't mean they don't suck everything they can from this site, tho.

I do think it is safe that they don't care for much of it because there are no ads on Patrick.net at all.

But if certain keywords meet Deep State domestic terrorism criteria, they will hand it over.
3   richwicks   2023 Jun 30, 8:33pm  

Patrick says

PS I have taken measures on the server side now to block Google by IP address, returning a 403 Forbidden to them.


I would recommend you return something useless to them like a webpage that says:

"Don't use Google, use any of these other search engines instead"

And list them.

Please register to comment:

api   best comments   contact   latest images   memes   one year ago   users   suggestions   gaiste