2
0

Google does not respect robots.txt


               
2023 Jun 30, 6:10pm   357 views  3 comments

by Patrick   follow (59)  

There is a standard for informing search engines that you do not want them to index your site. It's to include a file called robots.txt at the top level of you site, eg:

https://patrick.net/robots.txt

Note that the first thing I do is to tell Google to fuck off:


User-Agent:GoogleBot
Disallow: /


But Google disrespects the wishes of site owners and indexes anyway! Proof from my web server log:


34.32.251.230 - - [30/Jun/2023:23:57:41 +0000] "GET /housing/rss.xhtml HTTP/1.1" 403 27 "http://patrick.net/housing/rss.xhtml" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-" 0.002 0.000 127.0.0.1:8083 "-" 403 - .


And that is not a spoof of Google's bot, because 34.32.251.230 is really a Google IP:

% whois 34.32.251.230

NetRange: 34.4.5.0 - 34.63.255.255
CIDR: 34.4.64.0/18, 34.4.32.0/19, 34.16.0.0/12, 34.8.0.0/13, 34.32.0.0/11, 34.4.16.0/20, 34.4.128.0/17, 34.5.0.0/16, 34.4.8.0/21, 34.4.6.0/23, 34.6.0.0/15, 34.4.5.0/24
NetName: GOOGL-2
NetHandle: NET-34-4-5-0-1
Parent: NET34 (NET-34-0-0-0-0)
NetType: Direct Allocation
OriginAS:
Organization: Google LLC (GOOGL-2)
RegDate: 2022-05-09
Updated: 2022-05-09
Ref: https://rdap.arin.net/registry/ip/34.4.5.0
...
Comment: Complaints can also be sent to the GC Abuse desk
Comment: (google-cloud-compliance@google.com )
Comment: but may have longer turnaround times.


So I wrote google-cloud-compliance@google.com to ask them to stop that, but they have not replied, and not stopped.

Summary: Google is evil, and will try to index your site whether you ask them to stop or not.

PS I have taken measures on the server side now to block Google by IP address, returning a 403 Forbidden to them.

« First        Comments 3 - 3 of 3        Search these comments

3   richwicks   @   2023 Jun 30, 8:33pm  

Patrick says

PS I have taken measures on the server side now to block Google by IP address, returning a 403 Forbidden to them.


I would recommend you return something useless to them like a webpage that says:

"Don't use Google, use any of these other search engines instead"

And list them.

« First        Comments 3 - 3 of 3        Search these comments

Please register to comment:

api   best comments   contact   latest images   memes   one year ago   users   suggestions   gaiste