V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
V2EX 提问指南
nikoo
V2EX  ›  问与答

Google 根本不理会 robots.txt 仍然收录站点的标题与 URL

  •  
  •   nikoo · 2016-12-01 20:56:23 +08:00 · 3667 次点击
    这是一个创建于 2944 天前的主题,其中的信息可能已经有所发展或是发生改变。
    在 VPS 上搭建了一个 wiki 程序并用了一个之前没使用过的域名
    从可访问那一刻起保证了 /robots.txt 一直为:
    User-agent: *
    Disallow: /

    这样用了两个多月,导入了很多文章,今天在 google site 一看吓了一跳, google 根本不顾 robots.txt 的限制收录了两页的内容,所有内容包含 title 与 url ,第三行描述全部为:由于此网站的 robots.txt ,所以无法提供该结果的相关说明。
    了解详情

    用了 Google Remove URLs Tool 申请删除也没有任何响应

    Google 不是不作恶吗?为什么还会收录明确禁止收录的页面标题与 URL ?
    如何彻底禁止 google 收录站点的所有内容?

    btw:其他搜索引擎如yahoo、bing、baidu都很规矩没有收录该站点任何内容
    14 条回复    2016-12-02 10:13:11 +08:00
    auzeonfung
        1
    auzeonfung  
       2016-12-01 20:59:22 +08:00 via Android
    服务器 ban 掉 Google 的 IP
    stamaimer
        2
    stamaimer  
       2016-12-01 21:03:05 +08:00 via iPhone
    @auzeonfung 你知道谷歌有多少 ip?
    xmoiduts
        3
    xmoiduts  
       2016-12-01 21:07:11 +08:00
    题主搜一下 taobao ?
    imcocc
        4
    imcocc  
       2016-12-01 21:07:27 +08:00 via iPhone
    搜索 屏蔽垃圾爬虫
    用 useragent 匹配屏蔽
    gogohigh
        5
    gogohigh  
       2016-12-01 21:09:43 +08:00
    @stamaimer
    gfwlist
    nikoo
        6
    nikoo  
    OP
       2016-12-01 21:16:31 +08:00   ❤️ 1
    一些研究收获:
    Why do Google search results include pages disallowed in robots.txt?
    http://webmasters.stackexchange.com/questions/24569/why-do-google-search-results-include-pages-disallowed-in-robots-txt
    Does Google ignore robots.txt
    http://webmasters.stackexchange.com/questions/54879/does-google-ignore-robots-txt

    总结上面两个帖子中的结论:
    Google 的确会无视 robots.txt 收录禁止收录的页面,解决方法是在所有页面中加入
    <meta name="robots" content="noindex, nofollow">
    Google 的解释是只要这个页面在其他被收录页面中有链接就会被收录并且无视 robots.txt

    我感觉并不对,因为我的 wiki 里导入的文章没有也不可能在其他站点有链接,怎么就连标题带 URL 的被收录了呢
    caiych
        7
    caiych  
       2016-12-01 21:19:57 +08:00   ❤️ 1
    查 robots.txt 的细节的时候查到 google 的文档,里面写的是
    > 如果您想从搜索结果中屏蔽自己的网页,请使用其他方法,例如密码保护或 noindex 标记或指令。
    不知道楼主有没有设置这个…

    https://support.google.com/webmasters/answer/6062608?visit_id=1-636161949805851671-2329679117&hl=zh-Hans&rd=2
    nikoo
        8
    nikoo  
    OP
       2016-12-01 21:34:16 +08:00
    @caiych 非常感谢,很有收获的文档,感觉 Google 这样的做法有瑕疵:

    robots.txt 指令无法阻止其他网站引用您的网址
    尽管 Google 不会抓取 robots.txt 禁止访问的内容或将其编入索引,我们仍有可能在网络上的其他位置找到被禁止访问的网址并将其编入索引。因此,相关网址和其他公开显示的信息(如相关网站的链接中的定位文字)仍可能会出现在 Google 搜索结果中。您可以通过使用其他网址屏蔽方法(例如为您服务器上的文件提供密码保护或使用 noindex 元标记或响应标头),完全阻止您的网址出现在 Google 搜索结果中。

    那么问题来了,在
    使用元标记阻止搜索引擎将您的网页编入索引 https://support.google.com/webmasters/answer/93710
    中, Google 爬虫会因为 robots.txt 限制无法访问"noindex 元标记",那我在自己页面设置"noindex 元标记"理论上是无效的(因为 robots.txt 限制)
    khaki
        9
    khaki  
       2016-12-01 21:44:24 +08:00
    这里的文档更详细 https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt ,会不会是子域名的问题
    auzeonfung
        10
    auzeonfung  
       2016-12-01 21:59:55 +08:00
    @stamaimer deny from 104.132.0.0/21
    deny from 104.132.12.0/24
    deny from 104.132.128.0/24
    deny from 104.132.129.0/24
    deny from 104.132.13.0/26
    deny from 104.132.13.112/28
    deny from 104.132.13.128/25
    deny from 104.132.13.64/27
    deny from 104.132.13.96/28
    deny from 104.132.130.0/24
    deny from 104.132.131.0/24
    deny from 104.132.132.0/24
    deny from 104.132.133.0/24
    deny from 104.132.134.0/24
    deny from 104.132.135.0/24
    deny from 104.132.136.0/23
    deny from 104.132.138.0/24
    deny from 104.132.139.0/24
    deny from 104.132.14.0/23
    deny from 104.132.140.0/24
    deny from 104.132.141.0/26
    deny from 104.132.141.112/28
    deny from 104.132.141.128/25
    deny from 104.132.141.64/27
    deny from 104.132.141.96/28
    deny from 104.132.142.0/24
    deny from 104.132.143.0/24
    deny from 104.132.144.0/24
    deny from 104.132.145.0/24
    deny from 104.132.146.0/24
    deny from 104.132.147.0/24
    deny from 104.132.148.0/23
    deny from 104.132.150.0/24
    deny from 104.132.151.0/24
    deny from 104.132.152.0/24
    deny from 104.132.153.0/24
    deny from 104.132.154.0/23
    deny from 104.132.156.0/24
    deny from 104.132.157.0/24
    deny from 104.132.158.0/24
    deny from 104.132.159.0/24
    deny from 104.132.16.0/24
    deny from 104.132.160.0/24
    deny from 104.132.161.0/24
    deny from 104.132.162.0/24
    deny from 104.132.163.0/24
    deny from 104.132.164.0/23
    deny from 104.132.166.0/24
    deny from 104.132.167.0/24
    deny from 104.132.168.0/24
    deny from 104.132.169.0/24
    deny from 104.132.17.0/26
    deny from 104.132.17.112/28
    deny from 104.132.17.128/25
    deny from 104.132.17.64/27
    deny from 104.132.17.96/28
    deny from 104.132.170.0/24
    deny from 104.132.171.0/24
    deny from 104.132.172.0/22
    deny from 104.132.176.0/23
    deny from 104.132.178.0/24
    deny from 104.132.179.0/24
    deny from 104.132.18.0/24
    deny from 104.132.180.0/24
    deny from 104.132.181.0/24
    deny from 104.132.182.0/24
    deny from 104.132.183.0/24
    deny from 104.132.184.0/24
    deny from 104.132.185.0/24
    deny from 104.132.186.0/24
    deny from 104.132.187.0/24
    deny from 104.132.188.0/24
    deny from 104.132.189.0/24
    deny from 104.132.19.0/24
    deny from 104.132.190.0/23
    deny from 104.132.192.0/22
    deny from 104.132.196.0/24
    deny from 104.132.197.0/24
    deny from 104.132.198.0/23
    deny from 104.132.20.0/24
    deny from 104.132.200.0/23
    deny from 104.132.202.0/24
    deny from 104.132.203.0/24
    deny from 104.132.204.0/24
    deny from 104.132.205.0/24
    deny from 104.132.206.0/23
    deny from 104.132.208.0/24
    deny from 104.132.209.0/24
    deny from 104.132.21.0/26
    deny from 104.132.21.112/28
    deny from 104.132.21.128/25
    deny from 104.132.21.64/27
    deny from 104.132.21.96/28
    deny from 104.132.210.0/23
    deny from 104.132.212.0/22
    deny from 104.132.216.0/21
    deny from 104.132.22.0/24
    deny from 104.132.224.0/19
    deny from 104.132.23.0/24
    deny from 104.132.24.0/26
    deny from 104.132.24.128/25
    deny from 104.132.24.64/26
    deny from 104.132.25.0/24
    deny from 104.132.26.0/24
    deny from 104.132.27.0/24
    deny from 104.132.28.0/24
    deny from 104.132.29.0/24
    deny from 104.132.30.0/23
    deny from 104.132.32.0/24
    deny from 104.132.33.0/24
    deny from 104.132.34.0/24
    deny from 104.132.35.0/24
    deny from 104.132.36.0/22
    deny from 104.132.40.0/21
    deny from 104.132.48.0/22
    deny from 104.132.52.0/23
    deny from 104.132.54.0/24
    deny from 104.132.55.0/24
    deny from 104.132.56.0/21
    deny from 104.132.64.0/18
    deny from 104.132.8.0/22
    deny from 104.133.0.0/17
    deny from 104.133.128.0/18
    deny from 104.133.192.0/19
    deny from 104.133.224.0/20
    deny from 104.133.240.0/21
    deny from 104.133.248.0/24
    deny from 104.133.249.0/24
    deny from 104.133.250.0/23
    deny from 104.133.252.0/22
    deny from 104.134.0.0/16
    deny from 104.135.0.0/17
    deny from 104.135.128.0/18
    deny from 104.135.192.0/19
    deny from 104.135.224.0/19
    deny from 104.154.0.0/15
    deny from 104.196.0.0/15
    deny from 104.198.0.0/16
    deny from 104.199.0.0/17
    deny from 104.199.128.0/20
    deny from 104.199.144.0/23
    deny from 104.199.146.0/24
    deny from 104.199.147.0/24
    deny from 104.199.148.0/22
    deny from 104.199.152.0/21
    deny from 104.199.160.0/19
    deny from 104.199.192.0/18
    deny from 107.167.160.0/19
    deny from 107.178.192.0/18
    deny from 108.170.192.0/20
    deny from 108.170.208.0/21
    deny from 108.170.216.0/24
    deny from 108.170.217.0/25
    deny from 108.170.217.128/28
    deny from 108.170.217.160/27
    deny from 108.170.217.192/26
    deny from 108.170.218.0/23
    deny from 108.170.220.0/22
    deny from 108.170.224.0/19
    deny from 108.177.0.0/17
    deny from 108.59.80.0/24
    deny from 108.59.81.0/27
    deny from 108.59.82.0/23
    deny from 108.59.84.0/22
    deny from 108.59.88.0/22
    deny from 108.59.92.0/27
    deny from 108.59.92.128/26
    deny from 108.59.92.192/27
    deny from 108.59.92.96/27
    deny from 108.59.93.0/27
    deny from 108.59.93.192/26
    deny from 108.59.93.32/29
    deny from 108.59.93.40/31
    deny from 108.59.93.43/32
    deny from 108.59.93.44/30
    deny from 108.59.93.48/28
    deny from 108.59.93.64/26
    deny from 108.59.94.0/28
    deny from 108.59.94.128/26
    deny from 108.59.94.16/29
    deny from 108.59.94.192/28
    deny from 108.59.94.208/29
    deny from 108.59.94.240/28
    deny from 108.59.94.32/27
    deny from 108.59.94.64/26
    deny from 108.59.95.0/24
    deny from 12.216.80.0/24
    deny from 12.234.149.240/29
    deny from 125.16.7.72/30
    deny from 125.17.82.112/30
    deny from 128.177.109.0/26
    deny from 128.177.119.128/25
    deny from 128.177.163.0/25
    deny from 130.211.0.0/16
    deny from 142.250.0.0/15
    deny from 146.148.0.0/17
    deny from 162.216.148.0/22
    deny from 162.222.176.0/21
    deny from 172.102.8.0/21
    deny from 172.217.0.0/16
    deny from 172.253.0.0/16
    deny from 173.194.0.0/18
    deny from 173.194.100.0/22
    deny from 173.194.104.0/21
    deny from 173.194.112.0/20
    deny from 173.194.128.0/17
    deny from 173.194.64.0/19
    deny from 173.194.96.0/24
    deny from 173.194.97.0/24
    deny from 173.194.98.0/24
    deny from 173.194.99.0/24
    deny from 173.255.112.0/22
    deny from 173.255.116.0/25
    deny from 173.255.116.128/26
    deny from 173.255.116.192/27
    deny from 173.255.117.128/25
    deny from 173.255.117.32/27
    deny from 173.255.117.64/26
    deny from 173.255.118.0/23
    deny from 173.255.120.0/24
    deny from 173.255.121.0/25
    deny from 173.255.121.128/26
    deny from 173.255.122.128/26
    deny from 173.255.122.64/26
    deny from 173.255.123.0/24
    deny from 173.255.124.0/27
    deny from 173.255.124.128/29
    deny from 173.255.124.144/28
    deny from 173.255.124.160/27
    deny from 173.255.124.192/27
    deny from 173.255.124.232/29
    deny from 173.255.124.240/29
    deny from 173.255.124.32/28
    deny from 173.255.124.48/29
    deny from 173.255.124.64/26
    deny from 173.255.125.0/27
    deny from 173.255.125.128/25
    deny from 173.255.125.72/29
    deny from 173.255.125.80/28
    deny from 173.255.125.96/27
    deny from 173.255.126.0/23
    deny from 180.87.33.64/26
    deny from 192.104.160.0/23
    deny from 192.158.28.0/22
    deny from 192.178.0.0/15
    deny from 195.16.45.144/29
    deny from 198.108.100.192/28
    deny from 199.192.112.0/25
    deny from 199.192.112.128/26
    deny from 199.192.112.192/27
    deny from 199.192.112.224/29
    deny from 199.192.113.0/25
    deny from 199.192.113.128/27
    deny from 199.192.113.176/28
    deny from 199.192.113.192/26
    deny from 199.192.114.0/25
    deny from 199.192.114.192/26
    deny from 199.192.115.0/28
    deny from 199.192.115.128/25
    deny from 199.192.115.80/28
    deny from 199.192.115.96/27
    deny from 199.223.232.0/21
    deny from 203.222.167.144/28
    deny from 206.160.135.240/28
    deny from 207.223.160.0/20
    deny from 208.184.125.240/28
    deny from 208.21.209.0/28
    deny from 208.44.48.240/29
    deny from 208.46.199.160/29
    deny from 209.185.108.128/25
    deny from 209.85.128.0/17
    deny from 213.155.151.128/26
    deny from 213.200.103.128/26
    deny from 213.200.99.192/26
    deny from 216.109.75.80/28
    deny from 216.136.145.128/27
    deny from 216.239.32.0/24
    deny from 216.239.33.0/29
    deny from 216.239.33.104/29
    deny from 216.239.33.112/28
    deny from 216.239.33.128/25
    deny from 216.239.33.16/28
    deny from 216.239.33.32/29
    deny from 216.239.33.40/29
    deny from 216.239.33.48/28
    deny from 216.239.33.64/27
    deny from 216.239.33.8/29
    deny from 216.239.33.96/29
    deny from 216.239.34.0/24
    deny from 216.239.35.0/24
    deny from 216.239.36.0/23
    deny from 216.239.38.0/24
    deny from 216.239.39.0/24
    deny from 216.239.40.0/22
    deny from 216.239.44.0/23
    deny from 216.239.46.0/23
    deny from 216.239.48.0/22
    deny from 216.239.52.0/23
    deny from 216.239.54.0/24
    deny from 216.239.55.0/28
    deny from 216.239.55.128/27
    deny from 216.239.55.16/29
    deny from 216.239.55.160/29
    deny from 216.239.55.168/29
    deny from 216.239.55.176/28
    deny from 216.239.55.192/26
    deny from 216.239.55.24/29
    deny from 216.239.55.32/27
    deny from 216.239.55.64/26
    deny from 216.239.56.0/21
    deny from 216.252.220.0/22
    deny from 216.33.229.144/29
    deny from 216.33.229.160/29
    deny from 216.34.7.176/28
    deny from 216.58.192.0/19
    deny from 216.74.130.48/28
    deny from 216.74.153.0/27
    deny from 217.118.234.96/28
    deny from 23.236.48.0/20
    deny from 23.251.128.0/19
    deny from 4.3.2.0/24
    deny from 41.206.188.128/26
    deny from 61.246.190.124/30
    deny from 61.246.224.136/30
    deny from 63.158.137.224/29
    deny from 63.161.156.0/24
    deny from 63.166.17.128/25
    deny from 63.226.245.56/29
    deny from 63.237.119.112/29
    deny from 63.88.22.0/23
    deny from 64.124.98.104/29
    deny from 64.233.160.0/23
    deny from 64.233.162.0/24
    deny from 64.233.163.0/24
    deny from 64.233.164.0/22
    deny from 64.233.168.0/21
    deny from 64.233.176.0/20
    deny from 64.41.146.208/28
    deny from 64.41.221.192/28
    deny from 64.68.64.64/26
    deny from 64.68.80.0/20
    deny from 64.71.148.240/29
    deny from 64.9.224.0/19
    deny from 65.167.144.64/28
    deny from 65.170.13.0/28
    deny from 65.171.1.144/28
    deny from 65.216.183.0/24
    deny from 65.220.13.0/24
    deny from 66.102.0.0/21
    deny from 66.102.12.0/23
    deny from 66.102.14.0/25
    deny from 66.102.14.128/30
    deny from 66.102.14.132/31
    deny from 66.102.14.134/31
    deny from 66.102.14.136/29
    deny from 66.102.14.144/28
    deny from 66.102.14.160/27
    deny from 66.102.14.192/26
    deny from 66.102.15.0/24
    xiaoz
        11
    xiaoz  
       2016-12-01 22:01:01 +08:00   ❤️ 1
    用 google 站长工具检测下你网站的 robots.txt ,之前我遇到了 robots.txt 包含 bom 头被 google 报错。
    Vicer
        12
    Vicer  
       2016-12-01 23:48:02 +08:00 via Android
    学习一下
    Showfom
        13
    Showfom  
       2016-12-02 09:08:13 +08:00
    @stamaimer Google 的 爬虫 IP 基本都隐藏在这儿

    http://bgp.he.net/AS15169#_prefixes

    全部屏蔽可破,亲测
    stamaimer
        14
    stamaimer  
       2016-12-02 10:13:11 +08:00 via iPhone
    学习了,同志们。
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   2405 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 74ms · UTC 15:59 · PVG 23:59 · LAX 07:59 · JFK 10:59
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.