最近用 scrapy 框架做爬虫,前几天的数据一直都没问题,这几天爬取的数据明显减少了;应该不是代码的原因,看了下日志,有些 URL 没有抓下来,怎么回事?
2019-04-01 00:00:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yellowpages.com/search?search_terms=Tires&geo_location_terms=Dundalk%2C+MD&page=4> (referer: https://www.yellowpages.com/search?search_terms=Tires&geo_location_terms=Dundalk%2C+MD)
2019-04-01 00:00:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yellowpages.com/search?search_terms=Tires&geo_location_terms=Dundalk%2C+MD&page=3> (referer: https://www.yellowpages.com/search?search_terms=Tires&geo_location_terms=Dundalk%2C+MD)
2019-04-01 00:00:06 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yellowpages.com/search?search_terms=Tires&geo_location_terms=Dundalk%2C+MD&page=3>
如上,page=3 的就 scraped 下来了,而 page=4 的这个只是 crawled 了并没有 scraped,这是为什么,存在好多这样的情况。
1
dylanhu OP 重点是前几天没什么这种情况,这两天开始数据少了很多
|