V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
• 请不要在回答技术问题时复制粘贴 AI 生成的内容
hpxl
V2EX  ›  程序员

php 淘宝、天猫店铺商品采集

  •  
  •   hpxl · 2014-04-29 22:07:06 +08:00 · 15011 次点击
    这是一个创建于 3891 天前的主题,其中的信息可能已经有所发展或是发生改变。
    能够规避淘宝防采集功能,通过代理快速采集店铺商品,商品信息以及图片默认存放在./data目录。

    https://github.com/hpxl/fetch-taobao-goods
    如果觉得有用,欢迎star
    第 1 条附言  ·  2014-04-30 22:28:30 +08:00
    1.修复当淘宝店铺没有店铺分类时,商品采集失败的问题。
    2.脚本运行需要开启curl扩展
    18 条回复    2014-09-03 14:30:13 +08:00
    sadara
        1
    sadara  
       2014-04-29 22:49:51 +08:00 via iPhone
    记得有个淘宝客程序叫单店宝
    mahone3297
        2
    mahone3297  
       2014-04-29 23:35:37 +08:00
    已fork。。。
    leyle
        3
    leyle  
       2014-04-29 23:55:55 +08:00 via Android
    这个有意思,先关注下,白天电脑看看
    bigshan
        4
    bigshan  
       2014-04-30 01:49:46 +08:00 via iPhone
    明天用电脑看看咯
    huangsong
        5
    huangsong  
       2014-04-30 10:35:31 +08:00
    fork 一下
    aWangami
        6
    aWangami  
       2014-04-30 12:40:28 +08:00
    C:\Users\Administrator\Desktop\Fetch-Taobao>php fetch.php 'http://shop65262430.taobao.com'
    PHP Warning: file_put_contents(/tmp/fetchgoods.pid): failed to open stream: No such file or directo
    ry in C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php on line 13
    PHP Stack trace:
    PHP 1. {main}() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:0
    PHP 2. file_put_contents() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:13

    Warning: file_put_contents(/tmp/fetchgoods.pid): failed to open stream: No such file or directory in
    C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php on line 13

    Call Stack:
    0.0010 127528 1. {main}() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:0
    0.0010 128008 2. file_put_contents() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php
    :13

    PHP Notice: Undefined index: scheme in C:\Users\Administrator\Desktop\Fetch-Taobao\FetchGoods.class
    .php on line 59
    PHP Stack trace:
    PHP 1. {main}() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:0
    PHP 2. FetchGoods->run() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:33
    PHP 3. FetchGoods->fetchOneShop() C:\Users\Administrator\Desktop\Fetch-Taobao\FetchGoods.class.php
    :50

    Notice: Undefined index: scheme in C:\Users\Administrator\Desktop\Fetch-Taobao\FetchGoods.class.php
    on line 59

    Call Stack:
    0.0010 127528 1. {main}() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:0
    0.0068 192584 2. FetchGoods->run() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:3
    3
    0.0068 193152 3. FetchGoods->fetchOneShop() C:\Users\Administrator\Desktop\Fetch-Taobao\Fe
    tchGoods.class.php:50

    PHP Notice: Undefined index: host in C:\Users\Administrator\Desktop\Fetch-Taobao\FetchGoods.class.p
    hp on line 59
    PHP Stack trace:
    PHP 1. {main}() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:0
    PHP 2. FetchGoods->run() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:33
    PHP 3. FetchGoods->fetchOneShop() C:\Users\Administrator\Desktop\Fetch-Taobao\FetchGoods.class.php
    :50

    Notice: Undefined index: host in C:\Users\Administrator\Desktop\Fetch-Taobao\FetchGoods.class.php on
    line 59

    Call Stack:
    0.0010 127528 1. {main}() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:0
    0.0068 192584 2. FetchGoods->run() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:3
    3
    0.0068 193152 3. FetchGoods->fetchOneShop() C:\Users\Administrator\Desktop\Fetch-Taobao\Fe
    tchGoods.class.php:50

    shop_url:'http://shop65262430.taobao.com' ... start_time:04-29 15:19:11 ... start!
    PHP Fatal error: Call to undefined function curl_init() in C:\Users\Administrator\Desktop\Fetch-Tao
    bao\HttpFetch.class.php on line 127
    PHP Stack trace:
    PHP 1. {main}() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:0
    PHP 2. FetchGoods->run() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:33
    PHP 3. FetchGoods->fetchOneShop() C:\Users\Administrator\Desktop\Fetch-Taobao\FetchGoods.class.php
    :50
    PHP 4. HttpFetch->get() C:\Users\Administrator\Desktop\Fetch-Taobao\FetchGoods.class.php:74
    PHP 5. HttpFetch->disguise_curl() C:\Users\Administrator\Desktop\Fetch-Taobao\HttpFetch.class.php:
    29

    Fatal error: Call to undefined function curl_init() in C:\Users\Administrator\Desktop\Fetch-Taobao\H
    ttpFetch.class.php on line 127

    Call Stack:
    0.0010 127528 1. {main}() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:0
    0.0068 192584 2. FetchGoods->run() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:3
    3
    0.0068 193152 3. FetchGoods->fetchOneShop() C:\Users\Administrator\Desktop\Fetch-Taobao\Fe
    tchGoods.class.php:50
    0.0215 197880 4. HttpFetch->get() C:\Users\Administrator\Desktop\Fetch-Taobao\FetchGoods.c
    lass.php:74
    0.0215 197896 5. HttpFetch->disguise_curl() C:\Users\Administrator\Desktop\Fetch-Taobao\Ht
    tpFetch.class.php:29

    PHP Warning: unlink(/tmp/fetchgoods.pid): No such file or directory in C:\Users\Administrator\Deskt
    op\Fetch-Taobao\fetch.php on line 15
    PHP Stack trace:
    PHP 1. {main}() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:0
    PHP 2. FetchGoods->run() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:33
    PHP 3. FetchGoods->fetchOneShop() C:\Users\Administrator\Desktop\Fetch-Taobao\FetchGoods.class.php
    :50
    PHP 4. HttpFetch->get() C:\Users\Administrator\Desktop\Fetch-Taobao\FetchGoods.class.php:74
    PHP 5. HttpFetch->disguise_curl() C:\Users\Administrator\Desktop\Fetch-Taobao\HttpFetch.class.php:
    29
    PHP 6. removePidFile() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:0
    PHP 7. unlink() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:15

    Warning: unlink(/tmp/fetchgoods.pid): No such file or directory in C:\Users\Administrator\Desktop\Fe
    tch-Taobao\fetch.php on line 15

    Call Stack:
    0.0010 127528 1. {main}() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:0
    0.0068 192584 2. FetchGoods->run() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:3
    3
    0.0068 193152 3. FetchGoods->fetchOneShop() C:\Users\Administrator\Desktop\Fetch-Taobao\Fe
    tchGoods.class.php:50
    0.0215 197880 4. HttpFetch->get() C:\Users\Administrator\Desktop\Fetch-Taobao\FetchGoods.c
    lass.php:74
    0.0215 197896 5. HttpFetch->disguise_curl() C:\Users\Administrator\Desktop\Fetch-Taobao\Ht
    tpFetch.class.php:29
    0.0342 194016 6. removePidFile() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:0
    0.0342 194128 7. unlink() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:15


    C:\Users\Administrator\Desktop\Fetch-Taobao>
    andyhu
        7
    andyhu  
       2014-05-01 04:45:49 +08:00
    mark关注下,不过采集这东西用php有点太痛苦了
    ptsa
        8
    ptsa  
       2014-05-02 22:28:53 +08:00
    @sadara $1199 现在淘宝客不好做吧
    ptsa
        9
    ptsa  
       2014-05-02 22:30:23 +08:00
    @sadara 而且还是去年的版本 不知道好不好用
    hanchengluo
        10
    hanchengluo  
       2014-05-03 10:22:41 +08:00
    @andyhu 我也是用PHP采集的,2G数据用了差不多一个月时间,有更好的推荐吗?
    andyhu
        11
    andyhu  
       2014-05-03 10:43:43 +08:00
    @hanchengluo 试下node.js+request+cheerio吧,我其实工作中是用PHP的,但如果有需要抓取远程页面这种工作,用完这个组合以后再回去PHP会觉得非常痛苦
    andyhu
        12
    andyhu  
       2014-05-03 10:45:02 +08:00
    @ptsa 淘宝客,主要不好做在哪方面?听说蘑菇街和美丽说都转型了,具体是怎么一个情况?
    hanchengluo
        13
    hanchengluo  
       2014-05-03 10:52:18 +08:00
    @andyhu 主要是取出标签再存入数据库,主要压力应该是抓取速度和数据库IO。我想应该和所用的程序没关的。
    www.smartweb.cn
    andyhu
        14
    andyhu  
       2014-05-03 10:57:42 +08:00
    html parsing也浪费时间,另外php不支持多线程,每个请求都要等待很慢的。数据库我用的是mongodb,速度还是很快的
    andyhu
        15
    andyhu  
       2014-05-03 11:01:49 +08:00
    @hanchengluo 刚才看了您的网站,网页快照用的是什么啊?是phantomjs搞定的吗?node有个thumbbot比较强悍,可以通吃网页 图片 视频缩略图预览。不过是基于phantomjs的,如果需要截取带flash的界面,估计还是要用特殊定制的版本才行,老版的phantomjs已经不支持flash了。总体感觉抓取这东西,php和node.js毫无可比性。python都比php好用很多,也有不少专业的爬虫模块
    hanchengluo
        16
    hanchengluo  
       2014-05-03 11:14:40 +08:00
    @andyhu 多谢光临,我就只用PHP下面的CI,对JS也不熟。以前想搞个爬虫,想学下GoLang,但没坚持,还是用php了,人老了,学不动了。准备将网站改成一个小门户,还在构思中,没采集又没资料,但又怕采集被K。
    laodao
        17
    laodao  
       2014-05-03 12:14:27 +08:00
    ym1623
        18
    ym1623  
       2014-09-03 14:30:13 +08:00
    我发现你这个项目不行啊,,一样会被天猫拦截到...
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   1126 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 24ms · UTC 18:46 · PVG 02:46 · LAX 10:46 · JFK 13:46
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.