V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
推荐学习书目
Learn Python the Hard Way
Python Sites
PyPI - Python Package Index
http://diveintopython.org/toc/index.html
Pocoo
值得关注的项目
PyPy
Celery
Jinja2
Read the Docs
gevent
pyenv
virtualenv
Stackless Python
Beautiful Soup
结巴中文分词
Green Unicorn
Sentry
Shovel
Pyflakes
pytest
Python 编程
pep8 Checker
Styles
PEP 8
Google Python Style Guide
Code Style from The Hitchhiker's Guide
bestehen
V2EX  ›  Python

爬虫中翻页每次都 set-cookie 重新更新 sessionid 问题

  •  
  •   bestehen · 2018-07-23 01:04:24 +08:00 · 2290 次点击
    这是一个创建于 2315 天前的主题,其中的信息可能已经有所发展或是发生改变。

    #!/usr/bin env python3 import requests import os import execjs,json,time

    class qimingpian(object):

     def __init__(self):
         self.s=requests.session()
         self.js_file=os.getcwd()+"/"+"js_decrypt.js"
     def get_content(self):
         cookies={}
         cookies_str='Hm_lvt_d1cdd45a1d449d32c7b4dbab4915de60=1532161260; Hm_lpvt_d1cdd45a1d449d32c7b4dbab4915de60=1532161260; gr_user_id=0ac1c623-6d25-4c89-b0eb-beaccb4ed35c; time_token=1532254367533; unionid=ETXncbCRyisjw/hr0zeTaonhpvkz/81ntwbBWAKYE4wdmhbtHCwxkjwb+0gjVdRzeJWqqIs6kiQsM8IbOYgM5A==; Hm_lvt_1e712c5331439bcf163b46f3d208f00b=1532161262,1532252857,1532254027,1532254368; Hm_lpvt_1e712c5331439bcf163b46f3d208f00b=1532254368; userinfo={%22nickname%22:%22Wing%E3%80%82%22%2C%22headimgurl%22:%22http://thirdwx.qlogo.cn/mmopen/vi_32/Q0j4TwGTfTJzmBzIeVHkjp6IVAl3uWAgB4FYIC96KygBjBvY2qAHycK1OctdAcODsWMh8zJia3j9GCBOzR5Truw/132%22%2C%22coin%22:%2250%22%2C%22applySubmit%22:%220%22%2C%22team_flag%22:%220%22%2C%22team_uuid%22:%22%22%2C%22vip_out_date%22:%22%22%2C%22usernum%22:%22226256331%22%2C%22team_enterprise%22:%220%22%2C%22enterprise_coin%22:%220%22%2C%22is_admin%22:%220%22%2C%22is_manager%22:%220%22%2C%22first_shenqing%22:%220%22%2C%22phone%22:%2213161346498%22%2C%22apply_phone%22:%2213161346498%22%2C%22scope%22:%22qmp%22%2C%22apply_state%22:3%2C%22liyou%22:%22%22%2C%22is_certify%22:1%2C%22ip%22:%22106.37.197.194%22%2C%22person_role%22:%22%22%2C%22claim_type%22:0%2C%22expireinfo%22:false%2C%22inneruser%22:false%2C%22apply_pro_state%22:3%2C%22person_id%22:%22%22}'
         for line in cookies_str.split(';'):  # 按照字符:进行划分读取
             # 其设置为 1 就会把字符串拆分成 2 份
             name, value = line.strip().split('=', 1)
             cookies[name] = value  # 为字典 cookies 添加内容
         url='http://pdf.api.qimingpian.com/t/getFileByPage1'
         headers={"Referer": "http://vip.qimingpian.com/","User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36","Host": "pdf.api.qimingpian.com","Accept": "application/json, text/plain, */*","Accept-Encoding": "gzip, deflate","Accept-Language": "en-US,en;q=0.9","Connection": "keep-alive","Content-Length": "183","Content-Type": "application/x-www-form-urlencoded","Origin": "http://vip.qimingpian.com"}
         for i in range(1,101):
           form_data={"page":"i","num":"40","w":"" ,"ptype": "qmp_pc","version": "2.0","unionid": "ETXncbCRyisjw/hr0zeTaonhpvkz/81ntwbBWAKYE4wdmhbtHCwxkjwb+0gjVdRzeJWqqIs6kiQsM8IbOYgM5A==","jtype": "vip","time_token": "1532254367533"}
           response=self.s.post(url=url,data=form_data,headers=headers,cookies=cookies)
           print(self.s.cookies)
           print(response.headers)
           print(response.text)
           json_data=json.loads(response.text)
           _js=open(self.js_file,'r').read()
           data=execjs.compile(_js).call('n',json_data['data1'])
           print(data)
           for j in range(0,len(data['items'])):
               name=data['items'][j]['name']
               report_source=data['items'][j]['report_source']
               update_time=data['items'][j]['update_time']
               url=data['items'][j]['url']
               print(name)
               print(report_source)
               print(update_time)
               print(url)
               print('\n')
    

    if name=='main': qimingpian().get_content()

    代码很少,js 加密破解,但是现在的问题,这个网站每请求一页就 set-cookie 重新设置 sessionid,我这里用的是 session 应该是动态的变化,为啥还是报错呢?现在的情况只能访问第一页 到第二页就报以下错误

    <RequestsCookieJar[<cookie phpsessid="3khddv90nbg11lu1ia8eld8ol3" for="" <a="" href="&lt;a href=" http:="" pdf.api.qimingpian.com"="" rel="nofollow">http://pdf.api.qimingpian.com" rel="nofollow">pdf.api.qimingpian.com=""/>]> {'Content-Type': 'text/html', 'Connection': 'keep-alive', 'Content-Length': '254', 'Via': 'kunlun6.cn24[,0]', 'Timing-Allow-Origin': '*', 'Date': 'Sun, 22 Jul 2018 16:49:17 GMT', 'EagleId': '7ae1224615322781579372751e', 'Server': 'Tengine', 'X-Tengine-Error': 'non-existent domain'}

    <html> <head><title>403 Forbidden</title></head> <body bgcolor="white">

    403 Forbidden

    You don't have permission to access the URL on this server.


    Powered by Tengine</body>

    http://vip.qimingpian.com/#/finos/investment/ireport 进去之后 创投数据 报告库里面的数据 不知道自己错在哪里?

    bestehen
        1
    bestehen  
    OP
       2018-07-23 01:25:14 +08:00
    Access-Control-Allow-Origin: *
    Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
    Connection: keep-alive
    Content-Type: application/json;charset=UTF-8
    Date: Sun, 22 Jul 2018 17:24:14 GMT
    Expires: Thu, 19 Nov 1981 08:52:00 GMT
    Pragma: no-cache
    Server: nginx/1.4.4
    Set-Cookie: PHPSESSID=g9o7ulf9399oldina7h8jvkef3; path=/
    Transfer-Encoding: chunked
    X-Powered-By: PHP/5.5.7
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   2645 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 24ms · UTC 04:39 · PVG 12:39 · LAX 20:39 · JFK 23:39
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.