写(抄)了几个小工(玩)具， V 友看看？

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› virtualenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› Pyflakes

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

这是一个创建于 2651 天前的主题，其中的信息可能已经有所发展或是发生改变。

地址https://github.com/tonyxyl/tools

web_scanner.py

这个脚本是用字典跑网站路由的, 目的是发现一些曾经爆出过漏洞的隐藏链接, 比如后台登陆入口, 备份的配置文件等;

web_runtime.py

这个脚本运行后, 可以在浏览器页面上运行 python 代码片段;

mail_client.py

这个脚本是用 tkinter 模块写的简易邮件发送客户端;

restart_windows.py

预约一个 windows 定时关机任务, 取消定时关机任务

douban_comment.ipynb

抓取豆瓣电影短评, 结巴分词, 生成词云

另：以后有想法会一直更新。

脚本

定时

词云

关机

19 条回复 • 2018-01-24 23:02:47 +08:00

jiezhi

2018-01-22 16:25:51 +08:00

收藏一下

owenliang

2018-01-22 16:28:13 +08:00

简单，好玩。

twotiger

2018-01-22 17:33:07 +08:00

web_runtime.py 只能用 localhost 打开，127.0.0.1，你就返回错误了。其他的在看

Applenice

2018-01-22 17:42:52 +08:00

star 了，看看

panpanpan

2018-01-22 21:50:32 +08:00

词云那个应该不能用了吧，豆瓣电影现在登陆之后最多返回 500 条短篇。

xuyl

2018-01-22 22:06:41 +08:00

@panpanpan 能用的，在 mac 和 win 上都测试过。至于能返回多少页，不登录状态下容易被反爬，登陆后没有完全爬完所有页，单线程递归爬取太慢，但手工点击是可以翻完所有页的。

panpanpan

2018-01-22 22:12:33 +08:00

@xuyl #6 你现在试试？我前段时间发现的，然后刚刚回复你之前还去试过，当 start=480 的时候就没有下一页可点了，已登录。

nuanyang

2018-01-22 22:12:58 +08:00 via iPhone

mark，多谢分享

aice114

2018-01-22 23:10:03 +08:00 via Android

mark 一个，有空看看

bob1994

2018-01-23 07:44:59 +08:00

豆瓣那个，报错 OSError: cannot open resource
请问这是什么意思？

xuyl

2018-01-23 09:41:25 +08:00

@bob1994 要在 jupyter notebook 上跑的

bob1994

2018-01-23 11:51:48 +08:00

@xuyl 我的确是在 jupyter notebook 上跑得，然后出现了这个问题。。
新手小白看不懂报错。。

OSError Traceback (most recent call last)
<ipython-input-2-8e2fa2e30a58> in <module>()
101 comment_url = urljoin(url, 'comments?start=0&limit=20&sort=new_score&status=P&percent_type=')
102 parse_comment(comment_url, headers)
--> 103 generate_cloud()

<ipython-input-2-8e2fa2e30a58> in generate_cloud()
72 word_frequence = {x[0]:x[1] for x in words_stat.head(1000).values}
73
---> 74 wordcloud = wordcloud.fit_words(word_frequence)
75 plt.imshow(wordcloud)
76

/usr/local/lib/python3.6/site-packages/wordcloud/wordcloud.py in fit_words(self, frequencies)
329 self
330 """
--> 331 return self.generate_from_frequencies(frequencies)
332
333 def generate_from_frequencies(self, frequencies, max_font_size=None):

/usr/local/lib/python3.6/site-packages/wordcloud/wordcloud.py in generate_from_frequencies(self, frequencies, max_font_size)
430 while True:
431 # try to find a position
--> 432 font = ImageFont.truetype(self.font_path, font_size)
433 # transpose font optionally
434 transposed_font = ImageFont.TransposedFont(

/usr/local/lib/python3.6/site-packages/PIL/ImageFont.py in truetype(font, size, index, encoding, layout_engine)
258
259 try:
--> 260 return FreeTypeFont(font, size, index, encoding, layout_engine)
261 except IOError:
262 ttf_filename = os.path.basename(font)

/usr/local/lib/python3.6/site-packages/PIL/ImageFont.py in __init__(self, font, size, index, encoding, layout_engine)
141
142 if isPath(font):
--> 143 self.font = core.getfont(font, size, index, encoding, layout_engine=layout_engine)
144 else:
145 self.font_bytes = font.read()

OSError: cannot open resource

bob1994

2018-01-23 12:03:20 +08:00

@xuyl 好了，已解决，是缺少了 simhei.ttf 字体导致的

xuyl

2018-01-23 12:07:05 +08:00

@bob1994 赞，我的疏忽，是要加载字体的，忘了在文档里写上。

bob1994

2018-01-23 22:39:05 +08:00

@xuyl 新的问题出现，有些爬取正常，有些出错了。。
AttributeError Traceback (most recent call last)
<ipython-input-5-8e2fa2e30a58> in <module>()
100 print('\n================================\n\n')
101 comment_url = urljoin(url, 'comments?start=0&limit=20&sort=new_score&status=P&percent_type=')
--> 102 parse_comment(comment_url, headers)
103 generate_cloud()

<ipython-input-5-8e2fa2e30a58> in parse_comment(url, headers)
43 comment_divs = soup.find_all('div', class_='comment')
44 if comment_divs is not None:
---> 45 comments += ''.join([item.find('p').string.strip() for item in comment_divs])
46 next_page = soup.find('a', class_='next')
47 if next_page is not None and page < 30:

<ipython-input-5-8e2fa2e30a58> in <listcomp>(.0)
43 comment_divs = soup.find_all('div', class_='comment')
44 if comment_divs is not None:
---> 45 comments += ''.join([item.find('p').string.strip() for item in comment_divs])
46 next_page = soup.find('a', class_='next')
47 if next_page is not None and page < 30:

AttributeError: 'NoneType' object has no attribute 'strip'

麻烦看下这是什么错误

xuyl

2018-01-23 23:05:11 +08:00

@bob1994 多谢反馈，已经修复，可以去看最新的代码。

bob1994

2018-01-24 20:56:20 +08:00

@xuyl 好像。。还是有问题。。不知道是不是被反爬了，
还是之前的报错'NoneType' object has no attribute 'strip'。
我的测试网址： https://movie.douban.com/subject/26942674/

xuyl

2018-01-24 22:20:33 +08:00

@bob1994 刚刚又改进了一点，测试了你给的链接，没问题了。

bob1994

2018-01-24 23:02:47 +08:00

@xuyl 棒棒哒~