使用 beautifulsoup 解析网页非常的慢，有什么同类产品可以替代么？

推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› virtualenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› Pyflakes

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

This topic created in 4420 days ago, the information mentioned may be changed or developed.

另如果单纯用正则匹配的话，效率如何？

beautifulsoup

正则

同类产品

26 replies • 2014-05-06 16:11:27 +08:00

for4

Apr 28, 2014

我一直用的lxml。

你可以看一下这个
http://www.cnblogs.com/rzhang/archive/2011/12/29/python-html-parsing.html

halfcrazy

Apr 28, 2014

@for4 如果性能能相差十倍，确实很诱人，这就试试去

qonco

Apr 28, 2014

jsoup

qonco

Apr 28, 2014

正则不是用来匹配html的

Ever

Apr 28, 2014

@halfcrazy 美丽汤指定第二个参数为lxml就能走lxml parser, 不用重写。

halfcrazy

Apr 28, 2014

@qonco jsoup是java的啊，另我的意思是只用正则来解析网页提取内容。

halfcrazy

Apr 28, 2014

@Ever 是这样么？ soup = BeautifulSoup(page,"lxml")

halfcrazy

Apr 28, 2014

@Ever 用了这个lxml’s HTML parser效果似乎不是很明显啊

bilipan

Apr 28, 2014

pyquery可以试下，语法跟jquery类似

binux

Apr 28, 2014

正则比xml建树快得多，直接用xpath，比soup，pyquery快。
即便如此，lxml单进程每秒30个页面还是没问题的。加大并发就好了。

flyer103

Apr 28, 2014 via Android

@binux 想问下 “lxml单进程每秒30个页面还是没问题的” 是如何测出来的，平均获取单个页面中的数据条目有多少？

binux

Apr 28, 2014

@flyer103 timeit，每个页面80条xpath规则

andyhu

Apr 28, 2014

可以不用python吗？nodejs+cheerio非常爽，完全jquery的语法解析，速度也很快

kxxoling

Apr 28, 2014 via iPad

bs有坑啊！lxml！

187j3x1

Apr 28, 2014

匹配一堆相同内容正则舒服很多能正则就正则

dreasky

Apr 28, 2014

亲测正则的速度快最灵活

a2z

Apr 28, 2014

bs4

tomnee

Apr 28, 2014

pyquery, 套的lxml, 性能比bs好，用起来比较简单。

daiv

Apr 28, 2014

pyquery，用起来还是很舒服的

walleL

Apr 28, 2014

不知道大家有没有注意过这个功能，很赞啊

okidogi

Apr 28, 2014

beautifulsoup4 使用的就是lxml的库，应该会快一些。

pip install beautifulsoup4

halfcrazy

Apr 28, 2014

@okidogi 默认用的是html.parser吧，lxml好像要手动指定

chevalier

Apr 29, 2014

@walleL 一直在用，写爬虫XPath解析网页必备

orancho

Apr 29, 2014 via Android

nokigiri

Ever

Apr 29, 2014

@halfcrazy 是不是单个文档过大? 数量大试试开线程池解决， lxml会释放GIL，能有效利用多核。

remnet

May 6, 2014

beautifulsoup 用过感觉的确挺慢的