推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› virtualenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› Pyflakes

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

This topic created in 3007 days ago, the information mentioned may be changed or developed.

HTML Parsing

纯净的 HTML 解析库, 取代复杂的 beautifulsoup4, pyquery, lxml

github: https://github.com/gaojiuli/htmlparsing

安装

pip install htmlparsing

# or

pip install git+https://github.com/gaojiuli/htmlparsing

用法

import requests

from htmlparsing import Element

url = 'https://python.org'
r = requests.get(url)

初始化

e = Element(text=r.text, base_url=url)

获取页面中的链接

e.links
"""
{...'/users/membership/', '/events/python-events', '//docs.python.org/3/tutorial/controlflow.html#defining-functions'}
"""


e.absolute_links
"""
{...'https://python.org/download/alternatives',  'https://python.org/about/success/#software-development', 'https://python.org/download/other/', 'https://python.org/community/irc/'}
"""

选择器以及选择属性

e.xpath('//a')[0].attrs
"""{'href': '#content', 'title': 'Skip to content'}"""

e.xpath('//a')[0].attrs.title
"""Skip to content"""

e.css('a')[0].attrs
"""{'href': '#content', 'title': 'Skip to content'}"""

e.parse('<a href="#content" title="Skip to content">{}</a>'))
"""<Result ('Skip to content',) {}>"""

获取文本内容和整个 HTML

e.xpath('//a')[5].text
"""PyPI"""

e.xpath('//a')[5].html
"""<a href="https://pypi.python.org/" title="Python Package Index">PyPI</a>"""

e.xpath('//a')[5].markdown
"""[PyPI]( https://pypi.python.org/ "Python Package Index")"""

目前支持的选择器: xpath, css ,parse

github: https://github.com/gaojiuli/htmlparsing

xpath

Python

content'

9 replies