大佬们，有什么好用的开源网页正文提取的库

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

For Existing Member Sign In

• 请不要在回答技术问题时复制粘贴 AI 生成的内容

This topic created in 823 days ago, the information mentioned may be changed or developed.

现在有一个需要提取网页正文的需求。大佬们有什么觉得很好用的开源库啊。

另外开源知识库产品也求个推荐。

想要做一个网页爬取，正文提取，然后到知识库，最后 api 输出的组合。

谢谢大佬们

知识库

正文

大佬

提取

17 replies • 2024-02-06 19:47:57 +08:00

zuoyouTU

Feb 6, 2024

如果目标页面格式清楚，用 selenium 或者 pytesseract 简单定制一下应该可以
前者拿明文后者用 ocr 拿其他的

zqjilove

Feb 6, 2024

gen 。github 、v2 里搜索一下，好像还是 v 友开发的。

wbrobot

Feb 6, 2024

国外好用的都是收费 API
国内以前有一个，后来没有了
开源的需要自己改的东西太多了，以后有基于 AI 的可能会好很多

Cloud200

Feb 6, 2024

https://github.com/Unstructured-IO

Cloud200

Feb 6, 2024

https://github.com/labring/FastGPT

rizon

Feb 6, 2024

我本以为正文提取的库挺多的，结果查了一下发现，这条路好像还没有趟的很好啊。目前看到一个最简单的方法就是基于标签的密度。

FrankAdler

Feb 6, 2024

https://github.com/mozilla/readability 这个挺好用的

itskingname

Feb 6, 2024

@zqjilove GNE： https://github.com/GeneralNewsExtractor/GeneralNewsExtractor

itskingname

Feb 6, 2024

GNE 开源版： https://github.com/GeneralNewsExtractor/GeneralNewsExtractor

GNE 高级版： https://www.kingname.info/2023/12/06/GnePro/

DTCPSS

Feb 6, 2024

Mozilla 的 Readability
https://github.com/mozilla/readability

rizon

Feb 6, 2024

@FrankAdler #7 对对对，就是这个思路，那些各类网页阅读器的思路。我试试这个如何

rizon

Feb 6, 2024

@DTCPSS #10 这个看着蛮好用的诶，感谢兄弟。火狐真棒，哈哈

oaa

Feb 6, 2024

1 ） Readability ，https://github.com/mozilla/readability ，是一种基于规则的方法，被 Mozilla Firefox 浏览器的阅读模式使用，它通过检查 HTML 元素的标签名称、文本数量、链接密度以及满足主要内容标准的文本模式来提取主要内容

2 ） DOM Distiller ，https://github.com/chromium/dom-distiller ，是 Google Chrome 浏览器的阅读模式，它是一种混合方法，使用了 Boilerpipe 分类器和一些规则，有点类似于 Readability

3 ） Web2Text ，https://github.com/dalab/web2text ，是基于深度神经网络的分类器，使用了 CNN 模型和包括单词计数、标点符号存在和停用词数量等 128 个结构和文本特征来确定每个文本块是否属于主要内容

4 ） Boilernet ，https://github.com/mrjleo/boilernet ，是基于深度神经网络的分类器，使用 LSTM 将网页的文本节点视为由单词和 DOM 树根路径组成的文本块序列
好像还有个啥论文。。
via https://twitter.com/Barret_China/status/1729889136520335606?s=20

Immortal

Feb 6, 2024

rod

chingyat

Feb 6, 2024

1. Mozilla 的 readability https://github.com/mozilla/readability
2. Postlight/parser https://github.com/postlight/parser

dyllen

Feb 6, 2024

之前不记得哪里看的，哪些聚合网站好像是有用的密度分析方法做的。

zqjilove

Feb 6, 2024

目前最靠谱的就是用 gpt