使用Python從HTML中提取可讀文本？

我知道像html2text，BeautifulSoup等utils，但問題是，他們也提取JavaScript並將其添加到文本使它很難分開。使用Python從HTML中提取可讀文本？

htmlDom = BeautifulSoup(webPage) 

htmlDom.findAll(text=True)

或者，

from stripogram import html2text 
extract = html2text(webPage)

這些都提取所有的JavaScript的網頁上爲好，這是不需要的。

我只是想要可讀的文本，你可以從你的瀏覽器中複製來提取。

來源

2010-07-03 demos

如果你想避免提取任何script標籤的內容爲BeautifulSoup，

nonscripttags = htmlDom.findAll(lambda t: t.name != 'script', recursive=False)

會爲你做這件事，得到根的非腳本標籤的直接子代（並且單獨的htmlDom.findAll(recursive=False, text=True)將獲得根的直接子代的字符串）。您需要遞歸執行此操作;例如，作爲發電機：

def nonScript(tag): 
    return tag.name != 'script' 

def getStrings(root): 
    for s in root.childGenerator(): 
    if hasattr(s, 'name'): # then it's a tag 
     if s.name == 'script': # skip it! 
     continue 
     for x in getStrings(s): yield x 
    else:      # it's a string! 
     yield s

我使用childGenerator（代替findAll），這樣我就可以讓所有的孩子們，爲了和我自己做的過濾。

來源

2010-07-03 18:39:25

謝謝！完美地完成這項工作。 – demos 2010-07-04 01:10:54

@demos，不客氣，很高興聽到這個！順便說一句，爲什麼接受（和順便說一句吧！）沒有upvote？似乎很奇怪！ - ） – 2010-07-04 02:55:00

@Alex Martelli第一次是從我這裏得到的。真可惜，在19個月裏，這個答案沒有得到任何讚揚！ – eyquem 2012-02-07 18:50:50

使用BeautifulSoup，沿着這些路線的東西：

def _extract_text(t): 
    if not t: 
     return "" 
    if isinstance(t, (unicode, str)): 
     return " ".join(filter(None, t.replace("\n", " ").split(" "))) 
    if t.name.lower() == "br": return "\n" 
    if t.name.lower() == "script": return "\n" 
    return "".join(extract_text(c) for c in t) 
def extract_text(t): 
    return '\n'.join(x.strip() for x in _extract_text(t).split('\n')) 
print extract_text(htmlDom)

來源

2010-07-03 18:32:10

您可以刪除在美麗的湯腳本標記，是這樣的：

for script in soup("script"): 
    script.extract()

Removing Elements

來源

2010-07-03 18:35:37 jkyle

看起來像一個快速的解決方案，但什麼是懲罰標籤提取？ – demos 2010-07-04 01:11:16

試試看：

http://code.google.com/p/boilerpipe/

http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/

來源

2012-02-07 18:38:50 saravanan

使用Python從HTML中提取可讀文本？

回答

相關問題