只從html頁面提取單詞

我正在使用python 2.7，並且我有一個文件夾和一個html頁面列表，我只想從中提取單詞。目前，我正在使用的過程是打開html文件，通過美麗的湯庫運行它，獲取文本並將其寫入新文件。但這裏的問題是我仍然得到輸出中的javascript，css（body，color，＃000000 .etc），symbols（|，`，〜，[] .etc）和隨機數。只從html頁面提取單詞

我該如何擺脫不必要的輸出並僅獲取文本？

path = *folder path* 
raw = open(path + "/raw.txt", "w") 
files = os.listdir(path) 
for name in files: 
    fname = os.path.join(path, name) 
    try: 
     with open(fname) as f: 
      b = f.read() 
      soup = BeautifulSoup(b) 
      txt = soup.body.getText().encode("UTF-8") 
      raw.write(txt)

來源

2014-12-29 user3702643

你所說的「字」是什麼意思？爲了從一個字符串中提取單詞，需要一個非常有效的「單詞」定義，一個可以變成算法的單詞。例如，「挑選」一個單詞，還是兩個單詞分隔的單詞？那麼「F1」，「i18n」和「α」呢？ –

在這種情況下，一個詞被定義爲任何可用在英語詞典 – user3702643

所以你需要一個字典查找呢？（使用一些字典，你認爲是「字典」）。 –

能去掉腳本和風格標籤

import requests 
from bs4 import BeautifulSoup 

session = requests.session() 

soup = BeautifulSoup(session.get('http://stackoverflow.com/questions/27684020/extracting-only-words- from-html-pages').text) 

#This part here will strip out the script and style tags. 
for script in soup(["script", "style"]): 
script.extract() 

print soup.get_text()

來源

2014-12-29 06:14:54 mnjeremiah

完美工作。謝謝！ – user3702643

只從html頁面提取單詞

回答

相關問題