只從網頁中提取有意義的文本

我得到一個網址列表並使用nltk將它們刮取。我的最終結果是以列表的形式列出網頁上的所有單詞。麻煩的是，我只是在尋找不是通常的英語「糖」詞的關鍵字和短語，比如「as，and like，to，am，for」等。我知道我可以構建一個全部常見的文件英語單詞，只是從我的刮痕標記列表中刪除它們，但有一個內置的功能，可以自動執行此操作嗎？只從網頁中提取有意義的文本

我基本上是在頁面上尋找有用的單詞，而不是絨毛，並且可以爲頁面提供一些上下文。幾乎像在stackoverflow上的標籤或谷歌用於搜索引擎優化的標籤。

來源

2014-04-03 John Baum

可能重複的[如何使用nltk或python刪除停用詞]（http://stackoverflow.com/questions/5486337/how-to-remove-stop-words-using-nltk-or-python） – tripleee

更多的信息，我認爲你正在尋找的是從nltk.corpus的stopwords.words：

>>> from nltk.corpus import stopwords 
>>> sw = set(stopwords.words('english')) 
>>> sentence = "a long sentence that contains a for instance" 
>>> [w for w in sentence.split() if w not in sw] 
['long', 'sentence', 'contains', 'instance']

編輯：搜索停用詞給可能的重複項：Stopword removal with NLTK,How to remove stop words using nltk or python。查看這些問題的答案。並且也考慮Effects of Stemming on the term frequency?

來源

2014-04-03 21:07:59 fredtantini