在自然語言處理,排除常用詞術語叫做「停止詞」。
你想保留每個單詞的順序和數量,還是隻想要在頁面上顯示的單詞集?
如果你只是想要在頁面上出現的單詞集,使用集可能是要走的路。像下面的內容可能會奏效:
# It's probably more common to define your STOP_WORDS in a file and then read
# them into your data structure to keep things simple for large numbers of those
# words.
STOP_WORDS = set([
'het',
'om'
])
all_words = set()
for item in g_data:
all_words |= set(item.text.split())
all_words -= STOP_WORDS
print all_words
如果,另一方面,你關心的順序,你可以只從加入終止詞列表避免。
words_in_order = []
for item in g_data:
words_from_span = item.text.split()
# You might want to break this out into its own function for modularity.
for word in words_from_span:
if word not in STOP_WORDS:
words_in_order.append(word)
print words_in_order
如果你不關心順序,但你想要頻率,你可以創建一個dict(或方便的defaultdict)爲單詞計數。
from collections import defaultdict
word_counts = defaultdict(int)
for item in g_data:
# You might want to break this out into its own function for modularity.
for word in item.text.split():
if word not in STOP_WORDS:
word_counts[word] += 1
for word, count in word_counts.iteritems():
print '%s: %d' % (word, count)
你是指「het」還是「om」?他們是什麼樣的話? – 2015-04-05 19:05:01
嘿Padraic,他們是非常普遍的詞,如「它」和「因爲」。最終,我的目標是隻保留重要和特殊的單詞,例如「Franciscus」和「Google」等單詞。 – Arthurrrrrr 2015-04-05 19:08:32
您可以使用一組要用空字符串替換的單詞,但無法使用BeautifulSoup來執行所需操作。這比漂亮的鞋子更漂亮 – 2015-04-05 19:11:21