從python列表中刪除單詞？

我是一個完整的noob python和網頁抓取，並且很早就遇到了一些問題。我已經能夠刮荷蘭新聞網站他們的標題和分詞。現在我的目標是從結果中刪除某些單詞。例如，我不想在名單中使用像「het」和「om」這樣的詞。有誰知道我該怎麼做？（我使用python請求和BeautifulSoup）從python列表中刪除單詞？

import requests 
 
from bs4 import BeautifulSoup 
 

 
url="http://www.nu.nl" 
 
r=requests.get(url) 
 

 
soup=BeautifulSoup(r.content) 
 

 
g_data=soup.find_all("span" , {"class": "title"}) 
 

 

 
for item in g_data: 
 
    print item.text.split()

來源

2015-04-05 Arthurrrrrr

你是指「het」還是「om」？他們是什麼樣的話？ – 2015-04-05 19:05:01

嘿Padraic，他們是非常普遍的詞，如「它」和「因爲」。最終，我的目標是隻保留重要和特殊的單詞，例如「Franciscus」和「Google」等單詞。 – Arthurrrrrr 2015-04-05 19:08:32

您可以使用一組要用空字符串替換的單詞，但無法使用BeautifulSoup來執行所需操作。這比漂亮的鞋子更漂亮 – 2015-04-05 19:11:21

在自然語言處理，排除常用詞術語叫做「停止詞」。

你想保留每個單詞的順序和數量，還是隻想要在頁面上顯示的單詞集？

如果你只是想要在頁面上出現的單詞集，使用集可能是要走的路。像下面的內容可能會奏效：

# It's probably more common to define your STOP_WORDS in a file and then read 
# them into your data structure to keep things simple for large numbers of those 
# words. 
STOP_WORDS = set([ 
    'het', 
    'om' 
]) 

all_words = set() 
for item in g_data: 
    all_words |= set(item.text.split()) 
all_words -= STOP_WORDS 
print all_words

如果，另一方面，你關心的順序，你可以只從加入終止詞列表避免。

words_in_order = [] 
for item in g_data: 
    words_from_span = item.text.split() 
    # You might want to break this out into its own function for modularity. 
    for word in words_from_span: 
     if word not in STOP_WORDS: 
      words_in_order.append(word) 
print words_in_order

如果你不關心順序，但你想要頻率，你可以創建一個dict（或方便的defaultdict）爲單詞計數。

from collections import defaultdict 
word_counts = defaultdict(int) 
for item in g_data: 
    # You might want to break this out into its own function for modularity. 
    for word in item.text.split(): 
     if word not in STOP_WORDS: 
      word_counts[word] += 1 
for word, count in word_counts.iteritems(): 
    print '%s: %d' % (word, count)

來源

2015-04-05 20:25:50

很高興幫助！僅供參考 - 如果StackOverflow上的答案對您有幫助，您可以接受它並/或將其提升。 – 2015-04-07 17:26:55

從python列表中刪除單詞？

回答

相關問題