找到網站中最常用的詞

我是新來的python。我有一個簡單的程序來查找單詞在網站中的使用次數。找到網站中最常用的詞

opener = urllib2.build_opener() 
opener.addheaders = [('User-agent', 'Mozilla/5.0')] 

url = 'http://en.wikipedia.org/wiki/Albert_Einstein' 
ourUrl = opener.open(url).read() 
soup = BeautifulSoup(ourUrl) 
dem = soup.findAll('p') #find paragraphs 
for i in dem: # loop for each para 

    words = re.findall(r'\w+', i.text) 
    cap_words = [word.upper() for word in words] 
    word_counts = Counter(cap_words) 
    print word_counts

這是這給了我一個字段，而不是網站的總字數。需要做什麼改變。另外，如果我想過濾出像a，a這樣的常見文章，我需要包含哪些代碼。

來源

2013-07-28 user2626758

相關，但沒有問到的東西：我會用nltk來查找單詞。 –

假設你真的想找到只包含段落的話，很高興與你的正則表達式，這是最小的變化來獲取檢索文檔的總字數：

soup = BeautifulSoup(ourUrl) 
dem = soup.findAll('p') #find paragraphs 
word_counts = Counter() 
for i in dem: # loop for each para 
    words = re.findall(r'\w+', i.text) 
    cap_words = [word.upper() for word in words] 
    word_counts.update(cap_words) 

print word_counts

要忽略常用詞，其中一種方法是定義一個可忽略詞的凍結集：

word_counts = Counter() 
stopwords = frozenset(('A', 'AN', 'THE')) 
for i in dem: # loop for each para 
    words = re.findall(r'\w+', i.text) 
    cap_words = [word.upper() for word in words if not word.upper() in stopwords] 
    word_counts.update(cap_words)

來源

2013-07-28 02:19:59

找到網站中最常用的詞

回答

相關問題