Python - 關鍵字閱讀程序，無法刪除標點符號

你一直在玩一個簡單的程序，讀取文本並確定首字母大寫的關鍵字。我遇到的問題是該程序不會刪除標點符號，我的意思是，佛羅多佛羅多。佛羅多，作爲不同的條目出現而不是相同。我嘗試使用導入字符串和周圍的標點符號，但它沒有奏效。Python - 關鍵字閱讀程序，無法刪除標點符號

下面是我的代碼和我使用的文本是從http://www.angelfire.com/rings/theroaddownloads/fotr.pdf（複製到名爲novel.txt的txt文檔）。再次感謝

by_word = {} 
with open ('novel.txt') as f: 
    for line in f: 
    for word in line.strip().split(): 
     if word[0].isupper(): 
     if word in by_word: 
      by_word[word] += 1 
     else: 
      by_word[word] = 1 

by_count = [] 
for word in by_word: 
    by_count.append((by_word[word], word)) 

by_count.sort() 
by_count.reverse() 

for count, word in by_count[:100]: 
    print(count, word)

來源

2017-04-24 Joshua Robertson

可能的重複[從Python中的字符串去除標點符號的最佳方式]（http://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string-in-python ） – elethan

首先嚐試使用上述解決方案，但它似乎沒有與我的實現工作，我可能會做錯了。 –

希望這下面會爲你工作按預期：

import string 
exclude = set(string.punctuation) 

by_word = {} 
with open ('novel.txt') as f: 
    for line in f: 
    for word in line.strip().split(): 
     if word[0].isupper(): 
     word = ''.join(char for char in word if char not in exclude) 
     if word in by_word: 
      by_word[word] += 1 
     else: 
      by_word[word] = 1 

by_count = [] 
for word in by_word: 
    by_count.append((by_word[word], word)) 

by_count.sort() 
by_count.reverse() 

for count, word in by_count[:100]: 
    print(count, word)

它將從word刪除所有的

!"#$%&'()*+,-./:;<=>[email protected][\]^_`{|}~

。

來源

2017-04-24 02:24:40 Claudio

完美，謝謝你！ –

您的代碼是細，剝離標點，使用一個正則表達式拆分，

for word in line.strip().split():

可改爲

for word in re.split('[,.;]',line.strip()):

，其中在[]第一個參數包含所有標點符號。這使用re模塊，https://docs.python.org/2/library/re.html#re.split。

來源

2017-04-24 02:15:35 Pbd

感謝您似乎刪除了標點符號但現在正在獲取Traceback（最近調用最後一個）：文件「C：\ Users \ joshr \ Desktop \ Key-word reader.py」，第7行，在 if word [0] .isupper（）： IndexError：字符串索引超出範圍 - 我明白這個錯誤試圖說的是什麼，但是由於每個列表僅由一個對象組成，因此索引0應該沒有問題。 –

Python - 關鍵字閱讀程序，無法刪除標點符號

回答

相關問題