2016-11-26 182 views
1

我試圖找到在莎士比亞的三篇文章中出現的前50個詞以及每個詞出現在macbeth.txt,allswell.txt中的比例,和othello.txt。這是我到目前爲止的代碼:在列表中找到重複項並添加它們的值

def byFreq(pair): 
    return pair[1] 

def shakespeare(): 
counts = {} 
A = [] 
for words in ['macbeth.txt','allswell.txt','othello.txt']: 
    text = open(words, 'r').read() 
    test = text.lower() 

    for ch in '!"$%&()*+,-./:;<=>[email protected][\\]^_`{|}~': 
     text = text.replace(ch, ' ') 
     words = text.split() 

    for w in words: 
     counts[w] = counts.get(w, 0) + 1 

    items = list(counts.items()) 
    items.sort() 
    items.sort(key=byFreq, reverse = True) 

    for i in range(50): 
     word, count = items[i] 
     count = count/float(len(counts)) 
     A += [[word, count]] 
print A 

,其輸出:

 >>> shakespeare() 
[['the', 0.12929982922664066], ['and', 0.09148572822639668], ['I', 0.08075140278116613], ['of', 0.07684801171017322], ['to', 0.07562820200048792], ['a', 0.05220785557453037], ['you', 0.04415711149060746], ['in', 0.041717492071236886], ['And', 0.04147353012929983], ['my', 0.04147353012929983], ['is', 0.03927787265186631], ['not', 0.03781410100024396], ['that', 0.0358624054647475], ['it', 0.03366674798731398], ['Macb', 0.03342278604537692], ['with', 0.03269090021956575], ['his', 0.03147109050988046], ['be', 0.03025128080019517], ['The', 0.028787509148572824], ['haue', 0.028543547206635766], ['me', 0.027079775555013418], ['your', 0.02683581361307636], ['our', 0.025128080019516955], ['him', 0.021956574774335203], ['Enter', 0.019516955354964626], ['That', 0.019516955354964626], ['for', 0.01927299341302757], ['this', 0.01927299341302757], ['he', 0.018541107587216395], ['To', 0.01780922176140522], ['so', 0.017077335935594046], ['all', 0.0156135642839717], ['What', 0.015369602342034643], ['are', 0.015369602342034643], ['thou', 0.015369602342034643], ['will', 0.015125640400097584], ['Macbeth', 0.014881678458160527], ['thee', 0.014881678458160527], ['But', 0.014637716516223469], ['but', 0.014637716516223469], ['Macd', 0.014149792632349353], ['they', 0.014149792632349353], ['their', 0.013905830690412296], ['we', 0.013905830690412296], ['as', 0.01341790680653818], ['vs', 0.01341790680653818], ['King', 0.013173944864601122], ['on', 0.013173944864601122], ['yet', 0.012198097096852892], ['Rosse', 0.011954135154915833], ['the', 0.15813168261114238], ['I', 0.14279684862127182], ['and', 0.1231007315700619], ['to', 0.10875070343275182], ['of', 0.10481148002250985], ['a', 0.08581879572312887], ['you', 0.08581879572312887], ['my', 0.06992121553179516], ['in', 0.061902082160945414], ['is', 0.05852560495216657], ['not', 0.05486775464265616], ['it', 0.05472706809229038], ['that', 0.05472706809229038], ['his', 0.04727068092290377], ['your', 0.04389420371412493], ['me', 0.043753517163759144], ['be', 0.04305008441193022], ['And', 0.04037703995498031], ['with', 0.038266741699493526], ['him', 0.037703995498030385], ['for', 0.03601575689364097], ['he', 0.03404614518851998], ['The', 0.03137310073157006], ['this', 0.030810354530106922], ['her', 0.029262802476083285], ['will', 0.0291221159257175], ['so', 0.027011817670230726], ['have', 0.02687113111986494], ['our', 0.02687113111986494], ['but', 0.024760832864378166], ['That', 0.02293190770962296], ['PAROLLES', 0.022791221159257174], ['To', 0.021384355655599326], ['all', 0.021384355655599326], ['shall', 0.021102982554867755], ['are', 0.02096229600450197], ['as', 0.02096229600450197], ['thou', 0.02039954980303883], ['Macb', 0.019274057400112548], ['thee', 0.019274057400112548], ['no', 0.01871131119864941], ['But', 0.01842993809791784], ['Enter', 0.01814856499718627], ['BERTRAM', 0.01758581879572313], ['HELENA', 0.01730444569499156], ['we', 0.01730444569499156], ['do', 0.017163759144625774], ['thy', 0.017163759144625774], ['was', 0.01674169949352842], ['haue', 0.016460326392796848], ['I', 0.19463784682531435], ['the', 0.17894627455055595], ['and', 0.1472513769094877], ['to', 0.12989712147978802], ['of', 0.12002494024732412], ['you', 0.1079704873739998], ['a', 0.10339810869791126], ['my', 0.0909279850358516], ['in', 0.07627558973293151], ['not', 0.07159929335965914], ['is', 0.0697287748103502], ['it', 0.0676504208666736], ['that', 0.06733866777512211], ['me', 0.06099968824690845], ['your', 0.0543489556271433], ['And', 0.053205860958121166], ['be', 0.05310194326093734], ['his', 0.05154317780317988], ['with', 0.04769822300737816], ['him', 0.04665904603553985], ['her', 0.04364543281720877], ['for', 0.04322976202847345], ['he', 0.042190585056635144], ['this', 0.04187883196508366], ['will', 0.035332017042502335], ['Iago', 0.03522809934531851], ['so', 0.03356541619037722], ['The', 0.03325366309882573], ['haue', 0.031902733035435935], ['do', 0.03138314454951678], ['but', 0.030240049880494647], ['That', 0.02857736672555336], ['thou', 0.027642107450898887], ['as', 0.027434272056531227], ['To', 0.026810765873428243], ['our', 0.02504416502130313], ['are', 0.024628494232567806], ['But', 0.024420658838200146], ['all', 0.024316741141016316], ['What', 0.024212823443832486], ['shall', 0.024004988049464823], ['on', 0.02265405798607503], ['thee', 0.022134469500155875], ['Enter', 0.021822716408604385], ['thy', 0.021199210225501402], ['no', 0.020783539436766082], ['she', 0.02026395095084693], ['am', 0.02005611555647927], ['by', 0.019848280162111608], ['have', 0.019848280162111608]] 

相反outputing三種文本,其輸出的每個文本,150個字的前50字的前50字。我努力嘗試刪除重複項,但將它們的比率加在一起。例如,在macbeth.txt中,「the」這個詞的比例爲0.12929982922664066,allswell.txt的比例爲0.15813168261114238,而其他的則爲0.17894627455055595。我想結合他們三人的比例。我非常確定我必須使用for循環,但我正努力循環遍歷列表中的列表。我更喜歡Java的人,所以任何幫助將不勝感激!

+0

您是否正在尋找每個文件中單詞出現率或所有3個文件合併的比率?換句話說,「這個」的比率應該是每個作品單獨出現的頻率(因此它有3個不同的比率),還是應該是所有三個文本中「the」發生頻率的比率(一個值)。 – TheF1rstPancake

+0

這應該是在所有三個文本組合(「值」)中出現「the」的頻率對於混淆抱歉。 –

+0

好的。那麼它會在你的邏輯上產生變化。你不能只將三個比率相加在一起。您必須記下所有三個文件中單詞出現的次數,然後將其除以每個文件中總字數的總和。您需要將三個單獨的文件視爲一個大文件,然後進行數學計算。 @ phynfo的解決方案爲你做了什麼? @ zmbq的解決方案也適用。所有你需要做的就是將'items = list(count.items())'後面的所有內容移出''for'循環。 – TheF1rstPancake

回答

1

您正在彙總循環內的文件計數。將彙總代碼移到for循環外部。

3

您可以使用列表中理解和反類:

from collections import Counter 

c = Counter([word for file in ['macbeth.txt','allswell.txt','othello.txt'] 
        for word in open(file).read().split()]) 

然後你得到它映射的話他們的計數的字典。你可以這樣對它們進行排序:

sorted([(i,v) for v,i in c.items()]) 

如果你想在相對數量,那麼你可以計算的話總人數:

numWords = sum([i for (v,i) in c.items()]) 

,並通過字典,理解適應的字典c

c = { v:(i/numWords) for (v,i) in c.items()} 
相關問題