查找只出現一次的單詞

我只檢索文件中的唯一單詞，這裏是我迄今爲止的內容，但是有沒有更好的方法可以在大O表示法中實現這一點？眼下這爲n的平方。如果你想找到的所有獨特的文字和考慮foo一樣foo.查找只出現一次的單詞

def retHapax(): 
    file = open("myfile.txt") 
    myMap = {} 
    uniqueMap = {} 
    for i in file: 
     myList = i.split(' ') 
     for j in myList: 
      j = j.rstrip() 
      if j in myMap: 
       del uniqueMap[j] 
      else: 
       myMap[j] = 1 
       uniqueMap[j] = 1 
    file.close() 
    print uniqueMap

來源

2015-04-02 godzilla

你的意思是獨一無二的，因爲它們中僅出現一次？ – 2015-04-02 12:13:16

是的，單詞只出現一次 – godzilla 2015-04-02 12:16:04

這是O（n），而不是O（n^2），因爲Python字典/集合查找是O（1），除非你有怪異的鍵導致_lots_的散列衝突。如果你的代碼使用了集合而不是字典，那麼它的內存效率會稍高一些，但它們都是作爲散列表實現的。但是，使用Counter是一個更好的計劃：它使代碼更易於閱讀，並且將更多工作委託給以C速度運行的代碼，而不是在測試時以Python速度運行。 – 2015-04-02 12:31:26

嘗試使用此方法獲得的唯一字的file.using Counter

from collections import Counter 
with open("myfile.txt") as input_file: 
    word_counts = Counter(word for line in input_file for word in line.split()) 
>>> [word for (word, count) in word_counts.iteritems() if count==1] 
-> list of unique words (words that appear exactly once)

來源

2015-04-02 12:13:40 itzMEonTV

這可以使用集？ – godzilla 2015-04-02 12:16:18

'set（f）'如何找到唯一的單詞？ – 2015-04-02 12:18:48

更新，我認爲它可以:) – itzMEonTV 2015-04-02 12:19:47

，你需要去掉標點符號。

from collections import Counter 
from string import punctuation 

with open("myfile.txt") as f: 
    word_counts = Counter(word.strip(punctuation) for line in f for word in line.split()) 

print([word for word, count in word_counts.iteritems() if count == 1])

如果你想忽略大小寫，你還需要使用line.lower()。如果你想準確地得到獨特的單詞，那麼除了在空白處分割行之外，還有更多的涉及。

來源

2015-04-02 12:16:15

使用'print（[k for k，v in c.items（）if v == 1]）'而不是'__getitem__'調用會更有效率...... – 2015-04-02 12:19:28

@JonClements，是的，只需要更少的時間來寫另一種方式;） – 2015-04-02 12:22:02

使用'.iteritems（）' - 更小的內存佔用會更有效率。 – EOL 2015-04-02 12:25:59

你可以稍微修改你的邏輯和（使用套例如，而不是類型的字典），它從獨特的前進第二次出現：

words = set() 
unique_words = set() 
for w in (word.strip() for line in f for word in line.split(' ')): 
    if w in words: 
     continue 
    if w in unique_words: 
     unique_words.remove(w) 
     words.add(w) 
    else: 
     unique_words.add(w) 
print(unique_words)

來源

2015-04-02 12:18:32 AChampion

我認爲OP正試圖找到文件中只出現一次的世界。 – hitzg 2015-04-02 12:20:31

@hitzg;編輯也使這個答案正確。 – 2015-04-02 13:49:16

如果僅僅執行'line.split（）'（不帶參數），就不需要'word.strip（）'。 – EOL 2015-04-03 03:56:03

我會去與collections.Counter的做法，但如果你只想使用set S，那麼你可以通過這樣做：

with open('myfile.txt') as input_file: 
    all_words = set() 
    dupes = set() 
    for word in (word for line in input_file for word in line.split()): 
     if word in all_words: 
      dupes.add(word) 
     all_words.add(word) 

    unique = all_words - dupes

鑑於輸入：

one two three 
two three four 
four five six

具有的輸出：

{'five', 'one', 'six'}

來源

2015-04-02 12:36:04

這是最有效的解決方案 – 2015-04-02 12:58:05

@Padraic，除非你做了一些'timeit's - 我懷疑它是.. 。Counter方法更直觀，更高效 – 2015-04-02 13:01:29

我剛剛計時，1.16ms對2000字的1.68ms – 2015-04-02 13:01:43

查找只出現一次的單詞

回答

相關問題