2012-03-29 44 views
0

我想獲得2個容器之間的差異,但容器是在一個奇怪的結構,所以我不知道什麼是最好的方式來執行它的差異。一個容器類型和結構我不能改變,但我可以改變其他的(可變分界)。在元組列表上執行設置的操作差異

delims = ['on','with','to','and','in','the','from','or'] 
words = collections.Counter(s.split()).most_common() 
# words results in [("the",2), ("a",9), ("diplomacy", 1)] 

#I want to perform a 'difference' operation on words to remove all the delims words 
descriptive_words = set(words) - set(delims) 

# because of the unqiue structure of words(list of tuples) its hard to perform a difference 
# on it. What would be the best way to perform a difference? Maybe... 

delims = [('on',0),('with',0),('to',0),('and',0),('in',0),('the',0),('from',0),('or',0)] 
words = collections.Counter(s.split()).most_common() 
descriptive_words = set(words) - set(delims) 

# Or maybe 
words = collections.Counter(s.split()).most_common() 
n_words = [] 
for w in words: 
    n_words.append(w[0]) 
delims = ['on','with','to','and','in','the','from','or'] 
descriptive_words = set(n_words) - set(delims) 

回答

3

如何只通過刪除所有的分隔符修改words

words = collections.Counter(s.split()) 
for delim in delims: 
    del words[delim] 
+0

看起來有效我想我會用它,但單詞是元組列表我怎麼能說「單詞[delim]」? – 2012-03-29 09:45:53

+0

@JakeM - 將其直接應用於Counter對象。 – eumiro 2012-03-29 09:48:38

+0

啊,我在想詞是Counter對象 – 2012-03-29 09:49:03

1

這是我我會怎麼做:

delims = set(['on','with','to','and','in','the','from','or']) 
# ... 
descriptive_words = filter(lamdba x: x[0] not in delims, words) 

使用過濾器的方法。一個可行的替代辦法是:

delims = set(['on','with','to','and','in','the','from','or']) 
# ... 
decsriptive_words = [ (word, count) for word,count in words if word not in delims ] 

確保該delims是一組允許O(1) lookup

+0

第一種方法使用'in',這是否意味着我們正在遍歷整個分隔符的每個比較? – 2012-03-29 09:48:44

+0

如果他們是集合或字典,則不是。 O(1)查找,[文檔說](http://wiki.python.org/moin/TimeComplexity)。 – brice 2012-03-29 09:51:25

0

如果你正在迭代它,爲什麼還要把它們轉換爲集?

dwords = [delim[0] for delim in delims] 
words = [word for word in words if word[0] not in dwords] 
+0

@Rob年輕是的,我試圖避免迭代他們的效率。任何不重複的解決方案是最好的,我認爲 – 2012-03-29 09:47:19

+0

壞主意。這將是O(n^2),不是嗎? – brice 2012-03-29 09:50:13

0

出於性能考慮,您可以使用拉姆達功能

filter(lambda word: word[0] not in delim, words) 
+0

過濾器+ lambda比列表理解的可讀性差,列表理解可以[通常更快](http://wiki.python.org/moin/PythonSpeed/PerformanceTips#循環)。 – 2012-03-29 10:11:49

+0

其次,由於delims是一個列表,所以它仍然在做O(n^2)。 – brice 2012-03-29 10:27:35

1

最簡單的答案是做:

import collections 

s = "the a a a a the a a a a a diplomacy" 
delims = {'on','with','to','and','in','the','from','or'} 
// For older versions of python without set literals: 
// delims = set(['on','with','to','and','in','the','from','or']) 
words = collections.Counter(s.split()) 

not_delims = {key: value for (key, value) in words.items() if key not in delims} 
// For older versions of python without dict comprehensions: 
// not_delims = dict(((key, value) for (key, value) in words.items() if key not in delims)) 

這給了我們:

{'a': 9, 'diplomacy': 1} 

另外一種方式就是去做先發制人:

import collections 

s = "the a a a a the a a a a a diplomacy" 
delims = {'on','with','to','and','in','the','from','or'} 
counted_words = collections.Counter((word for word in s.split() if word not in delims)) 

在這裏,您申請的單詞列表過濾你把它交給櫃檯前,這給了相同的結果。