Python，字典和卡方應急表

這是一個我長期絞盡腦汁的問題，所以任何幫助都會很棒。我有一個文件，其中包含以下格式的幾行（單詞，單詞出現的時間以及給定實例中包含給定單詞的文檔的頻率）。下面是輸入文件的一個例子。Python，字典和卡方應急表

#inputfile 
<word, time, frequency> 
apple, 1, 3 
banana, 1, 2 
apple, 2, 1 
banana, 2, 4 
orange, 3, 1

我具有低於Python類我用來創建2-d的字典來存儲使用作爲密鑰的上述文件中，和頻率的值：

class Ddict(dict): 
    ''' 
    2D dictionary class 
    ''' 
    def __init__(self, default=None): 
      self.default = default 

    def __getitem__(self, key): 
      if not self.has_key(key): 
       self[key] = self.default() 
      return dict.__getitem__(self, key) 


wordtime=Ddict(dict) # Store each inputfile entry with a <word,time> key 
timeword=Ddict(dict) # Store each inputfile entry with a <time,word> key 

# Loop over every line of the inputfile 
for line in open('inputfile'): 
    word,time,count=line.split(',') 

    # If <word,time> already a key, increment count 
    try: 
     wordtime[word][time]+=count 
    # Otherwise, create the key 
    except KeyError: 
     wordtime[word][time]=count 

    # If <time,word> already a key, increment count  
    try: 
     timeword[time][word]+=count 
    # Otherwise, create the key 
    except KeyError: 
     timeword[time][word]=count

我有所屬的問題在迭代這個2D字典中的條目時計算某些事物。對於每一個字 'W' 每次 'T' 的，計算：

文件數量與 字的 'W' 內時間t。（a）
文件數量無字'w'在時間't'內。（b）
文件數量與單詞'w'外部時間't'。（c）
文件數量不含字'w'外部時間't'。（d）

上面的每個項目代表每個單詞和時間的卡方列聯表中的一個單元格。所有這些都可以在一個循環內計算出來，還是一次只能完成一次？

理想的情況下，我想輸出是什麼的下方，其中a，b，C，d的所有的項目上面計算：

print "%s, %s, %s, %s" %(a,b,c,d)

在上面的輸入文件的情況下，結果試圖在時間'1'找到單詞'apple'的應急表將是(3,2,1,6)。我將解釋如何計算每個單元格：

「3」文檔包含時間'1'內的'apple'。
在時間「1」內有'2'文件不包含'apple'。
在時間'1'之外有'1'文檔包含 'apple'。
有6個文件在時間外 '1'不包含單詞 'apple'（1 + 4 + 1）。

來源

2010-06-12 GobiasKoffi

'dict.has_key（）'是舊的，不推薦使用，並且速度慢。而不是'd.has_key（k）'在d'中使用'k'。另一張海報提到了'defaultdict'。考慮更新您正在使用的教程/書籍。 – 2010-06-12 22:47:43

@JohnMachin謝謝，我一定會牢記未來。 – GobiasKoffi 2010-06-12 22:53:54

你的蘋果/ 1的4位數加起來爲12，超過觀測總數（11）！時間'1'以外只有5個文件不包含'apple'這個詞。

需要將觀測劃分成4項不相交的子集：
一個：蘋果和1 => 3
B：未蘋果和1 => 2
C：蘋果和不-1 => 1
d：未蘋果和不-1 => 5

下面是一些代碼示出這樣做的一種方式：

from collections import defaultdict 

class Crosstab(object): 

    def __init__(self): 
     self.count = defaultdict(lambda: defaultdict(int)) 
     self.row_tot = defaultdict(int) 
     self.col_tot = defaultdict(int) 
     self.grand_tot = 0 

    def add(self, r, c, n): 
     self.count[r][c] += n 
     self.row_tot[r] += n 
     self.col_tot[c] += n 
     self.grand_tot += n 

def load_data(line_iterator, conv_funcs): 
    ct = Crosstab() 
    for line in line_iterator: 
     r, c, n = [func(s) for func, s in zip(conv_funcs, line.split(','))] 
     ct.add(r, c, n) 
    return ct 

def display_all_2x2_tables(crosstab): 
    for rx in crosstab.row_tot: 
     for cx in crosstab.col_tot: 
      a = crosstab.count[rx][cx] 
      b = crosstab.col_tot[cx] - a 
      c = crosstab.row_tot[rx] - a 
      d = crosstab.grand_tot - a - b - c 
      assert all(x >= 0 for x in (a, b, c, d)) 
      print ",".join(str(x) for x in (rx, cx, a, b, c, d)) 

if __name__ == "__main__": 

    # inputfile 
    # <word, time, frequency> 
    lines = """\ 
    apple, 1, 3 
    banana, 1, 2 
    apple, 2, 1 
    banana, 2, 4 
    orange, 3, 1""".splitlines() 

    ct = load_data(lines, (str.strip, int, int)) 
    display_all_2x2_tables(ct)

和這裏的輸出：

orange,1,0,5,1,5 
orange,2,0,5,1,5 
orange,3,1,0,0,10 
apple,1,3,2,1,5 
apple,2,1,4,3,3 
apple,3,0,1,4,6 
banana,1,2,3,4,2 
banana,2,4,1,2,4 
banana,3,0,1,6,4

來源

2010-06-13 00:49:12

這個好方法。我特別喜歡'load_data'中的技術 - 使用'line_iterator'和'conv_funcs'。 – FMc 2010-06-13 01:38:43

Python，字典和卡方應急表

回答

相關問題