2011-08-23 97 views
1

我有一個包含製表符分隔的行5行的塊的文本文件:提取物的物品,Python的

1 \t DESCRIPTION \t SENTENCE \t ITEMS 

1 \t DESCRIPTION \t SENTENCE \t ITEMS 

1 \t DESCRIPTION \t SENTENCE \t ITEMS 

1 \t DESCRIPTION \t SENTENCE \t ITEMS 

1 \t DESCRIPTION \t SENTENCE \t ITEMS 

2 \t DESCRIPTION \t SENTENCE \t ITEMS 

2 \t DESCRIPTION \t SENTENCE \t ITEMS 

2 \t DESCRIPTION \t SENTENCE \t ITEMS 

2 \t DESCRIPTION \t SENTENCE \t ITEMS 

2 \t DESCRIPTION \t SENTENCE \t ITEMS 

在每個塊中,DESCRIPTION和SENTENCE列是相同的。感興趣的數據是在項目欄中其是用於在所述塊的每一行不同的,並且是在以下格式:

word1, word2, word3 

...等等

對於每個5線塊,我需要計算ITEMS中word1,word2等的頻率。例如,如果第一5行塊被如下

1 \t DESCRIPTION \t SENTENCE \t word1, word2, word3 

1 \t DESCRIPTION \t SENTENCE \t word1, word2 

1 \t DESCRIPTION \t SENTENCE \t word4 

1 \t DESCRIPTION \t SENTENCE \t word1, word2, word3 

1 \t DESCRIPTION \t SENTENCE \t word1, word2 

然後此5行塊的正確的輸出將是

1, SENTENCE, (word1: 4, word2: 4, word3: 2, word4: 1) 

即,組塊號,接着是判決隨後詞的頻率計數。

我有一些代碼可以提取五行塊並計算一個塊中的單詞的頻率,但是我被困在隔離每個塊的任務中,獲取單詞頻率,繼續前進到下一個等

from itertools import groupby 

def GetFrequencies(file): 
    file_contents = open(file).readlines() #file as list 
    """use zip to get the entire file as list of 5-line chunk tuples""" 
    five_line_increments = zip(*[iter(file_contents)]*5) 
    for chunk in five_line_increments: #for each 5-line chunk... 
     for sentence in chunk:   #...and for each sentence in that chunk 
      words = sentence.split('\t')[3].split() #get the ITEMS column at index 3 
      words_no_comma = [x.strip(',') for x in words] #get rid of the commas 
      words_no_ws = [x.strip(' ')for x in words_no_comma] #get rid of the whitespace resulting from the removed commas 


     """STUCK HERE The idea originally was to take the words lists for 
     each chunk and combine them to create a big list, 'collection,' and 
     feed this into the for-loop below.""" 





    for key, group in groupby(collection): #collection is a big list containing all of the words in the ITEMS section of the chunk, e.g, ['word1', 'word2', word3', 'word1', 'word1', 'word2', etc.] 
     print key,len(list(group)),  

回答

0

編輯你的代碼一點點,我認爲這是你想要它做的事情:

file_contents = open(file).readlines() #file as list 
"""use zip to get the entire file as list of 5-line chunk tuples""" 
five_line_increments = zip(*[iter(file_contents)]*5) 
for chunk in five_line_increments: #for each 5-line chunk... 
    word_freq = {} #word frequencies for each chunk 
    for sentence in chunk:   #...and for each sentence in that chunk 
     words = "".join(sentence.split('\t')[3]).strip('\n').split(', ') #get the ITEMS column at index 3 and put them in list 
     for word in words: 
      if word not in word_freq: 
       word_freq[word] = 1 
      else: 
       word_freq[word] += 1 


    print word_freq 

輸出:

{'word4': 1, 'word1': 4, 'word3': 2, 'word2': 4} 
+0

這確實一個不錯的位工作。謝謝! – Renklauf

0

總結:你要附加「字」的集合,如果他們不是「描述」或「句子」?試試這個:

for word in words_no_ws: 
    if word not in ("DESCRIPTION", "SENTENCE"): 
     collection.append(word) 
1

使用Python 2.7

#!/usr/bin/env python 

import collections 

chunks={} 

with open('input') as fd: 
    for line in fd: 
     line=line.split() 
     if not line: 
      continue 
     if chunks.has_key(line[0]): 
      for i in line[3:]: 
       chunks[line[0]].append(i.replace(',','')) 
     else: 
      chunks[line[0]]=[line[2]] 

for k,v in chunks.iteritems(): 
    counter=collections.Counter(v[1:]) 
    print k, v[0], counter 

輸出:

1 SENTENCE Counter({'word1': 3, 'word2': 3, 'word4': 1, 'word3': 1}) 
+0

不能因爲有一個timecrunch更新到2.7,但是這是代碼 – Renklauf

1

有一個在標準庫中的CSV分析器,可以處理輸入拆分爲您

import csv 
import collections 

def GetFrequencies(file_in): 
    sentences = dict() 
    with csv.reader(open(file_in, 'rb'), delimiter='\t') as csv_file: 
     for line in csv_file: 
      sentence = line[0] 
      if sentence not in sentences: 
       sentences[sentence] = collections.Counter() 
      sentences[sentence].update([x.strip(' ') for x in line[3].split(',')])