我有一個包含製表符分隔的行5行的塊的文本文件:提取物的物品,Python的
1 \t DESCRIPTION \t SENTENCE \t ITEMS
1 \t DESCRIPTION \t SENTENCE \t ITEMS
1 \t DESCRIPTION \t SENTENCE \t ITEMS
1 \t DESCRIPTION \t SENTENCE \t ITEMS
1 \t DESCRIPTION \t SENTENCE \t ITEMS
2 \t DESCRIPTION \t SENTENCE \t ITEMS
2 \t DESCRIPTION \t SENTENCE \t ITEMS
2 \t DESCRIPTION \t SENTENCE \t ITEMS
2 \t DESCRIPTION \t SENTENCE \t ITEMS
2 \t DESCRIPTION \t SENTENCE \t ITEMS
等
在每個塊中,DESCRIPTION和SENTENCE列是相同的。感興趣的數據是在項目欄中其是用於在所述塊的每一行不同的,並且是在以下格式:
word1, word2, word3
...等等
對於每個5線塊,我需要計算ITEMS中word1,word2等的頻率。例如,如果第一5行塊被如下
1 \t DESCRIPTION \t SENTENCE \t word1, word2, word3
1 \t DESCRIPTION \t SENTENCE \t word1, word2
1 \t DESCRIPTION \t SENTENCE \t word4
1 \t DESCRIPTION \t SENTENCE \t word1, word2, word3
1 \t DESCRIPTION \t SENTENCE \t word1, word2
然後此5行塊的正確的輸出將是
1, SENTENCE, (word1: 4, word2: 4, word3: 2, word4: 1)
即,組塊號,接着是判決隨後詞的頻率計數。
我有一些代碼可以提取五行塊並計算一個塊中的單詞的頻率,但是我被困在隔離每個塊的任務中,獲取單詞頻率,繼續前進到下一個等
from itertools import groupby
def GetFrequencies(file):
file_contents = open(file).readlines() #file as list
"""use zip to get the entire file as list of 5-line chunk tuples"""
five_line_increments = zip(*[iter(file_contents)]*5)
for chunk in five_line_increments: #for each 5-line chunk...
for sentence in chunk: #...and for each sentence in that chunk
words = sentence.split('\t')[3].split() #get the ITEMS column at index 3
words_no_comma = [x.strip(',') for x in words] #get rid of the commas
words_no_ws = [x.strip(' ')for x in words_no_comma] #get rid of the whitespace resulting from the removed commas
"""STUCK HERE The idea originally was to take the words lists for
each chunk and combine them to create a big list, 'collection,' and
feed this into the for-loop below."""
for key, group in groupby(collection): #collection is a big list containing all of the words in the ITEMS section of the chunk, e.g, ['word1', 'word2', word3', 'word1', 'word1', 'word2', etc.]
print key,len(list(group)),
這確實一個不錯的位工作。謝謝! – Renklauf