2013-04-28 72 views
0

我有一個三列的文件(用\ t分隔;第一列是詞,第二列是詞條,第三列是標籤)。有些行只包含點或逗號。如何從整個文件的列表中統計詞頻?

<doc n=1 id="CMP/94/10"> 
<head p="80%"> 
Customs customs tag1 
union union tag2 
in in tag3 
danger danger tag4 
of of tag5 
the the tag6 
</head> 
<head p="80%"> 
New new tag7 
restrictions restriction tag8 
in in tag3 
the the tag6 
. 
Hi hi tag8 

假設用戶搜索引理「in」。我想要在「in」之前和之後的「in」的頻率和引理的頻率。所以我想要在整個語料庫中使用「聯合」,「危險」,「限制」和「該」的頻率。結果應該是:

union 1 
danger 1 
restriction 1 
the 2 

我該怎麼做?我試圖使用lemma_counter = {}但它不起作用。

我沒有經驗的Python語言,所以請糾正我,如果我有什麼問題。

c = open("corpus.vert") 

corpus = [] 

for line in c: 
    if not line.startswith("<"): 
     corpus.append(line) 

lemma = raw_input("Lemma you are looking for: ") 

counter = 0 
lemmas_before_after = []  
for i in range(len(corpus)): 
    parsed_line = corpus[i].split("\t") 
    if len(parsed_line) > 1: 
     if parsed_line[1] == lemma: 
      counter += 1 #this counts lemma frequency 


      new_list = [] 

      for j in range(i-1, i+2): 
       if j < len(corpus) and j >= 0: 
        parsed_line_with_context = corpus[j].split("\t") 
     found_lemma = parsed_line_with_context[0].replace("\n","") 
     if len(parsed_line_with_context) > 1: 
      if lemma != parsed_line_with_context[1].replace("\n",""):       
      lemmas_before_after.append(found_lemma)   
     else: 
      lemmas_before_after.append(found_lemma)     

print "list of lemmas ", lemmas_before_after 


lemma_counter = {} 
for i in range(len(corpus)): 
    for lemma in lemmas_before_after: 
     if parsed_line[1] == lemma: 
      if lemma in lemma_counter: 
       lemma_counter[lemma] += 1 
      else: 
       lemma_counter[lemma] = 1 

print lemma_counter 


fA = counter 
print "lemma frequency: ", fA 

回答

0

這應該讓你80%的方式。

# Let's use some useful pieces of the awesome standard library 
from collections import namedtuple, Counter 

# Define a simple structure to hold the properties of each entry in corpus 
CorpusEntry = namedtuple('CorpusEntry', ['word', 'lemma', 'tag']) 

# Use a context manager ("with...") to automatically close the file when we no 
# longer need it 
with open('corpus.vert') as c: 
    corpus = [] 
    for line in c: 
     if len(line.strip()) > 1 and not line.startswith('<'): 
      # Remove the newline character and split at tabs 
      word, lemma, tag = line.strip().split('\t') 
      # Put the obtained values in the structure 
      entry = CorpusEntry(word, lemma, tag) 
      # Put the structure in the corpus list 
      corpus.append(entry) 

# It's practical to wrap the counting in a function 
def get_frequencies(lemma): 
    # Create a set of indices at which the lemma occurs in corpus. We use a 
    # set because it is more efficient for the next part, checking if some 
    # index is in this set 
    lemma_indices = set() 
    # Loop over corpus without manual indexing; enumerate provides information 
    # about the current index and the value (some CorpusEntry added earlier). 
    for index, entry in enumerate(corpus): 
     if entry.lemma == lemma: 
      lemma_indices.add(index) 

    # Now that we have the indices at which the lemma occurs, we can loop over 
    # corpus again and for each entry check if it is either one before or 
    # one after the lemma. If so, add the entry's lemma to a new set. 
    related_lemmas = set() 
    for index, entry in enumerate(corpus): 
     before_lemma = index+1 in lemma_indices 
     after_lemma = index-1 in lemma_indices 
     if before_lemma or after_lemma: 
      related_lemmas.add(entry.lemma) 

    # Finally, we need to count the number of occurrences of those related 
    # lemmas 
    counter = Counter() 
    for entry in corpus: 
     if entry.lemma in related_lemmas: 
      counter[entry.lemma] += 1 

    return counter 

print get_frequencies('in') 
# Counter({'the': 2, 'union': 1, 'restriction': 1, 'danger': 1}) 

它可以更簡潔(見下文)編寫的,該算法可以改進爲好,但它仍然爲O(n);關鍵是讓它可以理解。

對於那些有興趣:

with open('corpus.vert') as c: 
    corpus = [CorpusEntry(*line.strip().split('\t')) for line in c 
       if len(line.strip() > 1) and not line.startswith('<')] 

def get_frequencies(lemma): 
    lemma_indices = {index for index, entry in enumerate(corpus) 
        if entry.lemma == lemma} 
    related_lemmas = {entry.lemma for index, entry in enumerate(corpus) 
         if lemma_indices & {index+1, index-1}} 
    return Counter(entry.lemma for entry in corpus 
        if entry.lemma in related_lemmas) 

而且這裏的多個程序的風格,它作爲快三倍:

def get_frequencies(lemma): 
    counter = Counter() 
    related_lemmas = set() 
    for index, entry in enumerate(corpus): 
     counter[entry.lemma] += 1 
     if entry.lemma == lemma: 
      if index > 0: 
       related_lemmas.add(corpus[index-1].lemma) 
      if index < len(corpus)-1: 
       related_lemmas.add(corpus[index+1].lemma) 
    return {lemma: frequency for lemma, frequency in counter.iteritems() 
      if lemma in related_lemmas} 
+0

謝謝您的答覆。我發現,我的文件並不完全符合我的預期。有些行只包含一個點或逗號,所以元組不會爲它們工作。我試過這個:'如果不是line.startswith('<'):' '如果len(line)> 1:'但它仍然給我一個錯誤「需要多個值才能解包」。 – halik 2013-04-29 07:32:32

+0

@halik你必須考慮到每個'line'在它被添加到'corpus'之前,還包含新的行字符('\ n'),所以最初每個'line'的長度大於1. I調整了我的答案。 – 2013-04-29 10:13:52

相關問題