計算特定html標記中單詞的集合詞典

我想解析文檔，並且如果有與特定docno相關聯的名稱，請計算名稱的總數。在for循環結束之後，我要存儲名稱[docno] =字數。因此，如果namedict = {'henry'：''，'joe'：''），henry在docno = doc 1 -4次並且joe 6中，字典會將其存儲爲（'doc 1'：10）。到目前爲止，我只能算出整個文本文件中的名稱總數。計算特定html標記中單詞的集合詞典

from xml.dom.minidom import * 
import re 
from string import punctuation 
from operator import itemgetter 

def parseTREC1 (atext): 
    fc = open(atext,'r').read() 
    fc = '<DOCS>\n' + fc + '\n</DOCS>' 
    dom = parseString(fc) 
    w_re = re.compile('[a-z]+',re.IGNORECASE) 
    doc_nodes = dom.getElementsByTagName('DOC') 
    namelist={'Matt':'', 'Earl':'', 'James':''} 
    default=0 
    indexdict={} 
    N=10 
    names={} 
    words={} 
    for doc_node in doc_nodes: 
     docno = doc_node.getElementsByTagName('DOCNO')[0].firstChild.data 
     cnt = 1 
     for p_node in doc_node.getElementsByTagName('P'): 
      p = p_node.firstChild.data 
      words = w_re.findall(p) 
      words_gen=(word.strip(punctuation).lower() for line in words 
        for word in line.split()) 
       for aword in words: 
        if aword in namelist: 
         names[aword]=names.get(aword, 0) + 1 
print names 

    # top_words=sorted(names.iteritems(), key=lambda(word, count): (-count, word))[:N] 

    # for word, frequency in top_words: 
    #  print "%s: %d" % (word, frequency) 
     #print words + top_words 
#print docno + "\t" + str(numbers) 


parseTREC1('LA010189.txt')

來源

2011-04-30 granimal

我已經清理了一下你的代碼，使它更容易遵循。這裏有一些意見和建議：

要回答的關鍵問題：你應該存儲names[docno] = names.get(docno, 0) + 1計數。
使用defaultdict(int)代替names.get(aword, 0) + 1來累計計數。
使用set()爲namelist。
將re.MULTILINE選項添加到您的正則表達式應刪除line.split()的需要。
您沒有使用您的words_gen，這是一個疏忽？

我用這個文檔來進行測試，根據您的代碼：

<DOC> 
    <DOCNO>1</DOCNO> 
    <P>groucho 
     harpo 
     zeppo</P> 
    <P>larry 
     moe 
     curly</P> 
</DOC> 
<DOC> 
    <DOCNO>2</DOCNO> 
    <P>zoe 
     inara 
     kaylie</P> 
    <P>mal 
     wash 
     jayne</P> 
</DOC>

下面是代碼的清理後的版本數在每個段落的名稱：

import re 
from collections import defaultdict 
from string import punctuation 
from xml.dom.minidom import * 

RE_WORDS = re.compile('[a-z]+', re.IGNORECASE | re.M) 

def parse(path, names): 
    data = '<DOCS>' + open(path, 'rb').read() + '</DOCS>' 
    tree = parseString(data) 
    hits = defaultdict(int) 
    for doc in tree.getElementsByTagName('DOC'): 
     doc_no = 'doc ' + doc.getElementsByTagName('DOCNO')[0].firstChild.data 
     for node in doc.getElementsByTagName('P'): 
      text = node.firstChild.data 
      words = (w.strip(punctuation).lower() 
        for w in RE_WORDS.findall(text)) 
      hits[doc_no] += len(names.intersection(words)) 
    for item in hits.iteritems(): 
     print item 

names = set(['zoe', 'wash', 'groucho', 'moe', 'curly']) 
parse('doc.xml', names)

輸出：

(u'doc 2', 2) 
(u'doc 1', 3)

來源

2011-04-30 04:11:21 samplebias

感謝你 - 我得到它的工作。 – granimal 2011-05-01 04:28:33

計算特定html標記中單詞的集合詞典

回答

相關問題