nltk pos tagger看起來合併'。'

我是python，nlp和nltk的新手，請耐心等待。我有一些文章（〜200），這些文章都是用手分類的。我期待開發一種啓發式輔助/推薦類別。首先，我希望能夠在當前類別和文檔中的單詞之間建立關係。nltk pos tagger看起來合併'。'

我的前提是名詞比其他詞類更重要。例如，類別「能量」可能幾乎完全通過名詞來驅動：油，電池，風等。

我想要做的第一件事是標記零件並評估它們。在第一篇文章中，我遇到了一些問題。一些令牌綁定到標點符號。

for articles in articles[1]: 
    articles_id, content = articles 
    clean = nltk.clean_html(content).replace('&rsquo;', "'") 
    tokens = nltk.word_tokenize(clean) 
    pos_document = nltk.pos_tag(tokens) 
    pos ={} 
    for pos_word in pos_document: 
     word, part = pos_word 
     if pos.has_key(part): 
      pos[part].append(word) 
     else: 
      pos[part] = [word]

格式化輸出：

{ 
'VBG': ['continuing', 'paying', 'falling', 'starting'], 
'VBD': ['made', 'ended'], 'VBN': ['been', 'leaned', 'been', 'been'], 
'VBP': ['know', 'hasn', 'have', 'continue', 'expect', 'take', 'see', 'have', 'are'], 
'WDT': ['which', 'which'], 'JJ': ['negative', 'positive', 'top', 'modest', 'negative', 'real', 'financial', 'isn', 'important', 'long', 'short', 'next'], 
'VBZ': ['is', 'has', 'is', 'leads', 'is', 'is'], 'DT': ['Another', 'the', 'the', 'any', 'any', 'the', 'the', 'a', 'the', 'the', 'the', 'the', 'a', 'the', 'a', 'a', 'the', 'a', 'the', 'any'], 
'RP': ['back'], 
'NN': [ 'listless', 'day', 'rsquo', 'll', 'progress', 'rsquo', 't', 'news', 'season', 'corner', 'surprise', 'stock', 'line', 'growth', 'question', 
     'stop', 'engineering', 'growth', 'isn', 'rsquo', 't', 'rsquo', 't', 'stock', 'market', 'look', 'junk', 'bond', 'market', 'turning', 'junk', 
     'rock', 'history', 'guide', 't', 'day', '%', '%', '%', 'level', 'move', 'isn', 'rsquo', 't', 'indication', 'way'], 
',': [',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ','], '.': ['.'], 
'TO': ['to', 'to', 'to', 'to', 'to', 'to', 'to'], 
'PRP': ['them', 'they', 'they', 'we', 'you', 'they', 'it'], 
'RB': ['then', 'there', 'just', 'just', 'always', 'so', 'so', 'only', 'there', 'right', 'there', 'much', 'typically', 'far', 'certainly'], 
':': [';', ';', ';', ';', ';', ';', ';'], 
'NNS': ['folks', 'companies', 'estimates', 'covers', 's', 'equities', 'bonds', 'equities', 'flats'], 
'NNP': ['drift.', 'We', 'Monday', 'DC', 'note.', 'Earnings', 'EPS', 'same.', 'The', 'Street', 'now.', 'Since', 'points.', 'What', 'behind.', 'We', 'flat.', 'The'], 
'VB': ['get', 'manufacture', 'buy', 'boost', 'look', 'see', 'say', 'let', 'rsquo', 'rsquo', 'be', 'build', 'accelerate', 'be'], 
'WRB': ['when', 'where'], 
'CC': ['&', 'and', '&', 'and', 'and', 'or', 'and', '&', '&', '&', 'and', '&', 'and', 'but', '&'], 
'CD': ['47', '23', '30'], 
'EX': ['there'], 
'IN': ['on', 'if', 'until', 'of', 'around', 'as', 'on', 'down', 'since', 'of', 'for', 'under', 'that', 'about', 'at', 'at', 'that', 'like', 'if'], 
'MD': ['can', 'will', 'can', 'can', 'will'], 
'JJR': ['more'] 
}

的NMP字下通知 '漂移'。 - 不應該刪除這段時間嗎？我是否需要自行刪除這些內容或者是否缺少與庫中的內容？

來源

2013-12-18 akaphenom

我不確定這是否能解決您的問題，而不是在已清理的文本上調用'word_tokenize'，假設您的文章長度超過一個句子，則應該有一行'sents = nltk.sent_tokenize（clean）'然後在'sents'上運行'word_tokenize' – aelfric5578

這樣做 - 如果您將它作爲答案發布，我會接受它，否則我會在幾天內發佈答案。 – akaphenom

NLTK的詞標記器假定它的輸入已經被分離成句子。因此，爲了使其起作用，您需要首先在您的輸入上撥打sent_tokenize。我認爲您可以使用sent_tokenize的輸出作爲word_tokenize的輸入，但通常您會想要迭代您的句子。

for articles in articles[1]: 
    articles_id, content = articles 
    clean = nltk.clean_html(content).replace('&rsquo;', "'") 
    sents = nltk.sent_tokenize(clean) 
    pos ={} 
    for sent in sents: 
     tokens = nltk.word_tokenize(sent) 
     pos_document = nltk.pos_tag(tokens) 
     for pos_word in pos_document: 
      word, part = pos_word 
      if pos.has_key(part): 
       pos[part].append(word) 
      else: 
       pos[part] = [word]

我相信這是必要的原因是幫助從縮寫使用的時間段區分的句子結束標點符號週期（即你不希望「史密斯先生」被分成'Mr', '.', 'Smith'）

來源

2013-12-18 21:47:34 aelfric5578

nltk pos tagger看起來合併'。'

回答

相關問題