使用NLTK提取關係

這是一個follow-up of my question。我正在使用nltk解析出人員，組織及其關係。使用this example，我能夠創建大批人員和組織;然而，我在nltk.sem.extract_rel命令得到一個錯誤：使用NLTK提取關係

AttributeError: 'Tree' object has no attribute 'text'

下面是完整的代碼：

import nltk 
import re 
#billgatesbio from http://www.reuters.com/finance/stocks/officerProfile?symbol=MSFT.O&officerId=28066 
with open('billgatesbio.txt', 'r') as f: 
    sample = f.read() 

sentences = nltk.sent_tokenize(sample) 
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences] 
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences] 
chunked_sentences = nltk.batch_ne_chunk(tagged_sentences) 

# tried plain ne_chunk instead of batch_ne_chunk as given in the book 
#chunked_sentences = [nltk.ne_chunk(sentence) for sentence in tagged_sentences] 

# pattern to find <person> served as <title> in <org> 
IN = re.compile(r'.+\s+as\s+') 
for doc in chunked_sentences: 
    for rel in nltk.sem.extract_rels('ORG', 'PERSON', doc,corpus='ieer', pattern=IN): 
     print nltk.sem.show_raw_rtuple(rel)

這個例子是非常相似的一個given in the book，但該示例使用準備好'解析文檔'，這個文檔看起來不通，我不知道在哪裏找到它的對象類型。我也搜遍了git庫。任何幫助表示讚賞。

我的最終目標是爲一些公司提取人員，組織，職位（日期）;然後創建個人和組織的網絡地圖。

來源

2011-10-21 karlos

你有沒有想出解決辦法？我可以看到你想出了什麼，因爲我得到了完全相同的問題。 – user3314418

看起來是一個「已解析文檔」對象需要具有headline構件和text構件這兩者都是令牌，其中某些令牌被標爲樹木的列表。例如，這（哈克）示例工作：

import nltk 
import re 

IN = re.compile (r'.*\bin\b(?!\b.+ing)') 

class doc(): 
    pass 

doc.headline=['foo'] 
doc.text=[nltk.Tree('ORGANIZATION', ['WHYY']), 'in', nltk.Tree('LOCATION',['Philadelphia']), '.', 'Ms.', nltk.Tree('PERSON', ['Gross']), ','] 

for rel in nltk.sem.extract_rels('ORG','LOC',doc,corpus='ieer',pattern=IN): 
    print nltk.sem.relextract.show_raw_rtuple(rel)

在運行時，這提供了輸出：

[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']

很顯然，你不會真的這樣的代碼，但它提供了數據的工作示例格式預計爲extract_rels，您只需確定如何執行預處理步驟即可將數據轉換爲該格式。

來源

2011-10-21 17:27:52 bdk

謝謝，bdk。我正在嘗試將chunked_sentences中獲得的樹轉換爲解析後的doc格式。使用你的方法沒有錯誤，但它也沒有給我任何結果。正則表達式模式可能不匹配。 – karlos

嗯，不知道爲什麼你沒有得到上面的腳本的結果，我只是試着將它粘貼到一個文件中（以確保我沒有粘貼粘貼）並運行它，它在這裏給出了預期的結果。 – bdk

不，我的意思是，你的腳本工作正常，但是當我將其修改爲我的目的（使用我的文本/樹）時，它不返回關係。我懷疑它必須用我的正則表達式或我的樹做些什麼。謝謝你的幫助。 – karlos

這裏是nltk.sem.extract_rels功能的源代碼：

def extract_rels(subjclass, objclass, doc, corpus='ace', pattern=None, window=10): 
""" 
Filter the output of ``semi_rel2reldict`` according to specified NE classes and a filler pattern. 

The parameters ``subjclass`` and ``objclass`` can be used to restrict the 
Named Entities to particular types (any of 'LOCATION', 'ORGANIZATION', 
'PERSON', 'DURATION', 'DATE', 'CARDINAL', 'PERCENT', 'MONEY', 'MEASURE'). 

:param subjclass: the class of the subject Named Entity. 
:type subjclass: str 
:param objclass: the class of the object Named Entity. 
:type objclass: str 
:param doc: input document 
:type doc: ieer document or a list of chunk trees 
:param corpus: name of the corpus to take as input; possible values are 
    'ieer' and 'conll2002' 
:type corpus: str 
:param pattern: a regular expression for filtering the fillers of 
    retrieved triples. 
:type pattern: SRE_Pattern 
:param window: filters out fillers which exceed this threshold 
:type window: int 
:return: see ``mk_reldicts`` 
:rtype: list(defaultdict) 
""" 
....

所以，如果你通過語料庫參數作爲能源與環境研究所，該nltk.sem.extract_rels功能預計DOC參數是一個IEERDocument對象。你應該通過語料庫作爲王牌或只是不通過它（默認是王牌）。在這種情況下，它期望一個塊樹列表（這就是你想要的）。我修改代碼如下：

import nltk 
import re 
from nltk.sem import extract_rels,rtuple 

#billgatesbio from http://www.reuters.com/finance/stocks/officerProfile?symbol=MSFT.O&officerId=28066 
with open('billgatesbio.txt', 'r') as f: 
    sample = f.read().decode('utf-8') 

sentences = nltk.sent_tokenize(sample) 
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences] 
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences] 

# here i changed reg ex and below i exchanged subj and obj classes' places 
OF = re.compile(r'.*\bof\b.*') 

for i, sent in enumerate(tagged_sentences): 
    sent = nltk.ne_chunk(sent) # ne_chunk method expects one tagged sentence 
    rels = extract_rels('PER', 'ORG', sent, corpus='ace', pattern=OF, window=7) # extract_rels method expects one chunked sentence 
    for rel in rels: 
     print('{0:<5}{1}'.format(i, rtuple(rel)))

它給出結果：

[PER: u'Chairman/NNP'] u'and/CC Chief/NNP Executive/NNP Officer/NNP of/IN the/DT' [ORG: u'Company/NNP']

來源

2014-12-15 01:09:35 cuneytyvz

我沒有得到任何東西，當我複製和粘貼這個示例代碼，是正確的？...當我運行它不會給你你的輸出。 –

我跑了它，並採取了相同的結果。我認爲正則表達是正確的。我真的不知道可能是什麼問題。 – cuneytyvz

唯一的想法是刪除'.decode（）'，因爲我在python3中，你認爲這與這個問題有關嗎？...... –

這是NLTK版本的問題。你的代碼應該在NLTK 2.X 工作，但對於NLTK 3，你應該像這樣的代碼

IN = re.compile(r'.*\bin\b(?!\b.+ing)') 
for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'): 
    for rel in nltk.sem.relextract.extract_rels('ORG', 'LOC', doc,corpus='ieer', pattern = IN): 
     print (nltk.sem.relextract.rtuple(rel))

NLTK Example for Relation Extraction Does not work

來源

2015-07-07 07:03:38

使用NLTK提取關係

回答

相關問題