2011-10-21 46 views
7

這是一個follow-up of my question。我正在使用nltk解析出人員,組織及其關係。使用this example,我能夠創建大批人員和組織;然而,我在nltk.sem.extract_rel命令得到一個錯誤:使用NLTK提取關係

AttributeError: 'Tree' object has no attribute 'text' 

下面是完整的代碼:

import nltk 
import re 
#billgatesbio from http://www.reuters.com/finance/stocks/officerProfile?symbol=MSFT.O&officerId=28066 
with open('billgatesbio.txt', 'r') as f: 
    sample = f.read() 

sentences = nltk.sent_tokenize(sample) 
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences] 
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences] 
chunked_sentences = nltk.batch_ne_chunk(tagged_sentences) 

# tried plain ne_chunk instead of batch_ne_chunk as given in the book 
#chunked_sentences = [nltk.ne_chunk(sentence) for sentence in tagged_sentences] 

# pattern to find <person> served as <title> in <org> 
IN = re.compile(r'.+\s+as\s+') 
for doc in chunked_sentences: 
    for rel in nltk.sem.extract_rels('ORG', 'PERSON', doc,corpus='ieer', pattern=IN): 
     print nltk.sem.show_raw_rtuple(rel) 

這個例子是非常相似的一個given in the book,但該示例使用準備好'解析文檔',這個文檔看起來不通,我不知道在哪裏找到它的對象類型。我也搜遍了git庫。任何幫助表示讚賞。

我的最終目標是爲一些公司提取人員,組織,職位(日期);然後創建個人和組織的網絡地圖。

+0

你有沒有想出解決辦法?我可以看到你想出了什麼,因爲我得到了完全相同的問題。 – user3314418

回答

4

看起來是一個「已解析文檔」對象需要具有headline構件和text構件這兩者都是令牌,其中某些令牌被標爲樹木的列表。例如,這(哈克)示例工作:

import nltk 
import re 

IN = re.compile (r'.*\bin\b(?!\b.+ing)') 

class doc(): 
    pass 

doc.headline=['foo'] 
doc.text=[nltk.Tree('ORGANIZATION', ['WHYY']), 'in', nltk.Tree('LOCATION',['Philadelphia']), '.', 'Ms.', nltk.Tree('PERSON', ['Gross']), ','] 

for rel in nltk.sem.extract_rels('ORG','LOC',doc,corpus='ieer',pattern=IN): 
    print nltk.sem.relextract.show_raw_rtuple(rel) 

在運行時,這提供了輸出:

[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia'] 

很顯然,你不會真的這樣的代碼,但它提供了數據的工作示例格式預計爲extract_rels,您只需確定如何執行預處理步驟即可將數據轉換爲該格式。

+0

謝謝,bdk。我正在嘗試將chunked_sentences中獲得的樹轉換爲解析後的doc格式。使用你的方法沒有錯誤,但它也沒有給我任何結果。正則表達式模式可能不匹配。 – karlos

+0

嗯,不知道爲什麼你沒有得到上面的腳本的結果,我只是試着將它粘貼到一個文件中(以確保我沒有粘貼粘貼)並運行它,它在這裏給出了預期的結果。 – bdk

+0

不,我的意思是,你的腳本工作正常,但是當我將其修改爲我的目的(使用我的文本/樹)時,它不返回關係。我懷疑它必須用我的正則表達式或我的樹做些什麼。謝謝你的幫助。 – karlos

4

這裏是nltk.sem.extract_rels功能的源代碼:

def extract_rels(subjclass, objclass, doc, corpus='ace', pattern=None, window=10): 
""" 
Filter the output of ``semi_rel2reldict`` according to specified NE classes and a filler pattern. 

The parameters ``subjclass`` and ``objclass`` can be used to restrict the 
Named Entities to particular types (any of 'LOCATION', 'ORGANIZATION', 
'PERSON', 'DURATION', 'DATE', 'CARDINAL', 'PERCENT', 'MONEY', 'MEASURE'). 

:param subjclass: the class of the subject Named Entity. 
:type subjclass: str 
:param objclass: the class of the object Named Entity. 
:type objclass: str 
:param doc: input document 
:type doc: ieer document or a list of chunk trees 
:param corpus: name of the corpus to take as input; possible values are 
    'ieer' and 'conll2002' 
:type corpus: str 
:param pattern: a regular expression for filtering the fillers of 
    retrieved triples. 
:type pattern: SRE_Pattern 
:param window: filters out fillers which exceed this threshold 
:type window: int 
:return: see ``mk_reldicts`` 
:rtype: list(defaultdict) 
""" 
.... 

所以,如果你通過語料庫參數作爲能源與環境研究所,該nltk.sem.extract_rels功能預計DOC參數是一個IEERDocument對象。你應該通過語料庫作爲王牌或只是不通過它(默認是王牌)。在這種情況下,它期望一個塊樹列表(這就是你想要的)。我修改代碼如下:

import nltk 
import re 
from nltk.sem import extract_rels,rtuple 

#billgatesbio from http://www.reuters.com/finance/stocks/officerProfile?symbol=MSFT.O&officerId=28066 
with open('billgatesbio.txt', 'r') as f: 
    sample = f.read().decode('utf-8') 

sentences = nltk.sent_tokenize(sample) 
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences] 
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences] 

# here i changed reg ex and below i exchanged subj and obj classes' places 
OF = re.compile(r'.*\bof\b.*') 

for i, sent in enumerate(tagged_sentences): 
    sent = nltk.ne_chunk(sent) # ne_chunk method expects one tagged sentence 
    rels = extract_rels('PER', 'ORG', sent, corpus='ace', pattern=OF, window=7) # extract_rels method expects one chunked sentence 
    for rel in rels: 
     print('{0:<5}{1}'.format(i, rtuple(rel))) 

它給出結果:

[PER: u'Chairman/NNP'] u'and/CC Chief/NNP Executive/NNP Officer/NNP of/IN the/DT' [ORG: u'Company/NNP'] 
+1

我沒有得到任何東西,當我複製和粘貼這個示例代碼,是正確的?...當我運行它不會給你你的輸出。 –

+1

我跑了它,並採取了相同的結果。我認爲正則表達是正確的。我真的不知道可能是什麼問題。 – cuneytyvz

+1

唯一的想法是刪除'.decode()',因爲我在python3中,你認爲這與這個問題有關嗎?...... –

0

這是NLTK版本的問題。你的代碼應該在NLTK 2.X 工作,但對於NLTK 3,你應該像這樣的代碼

IN = re.compile(r'.*\bin\b(?!\b.+ing)') 
for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'): 
    for rel in nltk.sem.relextract.extract_rels('ORG', 'LOC', doc,corpus='ieer', pattern = IN): 
     print (nltk.sem.relextract.rtuple(rel)) 

NLTK Example for Relation Extraction Does not work