2017-01-04 64 views
1

感謝 「alvas」 從這裏,Named Entity Recognition with Regular Expression: NLTK並且作爲示例的代碼:NLTK命名實體識別用於在數據集的列

from nltk import ne_chunk, pos_tag 
from nltk.tokenize import word_tokenize 
from nltk.tree import Tree 

def get_continuous_chunks(text): 
    chunked = ne_chunk(pos_tag(word_tokenize(text))) 
    prev = None 
    continuous_chunk = [] 
    current_chunk = [] 

    for i in chunked: 
     if type(i) == Tree: 
      current_chunk.append(" ".join([token for token, pos in i.leaves()])) 
     elif current_chunk: 
      named_entity = " ".join(current_chunk) 
      if named_entity not in continuous_chunk: 
       continuous_chunk.append(named_entity) 
       current_chunk = [] 
     else: 
      continue 

    return continuous_chunk 

txt = 'The new GOP era in Washington got off to a messy start Tuesday as House Republicans,under pressure from President-elect Donald Trump.' 
print (get_continuous_chunks(txt)) 

輸出爲:

[ 'GOP' ,「華盛頓」號,「衆議院共和黨人」,「唐納德·特朗普」]

我取代這段文字與此:txt = df['content'][38]從我的數據集,我得到這樣的結果:

[ '伊那', '花托K.', '馬丁Cuilla', '菲利普K', '約翰ĴLavorato']

此數據集有許多行和一列命名爲「內容'。我的問題是如何使用此代碼爲每行提取此列中的名稱並將該名稱存儲在另一列和相應的行中?

import os 
from nltk.tag import StanfordNERTagger 
from nltk.tokenize import word_tokenize 
from nltk.tree import Tree 
st = StanfordNERTagger(stanford_classifier, stanford_ner_path, encoding='utf-8') 
text = df['content'] 
tokenized_text = word_tokenize(text) 
classified_text = st.tag(tokenized_text) 
print (classified_text) 

回答

2

嘗試apply

df['ne'] = df['content'].apply(get_continuous_chunks) 

的代碼在你的第二個例子,創建一個功能,並將其應用同樣的方式:

def my_st(text): 
    tokenized_text = word_tokenize(text) 
    return st.tag(tokenized_text) 

df['st'] = df['content'].apply(my_st) 
+0

太謝謝你了。 –

+0

你能告訴我,如果我想對這段代碼做同樣的事情,我該怎麼辦? –

+0

我把它添加到我的question.Thank你 –

相關問題