1
感謝 「alvas」 從這裏,Named Entity Recognition with Regular Expression: NLTK並且作爲示例的代碼:NLTK命名實體識別用於在數據集的列
from nltk import ne_chunk, pos_tag
from nltk.tokenize import word_tokenize
from nltk.tree import Tree
def get_continuous_chunks(text):
chunked = ne_chunk(pos_tag(word_tokenize(text)))
prev = None
continuous_chunk = []
current_chunk = []
for i in chunked:
if type(i) == Tree:
current_chunk.append(" ".join([token for token, pos in i.leaves()]))
elif current_chunk:
named_entity = " ".join(current_chunk)
if named_entity not in continuous_chunk:
continuous_chunk.append(named_entity)
current_chunk = []
else:
continue
return continuous_chunk
txt = 'The new GOP era in Washington got off to a messy start Tuesday as House Republicans,under pressure from President-elect Donald Trump.'
print (get_continuous_chunks(txt))
輸出爲:
[ 'GOP' ,「華盛頓」號,「衆議院共和黨人」,「唐納德·特朗普」]
我取代這段文字與此:txt = df['content'][38]
從我的數據集,我得到這樣的結果:
[ '伊那', '花托K.', '馬丁Cuilla', '菲利普K', '約翰ĴLavorato']
此數據集有許多行和一列命名爲「內容'。我的問題是如何使用此代碼爲每行提取此列中的名稱並將該名稱存儲在另一列和相應的行中?
import os
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize
from nltk.tree import Tree
st = StanfordNERTagger(stanford_classifier, stanford_ner_path, encoding='utf-8')
text = df['content']
tokenized_text = word_tokenize(text)
classified_text = st.tag(tokenized_text)
print (classified_text)
太謝謝你了。 –
你能告訴我,如果我想對這段代碼做同樣的事情,我該怎麼辦? –
我把它添加到我的question.Thank你 –