如何使用Python中的Stanford CoreNLP輸出一個文件，其中的命名實體被替換爲標籤？

我正在使用Python使用斯坦福大學的NLP。所以，我有一個函數輸入一些文本文件並將它們轉換爲xml文件（由Stanford CoreNLP生成）。現在，我想編寫另一個函數來輸入這些xml文件，並輸出相應的文件，其中包含相同的文本，但命名實體替換爲它們的標記，並用「STOP」字標記句子末尾，刪除標點符號。文件的開頭也有「STOP」字樣。給XML文件中的功能是：如何使用Python中的Stanford CoreNLP輸出一個文件，其中的命名實體被替換爲標籤？

import subprocess 
def generate_xml(input,output): 
    p = subprocess.Popen('java -cp stanford-corenlp-2012-07-09.jar:stanford-corenlp-2012-07-06-models.jar:xom.jar:joda-time.jar -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -filelist /Users/akritibahal/Downloads/stanford-corenlp-2012-07-09/myfile_list.txt -outputDirectory /Users/akritibahal/Downloads/stanford-corenlp-2012-07-09', shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) 
    for line in p.stdout.readlines(): 
     print line 

    retval = p.wait()

的功能是將給出一個輸出文件，命名實體標籤：

def process_file(input_xml,output_file):

任何人可以幫助我如何獲得這樣的輸出文件名爲實體標籤？

來源

2014-10-20 Linda Su

我一直在使用minidom解析CoreNLP的輸出。以下是您可能想要使用的一些入門代碼，但您可能需要檢查https://github.com/dasmith/stanford-corenlp-python

請注意，您需要獲取斯坦福大學CoreNLP使用的標記，因爲返回的數據是基於句子和標記的偏移量。

from xml.dom import minidom  
xmldoc = minidom.parseString(raw_xml_data) 
for sentence_xml in xmldoc.getElementsByTagName('sentences')[0].getElementsByTagName('sentence'): 
    parse = parser.parse(sentence_xml.getElementsByTagName('parse')[0].firstChild.nodeValue) 
    tokens = [(i,j) for i,j in zip(sentence_xml.getElementsByTagName('tokens')[0].getElementsByTagName('token'),parse.get_leaves())] 
    # example for processing dependencies 
    elements = sentence_xml.getElementsByTagName('dependencies') 
    for element in elements: 
     if element.getAttribute('type')=="collapsed-ccprocessed-dependencies": 
      dependencies += [i for i in element.getElementsByTagName('dep')]

來源

2015-04-19 17:10:41

如何使用Python中的Stanford CoreNLP輸出一個文件，其中的命名實體被替換爲標籤？

回答

相關問題