2016-09-21 71 views
1

我試圖運行pycorenlp來標記包含非ASCII字符的文本。有時nlp.annotate()返回一個字典,有時它返回一個字符串。有沒有辦法讓pycorenlp的`nlp.annotate()`總是返回相同類型的結果?

例如,

''' 
From https://github.com/smilli/py-corenlp/blob/master/example.py 
''' 
from pycorenlp import StanfordCoreNLP 
import pprint 
import re 

if __name__ == '__main__': 
    nlp = StanfordCoreNLP('http://localhost:9000') 
    text = u"tab with good effect, denies pain".encode('utf-8') 
    print('type(text): {0}'.format(type(text))) 

    output = nlp.annotate(text, properties={ 
     'annotators': 'tokenize,ssplit', 
     'outputFormat': 'json' 
    }) 
    #pp = pprint.PrettyPrinter(indent=4) 
    #pp.pprint(output) 
    print('type(output): {0}'.format(type(output))) 

    text = u"tab with good effect\u0013\u0013, denies pain".encode('utf-8') 
    print('\ntype(text): {0}'.format(type(text))) 
    output = nlp.annotate(text, properties={ 
     'annotators': 'tokenize,ssplit', 
     'outputFormat': 'json' 
    }) 
    print('type(output): {0}'.format(type(output))) 

輸出:

type(text): <type 'str'> 
type(output): <type 'dict'> 

type(text): <type 'str'> 
type(output): <type 'unicode'> 

我注意到,當type(output)<type 'unicode'>,我得到了斯坦福CoreNLP服務器這樣的警告:

WARNING: Untokenizable: ‼ (U+13, decimal: 19) 

有什麼nlp.annotate()總是返回相同的ty結果?


Stanford CoreNLP server用推出:

java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer 9000 

我用斯坦福CoreNLP 3.6.0,在Windows 7 SP1 64位旗艦版pycorenlp 0.3.0和Python 3.5的x64。

回答

相關問題