2017-05-21 112 views
1

我打算使用spaCy NLP引擎,我已經從字典開始。我已閱讀this resourcethis,但無法開始執行此操作。如何爲spacy NLP創建詞典?

我有這樣的代碼:

from spacy.en import English 
import _regex 
parser = English() 

# Test Data 
multiSentence = "There is an art, it says, or rather, a knack to flying." \ 
       "The knack lies in learning how to throw yourself at the ground and miss." \ 
       "In the beginning the Universe was created. This has made a lot of people "\ 
       "very angry and been widely regarded as a bad move." 
parsedData = parser(multiSentence) 
for i, token in enumerate(parsedData): 
    print("original:", token.orth, token.orth_) 
    print("lowercased:", token.lower, token.lower_) 
    print("lemma:", token.lemma, token.lemma_) 
    print("shape:", token.shape, token.shape_) 
    print("prefix:", token.prefix, token.prefix_) 
    print("suffix:", token.suffix, token.suffix_) 
    print("log probability:", token.prob) 
    print("Brown cluster id:", token.cluster) 
    print("----------------------------------------") 
    if i > 1: 
     break 

# Let's look at the sentences 
sents = [] 
for span in parsedData.sents: 
    # go from the start to the end of each span, returning each token in the sentence 
    # combine each token using join() 
    sent = ''.join(parsedData[i].string for i in range(span.start, span.end)).strip() 
    sents.append(sent) 

print('To show sentence') 
for sentence in sents: 
    print(sentence) 


# Let's look at the part of speech tags of the first sentence 
for span in parsedData.sents: 
    sent = [parsedData[i] for i in range(span.start, span.end)] 
    break 

for token in sent: 
    print(token.orth_, token.pos_) 

# Let's look at the dependencies of this example: 
example = "The boy with the spotted dog quickly ran after the firetruck." 
parsedEx = parser(example) 
# shown as: original token, dependency tag, head word, left dependents, right dependents 
for token in parsedEx: 
    print(token.orth_, token.dep_, token.head.orth_, [t.orth_ for t in token.lefts], [t.orth_ for t in token.rights]) 

# Let's look at the named entities of this example: 
example = "Apple's stocks dropped dramatically after the death of Steve Jobs in October." 
parsedEx = parser(example) 
for token in parsedEx: 
    print(token.orth_, token.ent_type_ if token.ent_type_ != "" else "(not an entity)") 

print("-------------- entities only ---------------") 
# if you just want the entities and nothing else, you can do access the parsed examples "ents" property like this: 
ents = list(parsedEx.ents) 
for entity in ents: 
    print(entity.label, entity.label_, ' '.join(t.orth_ for t in entity)) 

messyData = "lol that is rly funny :) This is gr8 i rate it 8/8!!!" 
parsedData = parser(messyData) 
for token in parsedData: 
    print(token.orth_, token.pos_, token.lemma_) 

我在哪裏可以更改這些令牌(token.orth,token.orth_,等等):

print("original:", token.orth, token.orth_) 
    print("lowercased:", token.lower, token.lower_) 
    print("lemma:", token.lemma, token.lemma_) 
    print("shape:", token.shape, token.shape_) 
    print("prefix:", token.prefix, token.prefix_) 
    print("suffix:", token.suffix, token.suffix_) 
    print("log probability:", token.prob) 
    print("Brown cluster id:", token.cluster) 

我可以保存這些標記在自己的字典?感謝您的幫助

+0

你能否進一步解釋一下關於你所期望的字典中獲得什麼? – alvas

回答

3

目前尚不清楚您需要的數據結構是什麼,但我們試着回答一些問題。

問:我可以在哪裏更改這些標記(token.orth,token.orth_,...)?

這些標記不應該更改,因爲它們是由spacy英文模型創建的標註。 (見annotations定義)

有關什麼個人註釋意味着詳情,請參閱spaCy Documentation for [ orth , pos , tag, lema and text ]

問:但我們可以改變這些標記的註解?

可能,是和否。

看代碼,我們看到spacy.tokens.doc.Doc類是一個相當複雜的用Cython對象:

cdef class Doc: 
    """ 
    A sequence of `Token` objects. Access sentences and named entities, 
    export annotations to numpy arrays, losslessly serialize to compressed 
    binary strings. 
    Aside: Internals 
     The `Doc` object holds an array of `TokenC` structs. 
     The Python-level `Token` and `Span` objects are views of this 
     array, i.e. they don't own the data themselves. 
    Code: Construction 1 
     doc = nlp.tokenizer(u'Some text') 
    Code: Construction 2 
     doc = Doc(nlp.vocab, orths_and_spaces=[(u'Some', True), (u'text', True)]) 
    """ 

但總的來說它包含一個固有緊密相關spacy.Vocab對象spacy.tokens.token.Token對象的序列。

首先,讓我們來看看這些註釋中的一些是否是可變的。讓我們先從POS標籤:

>>> import spacy 
>>> nlp = spacy.load('en') 
>>> doc = nlp('This is a foo bar sentence.') 

>>> type(doc[0]) # First word. 
<class 'spacy.tokens.token.Token'> 

>>> dir(doc[0]) # Properties/functions available for the Token object. 
['__bytes__', '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__pyx_vtable__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', 'ancestors', 'check_flag', 'children', 'cluster', 'conjuncts', 'dep', 'dep_', 'doc', 'ent_id', 'ent_id_', 'ent_iob', 'ent_iob_', 'ent_type', 'ent_type_', 'has_repvec', 'has_vector', 'head', 'i', 'idx', 'is_alpha', 'is_ancestor', 'is_ancestor_of', 'is_ascii', 'is_bracket', 'is_digit', 'is_left_punct', 'is_lower', 'is_oov', 'is_punct', 'is_quote', 'is_right_punct', 'is_space', 'is_stop', 'is_title', 'lang', 'lang_', 'left_edge', 'lefts', 'lemma', 'lemma_', 'lex_id', 'like_email', 'like_num', 'like_url', 'lower', 'lower_', 'n_lefts', 'n_rights', 'nbor', 'norm', 'norm_', 'orth', 'orth_', 'pos', 'pos_', 'prefix', 'prefix_', 'prob', 'rank', 'repvec', 'right_edge', 'rights', 'sentiment', 'shape', 'shape_', 'similarity', 'string', 'subtree', 'suffix', 'suffix_', 'tag', 'tag_', 'text', 'text_with_ws', 'vector', 'vector_norm', 'vocab', 'whitespace_'] 

# The POS tag assigned by spacy's model. 
>>> doc[0].tag_ 
'DT' 

# Let's try to override it. 
>>> doc[0].tag_ = 'NN' 

# It works!!! 
>>> doc[0].tag_ 
'NN' 

# What if we overwrite index of the tag_ rather than the form? 
>>> doc[0].tag 
474 
>>> doc[0].tag = 123 
Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
    File "spacy/tokens/token.pyx", line 206, in spacy.tokens.token.Token.tag.__set__ (spacy/tokens/token.cpp:6755) 
    File "spacy/morphology.pyx", line 64, in spacy.morphology.Morphology.assign_tag (spacy/morphology.cpp:4540) 
KeyError: 123 
>>> doc[0].tag = 352 
Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
    File "spacy/tokens/token.pyx", line 206, in spacy.tokens.token.Token.tag.__set__ (spacy/tokens/token.cpp:6755) 
    File "spacy/morphology.pyx", line 64, in spacy.morphology.Morphology.assign_tag (spacy/morphology.cpp:4540) 
KeyError: 352 

所以在某種程度上,如果你改變了POS標籤(.pos_)的形式,它仍然存在,但沒有辦法原則的方式,以獲得正確的密鑰,因爲這些密鑰自動生成Cython屬性。

讓我們來看看另一個註解.orth_

>>> doc[0].orth_ 
'This' 
>>> doc[0].orth_ = 'that' 
Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
AttributeError: attribute 'orth_' of 'spacy.tokens.token.Token' objects is not writable 

現在我們看到有像.orth_令牌的一些註釋保護被覆蓋。這很可能是因爲它會破壞令牌如何映射回輸入字符串的原始偏移量。

Ans:看起來Token對象的某些屬性可以改變,有些屬性不能改變。

問:那麼哪些令牌屬性可以更改,哪些不能?

一個簡單的方法來檢查這是查找在https://github.com/explosion/spaCy/blob/master/spacy/tokens/token.pyx#L32 Cython屬性__set__函數。

這將允許可變變量,並且很可能這些是可以被覆蓋/更改的令牌屬性。

E.g.

property lemma_: 
    def __get__(self): 
     return self.vocab.strings[self.c.lemma] 
    def __set__(self, unicode lemma_): 
     self.c.lemma = self.vocab.strings[lemma_] 

property pos_: 
    def __get__(self): 
     return parts_of_speech.NAMES[self.c.pos] 

property tag_: 
    def __get__(self): 
     return self.vocab.strings[self.c.tag] 
    def __set__(self, tag): 
     self.tag = self.vocab.strings[tag] 

我們將看到.tag_.lemma_是可變的,但.pos_是不是:

>>> doc[0].lemma_ 
'this' 
>>> doc[0].lemma_ = 'that' 
>>> doc[0].lemma_ 
'that' 

>>> doc[0].tag_ 
'DT' 
>>> doc[0].tag_ = 'NN' 
>>> doc[0].tag_ 
'NN' 

>>> doc[0].pos_ 
'NOUN' 
>>> doc[0].pos_ = 'VERB' 
Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
AttributeError: attribute 'pos_' of 'spacy.tokens.token.Token' objects is not writable 

問:我可以保存這些標記在自己的字典嗎?

我不完全確定這意味着什麼。但也許,你的意思是像pickle

不知何故,pickle作品古怪的用Cython對象,所以你可能需要保存的spacy創建spacy.tokens.doc.Docspacy.tokens.token.Token對象的其他方法的默認,即

>>> import pickle 
>>> import spacy 

>>> nlp = spacy.load('en') 
>>> doc = nlp('This is a foo bar sentence.') 

>>> doc 
This is a foo bar sentence. 

# Pickle the Doc object. 
>>> pickle.dump(doc, open('spacy_processed_doc.pkl', 'wb')) 

# Now you see me. 
>>> doc 
This is a foo bar sentence. 
# Now you don't 
>>> doc = None 
>>> doc 

# Let's load the saved pickle. 
>>> doc = pickle.load(open('spacy_processed_doc.pkl', 'rb')) 
>>> doc 

>>> type(doc) 
<class 'spacy.tokens.doc.Doc'> 
>>> doc[0] 
Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
    File "spacy/tokens/doc.pyx", line 185, in spacy.tokens.doc.Doc.__getitem__ (spacy/tokens/doc.cpp:5550) 
TypeError: 'NoneType' object is not subscriptable 
+0

哇!謝謝你非常清楚的解釋。 –