2017-07-30 35 views
3

我想解釋一些意大利語文本,以便對詞彙化內容的輸出進行一些頻率計數和進一步調查。引用意大利句子進行頻率計數

我比詞幹更喜歡引語,因爲我可以從句子中的上下文中提取詞義(例如區分動詞和名詞)並獲得語言中存在的詞,而不是那些詞的根通常沒有意義。

我發現patternpip2 install pattern)這個庫調用應以執行意大利語的詞形還原補充nltk,但我不知道下面的做法是正確的,因爲每個字本身lemmatized,不一個句子的上下文。

也許我應該給pattern標記一個句子的責任(所以也要註釋每個單詞與元數據有關的動詞/名詞/形容詞等),然後檢索詞彙化的單詞,但我無法做到這一點,我是現在甚至不確定這是可能的嗎?

另外:在意大利語中,有些文章用撇號表示,例如「l'appartamento」(英文中的「flat」)實際上是2個單詞:「lo」和「appartamento」。現在我無法找到一種方法來分割這兩個單詞的組合nltkpattern,所以我不能以正確的方式計算單詞的頻率。

import nltk 
import string 
import pattern 

# dictionary of Italian stop-words 
it_stop_words = nltk.corpus.stopwords.words('italian') 
# Snowball stemmer with rules for the Italian language 
ita_stemmer = nltk.stem.snowball.ItalianStemmer() 

# the following function is just to get the lemma 
# out of the original input word (but right now 
# it may be loosing the context about the sentence 
# from where the word is coming from i.e. 
# the same word could either be a noun/verb/adjective 
# according to the context) 
def lemmatize_word(input_word): 
    in_word = input_word#.decode('utf-8') 
    # print('Something: {}'.format(in_word)) 
    word_it = pattern.it.parse(
     in_word, 
     tokenize=False, 
     tag=False, 
     chunk=False, 
     lemmata=True 
    ) 
    # print("Input: {} Output: {}".format(in_word, word_it)) 
    the_lemmatized_word = word_it.split()[0][0][4] 
    # print("Returning: {}".format(the_lemmatized_word)) 
    return the_lemmatized_word 

it_string = "Ieri sono andato in due supermercati. Oggi volevo andare all'ippodromo. Stasera mangio la pizza con le verdure." 

# 1st tokenize the sentence(s) 
word_tokenized_list = nltk.tokenize.word_tokenize(it_string) 
print("1) NLTK tokenizer, num words: {} for list: {}".format(len(word_tokenized_list), word_tokenized_list)) 

# 2nd remove punctuation and everything lower case 
word_tokenized_no_punct = [string.lower(x) for x in word_tokenized_list if x not in string.punctuation] 
print("2) Clean punctuation, num words: {} for list: {}".format(len(word_tokenized_no_punct), word_tokenized_no_punct)) 

# 3rd remove stop words (for the Italian language) 
word_tokenized_no_punct_no_sw = [x for x in word_tokenized_no_punct if x not in it_stop_words] 
print("3) Clean stop-words, num words: {} for list: {}".format(len(word_tokenized_no_punct_no_sw), word_tokenized_no_punct_no_sw)) 

# 4.1 lemmatize the words 
word_tokenize_list_no_punct_lc_no_stowords_lemmatized = [lemmatize_word(x) for x in word_tokenized_no_punct_no_sw] 
print("4.1) lemmatizer, num words: {} for list: {}".format(len(word_tokenize_list_no_punct_lc_no_stowords_lemmatized), word_tokenize_list_no_punct_lc_no_stowords_lemmatized)) 

# 4.2 snowball stemmer for Italian 
word_tokenize_list_no_punct_lc_no_stowords_stem = [ita_stemmer.stem(i) for i in word_tokenized_no_punct_no_sw] 
print("4.2) stemmer, num words: {} for list: {}".format(len(word_tokenize_list_no_punct_lc_no_stowords_stem), word_tokenize_list_no_punct_lc_no_stowords_stem)) 

# difference between stemmer and lemmatizer 
print(
    "For original word(s) '{}' and '{}' the stemmer: '{}' '{}' (count 1 each), the lemmatizer: '{}' '{}' (count 2)" 
    .format(
     word_tokenized_no_punct_no_sw[1], 
     word_tokenized_no_punct_no_sw[6], 
     word_tokenize_list_no_punct_lc_no_stowords_stem[1], 
     word_tokenize_list_no_punct_lc_no_stowords_stem[6], 
     word_tokenize_list_no_punct_lc_no_stowords_lemmatized[1], 
     word_tokenize_list_no_punct_lc_no_stowords_lemmatized[1] 
    ) 
) 

給出了這樣的輸出:

1) NLTK tokenizer, num words: 20 for list: ['Ieri', 'sono', 'andato', 'in', 'due', 'supermercati', '.', 'Oggi', 'volevo', 'andare', "all'ippodromo", '.', 'Stasera', 'mangio', 'la', 'pizza', 'con', 'le', 'verdure', '.'] 
2) Clean punctuation, num words: 17 for list: ['ieri', 'sono', 'andato', 'in', 'due', 'supermercati', 'oggi', 'volevo', 'andare', "all'ippodromo", 'stasera', 'mangio', 'la', 'pizza', 'con', 'le', 'verdure'] 
3) Clean stop-words, num words: 12 for list: ['ieri', 'andato', 'due', 'supermercati', 'oggi', 'volevo', 'andare', "all'ippodromo", 'stasera', 'mangio', 'pizza', 'verdure'] 
4.1) lemmatizer, num words: 12 for list: [u'ieri', u'andarsene', u'due', u'supermercato', u'oggi', u'volere', u'andare', u"all'ippodromo", u'stasera', u'mangiare', u'pizza', u'verdura'] 
4.2) stemmer, num words: 12 for list: [u'ier', u'andat', u'due', u'supermerc', u'oggi', u'vol', u'andar', u"all'ippodrom", u'staser', u'mang', u'pizz', u'verdur'] 
For original word(s) 'andato' and 'andare' the stemmer: 'andat' 'andar' (count 1 each), the lemmatizer: 'andarsene' 'andarsene' (count 2) 
  • 如何使用他們的標記生成器有效lemmatize一些句子pattern? (假設引證被識別爲名詞/動詞/形容詞等)
  • 是否有一個Python代替pattern用於意大利語的詞形與nltk
  • 如何使用撇號分割綁定到下一個單詞的文章?

回答

1

我會盡量回答你的問題,因爲我知道我對意大利語瞭解不多。

1)據我所知,刪除撇號的主要責任是標記器,因此意大利標記器似乎失敗了。

3),你可以做一個簡單的事情就是調用replace方法(雖然你可能會需要使用re包更復雜的模式),一個例子:

word_tokenized_no_punct_no_sw_no_apostrophe = [x.split("'") for x in word_tokenized_no_punct_no_sw] 
word_tokenized_no_punct_no_sw_no_apostrophe = [y for x in word_tokenized_no_punct_no_sw_no_apostrophe for y in x] 

它產生:

['ieri', 'andato', 'due', 'supermercati', 'oggi', 'volevo', 'andare', 'all', 'ippodromo', 'stasera', 'mangio', 'pizza', 'verdure'] 

2)模式另一種方法是treetagger,授予它是不是最容易安裝所有的(你需要的python packagetool itself,然而這部分後,它的工作原理在Windows和Linux上)。

一個簡單的例子與上面的例子:

import treetaggerwrapper 
from pprint import pprint 

it_string = "Ieri sono andato in due supermercati. Oggi volevo andare all'ippodromo. Stasera mangio la pizza con le verdure." 
tagger = treetaggerwrapper.TreeTagger(TAGLANG="it") 
tags = tagger.tag_text(it_string) 
pprint(treetaggerwrapper.make_tags(tags)) 

pprint收率:

[Tag(word=u'Ieri', pos=u'ADV', lemma=u'ieri'), 
Tag(word=u'sono', pos=u'VER:pres', lemma=u'essere'), 
Tag(word=u'andato', pos=u'VER:pper', lemma=u'andare'), 
Tag(word=u'in', pos=u'PRE', lemma=u'in'), 
Tag(word=u'due', pos=u'ADJ', lemma=u'due'), 
Tag(word=u'supermercati', pos=u'NOM', lemma=u'supermercato'), 
Tag(word=u'.', pos=u'SENT', lemma=u'.'), 
Tag(word=u'Oggi', pos=u'ADV', lemma=u'oggi'), 
Tag(word=u'volevo', pos=u'VER:impf', lemma=u'volere'), 
Tag(word=u'andare', pos=u'VER:infi', lemma=u'andare'), 
Tag(word=u"all'", pos=u'PRE:det', lemma=u'al'), 
Tag(word=u'ippodromo', pos=u'NOM', lemma=u'ippodromo'), 
Tag(word=u'.', pos=u'SENT', lemma=u'.'), 
Tag(word=u'Stasera', pos=u'ADV', lemma=u'stasera'), 
Tag(word=u'mangio', pos=u'VER:pres', lemma=u'mangiare'), 
Tag(word=u'la', pos=u'DET:def', lemma=u'il'), 
Tag(word=u'pizza', pos=u'NOM', lemma=u'pizza'), 
Tag(word=u'con', pos=u'PRE', lemma=u'con'), 
Tag(word=u'le', pos=u'DET:def', lemma=u'il'), 
Tag(word=u'verdure', pos=u'NOM', lemma=u'verdura'), 
Tag(word=u'.', pos=u'SENT', lemma=u'.')] 

它還記號化相當漂亮的all'ippodromoalippodromo(這是希望正確的)發動機罩下在lemmatizing之前。現在我們只需要應用停用詞和標點符號的刪除操作就可以了。

The doc for installing the TreeTaggerWrapper python的庫

+1

謝謝,這工作。在linux上,安裝腳本的路徑可以直接在python代碼中指定,例如'treetaggerwrapper.TreeTagger(TAGLANG =「it」,TAGDIR ='/ abs-path/to/tree-tagger-linux-3.2.1 /')',或者像這裏解釋的環境變量http://treetaggerwrapper.readthedocs .io/en/latest /#configuration同樣爲了得到名爲元組的tag的引號:'tags_str = tagger.tag_text(it_string)'then'tags = treetaggerwrapper.make_tags(tags_str)'then'lemmas = map( (lambda x:x.lemma),tags)' – TPPZ