我想解釋一些意大利語文本,以便對詞彙化內容的輸出進行一些頻率計數和進一步調查。引用意大利句子進行頻率計數
我比詞幹更喜歡引語,因爲我可以從句子中的上下文中提取詞義(例如區分動詞和名詞)並獲得語言中存在的詞,而不是那些詞的根通常沒有意義。
我發現pattern
(pip2 install pattern
)這個庫調用應以執行意大利語的詞形還原補充nltk
,但我不知道下面的做法是正確的,因爲每個字本身lemmatized,不一個句子的上下文。
也許我應該給pattern
標記一個句子的責任(所以也要註釋每個單詞與元數據有關的動詞/名詞/形容詞等),然後檢索詞彙化的單詞,但我無法做到這一點,我是現在甚至不確定這是可能的嗎?
另外:在意大利語中,有些文章用撇號表示,例如「l'appartamento」(英文中的「flat」)實際上是2個單詞:「lo」和「appartamento」。現在我無法找到一種方法來分割這兩個單詞的組合nltk
和pattern
,所以我不能以正確的方式計算單詞的頻率。
import nltk
import string
import pattern
# dictionary of Italian stop-words
it_stop_words = nltk.corpus.stopwords.words('italian')
# Snowball stemmer with rules for the Italian language
ita_stemmer = nltk.stem.snowball.ItalianStemmer()
# the following function is just to get the lemma
# out of the original input word (but right now
# it may be loosing the context about the sentence
# from where the word is coming from i.e.
# the same word could either be a noun/verb/adjective
# according to the context)
def lemmatize_word(input_word):
in_word = input_word#.decode('utf-8')
# print('Something: {}'.format(in_word))
word_it = pattern.it.parse(
in_word,
tokenize=False,
tag=False,
chunk=False,
lemmata=True
)
# print("Input: {} Output: {}".format(in_word, word_it))
the_lemmatized_word = word_it.split()[0][0][4]
# print("Returning: {}".format(the_lemmatized_word))
return the_lemmatized_word
it_string = "Ieri sono andato in due supermercati. Oggi volevo andare all'ippodromo. Stasera mangio la pizza con le verdure."
# 1st tokenize the sentence(s)
word_tokenized_list = nltk.tokenize.word_tokenize(it_string)
print("1) NLTK tokenizer, num words: {} for list: {}".format(len(word_tokenized_list), word_tokenized_list))
# 2nd remove punctuation and everything lower case
word_tokenized_no_punct = [string.lower(x) for x in word_tokenized_list if x not in string.punctuation]
print("2) Clean punctuation, num words: {} for list: {}".format(len(word_tokenized_no_punct), word_tokenized_no_punct))
# 3rd remove stop words (for the Italian language)
word_tokenized_no_punct_no_sw = [x for x in word_tokenized_no_punct if x not in it_stop_words]
print("3) Clean stop-words, num words: {} for list: {}".format(len(word_tokenized_no_punct_no_sw), word_tokenized_no_punct_no_sw))
# 4.1 lemmatize the words
word_tokenize_list_no_punct_lc_no_stowords_lemmatized = [lemmatize_word(x) for x in word_tokenized_no_punct_no_sw]
print("4.1) lemmatizer, num words: {} for list: {}".format(len(word_tokenize_list_no_punct_lc_no_stowords_lemmatized), word_tokenize_list_no_punct_lc_no_stowords_lemmatized))
# 4.2 snowball stemmer for Italian
word_tokenize_list_no_punct_lc_no_stowords_stem = [ita_stemmer.stem(i) for i in word_tokenized_no_punct_no_sw]
print("4.2) stemmer, num words: {} for list: {}".format(len(word_tokenize_list_no_punct_lc_no_stowords_stem), word_tokenize_list_no_punct_lc_no_stowords_stem))
# difference between stemmer and lemmatizer
print(
"For original word(s) '{}' and '{}' the stemmer: '{}' '{}' (count 1 each), the lemmatizer: '{}' '{}' (count 2)"
.format(
word_tokenized_no_punct_no_sw[1],
word_tokenized_no_punct_no_sw[6],
word_tokenize_list_no_punct_lc_no_stowords_stem[1],
word_tokenize_list_no_punct_lc_no_stowords_stem[6],
word_tokenize_list_no_punct_lc_no_stowords_lemmatized[1],
word_tokenize_list_no_punct_lc_no_stowords_lemmatized[1]
)
)
給出了這樣的輸出:
1) NLTK tokenizer, num words: 20 for list: ['Ieri', 'sono', 'andato', 'in', 'due', 'supermercati', '.', 'Oggi', 'volevo', 'andare', "all'ippodromo", '.', 'Stasera', 'mangio', 'la', 'pizza', 'con', 'le', 'verdure', '.']
2) Clean punctuation, num words: 17 for list: ['ieri', 'sono', 'andato', 'in', 'due', 'supermercati', 'oggi', 'volevo', 'andare', "all'ippodromo", 'stasera', 'mangio', 'la', 'pizza', 'con', 'le', 'verdure']
3) Clean stop-words, num words: 12 for list: ['ieri', 'andato', 'due', 'supermercati', 'oggi', 'volevo', 'andare', "all'ippodromo", 'stasera', 'mangio', 'pizza', 'verdure']
4.1) lemmatizer, num words: 12 for list: [u'ieri', u'andarsene', u'due', u'supermercato', u'oggi', u'volere', u'andare', u"all'ippodromo", u'stasera', u'mangiare', u'pizza', u'verdura']
4.2) stemmer, num words: 12 for list: [u'ier', u'andat', u'due', u'supermerc', u'oggi', u'vol', u'andar', u"all'ippodrom", u'staser', u'mang', u'pizz', u'verdur']
For original word(s) 'andato' and 'andare' the stemmer: 'andat' 'andar' (count 1 each), the lemmatizer: 'andarsene' 'andarsene' (count 2)
- 如何使用他們的標記生成器有效lemmatize一些句子
pattern
? (假設引證被識別爲名詞/動詞/形容詞等) - 是否有一個Python代替
pattern
用於意大利語的詞形與nltk
? - 如何使用撇號分割綁定到下一個單詞的文章?
謝謝,這工作。在linux上,安裝腳本的路徑可以直接在python代碼中指定,例如'treetaggerwrapper.TreeTagger(TAGLANG =「it」,TAGDIR ='/ abs-path/to/tree-tagger-linux-3.2.1 /')',或者像這裏解釋的環境變量http://treetaggerwrapper.readthedocs .io/en/latest /#configuration同樣爲了得到名爲元組的tag的引號:'tags_str = tagger.tag_text(it_string)'then'tags = treetaggerwrapper.make_tags(tags_str)'then'lemmas = map( (lambda x:x.lemma),tags)' – TPPZ