下面是使用ngrams
從nltk
一個小例子。希望它能幫助:
from nltk.util import ngrams
from nltk import word_tokenize
# Creating test dataframe
df = pd.DataFrame({'text': ['my first sentence',
'this is the second sentence',
'third sent of the dataframe']})
print(df)
輸入dataframe
:
text
0 my first sentence
1 this is the second sentence
2 third sent of the dataframe
現在我們可以使用的n-gram與word_tokenize
沿着bigrams
和trigrams
和將其應用到數據幀中的每一行。對於bigram,我們將2
的值與標記化單詞一起傳遞給ngrams函數,而對於卦則傳遞值爲3
。 ngrams
返回的結果是generator
類型,所以它被轉換爲列表。對於每一行,列表bigrams
和trigrams
都保存在不同的列中。
df['bigram'] = df['text'].apply(lambda row: list(ngrams(word_tokenize(row), 2)))
df['trigram'] = df['text'].apply(lambda row: list(ngrams(word_tokenize(row), 3)))
print(df)
結果:
text \
0 my first sentence
1 this is the second sentence
2 third sent of the dataframe
bigram \
0 [(my, first), (first, sentence)]
1 [(this, is), (is, the), (the, second), (second, sentence)]
2 [(third, sent), (sent, of), (of, the), (the, dataframe)]
trigram
0 [(my, first, sentence)]
1 [(this, is, the), (is, the, second), (the, second, sentence)]
2 [(third, sent, of), (sent, of, the), (of, the, dataframe)]
你怎麼樣1)不張貼圖片2)不要張貼鏈接,圖片3)_excel_數據的圖片要少得多鏈接。 –
並閱讀:http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples –
有一個'ngrams'函數在nltk這很容易做到這一點,採取一個參數的數字你想組合在一起的單詞 – kev8484