熊貓Dataframe列值拆分

我有一個Excel數據集包含usertype，ID和屬性的描述。我已經在python熊貓中以dataframe（df）導入了這個文件。熊貓Dataframe列值拆分

現在我想將說明中的內容分成一個字，兩個字和三個字。我可以在NLTK庫的幫助下做一個單詞標記。但我堅持兩個和三個詞標記。例如，列Description中的行之一有句子 -

一個全新的住宅公寓在孟買主要道路用便攜式水。

我想這句話被分割爲

「A品牌」，「全新」，「新住宅」，「住宅公寓」 ......「飲用水」。

而這種拆分應該反映在該列的每一行中。

Image of my dataset in excel format

來源

2017-08-24 Rajitha Naik

你怎麼樣1）不張貼圖片2）不要張貼鏈接，圖片3）_excel_數據的圖片要少得多鏈接。 –

並閱讀：http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples –

有一個'ngrams'函數在nltk這很容易做到這一點，採取一個參數的數字你想組合在一起的單詞 – kev8484

下面是使用ngrams從nltk一個小例子。希望它能幫助：

from nltk.util import ngrams 
from nltk import word_tokenize 

# Creating test dataframe 
df = pd.DataFrame({'text': ['my first sentence', 
          'this is the second sentence', 
          'third sent of the dataframe']}) 
print(df)

輸入dataframe：

text 
0 my first sentence 
1 this is the second sentence 
2 third sent of the dataframe

現在我們可以使用的n-gram與word_tokenize沿着bigrams和trigrams和將其應用到數據幀中的每一行。對於bigram，我們將2的值與標記化單詞一起傳遞給ngrams函數，而對於卦則傳遞值爲3。 ngrams返回的結果是generator類型，所以它被轉換爲列表。對於每一行，列表bigrams和trigrams都保存在不同的列中。

df['bigram'] = df['text'].apply(lambda row: list(ngrams(word_tokenize(row), 2))) 
df['trigram'] = df['text'].apply(lambda row: list(ngrams(word_tokenize(row), 3))) 
print(df)

結果：

     text \ 
0   my first sentence 
1 this is the second sentence 
2 third sent of the dataframe 

                bigram \ 
0       [(my, first), (first, sentence)] 
1 [(this, is), (is, the), (the, second), (second, sentence)] 
2 [(third, sent), (sent, of), (of, the), (the, dataframe)] 

                trigram 
0          [(my, first, sentence)] 
1 [(this, is, the), (is, the, second), (the, second, sentence)] 
2  [(third, sent, of), (sent, of, the), (of, the, dataframe)]

來源

2017-08-24 19:42:10 0p3n5ourcE

熊貓Dataframe列值拆分

回答

相關問題