2015-03-03 50 views
0

我有以下的數據幀:句子檢測和提取到相同的數據幀

reviews <- data.frame(value = c("Product was received in excellent condition. Made with high quality materials. Very Good product", 
           "Inexpensive. An improvement over integrated graphics.", 
           "I love that product so excite. I will order again if I need more .", 
           "Excellent card, great graphics."), 
         user = c(1,2,3,4), 
         Review_Id = c("101968","101968","210546","112546"), 
         stringsAsFactors = FALSE) 

和我需要有期望的輸出:

 user  review_Id         sentence 
      1  101968  Made with high quality materials. 
      1  101968      Very Good product 
      2  101968        Inexpensive. 
      2  101968 An improvement over integrated graphics. 
      3  210546   I love that product so excite. 
      3  210546  I will order again if I need more . 
      4  112546   Excellent card, great graphics. 

我想知道是這樣的:sent_detect(reviews$value)

但是,我怎麼能結合這個功能來獲得所需的輸出。

+0

您的數據真的很乾淨嗎? (例如,所有句子的句號都是句號,後面跟一個空格?) – A5C1D2H2I1M1N2O1R2T1 2015-03-03 11:19:53

+0

如果不是,可以嘗試使用[this](http://www.inside-r.org/packages/cran/openNLP/docs/Maxent_Sent_Token_Annotator),最後有一個例子 – NicE 2015-03-03 11:25:58

回答

0

如果你的數據真的很整潔,你可以使用我的「splitstackshape」包中的cSplit

library(splitstackshape) 
cSplit(reviews, "value", ".", direction = "long") 
#           value user Review_Id 
# 1: Product was received in excellent condition 1 101968 
# 2:   Made with high quality materials 1 101968 
# 3:       Very Good product 1 101968 
# 4:         Inexpensive 2 101968 
# 5:  An improvement over integrated graphics 2 101968 
# 6:    I love that product so excite 3 210546 
# 7:   I will order again if I need more 3 210546 
# 8:    Excellent card, great graphics 4 112546 
+0

非常感謝...這個功能真的很棒。它解決了我的任務。再次感謝。 – martinkabe 2015-03-03 12:51:44

+0

還有最後一個問題......如果我不只是結束了句子。但例如!或?,所以我怎樣才能將它添加到sSplit函數? – martinkabe 2015-03-03 12:53:25

+0

@martinkabe,你可以嘗試類似'cSplit(評論,「價值」,「[。!?」,固定=假,stripWhite = FALSE,方向=「長」)'分裂「。 「和」?「。 – A5C1D2H2I1M1N2O1R2T1 2015-03-03 17:00:15