2016-06-25 53 views
0

在R中,我將一個4的ngram的DocumentTermMatrix轉換爲數據框,現在我想將ngram分成兩列,一列帶有字符串的前3個字,其他與硬道理。我可以通過多個步驟來完成這個任務,但是考慮到df的大小,我希望能夠在線完成。將數據幀中的字符串拆分爲兩列

這裏就是我試圖完成:

#    str_name   w123 w4 freq 
# 1 One Two Three Four One Two Three Four 10   

這給了我第三個字:

df <- data.frame(str_name = "One Two Three Four", freq = 10) 
df %>% separate(str_name, c("w123","w4"), sep = "\\w+$", remove=FALSE) 

#    str_name   w123 w4 freq 
# 1 One Two Three Four One Two Three  10 

這給了我最後一個字,但也包含了空間:

df <- data.frame(str_name = "One Two Three Four", freq = 10) 
df %>% separate(str_name, c("sp","w4"), sep = "\\w+\\s\\w+\\s\\w+", remove=FALSE) 

#    str_name sp w4 freq 
# 1 One Two Three Four  Four 10 

這是很長的路

df <- data.frame(w4 = "One Two Three Four", freq = 10) 
df <- df %>% separate(w4, c('w1', 'w2', 'w3', 'w4'), " ") 
df$lookup <- paste(df$w1,df$w2,df$w3) 

#  w1 w2 w3  w4 freq  lookup 
# 1 One Two Three  Four 10 One Two Three 

回答

3

嘗試\\s(?=\\w+$)看起來對空間的最後一個字之前的字符串分割:

df %>% separate(str_name, into = c("w123", "w4"), sep = "\\s(?=\\w+$)", remove = F) 
#    str_name   w123 w4 freq 
# 1 One Two Three Four One Two Three Four 10 

\\s(?=[\\S]+$)是另一種選擇是比上面一個看起來在過去的空間字符串中更貪婪分開。

df %>% separate(str_name, into = c("w123", "w4"), sep = "\\s(?=[\\S]+$)", remove = F) 
#    str_name   w123 w4 freq 
# 1 One Two Three Four One Two Three Four 10 
+0

完美,謝謝! – pheeper

0

我們可以用base R方法來解決這個

res <- cbind(df, read.table(text=sub("\\s(\\S+)$", ",\\1", df$str_name), 
    sep=",", header=FALSE, col.names = c("w123", "w4"), stringsAsFactors=FALSE))[c(1,3,4,2)] 
res 
#   str_name   w123 w4 freq 
#1 One Two Three Four One Two Three Four 10