2012-03-25 40 views
3

我有一個文本變量和一個分組變量。我想將文本變量摺疊爲每行一個字符串(合併)。所以只要小組專欄說m我想將文本分組在一起等等。我在前後提供了一個樣本數據集。我正在編寫這個包,並且迄今爲止避免了對除wordcloud之外的其他包的所有依賴,並且希望以此方式保留它。通過分組變量摺疊列(以基數爲單位)

我懷疑rle可能對cumsum很有用,但一直沒能弄清楚這一點。

預先感謝您。

什麼數據看起來像

        text group 
1  Computer is fun. Not too fun.  m 
2    No its not, its dumb.  m 
3    How can we be certain?  f 
4     There is no way.  m 
5      I distrust you.  m 
6   What are you talking about?  f 
7  Shall we move on? Good then.  f 
8 Im hungry. Lets eat. You already?  m 

我想要什麼數據看起來像

             text group 
1  Computer is fun. Not too fun. No its not, its dumb.  m 
2         How can we be certain?  f 
3       There is no way. I distrust you.  m 
4 What are you talking about? Shall we move on? Good then.  f 
5      Im hungry. Lets eat. You already?  m 

數據

dat <- structure(list(text = c("Computer is fun. Not too fun.", "No its not, its dumb.", 
"How can we be certain?", "There is no way.", "I distrust you.", 
"What are you talking about?", "Shall we move on? Good then.", 
"Im hungry. Lets eat. You already?"), group = structure(c(2L, 
2L, 1L, 2L, 2L, 1L, 1L, 2L), .Label = c("f", "m"), class = "factor")), .Names = c("text", 
"group"), row.names = c(NA, 8L), class = "data.frame") 

編輯:我發現我可以用於與該組變量的每個運行添加獨特的列:

x <- rle(as.character(dat$group))[[1]] 
dat$new <- as.factor(rep(1:length(x), x)) 

產量:

        text group new 
1  Computer is fun. Not too fun.  m 1 
2    No its not, its dumb.  m 1 
3    How can we be certain?  f 2 
4     There is no way.  m 3 
5      I distrust you.  m 3 
6   What are you talking about?  f 4 
7  Shall we move on? Good then.  f 4 
8 Im hungry. Lets eat. You already?  m 5 

回答

5

這使得使用RLE來創建一個ID組句子上。它採用tapply連同粘貼帶來一起輸出

## Your example data 
dat <- structure(list(text = c("Computer is fun. Not too fun.", "No its not, its dumb.", 
"How can we be certain?", "There is no way.", "I distrust you.", 
"What are you talking about?", "Shall we move on?  Good then.", 
"Im hungry.  Lets eat.  You already?"), group = structure(c(2L, 
2L, 1L, 2L, 2L, 1L, 1L, 2L), .Label = c("f", "m"), class = "factor")), .Names = c("text", 
"group"), row.names = c(NA, 8L), class = "data.frame") 


# Needed for later 
k <- rle(as.numeric(dat$group)) 
# Create a grouping vector 
id <- rep(seq_along(k$len), k$len) 
# Combine the text in the desired manner 
out <- tapply(dat$text, id, paste, collapse = " ") 
# Bring it together into a data frame 
answer <- data.frame(text = out, group = levels(dat$group)[k$val]) 
+1

我不相信你需要「以次(長度(k $ len))「,因爲序列會將」seq_along「作爲k $長度向量,給出相應的數字序列:id < - rep(seq(k $ length),k $ length) – 2012-03-25 05:04:28

+0

@BryanGoodrich Good catch 。本來我只是打算做1:長度(k $ len),但最近我一直在更多地使用seq和seq_along,並且我想最終會導致兩種方法的混淆。 – Dason 2012-03-25 05:28:35

+0

我通常只是堅持seq,但爲了清晰起見,我可以看到seq_along如何明確表示您正在數值遍歷值的向量。當我處理使用x [[(某些邏輯在這裏...)]的布爾向量上的多餘時,我經常傾向於走這條清晰的路線。這不是必要的,但它確實給了我更喜歡的編碼的語言清晰度。 – 2012-03-26 07:16:24

1

我得到了答案,回來後卻達誠打我給它比我自己更理解。

x <- rle(as.character(dat$group))[[1]] 
dat$new <- as.factor(rep(1:length(x), x)) 

Paste <- function(x) paste(x, collapse=" ") 
aggregate(text~new, dat, Paste) 

編輯 如何我會用骨料和我從你的迴應教訓(雖然tapply是一個更好的解決方案)做到這一點:

y <- rle(as.character(dat$group)) 
x <- y[[1]] 
dat$new <- as.factor(rep(1:length(x), x)) 

text <- aggregate(text~new, dat, paste, collapse = " ")[, 2] 
data.frame(text, group = y[[2]]) 
+1

請注意,您不需要定義「粘貼」,因爲聚合允許您將其他參數傳遞給正在應用的功能。你應該能夠刪除粘貼並使用它來代替'aggregate(text〜new,dat,paste,collapse =「」)'' – Dason 2012-03-25 04:06:25