2017-10-15 45 views
-2

說我有這樣的數據:根據條件將多行字符串摺疊爲一行。

df <- data.frame(
    text = c("Treatment1: This text is","on two lines","","Treatment2:This text","has","three lines","","Treatment3: This has one") 
       ) 
df 
         text 
1 Treatment1: This text is 
2    on two lines 
3       
4  Treatment2:This text 
5      has 
6    three lines 
7       
8 Treatment3: This has one 

我將如何解析這個文本,以使所有的「治療」是他們自己的行與下面的所有文字在同一行?

例如,這是需要的輸出:

text 
1 Treatment1: This text is on two lines 
2 Treatment2: This text has three lines     
3 Treatment3: This has one 

誰能推薦一個辦法做到這一點?

回答

2

也許像下面這樣。
首先,數據格式爲dput,最佳格式是在帖子中共享數據集。

df <- 
structure(list(text = c("Treatment1: This text is", "on two lines", 
"", "Treatment2:This text", "has", "three lines", "", "Treatment3: This has one" 
)), .Names = "text", class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8")) 

現在的base R代碼。

fact <- cumsum(grepl("treatment", df$text, , ignore.case = TRUE)) 
result <- do.call(rbind, lapply(split(df, fact), function(x) 
        trimws(paste(x$text, collapse = " ")))) 
result <- as.data.frame(result) 
names(result) <- "text" 
result 
#         text 
#1 Treatment1: This text is on two lines 
#2 Treatment2:This text has three lines 
#3    Treatment3: This has one 

編輯。
正如Rich Scriven在他的評論中指出的那樣,tapply可以大大簡化上面的代碼。 (我沒有看到,我有時複雜太多。)

result2 <- data.frame(
    text = tapply(df$text, fact, function(x) trimws(paste(x, collapse = " "))) 
) 

all.equal(result, result2) 
#[1] "Component 「text」: 'current' is not a factor" 
+0

看一看'tapply()'。它可以代替'do.call(rbind,lapply(split(...),...))' –

+0

@RichScriven謝謝你,回答編輯你的建議。 –

0
x <- gsub("\\s+Treatment", "*BREAK*Treatment", 
      as.character(paste(df[[1]], collapse = " "))) 
data.frame(text = unlist(strsplit(x, "\\*BREAK\\*")))