2017-09-01 61 views


text <- " customer:customer text1 agent:agent text 1 customer:customer text2 agent:agent text 2


# desired outputs: 
# field for customer texts 
"customer text1, customer text2" 
# field for agent texts 
"agent text1, agent text2" 

我有什麼能到目前爲止做(無線? TH上的正則表達式受試者有限的經驗)是:

customerText <- gsub("^ customer:| agent:(.*)", "", text) 
[1] "customer text1" 



> callid <- c("1","2") 
> conversation <- c(" customer:customer text 1 agent:agent text 1 customer:customer text 2 agent:agent text 2", 
+     " agent:agent text 8 customer:customer text 8 agent:agent text 9 customer:customer text 9") 
> conversationCustomer <- c("customer text 1, customer text 2", "customer text 8, customer text 9") 
> conversationAgent <- c("agent text 1, agent text 2", "agent text 8, agent text 9") 
> df <- data.frame(callid, conversation) 
> dfDesired <- data.frame(callid, conversation, conversationCustomer, conversationAgent) 
> rm(callid, conversation, conversationCustomer, conversationAgent) 
> df 
    callid                    conversation 
1  1 customer:customer text 1 agent:agent text 1 customer:customer text 2 agent:agent text 2 
2  2 agent:agent text 8 customer:customer text 8 agent:agent text 9 customer:customer text 9 
> dfDesired 
    callid                    conversation    conversationCustomer   conversationAgent 
1  1 customer:customer text 1 agent:agent text 1 customer:customer text 2 agent:agent text 2 customer text 1, customer text 2 agent text 1, agent text 2 
2  2 agent:agent text 8 customer:customer text 8 agent:agent text 9 customer:customer text 9 customer text 8, customer text 9 agent text 8, agent text 9 



R爲文本解析?上帝祝福你。 – Matt




v1 <- str_extract_all(text, "(?<=:)(customer\\s+\\w+\\s*\\d*)|(agent\\s+\\w+\\s*\\d*)")[[1]] 
v1[c(TRUE, FALSE)] 
v1[c(FALSE, TRUE)] 


v1 <- strsplit(trimws(text), "(customer|agent):\\s*")[[1]] 
v2 <- trimws(v1[nzchar(v1)]) 
toString(v2[c(TRUE, FALSE)]) 
toString(v2[c(FALSE, TRUE)]) 

上面的問題中,我給出了矢量「文本」作爲示例,您的解決方案對此很有幫助。謝謝!然而,當我嘗試('strsplit'方法)與我的數據框上的真實數據時,它發現了以下錯誤。 > df $ conversation_customer < - strsplit(修正(df $會話),「(客戶|代理程序):\\ s *」)[[1]] 錯誤'$ < - 。data.frame'('* tmp * ',conversationCustomer,value = c(「」,:: 替代品有86行,數據有1個。然後,在你的代碼的幫助下,我想出了:df $ conversationCustomer < - toString(strsplit(trimws(df $ conversation ),「(customer | agent):\\ s *」)[[1]] [c(TRUE,FALSE)]) – kzmlbyrk


@kzmlbyrk如果它是一個data.frame,那麼您不需要將第一個element,ue sapply'lst < - strsplit(trimws(df $ conversation),「(customer | agent):\\ s *」); do.call(rbind,lapply(lst,function(x)x [nzchar(x ())] [c(TRUE,FALSE)]))'同樣用'c(FALSE,TRUE)' – akrun


我錯過了什麼嗎?df $ conversationCustomer < - strsplit(trimws(df $ conversation) :\\ s *「); df $ conversationCustomer < - do.call(rbind,lapply(df $ conversationCustomer,function(x)x [nzcha (函數(...,deparse.level = 1): 結果列數不是矢量的倍數長度(arg 1)和'conversationCustomer'列僅包含第一個值。 – kzmlbyrk



df$conversationCustomer <- gsub("agent:.*?customer:", ",", df$conversation) # replaces any text starting with "agent:" and ending with "customer:" and assigns the customer text to new variable. 
df$conversationCustomer <- gsub("agent:.*", "", df$conversationCustomer) # this is for the agent texts at the end of conversation those I couldn't clean the "agent:" part using first regex 
df$conversationCustomer <- gsub("customer:", "", df$conversationCustomer) # this is for removing the "customer:" in the conversations those starts with customer text. (Again, I couldn't clean "customer:" part using first regex.) 
df$conversationAgent <- gsub("customer:.*?agent:", ",", df$conversation) 
df$conversationAgent <- gsub("customer:.*", "", df$conversationAgent) 
df$conversationAgent <- gsub("agent:", "", df$conversationAgent)