2017-09-01 61 views
1

我有一個文本列,其中包含對客戶和代理之間的電話呼叫的文本記錄的語音。對原始文本值一些文本操作後,說我有類似下面的矢量爲例:(注意在矢量文本開頭的空間)取代r中的正則表達式

text <- " customer:customer text1 agent:agent text 1 customer:customer text2 agent:agent text 2

問題:我怎樣才能提取客戶和代理文本翻譯成從原來的源字段兩個獨立的字段(在這種情況下text矢量)

# desired outputs: 
# field for customer texts 
"customer text1, customer text2" 
# field for agent texts 
"agent text1, agent text2" 

我有什麼能到目前爲止做(無線? TH上的正則表達式受試者有限的經驗)是:

customerText <- gsub("^ customer:| agent:(.*)", "", text) 
customerText 
[1] "customer text1" 

編輯:

請考慮以下爲基於數據幀的方法可重放代碼,而不是基於一個以上向量。

> callid <- c("1","2") 
> conversation <- c(" customer:customer text 1 agent:agent text 1 customer:customer text 2 agent:agent text 2", 
+     " agent:agent text 8 customer:customer text 8 agent:agent text 9 customer:customer text 9") 
> conversationCustomer <- c("customer text 1, customer text 2", "customer text 8, customer text 9") 
> conversationAgent <- c("agent text 1, agent text 2", "agent text 8, agent text 9") 
> df <- data.frame(callid, conversation) 
> dfDesired <- data.frame(callid, conversation, conversationCustomer, conversationAgent) 
> rm(callid, conversation, conversationCustomer, conversationAgent) 
> 
> df 
    callid                    conversation 
1  1 customer:customer text 1 agent:agent text 1 customer:customer text 2 agent:agent text 2 
2  2 agent:agent text 8 customer:customer text 8 agent:agent text 9 customer:customer text 9 
> dfDesired 
    callid                    conversation    conversationCustomer   conversationAgent 
1  1 customer:customer text 1 agent:agent text 1 customer:customer text 2 agent:agent text 2 customer text 1, customer text 2 agent text 1, agent text 2 
2  2 agent:agent text 8 customer:customer text 8 agent:agent text 9 customer:customer text 9 customer text 8, customer text 9 agent text 8, agent text 9 

謝謝!

+0

R爲文本解析?上帝祝福你。 – Matt

回答

1

我們可以使用str_extract

library(stringr) 
v1 <- str_extract_all(text, "(?<=:)(customer\\s+\\w+\\s*\\d*)|(agent\\s+\\w+\\s*\\d*)")[[1]] 
v1[c(TRUE, FALSE)] 
v1[c(FALSE, TRUE)] 

或者使用strsplit

v1 <- strsplit(trimws(text), "(customer|agent):\\s*")[[1]] 
v2 <- trimws(v1[nzchar(v1)]) 
toString(v2[c(TRUE, FALSE)]) 
toString(v2[c(FALSE, TRUE)]) 
+0

上面的問題中,我給出了矢量「文本」作爲示例,您的解決方案對此很有幫助。謝謝!然而,當我嘗試('strsplit'方法)與我的數據框上的真實數據時,它發現了以下錯誤。 > df $ conversation_customer < - strsplit(修正(df $會話),「(客戶|代理程序):\\ s *」)[[1]] 錯誤'$ < - 。data.frame'('* tmp * ',conversationCustomer,value = c(「」,:: 替代品有86行,數據有1個。然後,在你的代碼的幫助下,我想出了:df $ conversationCustomer < - toString(strsplit(trimws(df $ conversation ),「(customer | agent):\\ s *」)[[1]] [c(TRUE,FALSE)]) – kzmlbyrk

+0

@kzmlbyrk如果它是一個data.frame,那麼您不需要將第一個element,ue sapply'lst < - strsplit(trimws(df $ conversation),「(customer | agent):\\ s *」); do.call(rbind,lapply(lst,function(x)x [nzchar(x ())] [c(TRUE,FALSE)]))'同樣用'c(FALSE,TRUE)' – akrun

+0

我錯過了什麼嗎?df $ conversationCustomer < - strsplit(trimws(df $ conversation) :\\ s *「); df $ conversationCustomer < - do.call(rbind,lapply(df $ conversationCustomer,function(x)x [nzcha (函數(...,deparse.level = 1): 結果列數不是矢量的倍數長度(arg 1)和'conversationCustomer'列僅包含第一個值。 – kzmlbyrk

0

現在,我可以如下解決這個問題。我想這可能會縮短由正則表達式更有經驗的人。

df$conversationCustomer <- gsub("agent:.*?customer:", ",", df$conversation) # replaces any text starting with "agent:" and ending with "customer:" and assigns the customer text to new variable. 
df$conversationCustomer <- gsub("agent:.*", "", df$conversationCustomer) # this is for the agent texts at the end of conversation those I couldn't clean the "agent:" part using first regex 
df$conversationCustomer <- gsub("customer:", "", df$conversationCustomer) # this is for removing the "customer:" in the conversations those starts with customer text. (Again, I couldn't clean "customer:" part using first regex.) 
df$conversationAgent <- gsub("customer:.*?agent:", ",", df$conversation) 
df$conversationAgent <- gsub("customer:.*", "", df$conversationAgent) 
df$conversationAgent <- gsub("agent:", "", df$conversationAgent)