2017-09-13 57 views
1

我在csv文件中有客戶服務的客戶查詢和答案。我需要確定每個問題的主題,然後在此基礎上開發一個分類模型。我創建了兩個文檔術語表(清理文檔後),一個用於提問,另一個用於答案。我通過在整個文檔中僅使用400次以上的術語(大約4萬個問題和答案)縮小了規模。按行合併兩個文檔術語矩陣

我想創建一個數據框,將這兩個矩陣按行合併,只保留常見的單詞並回答dtm(並將它們的頻率相加,我應該如何在R中執行此操作?最高頻率單詞標記的問題。

上的辦法任何幫助/建議是高度讚賞。

> str(inspect(dtmaf)) 
<<DocumentTermMatrix (documents: 38697, terms: 237)>> 
Non-/sparse entries: 326124/8845065 
Sparsity   : 96% 
Maximal term length: 13 
Weighting   : term frequency (tf) 
Sample    : 
    Terms 
Docs booking card change check confirm confirmation email make port wish 
12316  3 1  0  0  0   0  0 0 1 1 
137   4 1  2  0  1   0  0 0 0 0 
17618  4 1  0  0  0   0  0 2 0 2 
18082  2 1  3  1  1   0  0 0 1 0 
19141  3 0  2  0  1   0  0 0 1 0 
21862  2 0  0  0  0   0  0 1 0 0 
2756  1 0  2  0  0   0  0 1 0 1 
27578  2 1  5  0  0   0  0 0 0 1 
30312  4 1  2  0  0   0  0 2 0 2 
9019  1 1  1  0  0   0  0 0 0 0 
num [1:10, 1:10] 3 4 4 2 3 2 1 2 4 1 ... 
- attr(*, "dimnames")=List of 2 
..$ Docs : chr [1:10] "12316" "137" "17618" "18082" ... 
..$ Terms: chr [1:10] "booking" "card" "change" "check" ... 

> str(inspect(dtmc)) 
<<DocumentTermMatrix (documents: 38697, terms: 189)>> 
Non-/sparse entries: 204107/7109626 
Sparsity   : 97% 
Maximal term length: 13 
Weighting   : term frequency (tf) 
Sample    : 
     Terms 
Docs booking car change confirmation like number possible reservation return ticket 
    14091  0 0  0   0 2  0  0   2  0  0 
    18220  6 0  0   2 0  0  0   0  0  0 
    20103  1 0  1   0 0  1  0   0  0  0 
    20184  0 3  0   0 0  1  0   4  1  0 
    21005  3 5  0   1 2  0  1   0  0  0 
    24877  0 1  1   0 0  0  0   2  0  1 
    26135  0 0  0   0 0  0  0   1  0  0 
    28200  5 2  1   0 0  0  0   1  0  0 
    2979  12 7  2   0 1  0  0   0  0  0 
    680   0 0  1   2 0  1  0   0  0  0 
num [1:10, 1:10] 0 6 1 0 3 0 0 5 12 0 ... 
- attr(*, "dimnames")=List of 2 
    ..$ Docs : chr [1:10] "14091" "18220" "20103" "20184" ... 
    ..$ Terms: chr [1:10] "booking" "car" "change" "confirmation" ... 

預期輸出與(237 + 189)的條款和38697行的矩陣。在匹配方面兩個dtms將每列有一列並且它們的頻率總結,並且不匹配的術語將被重現。

這裏是10個文件重複的例子:

> dput(datamsg) 
structure(list(cmessage = c("No answer ?", "Hello the third number is . I bought this boarding card immediately after the operator has told me from the previous logbook the number can not be found in the system. Therefore I request to return money. It was not my fault !", 
"Hi I forget probably choose items on the How can I do this now. ", 
"Hi I forget probably choose items How can i do this now. ", 
"Hello I tell if I have booked . If not is it possible and what would it cost? ", 
"First I wanted to transfer fromThen I wanted to know if you can spontaneously postpone the return ", 
"Hello. Does the have an exact address? With this address I do not find it on the navigation. Have an exact address where I can get the ticets. Where I get the Tikets then. Is the automatic chekin. Or do I then mot the tickets to the Chekin. Thank you. But rather ask more questions. ", 
"Dear booked everything again. Also the journey through In my previous message I stated that it is a complete cancellation and I have booked the return trip. I do not intend to pay twice for travel. ", 
"Thank you. When will the new registration show ?...as it still shows the . Thanks", 
"So my phone number is .Please tell me how this works."), afreply = c("Hello afraid there is no space on the September. I have also checked but are all fully booked. Would you like us to check any other dates for you? ", 
"Hello As far as we can see the booking No was a valid reservation. We have however contacted and can confirm that administration fee was refunded back to your card. ", 
"Good afternoon You are currently booked as high plane. You have requested an amendment to change the height which will be more expensive. Could you please confirm the actual height of . We have cancelled you amendment request please submit a new one with an accurate height ofreply to this message. ", 
"Hello thanks for your message. I have checked and can see you have amended your height to on your booking. If you require any other assistance with your booking please contact us.", 
"Hello you booked any In order to make a change to your booking kindly send us a amendment request via", 
"Dear Mr. what dimensions you want to take with you? here is only the possibility to change your departure for a change of booking fee and a possible ticket price difference. The ticket price difference can be requested if you call us an alternative travel date.", 
"Dear Sir or Madam we will send you the address ", "Hello your crossing with was already refunded. As my colleague told you your with was still valid. In case you have booked a second ticket with please send us the new booking reference number but we cannot guarantee that you will be entitle to a refund. ", 
"if you can authorise us to take the payment from the card you used to make the we can then make the change.", 
"Good morning we could not reach you by telephone. If you do not have we can send you an invoice via PayPal. The change can not be made until paid. . Do you want to pay the change to 1. " 
)), .Names = c("cmessage", "afreply"), class = "data.frame", row.names = c(NA, 
-10L)) 

corpus1<-Corpus(VectorSource(datamsg$cmessage)) 
corpus2<-Corpus(VectorSource(datamsg$afreply)) 
dtmc<-DocumentTermMatrix(corpus1, control = list(weighting = weightTf)) 
dtmaf<-DocumentTermMatrix(corpus2, control = list(weighting = weightTf)) 
+0

想要輸出的文章示例 – PoGibas

+0

只是在 – NKaz

+0

上面的問題中增加了預期的輸出,但這對任何人都沒有幫助。如果你希望你的問題得到解答後可重現的例子和想要的輸出的例子:https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – PoGibas

回答

1

您的代碼:

#dput(datamsg) 
datamsg <- 
     structure(
       list(
         cmessage = c(
           "No answer ?", 
           "Hello the third number is . I bought this boarding card immediately after the operator has told me from the previous logbook the number can not be found in the system. Therefore I request to return money. It was not my fault !", 
           "Hi I forget probably choose items on the How can I do this now. ", 
           "Hi I forget probably choose items How can i do this now. ", 
           "Hello I tell if I have booked . If not is it possible and what would it cost? ", 
           "First I wanted to transfer fromThen I wanted to know if you can spontaneously postpone the return ", 
           "Hello. Does the have an exact address? With this address I do not find it on the navigation. Have an exact address where I can get the ticets. Where I get the Tikets then. Is the automatic chekin. Or do I then mot the tickets to the Chekin. Thank you. But rather ask more questions. ", 
           "Dear booked everything again. Also the journey through In my previous message I stated that it is a complete cancellation and I have booked the return trip. I do not intend to pay twice for travel. ", 
           "Thank you. When will the new registration show ?...as it still shows the . Thanks", 
           "So my phone number is .Please tell me how this works." 
         ), 
         afreply = c(
           "Hello afraid there is no space on the September. I have also checked but are all fully booked. Would you like us to check any other dates for you? ", 

           "Hello As far as we can see the booking No was a valid reservation. We have however contacted and can confirm that administration fee was refunded back to your card. ", 
           "Good afternoon You are currently booked as high plane. You have requested an amendment to change the height which will be more expensive. Could you please confirm the actual height of . We have cancelled you amendment request please submit a new one with an accurate height ofreply to this message. ", 
           "Hello thanks for your message. I have checked and can see you have amended your height to on your booking. If you require any other assistance with your booking please contact us.", 
           "Hello you booked any In order to make a change to your booking kindly send us a amendment request via", 
           "Dear Mr. what dimensions you want to take with you? here is only the possibility to change your departure for a change of booking fee and a possible ticket price difference. The ticket price difference can be requested if you call us an alternative travel date.", 
           "Dear Sir or Madam we will send you the address ", 
           "Hello your crossing with was already refunded. As my colleague told you your with was still valid. In case you have booked a second ticket with please send us the new booking reference number but we cannot guarantee that you will be entitle to a refund. ", 
           "if you can authorise us to take the payment from the card you used to make the we can then make the change.", 
           "Good morning we could not reach you by telephone. If you do not have we can send you an invoice via PayPal. The change can not be made until paid. . Do you want to pay the change to 1. " 
         ) 
       ), 
       .Names = c("cmessage", "afreply"), 
       class = "data.frame", 
       row.names = c(NA,-10L) 
     ) 

corpus1<-Corpus(VectorSource(datamsg$cmessage)) # 10 docs 
corpus2<-Corpus(VectorSource(datamsg$afreply)) # 10 docs 


dtmc<-DocumentTermMatrix(corpus1, control = list(weighting = weightTf)) 
dtmaf<-DocumentTermMatrix(corpus2, control = list(weighting = weightTf)) 

我的代碼繼續:

library(tm) 
library(dplyr) 
library(stringr) 
# rename anonymous document ids: 
rownames(dtmc) <- dtmc %>% rownames() %>% as.numeric() %>% sprintf("doc%05d", .) 
rownames(dtmaf) <- dtmaf %>% rownames() %>% as.numeric() %>% sprintf("doc%05d", .) 

# transform to termDocumentmatrix 
tdmc <- dtmc %>% t() 
tdmaf<- dtmaf %>% t() 

# introduce new first column "word" 
tdmc_df <- tdmc %>% as.matrix() %>% as.data.frame() %>% rownames_to_column(var = "word") 
tdmaf_df <- tdmaf %>% as.matrix() %>% as.data.frame() %>% rownames_to_column(var = "word") 

# find common words 
tdm_df <- tdmc_df %>% inner_join(tdmaf_df, by=c("word")) 
tdm_df <- tdm_df %>% arrange(word) 
dtm_df <- tdm_df %>% column_to_rownames("word") %>% t() 


# count occurences of matching words 
colSums(dtm_df) 

# find nonmatching words 
dtm_df_nonmatching <- tdmc_df %>% anti_join(tdmaf_df, by=c("word")) %>% arrange(word) %>% column_to_rownames("word") 

# count occurences of nonmatching words 
rowSums(dtm_df_nonmatching) 

常用詞,數:

colSums(dtm_df) 
address  also  and booked  but  can  card  dear  for  from  have hello message 
     4  2  5  7  3  13  3  3  4  2  12  8  3 
    more  new  not number  pay please possible request still thanks  that  the  then 
     2  3  8  4  2  5  2  3  2  2  3  32  3 
    this  told travel  was  what  will  with would  you 
     6  2  2  5  2  4  7  2  25 
1

下面是使用quanteda包的一種更簡單的方法。

library("quanteda") 
packageVersion("quanteda") 
# [1] ‘0.99.9’ 

首先,我們創建了兩個文檔特徵矩陣,並找出他們共同的術語:

dfm_c <- dfm(datamsg$cmessage, remove_punct = TRUE) 
dfm_af <- dfm(datamsg$afreply, remove_punct = TRUE) 
common_feature_names <- intersect(featnames(dfm_c), featnames(dfm_af)) 

然後我們就可以使用cbind(),它(正確地)發出警告,你現在有將它們組合重複的功能。第二行選擇公共特徵,第三行將dfm中相同名稱的特徵合併起來,這就是你想要的。

combined_dfm <- cbind(dfm_c, dfm_af) %>% 
    dfm_select(pattern = common_feature_names) %>% 
    dfm_compress() 
head(combined_dfm) 
# Document-feature matrix of: 6 documents, 6 features (41.7% sparse). 
# 6 x 6 sparse Matrix of class "dfmSparse" 
#  features 
# docs no hello the number is i 
# text1 2  1 1  0 1 1 
# text2 1  2 6  2 1 2 
# text3 0  0 3  0 0 2 
# text4 0  1 0  0 0 3 
# text5 0  2 0  0 1 2 
# text6 0  0 3  0 1 2 

如果你真的想回去在TM,則可以使用轉換這樣的:

convert(combined_dfm, to = "tm") 
# <<DocumentTermMatrix (documents: 10, terms: 49)>> 
# Non-/sparse entries: 189/301 
# Sparsity   : 61% 
# Maximal term length: 8 
# Weighting   : term frequency (tf) 

注意:您還沒有明確規定,你可能需要合併一個DFM與不同的文件,所以我在這裏假設(從例子),文件是相同的。如果它們不同,那也很容易解決,但在問題中沒有說明。