我在csv文件中有客戶服務的客戶查詢和答案。我需要確定每個問題的主題,然後在此基礎上開發一個分類模型。我創建了兩個文檔術語表(清理文檔後),一個用於提問,另一個用於答案。我通過在整個文檔中僅使用400次以上的術語(大約4萬個問題和答案)縮小了規模。按行合併兩個文檔術語矩陣
我想創建一個數據框,將這兩個矩陣按行合併,只保留常見的單詞並回答dtm(並將它們的頻率相加,我應該如何在R中執行此操作?最高頻率單詞標記的問題。
上的辦法任何幫助/建議是高度讚賞。
> str(inspect(dtmaf))
<<DocumentTermMatrix (documents: 38697, terms: 237)>>
Non-/sparse entries: 326124/8845065
Sparsity : 96%
Maximal term length: 13
Weighting : term frequency (tf)
Sample :
Terms
Docs booking card change check confirm confirmation email make port wish
12316 3 1 0 0 0 0 0 0 1 1
137 4 1 2 0 1 0 0 0 0 0
17618 4 1 0 0 0 0 0 2 0 2
18082 2 1 3 1 1 0 0 0 1 0
19141 3 0 2 0 1 0 0 0 1 0
21862 2 0 0 0 0 0 0 1 0 0
2756 1 0 2 0 0 0 0 1 0 1
27578 2 1 5 0 0 0 0 0 0 1
30312 4 1 2 0 0 0 0 2 0 2
9019 1 1 1 0 0 0 0 0 0 0
num [1:10, 1:10] 3 4 4 2 3 2 1 2 4 1 ...
- attr(*, "dimnames")=List of 2
..$ Docs : chr [1:10] "12316" "137" "17618" "18082" ...
..$ Terms: chr [1:10] "booking" "card" "change" "check" ...
> str(inspect(dtmc))
<<DocumentTermMatrix (documents: 38697, terms: 189)>>
Non-/sparse entries: 204107/7109626
Sparsity : 97%
Maximal term length: 13
Weighting : term frequency (tf)
Sample :
Terms
Docs booking car change confirmation like number possible reservation return ticket
14091 0 0 0 0 2 0 0 2 0 0
18220 6 0 0 2 0 0 0 0 0 0
20103 1 0 1 0 0 1 0 0 0 0
20184 0 3 0 0 0 1 0 4 1 0
21005 3 5 0 1 2 0 1 0 0 0
24877 0 1 1 0 0 0 0 2 0 1
26135 0 0 0 0 0 0 0 1 0 0
28200 5 2 1 0 0 0 0 1 0 0
2979 12 7 2 0 1 0 0 0 0 0
680 0 0 1 2 0 1 0 0 0 0
num [1:10, 1:10] 0 6 1 0 3 0 0 5 12 0 ...
- attr(*, "dimnames")=List of 2
..$ Docs : chr [1:10] "14091" "18220" "20103" "20184" ...
..$ Terms: chr [1:10] "booking" "car" "change" "confirmation" ...
預期輸出與(237 + 189)的條款和38697行的矩陣。在匹配方面兩個dtms將每列有一列並且它們的頻率總結,並且不匹配的術語將被重現。
這裏是10個文件重複的例子:
> dput(datamsg)
structure(list(cmessage = c("No answer ?", "Hello the third number is . I bought this boarding card immediately after the operator has told me from the previous logbook the number can not be found in the system. Therefore I request to return money. It was not my fault !",
"Hi I forget probably choose items on the How can I do this now. ",
"Hi I forget probably choose items How can i do this now. ",
"Hello I tell if I have booked . If not is it possible and what would it cost? ",
"First I wanted to transfer fromThen I wanted to know if you can spontaneously postpone the return ",
"Hello. Does the have an exact address? With this address I do not find it on the navigation. Have an exact address where I can get the ticets. Where I get the Tikets then. Is the automatic chekin. Or do I then mot the tickets to the Chekin. Thank you. But rather ask more questions. ",
"Dear booked everything again. Also the journey through In my previous message I stated that it is a complete cancellation and I have booked the return trip. I do not intend to pay twice for travel. ",
"Thank you. When will the new registration show ?...as it still shows the . Thanks",
"So my phone number is .Please tell me how this works."), afreply = c("Hello afraid there is no space on the September. I have also checked but are all fully booked. Would you like us to check any other dates for you? ",
"Hello As far as we can see the booking No was a valid reservation. We have however contacted and can confirm that administration fee was refunded back to your card. ",
"Good afternoon You are currently booked as high plane. You have requested an amendment to change the height which will be more expensive. Could you please confirm the actual height of . We have cancelled you amendment request please submit a new one with an accurate height ofreply to this message. ",
"Hello thanks for your message. I have checked and can see you have amended your height to on your booking. If you require any other assistance with your booking please contact us.",
"Hello you booked any In order to make a change to your booking kindly send us a amendment request via",
"Dear Mr. what dimensions you want to take with you? here is only the possibility to change your departure for a change of booking fee and a possible ticket price difference. The ticket price difference can be requested if you call us an alternative travel date.",
"Dear Sir or Madam we will send you the address ", "Hello your crossing with was already refunded. As my colleague told you your with was still valid. In case you have booked a second ticket with please send us the new booking reference number but we cannot guarantee that you will be entitle to a refund. ",
"if you can authorise us to take the payment from the card you used to make the we can then make the change.",
"Good morning we could not reach you by telephone. If you do not have we can send you an invoice via PayPal. The change can not be made until paid. . Do you want to pay the change to 1. "
)), .Names = c("cmessage", "afreply"), class = "data.frame", row.names = c(NA,
-10L))
corpus1<-Corpus(VectorSource(datamsg$cmessage))
corpus2<-Corpus(VectorSource(datamsg$afreply))
dtmc<-DocumentTermMatrix(corpus1, control = list(weighting = weightTf))
dtmaf<-DocumentTermMatrix(corpus2, control = list(weighting = weightTf))
想要輸出的文章示例 – PoGibas
只是在 – NKaz
上面的問題中增加了預期的輸出,但這對任何人都沒有幫助。如果你希望你的問題得到解答後可重現的例子和想要的輸出的例子:https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – PoGibas