我有以下的訓練集:使用Rtexttools lib中機器學習
Text,y
MRR 93345,1
MRR 93434,1
MRR 93554,1
MRR 938900,1
MRR 93970,1
MRR 937899,1
MRR 93868,1
MRR 938769,1
MRR 93930,1
MRR 92325,1
MRR 931932,1
MRR 933922,1
MRR 934390,1
MRR 93204,1
MRR 93023,1
MRR 930982,1
MRR 87678,-1
MRR 87956,-1
MRR 87890,-1
MRR 878770,-1
MRR 877886,-1
MRR 87678367,-1
MRR 8790,-1
MRR 87345,-1
MRR 87149,-1
MRR 873790,-1
MRR 873493,-1
MRR 874303,-1
MRR 874343,-1
MRR 874304,-1
MRR 879034,-1
MRR 879430,-1
MRR 87943,-1
MRR 879434,-1
MRR 871984,-1
MRR 873949,-1
我的代碼如下:
# Create the document term matrix
dtMatrix <- create_matrix(data["Text"],language="english", removePunctuation=TRUE, stripWhitespace=TRUE,
toLower=TRUE,
removeStopwords=TRUE,
stemWords=TRUE, removeSparseTerms=.998)
# Configure the training data
container <- create_container(dtMatrix, data$y, trainSize=1:nrow(dtMatrix), virgin=FALSE)
# train a SVM Model
model <- train_model(container, "SVM", kernel="linear" ,cost=1)
# new data
predictionData <- list("MRR 93111")
# create a prediction document term matrix
predMatrix <- create_matrix(predictionData, originalMatrix=dtMatrix,language="english", removePunctuation=TRUE, stripWhitespace=TRUE,
toLower=TRUE,
removeStopwords=TRUE,
stemWords=TRUE, removeSparseTerms=.998)
# create the corresponding container
predSize = length(predictionData);
predictionContainer <- create_container(predMatrix, labels=rep(0,predSize), testSize=1:predSize, virgin=FALSE)
# predict
results <- classify_model(predictionContainer, model)
現在使用train_model功能,我想預測:MRR 93111因爲y = 1。 這意味着如果字符串以「MRR 93」開頭,則輸出應爲1,而詞幹「MRR 87」則爲-1。其實它不起作用,因爲我得到了MRR 93111 -1 0.5778781
此外,如果我以不同的方式對訓練集進行整理,或者如果我針對相同的數據集多次運行該腳本,似乎結果會發生變化聽起來對我來說很奇怪。
UPDATE1:dput(數據)
structure(list(Text = structure(c(26L, 28L, 30L, 34L, 36L, 31L,
32L, 33L, 35L, 21L, 24L, 27L, 29L, 25L, 22L, 23L, 10L, 20L, 14L,
13L, 12L, 11L, 15L, 3L, 1L, 5L, 4L, 7L, 9L, 8L, 16L, 18L, 17L,
19L, 2L, 6L), .Label = c("MRR 87149", "MRR 871984", "MRR 87345",
"MRR 873493", "MRR 873790", "MRR 873949", "MRR 874303", "MRR 874304",
"MRR 874343", "MRR 87678", "MRR 87678367", "MRR 877886", "MRR 878770",
"MRR 87890", "MRR 8790", "MRR 879034", "MRR 87943", "MRR 879430",
"MRR 879434", "MRR 87956", "MRR 92325", "MRR 93023", "MRR 930982",
"MRR 931932", "MRR 93204", "MRR 93345", "MRR 933922", "MRR 93434",
"MRR 934390", "MRR 93554", "MRR 937899", "MRR 93868", "MRR 938769",
"MRR 938900", "MRR 93930", "MRR 93970"), class = "factor"), Y = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, -1L,
-1L, -1L, -1L, -1L, -1L, -1L, -1L, -1L, -1L, -1L, -1L, -1L, -1L,
-1L, -1L, -1L, -1L, -1L, -1L)), .Names = c("Text", "Y"), class = "data.frame", row.names = c(NA,
-36L))
你能爲我們提供dput而不是寫出來你的訓練集的? – JonGrub
UPDATE1:你需要這個嗎? – unclejohn00