2017-07-28 59 views
1

我正嘗試使用R中的RTextTools庫創建文本分類器。訓練和測試數據框格式相同。它們都由兩列組成:第一列是文本,第二列是標籤。R - 如何調整使用RTextTools創建的文本分類器

最小重複的例子,我的節目的(取代的數據)爲止:

# Packages 
## Install 
install.packages('e1071', 'RTextTools') 
## Import 
library(e1071) 
library(RTextTools) 

data.train <- data.frame("content" = c("Lorem Ipsum is simply dummy text of the printing and typesetting industry.", "Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.", "It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged."), "label" = c("yes", "yes", "no")) 
data.test <- data.frame("content" = c("It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout.", "The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English.", "Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy."), "label" = c("no", "yes", "yes")) 

# Process training dataset 
data.train.dtm <- create_matrix(data.train$content, language = "english", weighting = tm::weightTfIdf, removePunctuation = TRUE, removeNumbers = TRUE, removeSparseTerms = 0, removeStopwords = TRUE, stemWords = TRUE, stripWhitespace = TRUE, toLower = TRUE) 
data.train.container <- create_container(data.train.dtm, data.train$label, trainSize = 1:nrow(data.train), virgin = FALSE) 

# Create linear SVM model 
model.linear <- train_model(data.train.container, "SVM", kernel = "linear", cost = 10, gamma = 1^-2) 

# Process testing dataset 
data.test.dtm <- create_matrix(data.test$content, originalMatrix = data.train.dtm) 
data.test.container <- create_container(data.test.dtm, labels = rep(0, nrow(data.test)), testSize = 1:nrow(data.test), virgin = FALSE) 

# Classify testing dataset 
model.linear.results <- classify_model(data.test.container, model.linear) 
model.linear.results.table <- table(Predicted = model.linear.results$SVM_LABEL, Actual = data.test$label) 
model.linear.results.table 

代碼我迄今爲止的作品,並且結果在表與實際值進行比較的預測的值。雖然結果非常不準確,但我很清楚該模型需要進行微調。

我知道e1071庫(RTextTools基於此庫)包含一個​​函數,用於返回最佳成本和伽馬值以產生最佳結果。使用它的問題是tune.svm函數中的data參數需要讀入數據幀,但由於我在做文本分類器,因此我不只是將簡單的數據幀讀入SVM,而是將文檔項矩陣。

沒有用,我嘗試閱讀DTM作爲這樣的數據幀:

model.tuned <- tune.svm(label~., data = as.data.frame(data.train.dtm), gamma = 10^(-6:-1), cost = 10^(-1:1)) 

我完全失去了和任何見解將不勝感激。

回答

1

您可以查看train_model(在RStudio中按F2)中的代碼,查看它如何調用svm()與容器(在您的案例中,data.train.container)。默認情況下,train_model使用

  • cross=0(不執行訓練數據的交叉驗證)
  • cost=100(約束違規的成本)
  • probability=TRUE(型號應允許概率預測)
  • kernel="radial"(用於SVM訓練的徑向內核)

作爲要傳入的參數svm()

要真正回答你的問題,通過create_container()返回的容器具有槽training_matrixtraining_codes,你可以使用如下:

model.tuned <- tune.svm(x = [email protected]_matrix, 
         y = [email protected]_codes, 
         gamma = 10^(-6:-1), 
         cost = 10^(-1:1), 
         # fill in any other SVM params as needed here 
         ) 
相關問題