H20：如何在文本數據上使用漸變提升？

我想實現一個非常簡單的ML學習問題，我用文本來預測一些結果。在R，一些基本的例子是：H20：如何在文本數據上使用漸變提升？

進口一些假的，但有趣的文字數據

library(caret) 
library(dplyr) 
library(text2vec) 

dataframe <- data_frame(id = c(1,2,3,4), 
         text = c("this is a this", "this is 
         another",'hello','what???'), 
         value = c(200,400,120,300), 
         output = c('win', 'lose','win','lose')) 

> dataframe 
# A tibble: 4 x 4 
    id   text value output 
    <dbl>   <chr> <dbl> <chr> 
1  1 this is a this 200 win 
2  2 this is another 400 lose 
3  3   hello 120 win 
4  4   what??? 300 lose

使用text2vec讓我的文字稀疏矩陣表示（見https://github.com/dselivanov/text2vec/blob/master/vignettes/text-vectorization.Rmd）

#these are text2vec functions to tokenize and lowercase the text 
prep_fun = tolower 
tok_fun = word_tokenizer 

#create the tokens 
train_tokens = dataframe$text %>% 
    prep_fun %>% 
    tok_fun 

it_train = itoken(train_tokens)  
vocab = create_vocabulary(it_train) 
vectorizer = vocab_vectorizer(vocab) 
dtm_train = create_dtm(it_train, vectorizer) 

> dtm_train 
4 x 6 sparse Matrix of class "dgCMatrix" 
    what hello another a is this 
1 .  .  . 1 1 2 
2 .  .  1 . 1 1 
3 .  1  . . . . 
4 1  .  . . . .

最後，使用我的稀疏矩陣訓練算法（例如，使用caret）以預測output。

mymodel <- train(x=dtm_train, y =dataframe$output, method="xgbTree") 

> confusionMatrix(mymodel) 
Bootstrapped (25 reps) Confusion Matrix 

(entries are percentual average cell counts across resamples) 

      Reference 
Prediction lose win 
     lose 17.6 44.1 
     win 29.4 8.8 

Accuracy (average) : 0.264

我的問題是：

我看到如何使用spark_read_csv，rsparkling和as_h2o_frame將數據導入h20。但是，對於第2點和第3點我完全失去了。

有人可以給我一些提示或告訴我，如果這種方法甚至可能與h2o？

非常感謝！

來源

2017-06-14 ℕʘʘḆḽḘ

什麼是it_train變量？我認爲你錯過了代碼中的一步（它幾乎可以重現，但還沒有）。 –

嗨@ErinLeDell你是對的。堅持一秒 –

@ErinLeDell問題更新！ –

您可以通過以下兩種方法中的任意一種來解決這一問題：1.首先在R中，然後移動到H2O進行建模，或者2.使用H2O的word2vec實現在H2O中完全進行。

使用R data.frames和text2vec，然後將稀疏矩陣轉換爲H2O框架並在H2O中進行建模。

# Use same code as above to get to this point, then: 

# Convert dgCMatrix to H2OFrame, cbind the response col 
train <- as.h2o(dtm_train) 
train$y <- as.h2o(dataframe$output) 

# Train any H2O model (e.g GBM) 
mymodel <- h2o.gbm(y = "y", training_frame = train, 
        distribution = "bernoulli", seed = 1)

或者您可以訓練word2vec嵌入到H2O中，將其應用於您的文本以獲得等價的稀疏矩陣。然後訓練一個H2O機器學習模型（GBM）。稍後我將嘗試使用您的數據編輯此答案，但是同時，這裏是example，演示在R中使用H2O的word2vec功能。

來源

2017-06-15 06:25:22

這真的很酷，@ErinLeDell。期待您的工作示例！謝謝 –

但是你在哪裏找到關於word2vec的問題？ Text2vec！= Word2vec。問題是關於如何將稀疏矩陣導出爲h2o！而要做到這一點 - 將矩陣轉換爲svmlight格式。 –

Noobie詢問的任務是在文本上訓練模型的H2O相當於 - 我的意思是您可以在H2O中執行相同的任務（使用Word2Vec）。上面的解決方案允許您繼續使用text2vec，但它也需要在R內存中執行文本處理計算（而不是分佈式H2O w2v），所以我建議將H2O w2v作爲解決方案。 –

H20：如何在文本數據上使用漸變提升？

回答

相關問題