0
我明白並理解R的randomForest函數只能處理少於54個類別的分類預測變量。但是,當我將我的分類預測因子修剪到少於54個類別時,仍然會出現錯誤。我已經看到的關於stackoverflow的分類預測器限制的唯一問題是如何解決此類別限制,但我試圖修剪我的類別數量以遵循該函數的限制,而且我仍然得到錯誤。randomForest Categorical Predictor Limitits
以下腳本創建一個數據框,以便我們可以預測'專業'。可以理解,由於'college_id'變量,當試圖在'df'上運行randomForest()時,出現「無法處理超過53個類別的分類預測變量」錯誤。
但是,當我修剪我的數據集,只包括前40名學院ID,我得到同樣的錯誤。我是否遺漏了一些基本的數據框架概念,即使現在只有40個「df2」數據框被填充,它仍然保留了所有類別?什麼是我可以使用的解決方法選項?
library(dplyr)
library(randomForest)
# create data frame
df <- data.frame(profession = sample(c("accountant", "lawyer", "dentist"), 10000, replace = TRUE),
zip = sample(c("32801", "32807", "32827", "32828"), 10000, replace = TRUE),
salary = sample(c(50000:150000), 10000, replace = TRUE),
college_id = as.factor(c(sample(c(1001:1040), 9200, replace = TRUE),
sample(c(1050:9999), 800, replace = TRUE))))
# results in error, as expected
rfm <- randomForest(profession ~ ., data = df)
# arrange college_ids by count and retain the top 40 in the 'df' data frame
sdf <- df %>%
dplyr::group_by(college_id) %>%
dplyr::summarise(n = n()) %>%
dplyr::arrange(desc(n))
sdf <- sdf[1:40, ]
df2 <- dplyr::inner_join(df, sdf, by = "college_id")
df2$n <- NULL
# confirm that df2 only contains 40 categories of 'college_id'
nrow(df2[which(!duplicated(df2$college_id)), ])
# THIS IS WHAT I WANT TO RUN, BUT STILL RESULTS IN ERROR
rfm2 <- randomForest(profession ~ ., data = df2)
就是這樣......謝謝! – bshelt141