我可以幫你解決兩個問題。 1-分層抽樣 兩分訓練&測試(即校準驗證)
n = c(2.23, 3.5, 12,2, 93, 57, 0.2,
33, 5,2, 305, 5.3,2, 3.9, 4)
s = c("aa", "bb", "aa","aa", "bb", "cc","aa", "bb",
"bb","aa", "aa","aa","aa","bb", "cc")
id = c(1, 2, 3,4, 5, 6,7, 8, 9,
10, 11, 12,13, 14, 15)
df = data.frame(id, n, s) # df is a data frame
source("http://news.mrdwab.com/stratified")
sample<- stratified(df=df,
id=1, #ID of your dataframe,
#if there isn't you have to create it
group=3, #the position of your predictor features
size=2, #cardinality of selection
seed="NULL")
#then add a new column to your selection
sample["cal_val"]<- 1
#now, you have a random selection of group 3,
#but you need to split it for cal and val, so:
sample2<- stratified(df=sample, #use your previous selection
id=1,
group=3, #sample on the same group used previously
size=1,#half of the previous selection
seed="NULL")
sample2["val"]<- 1
#merge the two selection
merge<- merge(sample, sample2, all.x=T, by="id")
merge[is.na(merge)] <- 0 #delete NA from merge
#create a column where 1 is for calibration and 2 for validation
merge["calVal"]<- merge$cal_val.x + merge$cal_val.y
#now "clean" you dataframe, because you have too many useless columns
id<- merge$id
n<- merge$n.x
s<- merge$s.x
calval<- merge$calVal
final_sample<- data.frame(id, n, s, calval)
歡迎來到So!我建議你:** RMysqlite **包提取你的數據,* sample *函數(** base **包)進行採樣! * kmeans *函數(** base **包)! * knn *函數(**類**包) – agstudy
如何處理大數據?數據庫的問題,預採樣保存在內存中。只有4Gb內存。 – erichfw
嘗試使用數據庫引擎執行隨機選擇:http://stackoverflow.com/q/580639/269476。 – James