將連續變量拆分爲相同大小的組

實施例的數據幀

das <- data.frame(anim=1:15, 
        wt=c(181,179,180.5,201,201.5,245,246.4, 
         189.3,301,354,369,205,199,394,231.3))

被切碎（根據wt的值），我將需要具有根據新的變量wt2這樣的3類後：

> das 
    anim wt wt2 
1  1 181.0 1 
2  2 179.0 1 
3  3 180.5 1 
4  4 201.0 2 
5  5 201.5 2 
6  6 245.0 2 
7  7 246.4 3 
8  8 189.3 1 
9  9 301.0 3 
10 10 354.0 3 
11 11 369.0 3 
12 12 205.0 2 
13 13 199.0 1 
14 14 394.0 3 
15 15 231.3 2

這將適用於大數據集

來源

2011-05-24 baz

請參閱：例如：http://stackoverflow.com/questions/5915916/divide-a-range-of-values-in-bins-of-equal-length-cut-vs-cut2，http：// stackoverflow。 com/questions/2647639/create-c-type-variable-in-r-based-on-range，http://stackoverflow.com/questions/5570293/r-adding-column-which-contains-bin-value-of-另一列，http://stackoverflow.com/questions/5161055/binning-data-finding-results-by-group-and-plotting-using-r，http://stackoverflow.com/questions/5731116/equal- frequency-discretization-in-r，http://stackoverflow.com/questions/3288361/create-size-categories-without-nested-ifelse-in-r，... – 2011-05-24 08:30:24

你確定@Ben Bolker的答案是不是正確的？您指定您需要相同大小的組。 – pir 2015-10-31 18:29:14

試試這個：

，如果你想

split(das, cut(das$anim, 3))

要分割的基礎上wt值，然後

library(Hmisc) # cut2 
split(das, cut2(das$wt, g=3))

反正，你可以做，通過結合cut，cut2和split。

修訂

，如果你想要一組指標作爲附加欄，然後

das$group <- cut(das$anim, 3)

如果列應該是指數像1，2，...，然後

das$group <- as.numeric(cut(das$anim, 3))

UPDATEDATED

嘗試這個辦法：使用CUT2

> das$wt2 <- as.numeric(cut2(das$wt, g=3)) 
> das 
    anim wt wt2 
1  1 181.0 1 
2  2 179.0 1 
3  3 180.5 1 
4  4 201.0 2 
5  5 201.5 2 
6  6 245.0 2 
7  7 246.4 3 
8  8 189.3 1 
9  9 301.0 3 
10 10 354.0 3 
11 11 369.0 3 
12 12 205.0 2 
13 13 199.0 1 
14 14 394.0 3 
15 15 231.3 2

來源

2011-05-24 01:31:14 kohske

你可以刪除as.numeric並使用'cut（das $ anim，3，labels = FALSE）' – Ben 2015-05-07 23:35:46

這應該被更新，所以很明顯它與下面的@Ben的答案不同。我誤以爲它會平均分割觀察結果。 – pir 2015-10-31 18:27:57

你確定'Hmisc :: cut2（）'解決方案沒有嗎？你能舉一個小例子嗎？ – 2015-10-31 18:34:34

替代沒有。

das$wt2 <- as.factor(as.numeric(cut(das$wt,3)))

或

das$wt2 <- as.factor(cut(das$wt,3, labels=F))

來源

2011-10-05 10:27:05 pedrosaurio

我認爲這會分成等寬而不是等分箱？ – 2015-10-31 18:38:12

或者看到cut_number從ggplot2包，例如

das$wt_2 <- as.numeric(cut_number(das$wt,3))

注意cut(...,3)將原始數據分成相等的長度的三個範圍的範圍內;如果數據分佈不均勻（您可以通過適當地使用quantile來複制cut_number的功能，但這是一個很好的便利功能），但不一定會導致每組的觀察值的數量相同。另一方面，Hmisc::cut2()使用g=自變量確實按分位數分裂，所以或多或少等於ggplot2::cut_number。到目前爲止，我可能認爲像cut_number這樣的東西可能會進入dplyr，但是as far as I can tell it hasn't。

來源

2011-11-01 11:41:05

ntile從dplyr現在做到這一點，但與NA的古怪行爲。

ntile_ <- function(x, n) { 
    b <- x[!is.na(x)] 
    q <- floor((n * (rank(b, ties.method = "first") - 1)/length(b)) + 1) 
    d <- rep(NA, length(x)) 
    d[!is.na(x)] <- q 
    return(d) 
}

來源

2016-10-15 01:22:57

以下是使用bin_data()函數從mltools包另一種解決方案：

我在下面的函數，在基礎R工程和不高於cut2溶液的當量使用的類似代碼。

library(mltools) 

# Resulting bins have an equal number of observations in each group 
das[, "wt2"] <- bin_data(das$wt, bins=3, binType = "quantile") 

# Resulting bins are equally spaced from min to max 
das[, "wt3"] <- bin_data(das$wt, bins=3, binType = "explicit") 

# Or if you'd rather define the bins yourself 
das[, "wt4"] <- bin_data(das$wt, bins=c(-Inf, 250, 322, Inf), binType = "explicit") 

das 
    anim wt         wt2         wt3   wt4 
1  1 181.0    [179, 200.333333333333)    [179, 250.666666666667) [-Inf, 250) 
2  2 179.0    [179, 200.333333333333)    [179, 250.666666666667) [-Inf, 250) 
3  3 180.5    [179, 200.333333333333)    [179, 250.666666666667) [-Inf, 250) 
4  4 201.0 [200.333333333333, 245.466666666667)    [179, 250.666666666667) [-Inf, 250) 
5  5 201.5 [200.333333333333, 245.466666666667)    [179, 250.666666666667) [-Inf, 250) 
6  6 245.0 [200.333333333333, 245.466666666667)    [179, 250.666666666667) [-Inf, 250) 
7  7 246.4    [245.466666666667, 394]    [179, 250.666666666667) [-Inf, 250) 
8  8 189.3    [179, 200.333333333333)    [179, 250.666666666667) [-Inf, 250) 
9  9 301.0    [245.466666666667, 394] [250.666666666667, 322.333333333333) [250, 322) 
10 10 354.0    [245.466666666667, 394]    [322.333333333333, 394] [322, Inf] 
11 11 369.0    [245.466666666667, 394]    [322.333333333333, 394] [322, Inf] 
12 12 205.0 [200.333333333333, 245.466666666667)    [179, 250.666666666667) [-Inf, 250) 
13 13 199.0    [179, 200.333333333333)    [179, 250.666666666667) [-Inf, 250) 
14 14 394.0    [245.466666666667, 394]    [322.333333333333, 394] [322, Inf] 
15 15 231.3 [200.333333333333, 245.466666666667)    [179, 250.666666666667) [-Inf, 250)

來源

2017-07-13 04:02:55 Ben

沒有任何額外的包，3爲組數：

> findInterval(das$wt, unique(quantile(das$wt, seq(0, 1, length.out = 3 + 1))), rightmost.closed = TRUE) 
[1] 1 1 1 2 2 2 3 1 3 3 3 2 1 3 2

您可以通過使用感興趣的值的代表樣本加快位數計算。仔細檢查FindInterval函數的文檔。

來源

2017-12-17 16:28:26 SamGG

將連續變量拆分爲相同大小的組

回答

相關問題