使用的R - 可變binwidths頻數和因素

我有相當大的數據集（超過1萬行），它的一個小樣本是在這裏：使用的R - 可變binwidths頻數和因素

structure(list(Feret = c(0.017, 0.016, 2.12, 0.016, 0.02, 0.023, 
0.017, 0.021, 0.02, 0.016, 0.027, 0.052, 0.061, 0.033, 0.041, 
0.017, 6.561, 7.123, 0.027, 0.018, 0.024, 4.099, 0.022, 0.025, 
0.037, 0.037, 0.018, 0.039, 0.027, 0.053, 0.016, 0.107, 0.52, 
0.041, 0.038, 0.039, 0.03, 0.071, 0.022, 0.118, 0.032, 0.018, 
0.027, 0.035, 8.113, 0.078, 4.089, 0.035, 0.057, 6.905, 2.5, 
0.282, 0.045, 0.039, 0.071, 0.037, 0.029, 0.027, 0.016, 0.02, 
0.026, 0.025, 0.026, 0.016, 0.016, 0.021), sample.type = structure(c(2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L), .Label = c("flower", "leaf"), class = "factor"), leaf.side = structure(c(2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L), .Label = c("lower", "upper"), class = "factor"), canopy = structure(c(2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L), .Label = c("bottom", "top"), class = "factor"), treatment = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L), .Label = c("blue", "green", "grey", "white", "yel-green" 
), class = "factor")), .Names = c("Feret", "sample.type", "leaf.side", 
"canopy", "treatment"), row.names = c(500000L, 500001L, 500002L, 
500003L, 500004L, 500005L, 500006L, 500007L, 500008L, 500009L, 
500010L, 800000L, 800001L, 800002L, 800003L, 800004L, 800005L, 
800006L, 800007L, 800008L, 800009L, 800010L, 1000L, 1001L, 1002L, 
1003L, 1004L, 1005L, 1006L, 1007L, 1008L, 1009L, 1010L, 10000L, 
10001L, 10002L, 10003L, 10004L, 10005L, 10006L, 10007L, 10008L, 
10009L, 10010L, 100000L, 100001L, 100002L, 100003L, 100004L, 
100005L, 100006L, 100007L, 100008L, 100009L, 100010L, 1160000L, 
1160001L, 1160002L, 1160003L, 1160004L, 1160005L, 1160006L, 1160007L, 
1160008L, 1160009L, 1160010L), class = "data.frame")

我一直在試圖建立的頻率計數'費雷特' 用下面的binswidths變量：

bins <- c(0.01,0.03,0.1,0.3,1,3,10)

，然後使用：

freq<-hist(df_temp$Feret, breaks=bins) 
ranges<-paste(head(bins,-1),bins[-1],sep=" - ") 
freq$counts 
df5<-data.frame(ranges = ranges, frequency = freq$counts) 
df5

但我真正需要做的是將各種因素（「sample.type」，「leaf.side」，「冠層」，「處理」）分成數據框，併爲每個子集提取頻率計數。我可以通過手動創建每個子集來做到這一點，但我想做一個更好的方法。我試過使用循環來創建子集，然後將hist（）函數應用於每個子集，但這需要很長時間。使用Dplyr還是Apply有更好的方法？我寧願只將結果放在表格中，然後根據需要繪製它們。

來源

2015-08-18 Charles Whitfield

也許像'DF％>％變異（費雷特=切（費雷特，break = bins））％>％count_（。，names（。））'？ –

'表（切（DF $費雷特，垃圾箱））' – SabDeM

下面的片段應該做你想要什麼：

我裝你的樣品放入df。

library("dplyr") 
df %>% group_by(sample.type, leaf.side, canopy, treatment) %>% 
    dplyr::select(Feret) %>% 
    do(data.frame(table(cut(.$Feret, breaks=bins, include.lowest=T))))

我把你引用到dplyr documentation。總之，x %>% f是f(x)和x -> f(a)是f(x,a)。

請注意，dplyr::select只是select，但我有很多次的命名空間問題，現在我總是指定包。

table(cut(df$Feret, breaks=bins))只是用hist所做的更好的方法。使用cut，您可以創建一個因子變量（請記住，如果您的值可以達到下限，請添加include.lowest = T），並使用table來計算每個級別的頻率。

這給：

sample.type leaf.side canopy treatment  Var1 Freq 
1  flower  upper top  green (0.01,0.03] 0 
2  flower  upper top  green (0.03,0.1] 6 
3  flower  upper top  green (0.1,0.3] 1 
4  flower  upper top  green  (0.3,1] 0 
5  flower  upper top  green  (1,3] 1 
6  flower  upper top  green  (3,10] 3 
7  flower  upper top  white (0.01,0.03] 4 
8  flower  upper top  white (0.03,0.1] 4 
9  flower  upper top  white (0.1,0.3] 0 
10  flower  upper top  white  (0.3,1] 0 
11  flower  upper top  white  (1,3] 0 
12  flower  upper top  white  (3,10] 3 
13  leaf  lower bottom  white (0.01,0.03] 5 
14  leaf  lower bottom  white (0.03,0.1] 4 
15  leaf  lower bottom  white (0.1,0.3] 1 
16  leaf  lower bottom  white  (0.3,1] 1 
17  leaf  lower bottom  white  (1,3] 0 
18  leaf  lower bottom  white  (3,10] 0 
19  leaf  lower top  grey (0.01,0.03] 10 
20  leaf  lower top  grey (0.03,0.1] 1 
21  leaf  lower top  grey (0.1,0.3] 0 
22  leaf  lower top  grey  (0.3,1] 0 
23  leaf  lower top  grey  (1,3] 0 
24  leaf  lower top  grey  (3,10] 0 
25  leaf  upper bottom  white (0.01,0.03] 4 
26  leaf  upper bottom  white (0.03,0.1] 6 
27  leaf  upper bottom  white (0.1,0.3] 1 
28  leaf  upper bottom  white  (0.3,1] 0 
29  leaf  upper bottom  white  (1,3] 0 
30  leaf  upper bottom  white  (3,10] 0 
31  leaf  upper top  blue (0.01,0.03] 10 
32  leaf  upper top  blue (0.03,0.1] 0 
33  leaf  upper top  blue (0.1,0.3] 0 
34  leaf  upper top  blue  (0.3,1] 0 
35  leaf  upper top  blue  (1,3] 1 
36  leaf  upper top  blue  (3,10] 0

（事實上，它並不打印這樣的，因爲這是一個TBL，但你可以使用print.data.frame打印TBL的老路上。）

從這裏可以直接提取你想要的信息。

來源

2015-08-18 14:18:08

真棒，這完美的作品。謝謝。現在，這給了我dplyr的味道，我會去閱讀文檔，看看我能不能找到一個教程。現在我看到你的代碼和解釋片段，看起來並不那麼令人生畏。 –

開始通過定義與因子名稱的字符向量：

factors <- c("sample.type","leaf.side","canopy", "treatment")

然後使用此載體的hist()函數應用到每個因子（假定該數據將被存儲在數據幀中的對象稱爲df）：

res <- sapply(factors, function(factor) { 
    lapply(split(df[, c("Feret", factor)], df[[factor]]), function(group) { 
    hist(group$Feret, breaks = bins, plot = FALSE) 
    }) 
}, simplify = FALSE)

你現在有每個因素一個元素，每個又是對每個級別的元素列表的列表：

> names(res) 
[1] "sample.type" "leaf.side" "canopy"  "treatment" 
> names(res$sample.type) 
[1] "flower" "leaf" 
> res$sample.type$flower 
$breaks 
[1] 0.01 0.03 0.10 0.30 1.00 3.00 10.00 

$counts 
[1] 4 10 1 0 1 6 

$density 
[1] 9.09090909 6.49350649 0.22727273 0.00000000 0.02272727 0.03896104 

$mids 
[1] 0.020 0.065 0.200 0.650 2.000 6.500 

$xname 
[1] "group$Feret" 

$equidist 
[1] FALSE 

attr(,"class") 
[1] "histogram" 
>

您可以將其格式化爲適合打印的內容。

來源

2015-08-18 14:56:40 mjkallen

所以這將是如何使用apply來做到這一點。非常感謝。 –

如果我們沒有興趣，沒有出現垃圾箱，我們只需要：

df %>% 
    group_by(sample.type, leaf.side, canopy, treatment, groups = cut(Feret, bins)) %>% 
    summarise(freq =n())

輸出：

sample.type leaf.side canopy treatment  groups freq 
1  flower  upper top  green (0.03,0.1] 6 
2  flower  upper top  green (0.1,0.3] 1 
3  flower  upper top  green  (1,3] 1 
4  flower  upper top  green  (3,10] 3 
5  flower  upper top  white (0.01,0.03] 4 
6  flower  upper top  white (0.03,0.1] 4 
7  flower  upper top  white  (3,10] 3 
8   leaf  lower bottom  white (0.01,0.03] 5 
9   leaf  lower bottom  white (0.03,0.1] 4 
10  leaf  lower bottom  white (0.1,0.3] 1 
11  leaf  lower bottom  white  (0.3,1] 1 
12  leaf  lower top  grey (0.01,0.03] 10 
13  leaf  lower top  grey (0.03,0.1] 1 
14  leaf  upper bottom  white (0.01,0.03] 4 
15  leaf  upper bottom  white (0.03,0.1] 6 
16  leaf  upper bottom  white (0.1,0.3] 1 
17  leaf  upper top  blue (0.01,0.03] 10 
18  leaf  upper top  blue  (1,3] 1

來源

2015-08-18 16:23:04 mpalanco

謝謝。我只是想知道如何處理零計數。 –

使用的R - 可變binwidths頻數和因素

回答

相關問題