0
我不是統計學家,但我確實希望使用基本概率來理解我的數據發生了什麼。使用R中的基本概率分析R
我創建的看着我使用直方圖,然後比較不同羣體我感興趣的分析,以集團整體特定箱數據的繁瑣,但非常有用的方法。它向我們展示了我們公司的一些令人難以置信的見解,並且很容易解釋圖中發生的事情。儘管這樣說很乏味,但這種類型的分析非常有用,其他人可能已經爲它創建了一個函數。
下面是我的代碼如下。這種類型的分析是否已經存在於一個函數中?另外我使用了logi.hist.plot(),它做了類似的事情,但它可能有問題,我更喜歡使用這個數據的「原始視圖」。
library(dplyr)
library(ggplot2)
#Create the data
set.seed(84102)
daba <- data.frame(YES_NO = c(0,0,1,1,1,1,0,0,0,1,0,1,0,1,0,1,0,0,0,1))
daba$UserCount <- c(23,43,45,65,32,10,34,68,65,75,43,24,37,54,73,29,87,32,21,12)
#Create the bins using hist(), clean up bins and make them integers
hist_breaks <- cut(daba$UserCount, breaks = hist(daba$UserCount, breaks = 20)$breaks)
daba$Breaks <- hist_breaks
daba$Breaks <- sub(".*,","",daba$Breaks)
daba$Breaks <- sub("]","",daba$Breaks)
daba$Breaks[is.na(daba$Breaks)] <- 0
daba$Breaks <- as.integer(daba$Breaks)
#Create two data groups to be compared
daba_NO <- filter(daba, daba$YES_NO == 0)
daba_YES <- filter(daba, daba$YES_NO == 1)
#Aggregate user count into histogram bins using aggregate()
daba_NOAgg <- aggregate(data = daba_NO, daba_NO$Breaks~daba_NO$UserCount, sum)
daba_YESAgg <- aggregate(data = daba_YES, daba_YES$Breaks~daba_YES$UserCount, sum)
#Rename the columns to clean it up
colnames(daba_NOAgg) <- c("UserCountNo", "Breaks")
colnames(daba_YESAgg) <- c("UserCountYes", "Breaks")
#Merge the two groups back together
daba_SUMAgg <- merge(x = daba_NOAgg, y = daba_YESAgg, by.x = "Breaks", by.y = "Breaks")
#Generate basic probability for Yes group of users
daba_SUMAgg$Probability <- (daba_SUMAgg$UserCountYes/(daba_SUMAgg$UserCountNo+daba_SUMAgg$UserCountYes))*100
#Graph the data
ggplot(data = daba_SUMAgg)+
geom_point(alpha = 0.4, mapping = aes(y = daba_SUMAgg$Probability, x = daba_SUMAgg$Breaks))+
labs(x = "BINS", y = "PROBABILITY", title = "PROBABILITY ANALYSIS USING BINS")
daba_SUMAgg
你確定你的'daba_SUMAgg'數據框有道理嗎?你得到2行的休息25和35.此外,你的一些休息,如90,失蹤。 – AntoniosK
我覺得你需要'聚合(data = daba_NO,daba_NO $ UserCount〜daba_NO $ Breaks,sum)'。你必須將你傳遞給'〜'的東西切換 – AntoniosK