2012-06-27 73 views
8

如何使用ggplot在直方圖上疊加任意參數分佈?如何使用ggplot在直方圖上疊加任意參數分佈?

我已經根據Quick-R example進行了嘗試,但我不明白縮放因子來自哪裏。這種方法合理嗎?我如何修改它以使用ggplot?

一個例子overplot使用這種方法的正常和對數正態分佈如下:

## Get a log-normalish data set: the number of characters per word in "Alice in Wonderland" 
alice.raw <- readLines(con = "http://www.gutenberg.org/cache/epub/11/pg11.txt", 
         n = -1L, ok = TRUE, warn = TRUE, 
         encoding = "UTF-8") 

alice.long <- paste(alice.raw, collapse=" ") 
alice.long.noboilerplate <- strsplit(alice.long, split="\\*\\*\\*")[[1]][3] 
alice.words <- strsplit(alice.long.noboilerplate, "[[:space:]]+")[[1]] 
alice.nchar <- nchar(alice.words) 
alice.nchar <- alice.nchar[alice.nchar > 0] 

# Now we want to plot both the histogram and then log-normal probability dist 
require(MASS) 
h <- hist(alice.nchar, breaks=1:50, xlab="Characters in word", main="Count") 
xfit <- seq(1, 50, 0.1) 

# Plot a normal curve 
yfit<-dnorm(xfit,mean=mean(alice.nchar),sd=sd(alice.nchar)) 
yfit <- yfit * diff(h$mids[1:2]) * length(alice.nchar) 
lines(xfit, yfit, col="blue", lwd=2) 

# Now plot a log-normal curve 
params <- fitdistr(alice.nchar, densfun="lognormal") 
yfit <- dlnorm(xfit, meanlog=params$estimate[1], sdlog=params$estimate[1]) 
yfit <- yfit * diff(h$mids[1:2]) * length(alice.nchar) 
lines(xfit, yfit, col="red", lwd=2) 

這將產生以下情節: Plot produced by the code above, showing a histogram of word length superimposed with a normal distribution curve and a log-normal distribution curve

爲了澄清,我想有在y軸計數而不是密度估計。

+0

注意到正態分佈沒有意義,因爲單詞都有> 0個字母,並且這些值是不連續的整數;正常是連續的。 –

+0

同意 - 這是一個帶有便利數據集的玩具示例。而正常的曲線可能不合適。 – fmark

回答

11

看一看stat_function()

alice.raw <- readLines(con = "http://www.gutenberg.org/cache/epub/11/pg11.txt", 
         n = -1L, ok = TRUE, warn = TRUE, 
         encoding = "UTF-8") 

alice.long <- paste(alice.raw, collapse=" ") 
alice.long.noboilerplate <- strsplit(alice.long, split="\\*\\*\\*")[[1]][3] 
alice.words <- strsplit(alice.long.noboilerplate, "[[:space:]]+")[[1]] 
alice.nchar <- nchar(alice.words) 
alice.nchar <- alice.nchar[alice.nchar > 0] 
dataset <- data.frame(alice.nchar = alice.nchar) 
library(ggplot2) 
ggplot(dataset, aes(x = alice.nchar)) + geom_histogram(aes(y = ..density..)) + 
    stat_function(fun = dnorm, 
    args = c(
     mean = mean(dataset$alice.nchar), 
     sd = sd(dataset$alice.nchar)), 
    colour = "red") 

enter image description here

如果你想對y軸數作爲例子,那麼你就需要一個轉換的密度函數以計數:

dnorm.count <- function(x, mean = 0, sd = 1, log = FALSE, n = 1, binwidth = 1){ 
    n * binwidth * dnorm(x = x, mean = mean, sd = sd, log = log) 
} 

ggplot(dataset, aes(x = alice.nchar)) + geom_histogram(binwidth=1.6) + 
    stat_function(fun = dnorm.count, 
       args = c(
        mean = mean(dataset$alice.nchar), 
        sd = sd(dataset$alice.nchar), 
        n = nrow(dataset), binwidth=1.6), 
       colour = "red") 

enter image description here

+0

不錯。我認爲stat_function必須是新的。它比我以前的方法有了很大的改進,首先創建一個x,dnorm(x,,)的數據框架。 –

+1

@David'stat_function'已經存在了,只要我記得! :) – joran

+0

這真是太棒了 - 是否有可能在y軸上計數,而不是像上面的例子那樣計算密度? – fmark

相關問題