與條件

隨機抽樣

我的問題坐在一個循環中，我有一個大的數據集（DF），一個子集，它看起來是這樣的：與條件

ID  Site Species 
101  4 x 
101  4 y 
101  4 z 
102  6 x 
102  6 z 
102  6 a 
102  6 b 
103  6 a 
103  6 z 
103  6 c 
103  6 x 
103  6 y 
105  6 x 
105  6 y 
105  6 a 
105  6 z 
108  1 x 
108  1 a 
108  1 c 
108  1 z

我想隨機選擇，使用的每一次迭代我循環（so，i）來自每個網站的個人ID的所有行。但關鍵的是，每個網站只有一個ID。我有一個單獨的函數，用於將我的大型數據集分爲多個站點，因此如果i=1那麼只有一個上述站點（例如）會出現在子集中。

如果i=3，作爲本貼例子，那麼我希望101所有行，要麼102，103或105，和所有的108

所有行我認爲像ddply()與sample()應做到這一點，但我無法讓它隨機發生。

任何建議將不勝感激。感謝

詹姆斯

來源

2014-01-15 user3122022

你能解釋爲什麼'I = 3'指那些'ID's應選擇以及爲什麼是'108'不同於'102，103 ，105'？你可以展示一些代碼來說明你在做什麼，一些一般的設置。目前還不清楚「我」是什麼。 –

好的，對不起，這裏有更多的上下文。我使用specaccum（）在不同數量的遠程攝像機（ID列）和不同數量的站點（站點列）之間引導物種積累曲線的生成。所以我需要一個站點的曲線，一個攝像機，兩個攝像機等，然後是兩個站點，一個攝像機的曲線，兩個攝像機等。我的第一個循環：for（l in 1：length（sitelist）），subset into l可能的網站，並在這些網站上生成所有可能的相機列表。我的下一個嵌套循環：for（i in 1：l）是我想要採樣一個攝像頭，兩個攝像頭（來自diff站點）等的地方。 – user3122022

108與102,103和105不同，因爲它位於不同的站點網站欄）。我想隨機選擇一個來自每個網站的ID。我提供的數據集顯示了i = 3（3站點）的迭代，其他迭代（更多站點）中有更多的ID，但我仍然只需要來自每個站點的一個ID，而不管我有多大有很多網站）。我希望這更有意義。 – user3122022

這個怎麼樣？我添加了一個函數來模擬我認爲你的數據看起來像什麼。

#dependencies 
require(plyr) 

#function to make data (just to work with) 
make_data<-function(id){ 
    set.seed(id) 
    num_sites<-round(runif(1)*3,0)+1 
    num_sp<-round(runif(1)*7,0)+1 
    sites<-sample(1:10,num_sites,FALSE) 
    ldply(sites,function(x)data.frame(sites=x,sp=sample(letters[1:26],num_sp,FALSE))) 
} 

#make a data frame for example use (as per question) 
ids<-100:200 
df<-ldply(ids,function(x)data.frame(id=x,make_data(x))) 

################################################ 
# HERE'S THE CODE FOR THE ANSWER    # 
# use ddply to summarise by site & sampled ids # 
filter<-ddply(df,.(sites),summarise,set=sample(id,1)) 
# then apply this filter to the original list 
ddply(filter,.(sites),.fun=function(x){return(df[df$site==x$sites & df$id==x$set,])})

來源

2014-01-15 08:27:59 Troy

謝謝，這兩個答案都很好，但是我用了這個，因爲它只有2行代碼。 – user3122022

我想你可以使用unique查找所有可能的ID /網站，然後從獨特的子集採樣。

例如，讓我們創建一個數據集

# Set the RNG seed for reproducibility 
set.seed(12345) 
ID <- rep(100:110, c(2, 6, 3, 1, 3, 8, 9, 2, 4, 5, 6)) 
site <- rep(1:6, c(8, 7, 8, 11, 4, 11)) 
species <- sample(letters[1:5], length(ID), replace=T) 

df <- data.frame(ID=ID, Site=site, Species=species)

因此，DF是這樣的：

> head(df, 15) 
    ID Site Species 
1 100 1  d 
2 100 1  e 
3 101 1  d 
4 101 1  e 
5 101 1  c 
6 101 1  a 
7 101 1  b 
8 101 1  c 
9 102 2  d 
10 102 2  e 
11 102 2  a 
12 103 2  a 
13 104 2  d 
14 104 2  a 
15 104 2  b

總結數據，我們有：

Site 1 -> 100, 101 
Site 2 -> 102, 103, 104 
Site 3 -> 105 
Site 4 -> 106, 107 
Site 5 -> 108 
Site 6 -> 109, 110

現在，讓我們說我想從3個網站中選擇

# The number of sites we want to sample 
num.sites <- 3 
# Find all the sites 
all.sites <- unique(df$Site) 
# Pick the sites. 
# You may also want to check that num.sites <= length(all.sites) 
sites <- sample(all.sites, num.sites)

在這種情況下，我們選擇了

> sites 
[1] 4 5 6

好了，現在我們發現可供每個站點

# Now find the IDs in each of those sites 
# simplify=F is VERY important to ensure we get a list even if every 
# site has the same number of IDs 
IDs <- sapply(chosen.sites, function(s) 
    { 
    unique(df$ID[df$Site==s]) 
    }, simplify=FALSE)

這讓我們

> IDs 
[[1]] 
[1] 106 107 

[[2]] 
[1] 108 

[[3]] 
[1] 109 110

的ID現在選擇每一個ID網站

# NOTE: this assumes the same ID is not found in multiple sites 
# but it's easy to deal with the opposite case 
# Again, we return a list, because sapply does not seem 
# to play well with data frames... (try it!) 
res <- sapply(IDs, function(i) 
    { 
    chosen.ID <- sample(as.list(i), 1) 
    df[df$ID==chosen.ID,] 
    }, simplify=FALSE) 

# Finally convert the list to a data frame 
res <- do.call(rbind, res) 


> res 
    ID Site Species 
24 106 4  d 
25 106 4  d 
26 106 4  b 
27 106 4  d 
28 106 4  c 
29 106 4  b 
30 106 4  c 
31 106 4  d 
32 106 4  a 
35 108 5  b 
36 108 5  b 
37 108 5  e 
38 108 5  e 
44 110 6  d 
45 110 6  b 
46 110 6  b 
47 110 6  a 
48 110 6  a 
49 110 6  a

因此，一切都在一個單一的功能

pickSites <- function(df, num.sites) 
    { 
    all.sites <- unique(df$Site) 
    chosen.sites <- sample(all.sites, num.sites) 

    IDs <- sapply(chosen.sites, function(s) 
     { 
     unique(df$ID[df$Site==s]) 
     }, simplify=FALSE) 

    res <- sapply(IDs, function(i) 
     { 
     chosen.ID <- sample(as.list(i), 1) 
     df[df$ID==chosen.ID,] 
     }, simplify=FALSE) 

    res <- do.call(rbind, res) 
    }

來源

2014-01-15 08:38:14 nico

回答

相關問題