2012-01-18 143 views
4

我是R中的新人,真的不確定如何過濾日期幀中的數據。R - 從數據幀中過濾數據

我已經創建了包含月份日期和相應溫度的兩欄的數據框。它的長度爲324.

> head(Nino3.4_1974_2000) 
    Month_common    Nino3.4_degree_1974_2000_plain 
1 1974-01-15      -1.93025 
2 1974-02-15      -1.73535 
3 1974-03-15      -1.20040 
4 1974-04-15      -1.00390 
5 1974-05-15      -0.62550 
6 1974-06-15      -0.36915 

過濾規則是選擇大於或等於0.5度的溫度。此外,它必須至少連續5個月。

我已經消除了溫度低於0.5度的數據(見下文)。

for (i in 1) { 
el_nino=Nino3.4_1974_2000[which(Nino3.4_1974_2000$Nino3.4_degree_1974_2000_plain >= 0.5),] 
} 

> head(el_nino) 
    Month_common    Nino3.4_degree_1974_2000_plain 
32 1976-08-15      0.5192000 
33 1976-09-15      0.8740000 
34 1976-10-15      0.8864501 
35 1976-11-15      0.8229501 
36 1976-12-15      0.7336500 
37 1977-01-15      0.9276500 

但是,我仍然需要連續提取5個月。我希望有人能幫助我。

+0

是*永遠*一個你本月'Month_common'行之間的區別是什麼? – 2012-01-18 05:20:06

+0

是的,間距是一個月。 – 2012-01-18 05:25:04

回答

4

如果你總是可以依靠的間距是一個月內,然後讓我們暫時拋卻時間信息:

temps <- Nino3.4_1974_2000$Nino3.4_degree_1974_2000_plain 

所以,因爲在每一個溫度該向量是總是相隔一個月,我們只需要尋找temps[i]>=0.5的運行,並且運行必須至少5個長。

如果我們做到以下幾點:

ofinterest <- temps >= 0.5 

我們只好值TRUE FALSE FALSE TRUE TRUE ....等載體ofinterest它的TRUEtemps[i]爲> = 0.5和FALSE否則。

要重新解釋您的問題,那麼我們只需要查看的發生次數,連續至少有5個TRUE連續

要做到這一點,我們可以使用函數rle?rle給出:

> ?rle 
Description 
    Compute the lengths and values of runs of equal values in a vector 
    - or the reverse operation. 
Value: 
    ‘rle()’ returns an object of class ‘"rle"’ which is a list with 
    components:  
lengths: an integer vector containing the length of each run. 
    values: a vector of the same length as ‘lengths’ with the 
      corresponding values. 

因此我們使用rle它計算了所有的連續和連續TRUE條紋成一排連續FALSE,並連續尋找至少5 TRUE

我只是做了一些數據來證明:

# for you, temps <- Nino3.4_1974_2000$Nino3.4_degree_1974_2000_plain 
temps <- runif(1000) 

# make a vector that is TRUE when temperature is >= 0.5 and FALSE otherwise 
ofinterest <- temps >= 0.5 

# count up the runs of TRUEs and FALSEs using rle: 
runs <- rle(ofinterest) 

# we need to find points where runs$lengths >= 5 (ie more than 5 in a row), 
# AND runs$values is TRUE (so more than 5 'TRUE's in a row). 
streakIs <- which(runs$lengths>=5 & runs$values) 

# these are all the el_nino occurences. 
# We need to convert `streakIs` into indices into our original `temps` vector. 
# To do this we add up all the `runs$lengths` up to `streakIs[i]` and that gives 
# the index into `temps`. 
# that is: 
# startMonths <- c() 
# for (n in streakIs) { 
#  startMonths <- c(startMonths, sum(runs$lengths[1:(n-1)]) + 1 
# } 
# 
# However, since this is R we can vectorise with sapply: 
startMonths <- sapply(streakIs, function(n) sum(runs$lengths[1:(n-1)])+1) 

現在,如果你這樣做Nino3.4_1974_2000$Month_common[startMonths]你會得到其中的厄爾尼諾開始的所有月份。

它歸結爲短短的幾行:

runs <- rle(Nino3.4_1974_2000$Nino3.4_degree_1974_2000_plain>=0.5) 
streakIs <- which(runs$lengths>=5 & runs$values) 
startMonths <- sapply(streakIs, function(n) sum(runs$lengths[1:(n-1)])+1) 
Nino3.4_1974_2000$Month_common[startMonths] 
+0

謝謝,它很棒 – 2012-01-18 07:36:03

1

以下是一個使用事實的方法,即月份相隔一個月。比問題簡化爲找到連續5行與臨時工> = 0.5度:

# Some sample data 
d <- data.frame(Month=1:20, Temp=c(rep(1,6),0,rep(1,4),0,rep(1,5),0, rep(1,2))) 
d 

# Use rle to find runs of temps >= 0.5 degrees 
x <- rle(d$Temp >= 0.5) 

# The find the last row in each run of 5 or more 
y <- x$lengths>=5 # BUG HERE: See update below! 
lastRow <- cumsum(x$lengths)[y] 

# Finally, deduce the first row and make a result matrix 
firstRow <- lastRow - x$lengths[y] + 1L 
res <- cbind(firstRow, lastRow) 
res 
#  firstRow lastRow 
#[1,]  1  6 
#[2,]  13  17 

UPDATE我有檢測運行與5個值小於0.5太的錯誤。下面是更新後的代碼(和測試數據):

d <- data.frame(Month=1:20, Temp=c(rep(0,6),1,0,rep(1,4),0,rep(1,5),0, 1)) 
x <- rle(d$Temp >= 0.5) 
y <- x$lengths>=5 & x$values 
lastRow <- cumsum(x$lengths)[y] 
firstRow <- lastRow - x$lengths[y] + 1L 
res <- cbind(firstRow, lastRow) 
res 
#  firstRow lastRow 
#[2,]  14  18 
+0

我不知道,它不能正常工作。特別是當數據凝視的數字小於.5時。 – 2012-01-18 06:51:49

+0

@YuDeng - 哎呀小錯誤。更新了答案。 – Tommy 2012-01-18 08:05:45

+0

謝謝,它運作良好, – 2012-01-19 00:10:59