2013-11-03 95 views
0

我有一個大型的數據集,每個工作站都有相同的經度和緯度。在數據集中,有些行缺少lat和lon,而是說'未知'。我需要填寫那些數據不會丟失的站點的未知數。R數據框根據其他人填寫缺失值

在這個例子中,我想第5行有3個和8插入的緯度和經度:

> station <- c("a","b","c","c","c") 
> lat <- c("1","2","3","3","unknown") 
> lon <- c("6","7","8","8","unknown") 
> data.frame(station,lat,lon) 
    station  lat  lon 
1  a  1  6 
2  b  2  7 
3  c  3  8 
4  c  3  8 
5  c unknown unknown 

有我的數據集一百萬行,如果它需要幾分鐘即可完成即因爲這隻在分析開始前運行一次。除非確實需要,否則我寧願不安裝其他軟件包。

+0

這是真正代表你的數據嗎?換句話說,你的數據集中是否真的有「未知」這個詞,或者是否編碼爲「NA」(應該是)? 「lat」和「lon」的'data.frame'中的值實際上是數值,還是因爲它們在這個問題中呢? – A5C1D2H2I1M1N2O1R2T1

+0

它在原始數據集中表示「未知」,這是因素。如果需要,我可以使用as.numeric來表示NA。 – John

+0

您的數據是按車站訂購的嗎?你確定你所有的電臺至少有一個有效值的代表嗎? – agstudy

回答

2

這樣的事情,也許 -

df$station <- as.character(df$station) 

unknownstations <- unique(subset(df,df$lat == "unknown","station")) 
unknownstationscoords <- unique(subset(df,station %in% unknownstations$station & lat != "unknown")) 

for(i in unknownstations$station) 
{ 
df[df$station == i,"lat"] <- subset(unknownstationscoords,station %in% i,"lat") 
df[df$station == i,"lon"] <- subset(unknownstationscoords,station %in% i,"lon") 
} 
0
y=function(station,lat,lon){ 

    temp=cbind(station,lat,lon) 
    lat_ind=lat!="unknown" 
    lon_ind=lon!="unknown" 


    if(all(lat_ind)==0){ 
    hash=unique(temp[lat_ind,]) 
    ind2=hash[,1]==station[!lat_ind] 
    temp[!lat_ind,]=temp[ind2,] 

    return(temp) 

    }else if(all(lon_ind)==0){ 
    hash=unique(temp[lon_ind,]) 
    ind2=hash[,1]==station[!lon_ind] 
    temp[!lon_ind,]=temp[ind2,] 

    return(temp) 


    }else { 

    return(temp) 
    } 


} 




##case1 

station <- c("a","b","c","c","c") 
lat <- c("1","2","3","3","unknown") 
lon <- c("6","7","8","8","unknown") 

y(station,lat,lon) 
# station lat lon 
# [1,] "a"  "1" "6" 
# [2,] "b"  "2" "7" 
# [3,] "c"  "3" "8" 
# [4,] "c"  "3" "8" 
# [5,] "c"  "3" "8" 


##case2 

station <- c("a","b","c","c","c") 
lat <- c("1","2","3","3","3") 
lon <- c("6","7","8","8","unknown") 
y(station,lat,lon) 
# station lat lon 
# [1,] "a"  "1" "6" 
# [2,] "b"  "2" "7" 
# [3,] "c"  "3" "8" 
# [4,] "c"  "3" "8" 
# [5,] "c"  "3" "8" 


##case3 

station <- c("a","b","c","c","c") 
lat <- c("1","2","3","3","unknown") 
lon <- c("6","7","8","8","8") 
y(station,lat,lon) 
# station lat lon 
# [1,] "a"  "1" "6" 
# [2,] "b"  "2" "7" 
# [3,] "c"  "3" "8" 
# [4,] "c"  "3" "8" 
# [5,] "c"  "3" "8" 
+0

如果你打算將「unknown」設置爲NA,只需將lat_ind = lat!=「unknown」替換爲lat_ind =!is.na(lat),lon_ind = lon!=「unknown」as lon_ind =!is.na (lon), –

+0

另外,如果lat,lon,station等級是因子,則使用我的函數y(as.character(station),as.character(lat),as.character(lon)) –

2

我會使用na.locf從動物園包。首先,我會改變unknownNA,然後應用na.locf

> library(zoo) 
> df[ df=="unknown"] <- NA 
> df2 <- do.call(rbind, lapply(split(df, df$station), na.locf)) 
> df2[, -1] <- sapply(df2[, -1], as.numeric) # numeric variables should be numeric 
> df2 
    station lat lon 
a   a 1 6 
b   b 2 7 
c.3  c 3 8 
c.4  c 3 8 
c.5  c 3 8 

如果你想CHANTE的rownames,然後用rownames並指定名稱:

> rownames(df2) <- 1:nrow(df2) 
> df2 
    station lat lon 
1  a 1 6 
2  b 2 7 
3  c 3 8 
4  c 3 8 
5  c 3 8 
+0

修復它! ;)星期日+沒有咖啡=與包裹混淆 –