2014-12-28 31 views
6

我的問題與Connecting across missing values with geom_line密切相關,但它是後續而非重複。通過geom_line連接選定的NAs點()

我有缺失值NA的數據。數據已被長期「融化」,包裝爲reshape2,我使用ggplot2繪製了geom_points()geom_line()。在示例數據中,我只有一個組,在我的真實數據中,我有幾個組。我想繪製一個連接數據點的geom_line(),這些數據點之間沒有超過4年的缺失數據。換句話說,如果有3個相鄰的NA,則對數據幀應用na.rm,而如果NA至少有4個相鄰的行,則不要將na.rm應用於data.frame。

編輯:注意:我正在複製一本書中的數字,即使在數據丟失的情況下點也連接在一起。對於那些連接缺失數據的段使用不同的linetypecolour,以及圖例中的註解來解釋它。

在下面,我有一個非常乏味和醜陋的黑客,不會擴大到操縱大量的數據。我很感激更簡單的方法,特別希望找到一種簡單的方法來計算數據中連續的NAs實例。

### ggplot draws geom_line with NAs 

# Data (real-world example, so not exactly MWE) 
df <- 
structure(list(Year = c(1910, 1911, 1912, 1913, 1914, 1915, 1916, 
1917, 1918, 1919, 1920, 1921, 1922, 1923, 1924, 1925, 1926, 1927, 
1928, 1929, 1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 
1939, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 
1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 
1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 
1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 
1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 
1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 
2005, 2006, 2007, 2008, 2009, 2010), variable = structure(c(2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L), .Label = c("France", "Germany", "Sweden", "Japan" 
), class = c("ordered", "factor")), value = c(0.1724, 0.1748, 
0.1752, 0.1777, 0.1778, 0.1953, 0.2132, 0.2242, 0.222, 0.1947, 
NA, NA, NA, NA, NA, 0.113, 0.113, 0.115, 0.112, 0.111, NA, NA, 
0.114, 0.109, 0.113, 0.12, 0.137, 0.15, 0.163, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, 0.116, NA, NA, NA, NA, NA, NA, 0.11, 
NA, NA, NA, 0.122, NA, NA, NA, 0.122, NA, NA, 0.112, NA, NA, 
0.113, NA, NA, 0.101, NA, NA, 0.102, NA, NA, 0.1043, NA, NA, 
0.0906, NA, NA, 0.0964, NA, NA, 0.1052, NA, NA, 0.1043, NA, NA, 
0.1005, NA, NA, 0.1088, NA, NA, 0.101139312657167, 0.0950290025146689, 
0.0901042749371333, 0.09, 0.107249622799665, 0.108891198658843, 
0.115913495389774, 0.110684772282761, 0.113299133836267, 0.111991953059514 
)), .Names = c("Year", "variable", "value"), row.names = 102:202, class = "data.frame") 

默認的情節:

library("ggplot2") 
ggplot(data = df, aes(x = Year, y = value, group = variable, colour = variable, shape = variable)) + 
    geom_point(size = 3) + geom_line() 

enter image description here

刪除了所有的NAS圖(見Connecting across missing values with geom_line):

ggplot(data = df, aes(x = Year, y = value, group = variable, colour = variable, shape = variable)) + 
    geom_point(size = 3) + geom_line(data = df[!is.na(df$value), ]) 

enter image description here

所需的情節:

df2 <- df 
df2[df2$Year == 1922, ]$value <- "-999999" 
df2[df2$Year == 1948, ]$value <- "-999999" 
df2 <- df2[!is.na(df2$value), ] 
df2$value <- as.numeric(df2$value) 
ggplot(data = df2, aes(x = Year, y = value, group = variable, colour = variable, shape = variable)) + geom_point(size = 3) + 
    geom_line() + scale_y_continuous(limit = c(.08, .23)) 

enter image description here

+0

您想要的情節與小時規則不一致。 1950年的這一點應該是孤立的,因爲1939 - 1949年是'NA',1951 - 1956年也是如此。兩者都是> 3 NA的序列。 – jlhoward

回答

3

這將產生你的 「理想情節」,在該意見指出除外。

x <- rle(!is.na(df$value)) 
x$values[which(x$lengths>3 & !x$values)] <- TRUE 
indx <- inverse.rle(x) 
library(ggplot2) 
ggplot(df[indx,],aes(x=Year,y=value,color=variable))+ 
    geom_point(size=3)+ 
    geom_line() 

基本上,我們編碼NAFALSE,以及其他一切作爲TRUE,然後執行行程長度編碼,以識別T/F序列。任何長度> 3的FALSE的序列都應該保留,所以我們將它們轉換爲TRUE(就好像它們不是NA),然後我們使用inverse rle來恢復索引向量,如果該行應該保留,則返回TRUE。最後,我們將此應用於df以用於ggplot

+0

非常好,感謝您的解釋:我以前沒有聽說過''rle''的功能,這將非常棒。你還會發現我對選擇規則的不一致口頭描述! – PatrickT