2017-03-28 25 views
6

每年公司數量可以說我有數據幀:使用dplyr或數據表

df <- data.frame(City = c("NY", "NY", "NY", "NY", "NY", "LA", "LA", "LA", "LA"), 
       YearFrom = c("2001", "2003", "2002", "2006", "2008", "2004", "2005", "2005", "2002"), 
       YearTo = c(NA, "2005", NA, NA, "2009", NA, "2008", NA, NA)) 

其中YearFrom是因爲當年例如公司成立,並且YearTo是被取消的年份。 如果YearTo是NA,那麼它仍然有效。

我想計算每年的公司數量。

表應該是這樣的

City |"Year" |"Count" 
"NY" |2001  1 
"NY" |2002  2 
"NY" |2003  3 
"NY" |2004  3 
"NY" |2005  2 
"NY" |2006  3 
"NY" |2007  3 
"NY" |2008  4 
"NY" |2009  3 
"LA" |2001  0 
"LA" |2002  1 
"LA" |2003  1 
"LA" |2004  2 
"LA" |2005  4 
"LA" |2006  4 
"LA" |2007  4 
"LA" |2008  2 
"LA" |2009  2 

我想通過dplyr或數據表包來解決這個問題,但我無法弄清楚如何?

+0

應該取消當年被包含或排除?它應該排除在 – lmo

+0

之外。我認爲這是正確的方式。 – Mislav

回答

7

較短tidyverse溶液。

# Firsts some data prep 
df <- mutate(df, 
    YearFrom = as.numeric(as.character(YearFrom)),      #Fix year coding 
    YearTo = as.numeric(as.character(YearTo)), 
    YearTo = coalesce(YearTo, max(c(YearFrom, YearTo), na.rm = TRUE))) #Replace NA with max 

df %>% 
    mutate(Years = map2(YearFrom, YearTo - 1, `:`)) %>%   #Find all years 
    unnest() %>%             #Spread over rows 
    count(Years, City) %>%          #Count them 
    complete(City, Years, fill = list(n = 0))     #Add in zeros, if needed 
+0

簡潔而有效。尼斯 – www

+0

所有的答案都是對的,但我發現你最直觀。數據表的方式更快,但希望我的數據集不是那麼大。 – Mislav

+0

最後一行可以爲城市生成NA。不知道爲什麼。當我禁止Yeares到bi> = 2001時,它運行良好。 – Mislav

2

此解決方案使用dplyrtidyr

library(dplyr) 
library(tidyr) 

df %>% 
    # Change YearFrom and YearTo to numeric 
    mutate(YearFrom = as.numeric(as.character(YearFrom)), 
     YearTo = as.numeric(as.character(YearTo))) %>% 
    # Replace NA with 2017 in YearTo 
    mutate(YearTo = ifelse(is.na(YearTo), 2017, YearTo)) %>% 
    # All number in YearTo minus 1 to exclude the year of cancellation 
    mutate(YearTo = YearTo - 1) %>% 
    # Group by row 
    rowwise() %>% 
    # Create a tbl for each row, expand the Year column based on YearFrom and YearTo 
    do(data_frame(City = .$City, Year = seq(.$YearFrom, .$YearTo, by = 1))) %>% 
    ungroup() %>% 
    # Count the number of each City and Year 
    count(City, Year) %>% 
    # Rename the column n to Count 
    rename(Count = n) %>% 
    # Spread the data frame to find the implicity missing value in LA, 2001 
    spread(Year, Count) %>% 
    # Gather the data frame to account for the missing value in LA, 2001 
    gather(Year, Count, - City) %>% 
    # Replace NA with 0 in Count 
    mutate(Count = ifelse(is.na(Count), 0L, Count)) %>% 
    # Arrange the data 
    arrange(desc(City), Year) %>% 
    # Filter the data until 2009 
    filter(Year <= 2009) 
4

下面是使用data.table一個答案。數據準備位於底部。

# get list of businesses, one obs per year of operation 
cityList <- lapply(seq_len(nrow(df)), 
       function(i) df[i, .(City, "Year"=seq(YearFrom, YearTo - 1))]) 

# combine to a single data.table 
dfNew <- rbindlist(cityList) 

# get counts 
dfNew <- dfNew[, .(Count=.N), by=.(City, Year)] 

寫入一個線,這是

# get the counts 
rbindlist(lapply(seq_len(nrow(df)), 
      function(i) df[i, .(City, "Year"=seq(YearFrom, YearTo - 1))]))[, .(Count=.N), 
    by=.(City, Year)] 

這裏,lapply貫穿每一行,並構造data.table反覆城市值作爲一列,並與年操作的第二列。在這裏,YearTo是遞減的,所以它不包括關閉的年份。請注意,在數據準備中,缺失值設置爲2018年,因此包含當年。

lapply返回一個data.tables列表,它通過rbindlist合併成一個data.table。這個data.table彙總到城市年份對,計數使用.N構建。

這些返回

City Year Count 
1: NY 2001  1 
2: NY 2002  2 
3: NY 2003  3 
4: NY 2004  3 
5: NY 2005  2 
6: NY 2006  3 
7: NY 2007  3 
    ... 
26: LA 2012  3 
27: LA 2013  3 
28: LA 2014  3 
29: LA 2015  3 
30: LA 2016  3 
31: LA 2017  3 
32: LA 2002  1 
33: LA 2003  1 

數據

setDT(df) 
# convert string years to integers 
df[, grep("Year", names(df), value=TRUE) := 
    lapply(.SD, function(x) as.integer(as.character(x))), .SDcols=grep("Year", names(df))] 
# replace NA values with 2018 (to include 2017 in count) 
df[is.na(YearTo), YearTo := 2018] 
+0

不錯。但爲什麼在2005年的統計數字中,紐約是'3',而不是'2'? – www

+0

啊。我的代碼包括取消年份。值得澄清的是,這是否可取。 – lmo

+0

@ycw。在與OP協商之後,我調整了我的代碼以排除關閉年份,但包括當年。 – lmo

6

首先,清理數據...

curr_year = as.integer(year(Sys.Date())) 

library(data.table) 
setDT(df) 
df[, YearTo := as.integer(as.character(YearTo)) ] 
df[, YearFrom := as.integer(as.character(YearFrom)) ] 
df[, quasiYearTo := YearTo ] 
df[is.na(YearTo), quasiYearTo := curr_year ] 

然後,非相等連接:

df[CJ(City = City, Year = min(YearFrom):max(YearTo, na.rm=TRUE), unique=TRUE), 
    on=.(City, YearFrom <= Year, quasiYearTo > Year), allow.cartesian = TRUE, 
    .N 
, by=.EACHI][, .(City, Year = YearFrom, N)] 

    City Year N 
1: LA 2001 0 
2: LA 2002 1 
3: LA 2003 1 
4: LA 2004 2 
5: LA 2005 4 
6: LA 2006 4 
7: LA 2007 4 
8: LA 2008 3 
9: LA 2009 3 
10: NY 2001 1 
11: NY 2002 2 
12: NY 2003 3 
13: NY 2004 3 
14: NY 2005 2 
15: NY 2006 3 
16: NY 2007 3 
17: NY 2008 4 
18: NY 2009 3 
+0

對不起,改變了我的想法,但這是最好的答案。我有其他方法的一些問題! – Mislav