計數實例

我有一個大的數據表瓜分（超過240萬條記錄）（某些列刪除）：計數實例

X trip_id  from_station_id.x to_station_id.x 
1 1109420  94     69 
2 1109421  69     216 
3 1109427  240     245 
4 1109431  113     94 
5 1109433  127     332 
3 1109429  240     245

我想找到旅行的人數從每個站到每個相對的站。因此，例如，

From X  To Y  Sum 
94   69  1 
240  245  2

等，然後使用dplyr做出類似下面，然後加入它回到inital表將其限制爲不同from_station_id/to_combos，我將用它來繪製路線（我有經/緯每個站）：

X trip_id  from_station_id.x to_station_id.x Sum 
1 1109420  94     69    1 
2 1109421  69     216    1 
3 1109427  240     245    2 
4 1109431  113     94    1 
5 1109433  127     332    1 
3 1109429  240     245    1

我成功地用於計數，以獲得一些這方面，如：

count(Divvy$from_station_id.x==94 & Divvy$to_station_id.x == 69) 
    x freq 
1 FALSE 2454553 
2 TRUE  81

但是，這顯然是勞動密集型的，因爲有300個獨特的站，超過44k的pos組合。我創建了一個幫助表，以便我可以循環它。

n <- select(Divvy, from_station_id.y) 

    from_station_id.x 
1    94     
2    69     
3    240    
4    113    
5    113    
6    127    

    count(Divvy$from_station_id.x==n[1,1] & Divvy$to_station_id.x == n[2,1]) 

     x freq 
1 FALSE 2454553 
2 TRUE  81

我覺得自己像一個循環，如

output <- matrix(ncol=variables, nrow=iterations) 


output <- matrix() 
for(i in 1:n)(output[i, count(Divvy$from_station_id.x==n[1,1] & Divvy$to_station_id.x == n[2,1]))

應該工作，但想起來它仍然只會返回300行，而不是44K，所以它必須接着返回，並做n [2] & n [1] etc ...

我覺得可能還有一個更快的dplyr解決方案，可以讓我返回每個組合的數量並直接追加它，而無需額外的步驟/表創建，但我還沒有找到它。

我對R更新，我搜索了周圍/認爲我很近，但我無法連接最後一個加入Divvy的結果點。任何幫助讚賞。

來源

2015-04-03 ike

我嘗試了所有這三種解決方案，並且我不得不說他們都正確地獲得了總和，並以奇妙的方式工作。我使用dplyr選項作爲「最佳」選項，因爲它能夠爲我提供我想要的有限數量的行，但我認爲data.table選項可能是最優雅的。 – ike 2015-04-10 00:33:45

另外：如果其他人希望看到/使用原始數據集，請訪問：http://www.divvybikes.com/data – ike 2015-04-10 00:41:31

既然你說「它限制在不同的from_station_id/to_combos」，下面的代碼似乎提供你所追求的。您的數據被稱爲mydf。

library(dplyr) 
group_by(mydf, from_station_id.x, to_station_id.x) %>% 
count(from_station_id.x, to_station_id.x) 

# from_station_id.x to_station_id.x n 
#1    69    216 1 
#2    94    69 1 
#3    113    94 1 
#4    127    332 1 
#5    240    245 2

來源

2015-04-04 00:42:06 jazzurro

我最終使用此爲： counts4 < - GROUP_BY（divvydata，trip_id，from_station_id.x，to_station_id.x）％>％計數（from_station_id.x，to_station_id.x，From_Station_Lat，From_Station_Long，End_Station_Lat，End_Station_Long） – ike 2015-04-10 00:34:24

@ike我很高興你根據這個建議找到了你自己的解決方案。 :) – jazzurro 2015-04-10 14:04:57

我不完全確定這就是你要找的結果，但是這會計算具有相同原點和目的地的旅程的數量。隨意評論，讓我知道如果這不是你期望的最終結果。

dat <- read.table(text="X trip_id  from_station_id.x to_station_id.x 
1 1109420  94     69 
2 1109421  69     216 
3 1109427  240     245 
4 1109431  113     94 
5 1109433  127     332 
3 1109429  240     245", header=TRUE) 

dat$from.to <- paste(dat$from_station_id.x, dat$to_station_id.x, sep="-") 
freqs <- as.data.frame(table(dat$from.to)) 
names(freqs) <- c("from.to", "sum") 
dat2 <- merge(dat, freqs, by="from.to") 
dat2 <- dat2[order(dat2$trip_id),-1]

結果

dat2 

# X trip_id from_station_id.x to_station_id.x sum 
# 6 1 1109420    94    69 1 
# 5 2 1109421    69    216 1 
# 3 3 1109427    240    245 2 
# 4 3 1109429    240    245 2 
# 1 4 1109431    113    94 1 
# 2 5 1109433    127    332 1

來源

2015-04-03 23:53:07

這確實很好，謝謝。雖然我做了dat作爲read.csv，所以我可以直接導入文件並跳過其他一些步驟。謝謝。 – ike 2015-04-10 00:41:02

#Here is the data.table solution, which is useful if you are working with large data: 
library(data.table) 
setDT(DF)[,sum:=.N,by=.(from_station_id.x,to_station_id.x)][] #DF is your dataframe 

    X trip_id from_station_id.x to_station_id.x sum 
1: 1 1109420    94    69 1 
2: 2 1109421    69    216 1 
3: 3 1109427    240    245 2 
4: 4 1109431    113    94 1 
5: 5 1109433    127    332 1 
6: 3 1109429    240    245 2

來源

2015-04-04 00:09:23 Metrics

這是很好的解決方案。 – 2015-04-04 07:15:35

這很美，謝謝。 – ike 2015-04-10 00:33:58

回答

相關問題