2015-10-26 21 views
0

我有2個不同的dataframes如果範圍內的值,則返回列

str(drivenum) 
'data.frame': 95841 obs. of 7 variables: 
$ team: chr "SF" "ATL" "SF" "ATL" ... 
$ year: int 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ... 
$ opp : chr "ATL" "SF" "ATL" "SF" ... 
$ drvn: int 1 2 3 4 5 6 7 8 9 10 ... 
$ fpid: int 2 12 19 23 36 40 54 58 66 71 ... 
$ lpid: num 9 17 22 34 39 52 57 64 70 75 ... 
$ pts : num 6 3 0 3 0 3 0 3 0 6 ... 

str(drivedata) 
'data.frame': 669217 obs. of 7 variables: 
$ team: chr "SF" "SF" "SF" "SF" ... 
$ year: int 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ... 
$ opp : chr "ATL" "ATL" "ATL" "ATL" ... 
$ pid : int 1 2 3 4 5 6 7 8 9 10 ... 
$ dwn : int 0 1 2 1 2 1 1 1 2 0 ... 
$ ytg : int 0 10 9 10 6 10 10 6 4 0 ... 
$ yfog: int 0 26 27 37 41 60 70 94 96 0 ... 

我想如果drivedata $ PID下降drivenum $ FPID和drivenum $ LPID的範圍之間返回drivenum $越南民主共和國,但由於大小不同的數據框,我遇到了問題。有人有主意嗎?

+1

您可以檢查'foverlaps'從'庫(data.table)' – akrun

回答

0

您可以使用which找到drivenum適用於給定值在drivedata$pid行:

drivenum <- data.frame(fpid = c(2, 12, 19, 23, 36), 
         lpid = c(9, 17, 22, 34, 39), 
         drvn = c(1, 2, 3, 4, 5)) 

drivedata <- data.frame(pid = 1:20) 

drvn.list <- sapply(drivedata$pid, 
        function(x){ drivenum$drvn[which((drivenum$fpid <= x) & (x <= drivenum$lpid))]}) 

> drvn.list 
[[1]] 
numeric(0) 

[[2]] 
[1] 1 

[[3]] 
[1] 1 

[[4]] 
[1] 1 

[[5]] 
[1] 1 

[[6]] 
[1] 1 

[[7]] 
[1] 1 

[[8]] 
[1] 1 

[[9]] 
[1] 1 

[[10]] 
numeric(0) 

[[11]] 
numeric(0) 

[[12]] 
[1] 2 

[[13]] 
[1] 2 

[[14]] 
[1] 2 

[[15]] 
[1] 2 

[[16]] 
[1] 2 

[[17]] 
[1] 2 

[[18]] 
numeric(0) 

[[19]] 
[1] 3 

[[20]] 
[1] 3 

> 

這裏是一個替代的解決方案,存在drivenum$drvn最多一個值,

  • drivenum$fpiddrivenum$lpid越來越有序,即divenum$fpid[i]<drivenum$fpid[j]如果爲每個值工作,如果

    • drivedata$pidi<jdrivenum$lpid類似。

    儘管它包含一個循環,但速度更快。所以循環是並不總是那壞。使用具有尺寸8000的drivenum和大小60000的drivedata示例

    drvn.list.2 <- lapply(as.list(as.integer(rep(0,nrow(drivedata)))),head,0) 
    pos <- rep(NA,max(drivenum$lpid)) 
    pos[drivedata$pid] <- 1:nrow(drivedata) 
    
    for (i in 1:nrow(drivenum)) 
    { 
        if (max(drivedata$pid)<drivenum$fpid[i]) { break() } 
    
        drvn.list.2[pos[drivenum$fpid[i]:drivenum$lpid[i]]] <- 
        drivenum$drvn[i] 
    } 
    

    速度比較:

    #--------------------------------------------------------- 
    # Generate example data: 
    
    set.seed(1) 
    
    n <- 8000 
    d1 <- sample(1:3,n,replace=TRUE) 
    d2 <- sample(1:10,n,replace=TRUE) 
    
    drivenum <- data.frame(fpid = cumsum(d1+(c(0,d2)[-n])), 
             lpid = cumsum(d1+d2), 
             drvn = sample(1:n)) 
    
    drivedata <- data.frame(pid = sample(1:60000)) 
    
    #---------------------------------------------------------- 
    # Speed comparison: 
    
    system.time(
        for (k in 1:10) 
        { 
        drvn.list.1 <- sapply(drivedata$pid, 
              function(x){ drivenum$drvn[which((drivenum$fpid <= x) & (x <= drivenum$lpid))] }) 
        } 
    ) 
    
    system.time(
        for (k in 1:10) 
        { 
        drvn.list.2 <- lapply(as.list(as.integer(rep(0,nrow(drivedata)))),head,0) 
        pos <- rep(NA,max(drivenum$lpid)) 
        pos[drivedata$pid] <- 1:nrow(drivedata) 
    
        for (i in 1:nrow(drivenum)) 
        { 
         if (max(drivedata$pid)<drivenum$fpid[i]) { break() } 
    
         drvn.list.2[pos[drivenum$fpid[i]:drivenum$lpid[i]]] <- 
         drivenum$drvn[i] 
        } 
        } 
    ) 
    

    > system.time(
    + for (k in 1:10) 
    + { 
    +  drvn.list.1 <- .... [TRUNCATED] 
        user system elapsed 
    432.12 0.46 436.73 
    
    > system.time(
    + for (k in 1:10) 
    + { 
    +  drvn.list.2 <- lapply(as.list(as.integer(rep(0,nrow(drivedata)))),head,0) 
    +  pos <- rep(NA,max(dr .... [TRUNCATED] 
        user system elapsed 
        51.07 0.03 51.41 
    > 
    

    結果一致:

    > identical(drvn.list.1,drvn.list.2) 
    [1] TRUE 
    > 
    
  • +0

    真棒,感謝一噸,這樣做的工作。任何想法如何加快這一點?整個數據集運行需要22分鐘以上的時間。 – NateN

    +0

    我通過第二個更快的解決方案增強了我的答案。 – mra68

    相關問題