2013-04-26 138 views
-3

我試圖匹配兩個非常大的數據(nsar & crsp)集。我的代碼工作得很好,但需要很長時間。我的程序的工作方式如下:通過股票(從而控制NAV(只是一個數字)&日期 是一樣的),通過精確基金名稱 提高R腳本效率

  • 嘗試匹配

    1. 嘗試匹配(在控制了NAV &日)由最接近的匹配
    2. 嘗試匹配:首先搜索相同的NAV &日期 - >採取列表,只考慮有兩個匹配的措施最匹配的公司 - >取剩餘的條目,並找到最接近的匹配(但比賽距離限制)。

    任何建議,我怎麼能提高代碼的效率:

    #Go through each nsar entry and try to match with crsp 
    trackchanges = sapply(seq_along(nsar$fund),function(x){ 
    
        #Define vars 
        ticker = nsar$ticker[x] 
        r_date = format(nsar$r_date[x], "%m%Y") 
        nav1 = nsar$NAV_share[x] 
        nav2 = nsar$NAV_sshare[x] 
        searchbyname = 0 
    
        if(nav1 == 0) nav1 = -99 
        if(nav2 == 0) nav2 = -99 
    
        ########## If ticker is available --> Merge via ticker and NAV 
        if(is.na(ticker) == F) 
        { 
    
         #Look for same NAV, date and ticker 
         found = which(crsp$nasdaq == ticker & crsp$caldt2 == r_date & (round(crsp$mnav,1) == round(nav1,1) | round(crsp$mnav,1) == round(nav2,1))) 
    
    
         #If nothing found 
         if(length(found) == 0) 
         { 
    
          #Mark that you should search by names 
          searchbyname = 1 
    
         } else { #ticker found 
    
            #Record crsp_fundno and that match is found 
          nsar$match[x] = 1 
          nsar$crsp_fundno[x] = crsp$crsp_fundno[found[1]] 
          assign("nsar",nsar,envir=.GlobalEnv) 
    
          #Return: 1 --> Merged by ticker 
          return(1) 
         } 
    
        } 
    
        ########### 
    
        ########### No Ticker available or found --> Exact name matching 
        if(is.na(ticker) == T | searchbyname == 1) 
        { 
    
         #Define vars 
         name = tolower(nsar$fund[x]) 
         company = tolower(nsar$company[x]) 
    
         #Exact name, date and same NAV 
         found = which(crsp$fund_name2 == name & crsp$caldt2 == r_date & (round(crsp$mnav,1) == round(nav1,1) | round(crsp$mnav,1) == round(nav2,1))) 
    
    
    
         #If nothing found 
         if(length(found) == 0) 
         { 
    
          #####Continue searching by closest match 
    
           #First search for nav and date to get list of funds 
           allfunds = which(crsp$caldt2 == r_date & (round(crsp$mnav,1) == round(nav1,1) | round(crsp$mnav,1) == round(nav2,1))) 
           allfunds_companies = crsp$company[allfunds] 
    
           #Check if anything found 
           if(length(allfunds) == 0) 
           { 
            #Return: 0 --> nothing found 
            return(0) 
           } 
    
           #Get best match by lev and substring measure for company 
           levmatch = levenstheinMatch(company, allfunds_companies) 
           submatch = substringMatch(company, allfunds_companies) 
    
           allfunds = levmatch[levmatch %in% submatch] 
           allfunds_names = crsp$fund_name2[allfunds] 
    
           #Check if now anything found 
           if(length(allfunds) == 0) 
           { 
            #Mark match (5=Company not found) 
            nsar$match[x] = 5 
    
            #Save globally 
            assign("nsar",nsar,envir=.GlobalEnv) 
    
            #Return: 5 --> Company not found 
            return(5) 
           } 
    
    
           #Get best match by all measures 
           levmatch = levenstheinMatch(name, allfunds_names) 
           submatch = substringMatch(name, allfunds_names) 
    
    
           #Only accept if identical 
           allfunds = levmatch[levmatch %in% submatch] 
           allfunds_names = crsp$fund_name2[allfunds] 
    
    
           if(length(allfunds) > 0) 
           { 
            #Mark match (3=closest name matching) 
            nsar$match[x] = 3 
    
            #Add crsp_fundno to nsar data 
            nsar$crsp_fundno[x] = crsp$crsp_fundno[allfunds[1]] 
    
            #Save globally 
            assign("nsar",nsar,envir=.GlobalEnv) 
    
            #Return 3=closest name matching 
            return(3) 
    
           } else { 
            #return 0 -> no match 
            return(0) 
           } 
    
          ##### 
    
         } else { #If exact name,date,nav found 
    
          #Mark match (2=exact name matching) 
          nsar$match[x] = 2 
    
          #Add crsp_fundno to nsar data 
          nsar$crsp_fundno[x] = crsp$crsp_fundno[found[1]] 
    
          #Return 2=exact name matching 
          return(2) 
         } 
        } 
    
    
    
    
    
    })#End sapply 
    

    非常感謝您的幫助! Laurenz

  • +0

    你能發佈一個更簡單,可重複的例子嗎? – Nishanth 2013-04-26 12:36:06

    +0

    一些一般建議。少寫評論,但將工作流程切入功能。這樣你的中央環路可能會在十條左右。這使您的主要想法易於掌握,並且細節包含在功能中。 – 2013-04-26 14:00:18

    回答

    2

    劇本太複雜,以提供一個完整的答案,但基本的問題是在第一線

    #Go through each nsar entry... 
    

    ,你在一個迭代的方式設置出了問題。 R最適合於矢量。

    提升機sapply開始進行計算的可矢量化組件。例如,格式化r_date

    nsar$r_date_f <- format(nsar$r_date, "%m%Y") 
    

    該建議適用於線埋在你的代碼更深入,太,例如計算圓的CRSP $ mnav應在整列剛剛完成一次

    crsp$mnav_r <- round(crsp$mnav, 1) 
    

    使用[R成語哪裏適當的,如果「-99」代表缺失值,然後用NA

    nav1 <- nsar$NAV_share 
    nav1[nav1 == -99] <- NA 
    nasr$nav1 <- nav1 
    

    代碼從其他的包,你可以使用更容易治療NA正確。

    使用成熟的R的功能和對於更復雜的查詢。這是棘手的,但如果我正確地讀你的代碼,你對「同NAV,日期,股票代碼」查詢可以使用merge做連接,假定列已通過前面的代碼矢量操作創建,如

    nasr1 <- nasr[!is.na(nasr$ticker), , drop=FALSE] 
    df0 <- merge(nasr1, crsp, 
          by.x = c("ticker", rdate_r", "nav1_r"), 
          by.y = c("nasdaq", "caldt2", "mnav_r")) 
    

    這並不包括 「|」條件,所以需要額外的工作。 plyr,data.table和sqldf包(以及其他)的開發部分是爲了簡化這些類型的操作,因此在您更加熟悉向量化計算時可能值得研究。

    這很難說,但我覺得這三個步驟解決您的代碼的主要挑戰。