2016-05-17 53 views
1

我已經遇到了一些我寫過的R代碼,我想也許你會知道如何使整個事情變得可行,從效率可以提高的意義上說。R - grepl超過700萬觀測值 - 如何提高效率?

所以,我想要做的是以下幾點:

我有一個推特數據集〜700萬周的觀察。目前,我對推文或任何其他元數據不感興趣,但僅在「位置」字段中有效,因此我已將這些數據提取到新的data.frame中,其中包含位置變量(字符串)和一個新的,當前爲空的「isRelevant」變量(邏輯)。此外,我有一個包含文本信息的矢量,格式如下:「地名(1)|地名(2)[...] |地名(i)」。我想要做的是在grepl位置變量的每一行中查看是否與Placenames向量匹配,如果是,則在isRelevant變量中返回「TRUE」,否則返回「FALSE」。

要做到這一點,我寫了一些R代碼裏面,這基本上可以歸結到這條線:

locations.df$isRelevant <- sapply(locations.df$locations, function(s) grepl(grep_places, s, ignore.case = TRUE)) 

因此grep_places是通過分離可能匹配項列表「|」字符,讓R知道它可以匹配矢量中的任何元素。我正在一臺遠程高容量計算機上運行該程序,該程序使用RStudio(R 3.2.0)提供超過2 TB的RAM,並且使用'pbsapply'運行它,該程序爲我提供了一個進度條。事實證明,這是可笑的長。到目前爲止,它已經完成了大約45%(我從一個多星期前開始),它說它還需要270多個小時才能完成。這顯然不是一個真正可行的情況,因爲我將來必須使用更大的數據集來運行類似的代碼。你有什麼想法,我可以在更可接受的時間內完成這項工作,也許就像一天或類似的事情(記住超強計算機)。

編輯

下面是一些半模擬數據表明了什麼我工作大約是這樣的:

print(grep_places) 
> grep_places 
"Acworth NH|Albany NH|Alexandria NH|Allenstown NH|Alstead NH|Alton NH|Amherst NH|Andover NH|Antrim NH|Ashland NH|Atkinson NH|Auburn NH|Barnstead NH|Barrington NH|Bartlett NH|Bath NH|Bedford NH|Belmont NH|Bennington NH|Benton NH|Berlin NH|Bethlehem NH|Boscawen NH|Bow NH|Bradford NH|Brentwood NH|Bridgewater NH|Bristol NH|Brookfield NH|Brookline NH|Campton NH|Canaan NH|Candia NH|Canterbury NH|Carroll NH|CenterHarbor NH|Charlestown NH|Chatham NH|Chester NH|Chesterfield NH|Chichester NH|Claremont NH|Clarksville NH|Colebrook NH|Columbia NH|Concord NH|Conway NH|Cornish NH|Croydon NH|Dalton NH|Danbury NH|Danville NH|Deerfield NH|Deering NH|Derry NH|Dorchester NH|Dover NH|Dublin NH|Dummer NH|Dunbarton NH|Durham NH|EastKingston NH|Easton NH|Eaton NH|Effingham NH|Ellsworth NH|Enfield NH|Epping NH|Epsom NH|Errol NH|Exeter NH|Farmington NH|Fitzwilliam NH|Francestown NH|Franconia NH|Franklin NH|Freedom NH|Fremont NH|Gilford NH|Gilmanton NH|Gilsum NH|Goffstown NH|Gorham NH|Goshen NH|Grafton NH|Grantham NH|Greenfield NH|Greenland NH|Greenville NH|Groton NH|Hampstead NH|Hampton NH|HamptonFalls NH|Hancock NH|Hanover NH|Harrisville NH|Hart'sLocation NH|Haverhill NH|Hebron NH|Henniker NH|Hill NH|Hillsborough NH|Hinsdale NH|Holderness NH|Hollis NH|Hooksett NH|Hopkinton NH|Hudson NH|Jackson NH|Jaffrey NH|Jefferson NH|Keene NH|Kensington NH|Kingston NH|Laconia NH|Lancaster NH|Landaff NH|Langdon NH|Lebanon NH|Lee NH|Lempster NH|Lincoln NH|Lisbon NH|Litchfield NH|Littleton NH|Londonderry NH|Loudon NH|Lyman NH|Lyme NH|Lyndeborough NH|Madbury NH|Madison NH|Manchester NH|Marlborough NH|Marlow NH|Mason NH|Meredith NH|Merrimack NH|Middleton NH|Milan NH|Milford NH|Milton NH|Monroe NH|MontVernon NH|Moultonborough NH|Nashua NH|Nelson NH|NewBoston NH|NewCastle NH|NewDurham NH|NewHampton NH|NewIpswich NH|NewLondon NH|Newbury NH|Newfields NH|Newington NH|Newmarket NH|Newport NH|Newton NH|NorthHampton NH|Northfield NH|Northumberland NH|Northwood NH|Nottingham NH|Orange NH|Orford NH|Ossipee NH|Pelham NH|Pembroke NH|Peterborough NH|Piermont NH|Pittsburg NH|Pittsfield NH|Plainfield NH|Plaistow NH|Plymouth NH|Portsmouth NH|Randolph NH|Raymond NH|Richmond NH|Rindge NH|Rochester NH|Rollinsford NH|Roxbury NH|Rumney NH|Rye NH|Salem NH|Salisbury NH|Sanbornton NH|Sandown NH|Sandwich NH|Seabrook NH|Sharon NH|Shelburne NH" 


head(location.df, n=20) 
>      location isRelevant 
1      London   NA 
2  Orleans village VT USA   NA 
3     The World   NA 
4    D M V Towson   NA 
5 Playa del Sol Solidaridad   NA 
6 Beautiful Downtown Burbank   NA 
7      <NA>   NA 
8       US   NA 
9    Gaithersburg Md   NA 
10      <NA>   NA 
11    California   NA 
12      Indy   NA 
13     Florida   NA 
14    exsnaveen com   NA 
15     Houston TX   NA 
16     Tweaking   NA 
17    Phoenix AZ   NA 
18    Malibu Ca USA   NA 
19   Hermosa Beach CA   NA 
20    California USA   NA 

提前感謝大家,我會認真地感謝所有幫助有了這個。

+1

這是一個合理的問題,因爲它代表,但如果你提供一點會更好(模擬)數據提供[可重現的例子](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)... –

+0

嗨本。對不起,遺憾。我現在添加了一些數據。乾杯! – nikUoM

+0

你可能對'stringi'包中的某些函數有更好的運氣,它往往會勝過其他正則表達式函數。 – nrussell

回答

3

grepl是一個矢量化函數,應該不需要對其應用循環。您是否嘗試過:

#dput(location.df)  
location.df<-structure(list(location = structure(c(12L, 14L, 17L, 5L, 16L, 
      2L, 1L, 19L, 8L, 1L, 3L, 11L, 7L, 6L, 10L, 18L, 15L, 13L, 9L, 
     4L), .Label = c("<NA>", "Beautiful Downtown Burbank", "California", 
      "California USA", "D M V Towson", "exsnaveen com", "Florida", 
      "Gaithersburg Md", "Hermosa Beach CA", "Houston TX", "Indy", 
      "London", "Malibu Ca USA", "Orleans village VT USA", "Phoenix AZ", 
      "Playa del Sol Solidaridad", "The World", "Tweaking", "US"), class = "factor"), 
      isRelevant = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
      NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("location", 
      "isRelevant"), row.names = c(NA, -20L), class = "data.frame") 

#grep_places with places in the test data 
grep_places<-"Gaithersburg Md|Phoenix AZ" 

location.df$isRelevant[grepl(grep_places, location.df$location, ignore.case = TRUE)]<-TRUE 

或稍快的實現,按照大衛Arenburg的評論:

location.df$isRelevant <- grepl(grep_places, location.df$location, ignore.case = TRUE)