2017-01-30 69 views


目標DF t_offices有4個領域的興趣在這裏:

 administrative_area_level_1 administrative_area_level_2  country  locality 
    1      Arizona    Maricopa County United States  Phoenix 
    2  District of Columbia      <NA> United States Washington 
    3      <NA>      <NA>   India   <NA> 
    4     New York    Albany County United States  Albany 
    5      Utrecht     Nieuwegein Netherlands Nieuwegein 
    6     Connecticut   Fairfield County United States  Stamford 
    707     Illinois      <NA> United States   <NA> 
    4241     Illinois      <NA> United States West Chicago 
999998      Alabama      <NA> United States  Altoona 
999999    Pennsylvania      <NA> United States Washington 

我需要administrative_area_level_2與美國記錄的縣更新NA值。該值是在DF t_places

 state_ab   place_name     county_name place_nameshort 
    1  AL   Abanda CDP    Chambers County   Abanda 
    2  AL  Abbeville city     Henry County  Abbeville 
    3  AL  Adamsville city    Jefferson County  Adamsville 
    4  AL   Addison town    Winston County   Addison 
    5  AL   Akron town     Hale County   Akron 
    6  AL  Alabaster city    Shelby County  Alabaster 
    12  AL   Altoona town Blount County, Etowah County   Altoona 
    4298  DC  Washington city   District of Columbia  Washington 
    7527  IL West Chicago city    DuPage County  Washington 
32611  PA Washington township    Armstrong County West Chicago 
32612  PA Washington township     Berks County  Washington 




state_ab    state_name 
1  AL    Alabama 
2  AK     Alaska 
3  AZ    Arizona 
4  AR    Arkansas 
5  CA    California 
6  CO    Colorado 
9  DC District of Columbia 
17  IL    Illinois 
42  PA   Pennsylvania 



no_county <- (!is.na(t_offices$country) 
      & t_offices$country == "United States" 
      & !is.na(t_offices$administrative_area_level_1) 
      & is.na(t_offices$administrative_area_level_2) 
      & !is.na(t_offices$locality)) 

t_offices$administrative_area_level_2[no_county] <- 
    t_places$county_name[!grepl(",", t_places$county_name) 
         & match(t_places$place_nameshort, t_offices$locality[no_county]) 
         & match(t_places$state_ab, 

編輯:繼@ r2evans的意見,這是我新的編碼的嘗試,它仍然不能正常工作:

# split multiple counties into columns 
t_places <- cSplit(t_places, "county_name", sep = ", ", drop = F, type.convert = F) 

# merge state names into places 
places_statename <- merge(t_places, r_states[,2:3]) 

# define condition to select t_offices records in U.S. with state and place but no county 
no_county <- (
    # country is U.S. 
    & t_offices$country == "United States" 
    # with state 
    & !is.na(t_offices$administrative_area_level_1) 
    # blank county 
    & is.na(t_offices$administrative_area_level_2) 
    # with place 
    & !is.na(t_offices$locality)) 

# update blank counties 
t_offices$administrative_area_level_2[no_county] <- 
    # unambiguous counties 
           # locality matches place 
           & match(t_offices$locality[no_county], places_statename$place_nameshort) 
           # administrative_area_level_1 matches state 
           & match(t_offices$administrative_area_level_1[no_county],places_statename$state_name)] 

我建議你爲了支持直接加入改革您的數據(通過'merge'或'dplyr :: left_join'和朋友)。這使得一切都變得更容易,更強大,並且更容易處理/排除故障。一開始:如果'縣名'可以包含多個以逗號分隔的值,可以用'tidyr :: separate'和'tidyr :: gather'來分割它們(所以加入更直觀/簡單。問題可以重現;現在,我們沒有符合您所有要求的代表性數據。 – r2evans


@ r2evans感謝您的建議!我已經添加了(真實和製作的)樣本數據以使問題具有可重現性。你的第一個建議是,我應該合併t_places和r_states並將縣名融入一個表中,然後用t_offices將該表加入? – syre


@ r2evans不會融化,但會轉換爲多列 – syre




# split multiple counties into columns 
t_places <- cSplit(t_places, "county_name", sep = ", ", drop = F, type.convert = F) 
# subset original places with single county 
places_singlecounty <- t_places[is.na(places_statename$county_name_2), c(1,8,9)] 
# subset truncated places with single county 
places_singlecounty <- merge(places_singlecounty, 
          places_singlecounty[, .N, by = c("state_ab", "place_nameshort")][N == 1, 1:2]) 
# merge state names into single-county truncated places 
places_statename <- merge(places_singlecounty, r_states[,2:3], by = "state_ab") 

# define condition to select t_offices records in U.S. with state and place but no county 
no_county <- (
    # country is U.S. 
    & t_offices$country == "United States" 
    # with state 
    & !is.na(t_offices$administrative_area_level_1) 
    # NA county 
    & is.na(t_offices$administrative_area_level_2) 
    # with place 
    & !is.na(t_offices$locality)) 

# update t_offices NA counties based on single-county truncated places 
t_offices[no_county, administrative_area_level_2 := 
      places_statename[.(.SD), county_name_1, 
          on = c(state_name = "administrative_area_level_1", 
            place_nameshort = "locality")]]