通過比較數據框中的現有變量添加新變量

我有一個包含2016年主選結果的數據集。數據集包含8列：State，state_abbr，county，fips（這是州和縣的ID號碼），派對，候選人，投票和分數票。我想創建一個名爲「結果」的新列，表示每個候選人在每個縣的「贏」或「損」。我使用過濾到dplyr 2名民主黨候選人中的數據，然後使用該代碼添加列：通過比較數據框中的現有變量添加新變量

Democrat$result <- ifelse(Democrat$fraction_votes > .5, "Win","Loss")

這顯然是不準確的方法，因爲勝利者並不總是得到的選票50％。我如何讓R比較每個縣的vote_fraction或投票總數，並返回「勝利」或「損失？」。 apply（）系列，for循環，還是編寫函數是創建新列的最佳方式？

state state_abbreviation county fips party  candidate 
    1 Alabama AL   Autauga 1001 Democrat Bernie Sanders 
    2 Alabama AL   Autauga 1001 Democrat Hillary Clinton 
    3 Alabama AL   Baldwin 1003 Democrat Bernie Sanders 
    4 Alabama AL   Baldwin 1003 Democrat Hillary Clinton 
    5 Alabama AL   Barbour 1005 Democrat Bernie Sanders 
    6 Alabama AL   Barbour 1005 Democrat Hillary Clinton 
    votes fraction_votes 
    1 544   0.182 
    2 2387   0.800 
    3 2694   0.329 
    4 5290   0.647 
    5 222   0.078 
    6 2567   0.906

來源

2017-02-22 Andrew Lastrapes

我們可以得到您的數據設置的例子嗎？ –

[編輯]你的文章！ –

好吧，那裏是 –

我會先用summarise功能從dplyr包找票給定縣收到任何候選人的最大數量，然後與縣最大的列添加到原始數據集，然後計算出結果。

# create a sample dataset akin to the question setup 
df <- data.frame(abrev = rep("AL", 6), county = c("Autuga", "Autuga", "Baldwin", "Baldwin", 
                "Barbour", "Barbour"), 
       party = rep("Democrat", 6), 
       candidate = rep(c("Bernie", "Hillary"), 3), 
       fraction_votes = c(0.18, 0.8, 0.32, 0.64, 0.07, 0.9)) 

# load a dplyr library 
library(dplyr) 

# calculate what was the maximum ammount of votes candidate received in a given county 

# take a df dataset 
winners <- df %>% 
     # group it by a county 
     group_by(county) %>% 
     # for each county, calculate maximum of votes 
     summarise(score = max(fraction_votes)) 

# join the original dataset and the dataset with county maximumus 
# join them by county column 
df <- left_join(df, winners, by = c("county")) 

# calculate the result column 
df$result <- ifelse(df$fraction_votes == df$score, "Win", "Loss")

如果存在具有相同名稱的不同縣，你將不得不調整分組和接合部，但邏輯應該是相同的

來源

2017-02-22 17:28:40 ira

做得非常好！ –

在基R，可以計算出一個二進制向量與ave ：

Democrat$winner <- ave(Democrat$fraction_votes, Democrat$fips, FUN=function(i) i == max(i))

Democrat 
    state state_abbreviation county fips party candidate votes fraction_votes winner 
1 Alabama     AL Autauga 1001 Democrat Bernie 544   0.182  0 
2 Alabama     AL Autauga 1001 Democrat Hillary 2387   0.800  1 
3 Alabama     AL Baldwin 1003 Democrat Bernie 2694   0.329  0 
4 Alabama     AL Baldwin 1003 Democrat Hillary 5290   0.647  1 
5 Alabama     AL Barbour 1005 Democrat Bernie 222   0.078  0 
6 Alabama     AL Barbour 1005 Democrat Hillary 2567   0.906  1

其如果需要，可以通過將ave包裝在as.logical中轉換爲邏輯。

這也是data.table非常簡單。假設FIPS是唯一的國有縣ID：

library(data.table) 
# convert to data.table 
setDT(Democrat) 

# get logical vector that proclaims winner if vote fraction is maximum 
Democrat[, winner := fraction_votes == max(fraction_votes), by=fips]

Democrat 
    state state_abbreviation county fips party candidate votes fraction_votes winner 
1: Alabama     AL Autauga 1001 Democrat Bernie 544   0.182 FALSE 
2: Alabama     AL Autauga 1001 Democrat Hillary 2387   0.800 TRUE 
3: Alabama     AL Baldwin 1003 Democrat Bernie 2694   0.329 FALSE 
4: Alabama     AL Baldwin 1003 Democrat Hillary 5290   0.647 TRUE 
5: Alabama     AL Barbour 1005 Democrat Bernie 222   0.078 FALSE 
6: Alabama     AL Barbour 1005 Democrat Hillary 2567   0.906 TRUE

數據

Democrat <- 
structure(list(state = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "Alabama", class = "factor"), 
    state_abbreviation = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "AL", class = "factor"), 
    county = structure(c(1L, 1L, 2L, 2L, 3L, 3L), .Label = c("Autauga", 
    "Baldwin", "Barbour"), class = "factor"), fips = c(1001L, 
    1001L, 1003L, 1003L, 1005L, 1005L), party = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L), .Label = "Democrat", class = "factor"), 
    candidate = structure(c(1L, 2L, 1L, 2L, 1L, 2L), .Label = c("Bernie", 
    "Hillary"), class = "factor"), votes = c(544L, 2387L, 2694L, 
    5290L, 222L, 2567L), fraction_votes = c(0.182, 0.8, 0.329, 
    0.647, 0.078, 0.906)), .Names = c("state", "state_abbreviation", 
"county", "fips", "party", "candidate", "votes", "fraction_votes" 
), row.names = c("1", "2", "3", "4", "5", "6"), class = "data.frame")

來源

2017-02-22 17:29:53 lmo

通過比較數據框中的現有變量添加新變量

回答

相關問題