2017-04-20 95 views
0

(我有一種感覺,我會感到非常愚蠢,我得到一個答案後,但我只是無法弄清楚這一點。)在R中,如何對data.frame的特定子集執行操作?

我有一個data.frame結尾的空列。它將主要被納入NA,但我想用一個值填充它的一些行。此列表示對data.frame中某列的數據缺失的猜測。

我最初data.frame看起來是這樣的:

Game | Rating | MinPlayers | MaxPlayers | MaxPlayersGuess 
--------------------------------------------------------- 
A | 6  | 3   | 6   | 
B | 7  | 3   | 7   | 
C | 6.5 | 3   | N/A  |median(df$MaxPlayers[df$MinPlayers ==3,]) 
D | 7  | 3   | 6   | 
E | 7  | 3   | 5   | 
F | 9.5 | 2   | 5   | 
G | 6  | 2   | 4   | 
H | 7  | 2   | 4   | 
I | 6.5 | 2   | N/A  |median(df$MaxPlayers[df$MinPlayers ==2,]) 
J | 7  | 2   | 2   | 
K | 7  | 2   | 4   | 

注意,兩排中有 「N/A」 爲MAXPLAYERS。我試圖做的是使用我必須猜測MaxPlayers可能是什麼的信息。如果3位玩家遊戲的中位數(MaxPlayers)爲6,則對於MinPlayers == 3和MaxPlayers == N/A的遊戲,MaxPlayerGuess應該等於6。 (我試圖在代碼中表示什麼價值MaxPlayerGuess應在本例中得到上面。)

產生的data.frame應該是這樣的:

Game | Rating | MinPlayers | MaxPlayers | MaxPlayersGuess 
--------------------------------------------------------- 
A | 6  | 3   | 6   | 
B | 7  | 3   | 7   | 
C | 6.5 | 3   | N/A  |6 
D | 7  | 3   | 6   | 
E | 7  | 3   | 5   | 
F | 9.5 | 2   | 5   | 
G | 6  | 2   | 4   | 
H | 7  | 2   | 4   | 
I | 6.5 | 2   | N/A  |4 
J | 7  | 2   | 2   | 
K | 7  | 2   | 4   | 

共享一個嘗試的結果:

gld$MaxPlayersGuess <- ifelse(is.na(gld$MaxPlayers), median(gld$MaxPlayers[gld$MinPlayers,]), NA) 


Error in gld$MaxPlayers[gld$MinPlayers, ] : 
incorrect number of dimensions 

回答

2

更新相對於發佈的示例。

這是我的一天,有時候更容易計算出你想要的,然後在你需要的時候抓住它,而不是使用所有這些邏輯連貫性。你試圖想出一種方法來一次計算它,這就讓它變得混亂,把它分解成幾個步驟。您需要知道每個可能的「MinPlayer」組的「MaxPlayer」的中值。然後,您想在MaxPlayer丟失時使用該值。所以這是一個簡單的方法來做到這一點。

#generate fake data 
MinPlayer <- rep(3:2, each = 4) 
MaxPlayer <- rep(2:5, each = 2, times = 2) 

df <- data.frame(MinPlayer, MaxPlayer) 

#replace some values of MaxPlayer with NA 
df$MaxPlayer <- ifelse(df$MaxPlayer == 3, NA, df$MaxPlayer) 

####STARTING DATA 
# > df 
# MinPlayer MaxPlayer 
# 1   3   2 
# 2   3   2 
# 3   3  NA 
# 4   3  NA 
# 5   2   4 
# 6   2   4 
# 7   2   5 
# 8   2   5 
# 9   3   2 
# 10   3   2 
# 11   3  NA 
# 12   3  NA 
# 13   2   4 
# 14   2   4 
# 15   2   5 
# 16   2   5 

####STEP 1 
#find the median of MaxPlayer for each group of MinPlayer (e.g., when MinPlayer == 1, 2 or whatever) 
#just add a column to the data frame that has the right median value for each subset of MinPlayer in it and grab that value to use later. 
library(plyr) #plyr is a great way to compute things across data subsets 
df <- ddply(df, c("MinPlayer"), transform, 
      median.minp = median(MaxPlayer, na.rm = TRUE)) #ignore NAs in the median 

####STEP 2 
#anytime that MaxPlayer == NA, grab the median value to replace the NA, otherwise keep the MaxPlayer value 
df$MaxPlayer <- ifelse(is.na(df$MaxPlayer), df$median.minp, df$MaxPlayer) 

####STEP 3 
#you had to compute an extra column you don't really want, so drop it now that you're done with it 
df <- df[ , !(names(df) %in% "median.minp")] 

####RESULT 
# > df 
# MinPlayer MaxPlayer 
# 1   2   4 
# 2   2   4 
# 3   2   5 
# 4   2   5 
# 5   2   4 
# 6   2   4 
# 7   2   5 
# 8   2   5 
# 9   3   2 
# 10   3   2 
# 11   3   2 
# 12   3   2 
# 13   3   2 
# 14   3   2 
# 15   3   2 
# 16   3   2 

老回答以下這裏....

請張貼重複的例子!

#fake data 
this <- rep(1:2, each = 1, times = 2) 
that <- rep(3:2, each = 1, times = 2) 

df <- data.frame(this, that) 

如果你只是問基本的索引....例如,尋找到一些滿足條件的值,這將返回與條件匹配值的行指數(查找哪些?):

> which(df$this < df$that) 
[1] 1 3 

這將返回符合條件的行的值而不是行索引 - 您只需使用由「which」返回的行索引在數據框的正確列(此處爲「this」)中找到相應的值即可

> df[which(df$this < df$that), "this"] 
[1] 1 1 

如果您希望在「this」比這個「小於」時應用一些計算,併爲您的數據框添加一個新列,則只需使用「ifelse」。否則創建一個符合條件的邏輯向量,然後將東西添加到符合條件的東西(例如,邏輯測試== TRUE的位置)。

#if "this" is < "that", multiply by 2 
df$result <- ifelse(df$this < df$that, df$this * 2, NA) 

> df 
this that result 
1 1 3  2 
2 2 2  NA 
3 1 3  2 
4 2 2  NA 

沒有一個可重複的例子,不能再提供更多的例子。

+0

道歉,因爲我不知道如何甚至開始編碼,我不知道如何提供一個可重複的例子程序。 – Zelbinian

+0

謝謝你試圖回答。通過嘗試一些您的建議,我能夠更好地看到問題並找出如何發佈示例。 – Zelbinian

+0

@Zelbinian,所以一般你會把griffmer的標記爲答案 – Chris

0

我認爲你已經擁有了@ griffmer的答案中所需的一切。但一個不太優雅,但也許更直觀的方式可能是一個循環:

## Your data: 
df <- data.frame(
     Game = LETTERS[1:11], 
     Rating = c(6,7,6.5,7,7,9.5,6,7,6.5,7,7), 
     MinPlayers = c(rep(3,5), rep(2,6)), 
     MaxPlayers = c(6,7,NA,6,5,5,4,4,NA,2,4)  
) 

## Loop over rows: 
df$MaxPlayersGuess <- vapply(1:nrow(df), function(ii){ 
      if (is.na(df$MaxPlayers[ii])){ 
       median(df$MaxPlayers[df$MinPlayers == df$MinPlayers[ii]], 
         na.rm = TRUE)    
      } else { 
       df$MaxPlayers[ii] 
      }   
     }, numeric(1)) 

如果你想使用dplyr,讓你

df 
# Game Rating MinPlayers MaxPlayers MaxPlayersGuess 
# 1  A 6.0   3   6    6 
# 2  B 7.0   3   7    7 
# 3  C 6.5   3   NA    6 
# 4  D 7.0   3   6    6 
# 5  E 7.0   3   5    5 
# 6  F 9.5   2   5    5 
# 7  G 6.0   2   4    4 
# 8  H 7.0   2   4    4 
# 9  I 6.5   2   NA    4 
# 10 J 7.0   2   2    2 
# 11 K 7.0   2   4    4 
0

,你可以嘗試:

輸入:

df <- data.frame(
    Game = LETTERS[1:11], 
    Rating = c(6,7,6.5,7,7,9.5,6,7,6.5,7,7), 
    MinPlayers = c(rep(3,5), rep(2,6)), 
    MaxPlayers = c(6,7,NA,6,5,5,4,4,NA,2,4)  
) 

process:

df %>% 
    group_by(MinPlayers) %>% 
    mutate(MaxPlayers = if_else(is.na(MaxPlayers), median(MaxPlayers, na.rm=TRUE), MaxPlayers)) 

這會將數據基礎MinPlayers分組,然後將MaxPlayers的中值賦予缺失數據的行。

輸出:

Source: local data frame [11 x 4] 
Groups: MinPlayers [2] 

    Game Rating MinPlayers MaxPlayers 
    <fctr> <dbl>  <dbl>  <dbl> 
1  A 6.0   3   6 
2  B 7.0   3   7 
3  C 6.5   3   6 
4  D 7.0   3   6 
5  E 7.0   3   5 
6  F 9.5   2   5 
7  G 6.0   2   4 
8  H 7.0   2   4 
9  I 6.5   2   4 
10  J 7.0   2   2 
11  K 7.0   2   4 
相關問題