2016-03-01 34 views
1

請考慮下面的數據幀傳遞根據數據幀的子集的功能以及數據幀列R鍵sapply

#build sample data.frame 
theData <- data.frame(surname = c("Smith","Parker", "Allen", "McGraw", "Parker", "Smith", "Smith"), 
        FamilySize = c(3, 2, 1, 1, 2, 3, 3)) 

首先,我需要驗證的人共享同一姓氏的數量對應到他們所屬的家庭的大小。例如,有3個人使用surname = "Smith",而FamilySize變量爲3。如果滿足這個條件,則家族的大小被附加到姓氏上(例如"3Smith")。如果不是,結果應該是"small"這個詞。

爲此我寫了這個功能:

# function 
familyKount <- function(df, lastName, famSize){ 
    # calculate number of persons sharing same surname 
    nPersons <- dim(subset(df, surname == lastName))[1] 

    # number of persons agrees with family size 
    if(nPersons == famSize) { 
      idFam <- paste(as.character(famSize), lastName, sep="") 
    } else {    # number of persons does not agree with family size 
      idFam <- "small" 
    } 
    idFam 
} 

所以,如果我調用這個函數如下

familyKount(theData, theData$surname[1], theData$FamilySize[1]) 

我得到正確的答案:"3Smith"

但是,我想要的是將此函數應用於整個數據幀,而無需爲surnameFamilySize(我不想使用for循環)指定索引。我嘗試過apply系列函數的變體,但我還沒有想出如何在這種情況下傳遞整個數據框以及它的特定列作爲函數的參數。

乾杯

回答

1

有很多解決方案。你可以例如使用表:

table(theData$surname) 

## Allen McGraw Parker Smith 
##  1  1  2  3 

或者與dplyr

library(dplyr) 
group_by(theData, surname) %>% 
    summarize(SizeCalculated = n() 
## Source: local data frame [4 x 2] 
## 
## surname SizeCalculated 
## (fctr)   (int) 
## 1 Allen    1 
## 2 McGraw    1 
## 3 Parker    2 
## 4 Smith    3) 

或者與aggregate()

aggregate(theData, list(theData$surname), length) 
## Group.1 surname FamilySize 
## 1 Allen  1   1 
## 2 McGraw  1   1 
## 3 Parker  2   2 
## 4 Smith  3   3 

您還可以找到一個解決方案與sapply()這可能是類似於你打算:

surnames <- unique(theData$surname) 
counts <- sapply(surnames, function(s) sum(theData$surname == s)) 
data.frame(surnames, counts) 
## surnames counts 
## 1 Smith  3 
## 2 Parker  2 
## 3 Allen  1 
## 4 McGraw  1 

這個想法是適用於姓氏。

所有這些解決方案都可以擴展爲包括theDataFamilySize的檢查。例如,aggregate()-溶液:

tab <- aggregate(theData, list(theData$surname), length) 
tab$size_check <- tab$surname == tab$FamilySize 
tab 
## Group.1 surname FamilySize size_check 
## 1 Allen  1   1  TRUE 
## 2 McGraw  1   1  TRUE 
## 3 Parker  2   2  TRUE 
## 4 Smith  3   3  TRUE