讀取基於r中不同值的值

-1

好吧，所以我有一點困難，我知道它必須有一個解決方案。我有一個13欄的數據表，但我們只關注兩個（票價和pClass）。有1309行，1308有票價值，並且我想通過基於不同類的平均值（pClass）來找到缺失的值。所以我想要的是告訴R找到一行，其中Fare = NA，讀取pClass（1,2或3）中的值，然後找到指定類別的平均值，然後替換票價中的缺失值與平均水平讀取基於r中不同值的值

所以我想總結你的使命，誰是勇敢和善良的足以幫助我。我想找到一個缺失的值，找出它是什麼類，平均具體說缺少的值類，並用正確的平均值替換缺失的值。

使用這種方法，而不是僅僅找到丟失和讀取的行是更好的途徑，當我在R中有多個缺失值時，我可以用正確的平均值代替，而不管決定列。

謝謝您的時間，

-Dylan

好了，因爲這是太特定的繼承人回答原來的問題，新計劃的男孩（和女孩還有什麼過你想成爲IDRC作爲只要你知道你在說什麼）。所以！新的計劃是使3個變量對應於三個不同的pClass（1,2和3）。這些pClass平均值中的每一個（將調用'em pClassAVG。（x）其中x = 1,2或3），然後讓R找到票價的缺失值，並用相應pClass的pClass變量（平均值）替換它們 R的思維過程應該是這樣的：「好吧，缺少的值。什麼是pClass？好吧，它是2，所以我們應該用pClassAVG.2替換缺失的值。」

最後一次我得到-1因爲沒有包括我的代碼它是

setwd("C:/Users/Maker/Desktop/Data Science/Data/Dylan T/Titanic data") 
titanic.train <- read.csv(file = "train.csv", stringsAsFactors = FALSE, header = TRUE) 
titanic.test <- read.csv(file = "test.csv", stringsAsFactors = FALSE, header = TRUE) 
# line one tells it where to look for data. line 2 & 3 tell it that hey we wanna manipulate this stuff 
#the string as factors does makes the factors line up bc we are gonna clean the data sheets togeather 
#the headers = true makes the computer understand that there are headers and to not count or read the 
#first line as data but as a title 
#currently reads incorrectly 

titanic.train$IsTrainSet <- TRUE 
titanic.test$IsTrainSet <- FALSE 
#makes a new collumb to tell us if it is the train set or test set 

titanic.test$Survived <- NA 
#makes a new collumb and fills it with NA to make the collumbs line up and have the same names 

titanic.full <- rbind(titanic.train, titanic.test) 
titanic.full[titanic.full$Embarked=='', "Embarked"] <- 'S' 
#ended day 1 at 12 minutes 

age.median <- median(titanic.full$Age, na.rm = TRUE) 
#creates a variable called age.median and assings it the median of the age collumb excluding the missing values (if we included missing 
#values it would break bc its adding an undefined numbe) 
#this method is better for replacing data that can change for example real time data that changes over the couse of the day and your 
#data gets its info updated every so often thus eliminating the problem of missing values and an incorrect median. 

titanic.full[is.na(titanic.full$Age), "Age"] <- age.median 
#table(is.na(titanic.full$Age) counts the missing values in the collumb age of titanic.full and returns true if there are missing value 

pClassAVG.1 <- median(titanic.full$Fare, na.rm = TRUE, titanic.full$Pclass == 1) 
pClassAVG.2 <- median(titanic.full$Fare, na.rm = TRUE, titanic.full$Pclass == 2)

最後兩行是我在告訴它，使前述方式pClassAVG.1嘗試和pClassAVG.2

來源

2017-10-13 Dylan

[A重複的例子，W /你的數據將是有益的（https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example） – Tung

Dylan，對於你的下一個問題，請看看這個鏈接@thecatalyst剛剛提供ed – Thai

df <- data_frame(Fare=c(10,20,30,40,50,60,NA,70,80), pClass=c(1,2,3,1,2,3,1,2,3)) 

a <- df$pClass[which(is.na(df$Fare))] # find the pClass where Fare is missing 

df$Fare[which(is.na(df$Fare))] <- mean(df$Fare[df$pClass==a], na.rm=T) # replace the missinf Fare with mean of corresponding pClass

這隻能如果缺失

來源

2017-10-13 23:13:47 Swapnil

Fare = c和pClass = c做什麼？ – Dylan

@Dylan c（）創建一個向量，然後將其分配給變量Fare和pClass。這些變量然後用作列以創建df – Swapnil

這必須努力...讓我知道是否有可能與apply更優雅的解決方案，它不

車費的價值...但是這作品以及

#Creating a data frame named df 
fare<- c(6,8,3,NA,5,1,0,7,NA,4,1,8,6,NA,2) 
pclass<- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3) 
df<-as.data.frame(cbind(fare,pclass)) 

#Creating a loop to look at each row 
for(i in 1:length(df$fare)){ 

#And if the value for fare is missing 
if(is.na(df$fare[i])){ 

#then, replace with the mean according to the group defined in pclass 
df$fare[i]<- mean(df$fare[df$pclass==df$pclass[i]],na.rm = TRUE) 

} 
}

來源

2017-10-13 23:22:14 Thai

讀取基於r中不同值的值

回答

相關問題