好吧,所以我有一點困難,我知道它必須有一個解決方案。 我有一個13欄的數據表,但我們只關注兩個(票價和pClass)。有1309行,1308有票價值,並且我想通過基於不同類的平均值(pClass)來找到缺失的值。所以我想要的是告訴R找到一行,其中Fare = NA,讀取pClass(1,2或3)中的值,然後找到指定類別的平均值,然後替換票價中的缺失值與平均水平讀取基於r中不同值的值
所以我想總結你的使命,誰是勇敢和善良的足以幫助我。我想找到一個缺失的值,找出它是什麼類,平均具體說缺少的值類,並用正確的平均值替換缺失的值。
使用這種方法,而不是僅僅找到丟失和讀取的行是更好的途徑,當我在R中有多個缺失值時,我可以用正確的平均值代替,而不管決定列。
謝謝您的時間,
-Dylan
好了,因爲這是太特定的繼承人回答原來的問題,新計劃的男孩(和女孩還有什麼過你想成爲IDRC作爲只要你知道你在說什麼)。所以!新的計劃是使3個變量對應於三個不同的pClass(1,2和3)。這些pClass平均值中的每一個(將調用'em pClassAVG。(x)其中x = 1,2或3),然後讓R找到票價的缺失值,並用相應pClass的pClass變量(平均值)替換它們 R的思維過程應該是這樣的:「好吧,缺少的值。什麼是pClass?好吧,它是2,所以我們應該用pClassAVG.2替換缺失的值。」
最後一次我得到-1因爲沒有包括我的代碼它是
setwd("C:/Users/Maker/Desktop/Data Science/Data/Dylan T/Titanic data")
titanic.train <- read.csv(file = "train.csv", stringsAsFactors = FALSE, header = TRUE)
titanic.test <- read.csv(file = "test.csv", stringsAsFactors = FALSE, header = TRUE)
# line one tells it where to look for data. line 2 & 3 tell it that hey we wanna manipulate this stuff
#the string as factors does makes the factors line up bc we are gonna clean the data sheets togeather
#the headers = true makes the computer understand that there are headers and to not count or read the
#first line as data but as a title
#currently reads incorrectly
titanic.train$IsTrainSet <- TRUE
titanic.test$IsTrainSet <- FALSE
#makes a new collumb to tell us if it is the train set or test set
titanic.test$Survived <- NA
#makes a new collumb and fills it with NA to make the collumbs line up and have the same names
titanic.full <- rbind(titanic.train, titanic.test)
titanic.full[titanic.full$Embarked=='', "Embarked"] <- 'S'
#ended day 1 at 12 minutes
age.median <- median(titanic.full$Age, na.rm = TRUE)
#creates a variable called age.median and assings it the median of the age collumb excluding the missing values (if we included missing
#values it would break bc its adding an undefined numbe)
#this method is better for replacing data that can change for example real time data that changes over the couse of the day and your
#data gets its info updated every so often thus eliminating the problem of missing values and an incorrect median.
titanic.full[is.na(titanic.full$Age), "Age"] <- age.median
#table(is.na(titanic.full$Age) counts the missing values in the collumb age of titanic.full and returns true if there are missing value
pClassAVG.1 <- median(titanic.full$Fare, na.rm = TRUE, titanic.full$Pclass == 1)
pClassAVG.2 <- median(titanic.full$Fare, na.rm = TRUE, titanic.full$Pclass == 2)
最後兩行是我在告訴它,使前述方式pClassAVG.1嘗試和pClassAVG.2
[A重複的例子,W /你的數據將是有益的(https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – Tung
Dylan,對於你的下一個問題,請看看這個鏈接@thecatalyst剛剛提供ed – Thai