2017-05-08 33 views
1

我有數據位數的數據從https://drive.google.com/file/d/0B9YMMvghK2ytSXI4RFo0clNLc28/view如何更換R中與基於狀態

基本上〜60萬行

它已經失蹤了一個價值的鑽石數據集的缺失數據。我想與特定的顏色

summary(BigDiamonds) 
##  X1    carat   cut    color   
## Min. :  1 Min. :0.200 Length:598024  Length:598024  
## 1st Qu.:149507 1st Qu.:0.500 Class :character Class :character 
## Median :299013 Median :0.900 Mode :character Mode :character 
## Mean :299013 Mean :1.071           
## 3rd Qu.:448518 3rd Qu.:1.500           
## Max. :598024 Max. :9.250           
##                   
## clarity    table   depth   cert   
## Length:598024  Min. : 0.00 Min. : 0.00 Length:598024  
## Class :character 1st Qu.:56.00 1st Qu.:61.00 Class :character 
## Mode :character Median :58.00 Median :62.10 Mode :character 
##      Mean :57.63 Mean :61.06      
##      3rd Qu.:59.00 3rd Qu.:62.70      
##      Max. :75.90 Max. :81.30      
##                  
## measurements   price    x    y   
## Length:598024  Min. : 300 Min. : 0.150 Min. : 1.000 
## Class :character 1st Qu.: 1220 1st Qu.: 4.740 1st Qu.: 4.970 
## Mode :character Median : 3503 Median : 5.780 Median : 6.050 
##      Mean : 8753 Mean : 5.991 Mean : 6.199 
##      3rd Qu.:11174 3rd Qu.: 6.970 3rd Qu.: 7.230 
##      Max. :99990 Max. :13.890 Max. :13.890 
##      NA's :713  NA's :1815  NA's :1852  
##  z   
## Min. : 0.040 
## 1st Qu.: 3.120 
## Median : 3.860 
## Mean : 4.033 
## 3rd Qu.: 4.610 
## Max. :13.180 
## NA's :2544 

table(BigDiamonds$color) 
## 
##  D  E  F  G  H  I  J  K  L 
## 73630 93483 93573 96204 86619 70282 48709 25868 9656 



Diamonds2=BigDiamonds[is.na(BigDiamonds$price),] 
Diamonds3=BigDiamonds[is.na(BigDiamonds$price)==F,] 
library(Hmisc) 
summarize(Diamonds3$price,Diamonds3$color,median) 
## Diamonds3$color Diamonds3$price 
## 1    D   2690 
## 2    E   2342 
## 3    F   2966 
## 4    G   3720 
## 5    H   4535 
## 6    I   4717 
## 7    J   4697 
## 8    K   4418 
## 9    L   3017 

我嘗試這樣做,但平均價格來代替NA值它不工作

Diamonds21=select(Diamonds2,price,color,cut) 

Diamonds21$newprice=ifelse(Diamonds21$color=="J",4697,Diamonds21$newprice) 
Diamonds21$newprice<-ifelse(Diamonds21$color=="D",2690,Diamonds21$newprice) 
Diamonds21$newprice<-ifelse(Diamonds21$color=="E",2342,Diamonds21$newprice) 
Diamonds21$newprice<-ifelse(Diamonds21$color=="F",2966,Diamonds21$newprice) 
Diamonds21$newprice<-ifelse(Diamonds21$color=="G",3720,Diamonds21$newprice) 
Diamonds21$newprice<-ifelse(Diamonds21$color=="H",4535,Diamonds21$newprice) 
Diamonds21$newprice<-ifelse(Diamonds21$color=="I",4717,Diamonds21$newprice) 
Diamonds21$newprice<-ifelse(Diamonds21$color=="K",4418,Diamonds21$newprice) 
Diamonds21$newprice<-ifelse(Diamonds21$color=="L",3017,Diamonds21$newprice) 

請告訴我錯了我的邏輯是什麼?

+1

鏈接到600K行數據集是矯枉過正。只需使用「mtcars」版本或小型代表作爲您的示例。 – thelatemail

+0

我會讓它人工缺失值我想,並取代中間價格基於齒輪或圓筒 –

回答

0

我試過了,它工作了

注意到第一行是不同

Diamonds21$newprice=ifelse(Diamonds21$color=="J",4697,Diamonds21$price) 
Diamonds21$newprice<-ifelse(Diamonds21$color=="D",2690,Diamonds21$newprice) 
Diamonds21$newprice<-ifelse(Diamonds21$color=="E",2342,Diamonds21$newprice) 
Diamonds21$newprice<-ifelse(Diamonds21$color=="F",2966,Diamonds21$newprice) 
Diamonds21$newprice<-ifelse(Diamonds21$color=="G",3720,Diamonds21$newprice) 
Diamonds21$newprice<-ifelse(Diamonds21$color=="H",4535,Diamonds21$newprice) 
Diamonds21$newprice<-ifelse(Diamonds21$color=="I",4717,Diamonds21$newprice) 
Diamonds21$newprice<-ifelse(Diamonds21$color=="K",4418,Diamonds21$newprice) 
Diamonds21$newprice<-ifelse(Diamonds21$color=="L",3017,Diamonds21$newprice) 
1

我們可以用na.aggregate來代替NA值。它可以採取FUN作爲參數,我們在那裏指定median。默認情況下,它給人的mean

library(zoo) 
na.aggregate(BigDiamonds$price, FUN = median) 
+0

它會給不同的中值不同顏色 –

+1

@AjayOhri在這種情況下,你必須使用一組操作。即'library(data.table); setDT(BigDiamonds,price:= na.aggregate(price,FUN = median),by = colors]' – akrun

1

有幾種不同的方式,你可以去這個問題,取決於什麼適合自己的需求。

首先,讓我們建立一個數據集的鑽石缺少price值:

library(dplyr) 

data(diamonds, package = "ggplot2") 

diamonds_missing <- diamonds %>% 
    mutate(price = ifelse(sample(1:0, 
           size = length(diamonds$price), 
           replace = TRUE, 
           prob = c(0.8, 0.2)), 
          price, NA)) 

現在鑽石的數據集有price價值觀缺失的20%。

你可以使用mutate()ifelse()中位數替換它們:

diamonds_missing %>% 
    mutate(price = ifelse(is.na(price), median(price, na.rm = TRUE), price)) 

#> # A tibble: 53,940 × 10 
#> carat  cut color clarity depth table price  x  y  z 
#> <dbl>  <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> 
#> 1 0.23  Ideal  E  SI2 61.5 55 326 3.95 3.98 2.43 
#> 2 0.21 Premium  E  SI1 59.8 61 326 3.89 3.84 2.31 
#> 3 0.23  Good  E  VS1 56.9 65 2396 4.05 4.07 2.31 
#> 4 0.29 Premium  I  VS2 62.4 58 334 4.20 4.23 2.63 
#> 5 0.31  Good  J  SI2 63.3 58 2396 4.34 4.35 2.75 
#> 6 0.24 Very Good  J VVS2 62.8 57 336 3.94 3.96 2.48 
#> 7 0.24 Very Good  I VVS1 62.3 57 336 3.95 3.98 2.47 
#> 8 0.26 Very Good  H  SI1 61.9 55 337 4.07 4.11 2.53 
#> 9 0.22  Fair  E  VS2 65.1 61 337 3.87 3.78 2.49 
#> 10 0.23 Very Good  H  VS1 59.4 61 338 4.00 4.05 2.39 
#> # ... with 53,930 more rows 

或者,如果你願意,你可以從tidyr包中使用該功能replace_na()

library(tidyr) 
diamonds_missing %>% 
    replace_na(list(price = median(diamonds_missing$price, na.rm = TRUE))) 
#> # A tibble: 53,940 × 10 
#> carat  cut color clarity depth table price  x  y  z 
#> <dbl>  <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> 
#> 1 0.23  Ideal  E  SI2 61.5 55 326 3.95 3.98 2.43 
#> 2 0.21 Premium  E  SI1 59.8 61 326 3.89 3.84 2.31 
#> 3 0.23  Good  E  VS1 56.9 65 2396 4.05 4.07 2.31 
#> 4 0.29 Premium  I  VS2 62.4 58 334 4.20 4.23 2.63 
#> 5 0.31  Good  J  SI2 63.3 58 2396 4.34 4.35 2.75 
#> 6 0.24 Very Good  J VVS2 62.8 57 336 3.94 3.96 2.48 
#> 7 0.24 Very Good  I VVS1 62.3 57 336 3.95 3.98 2.47 
#> 8 0.26 Very Good  H  SI1 61.9 55 337 4.07 4.11 2.53 
#> 9 0.22  Fair  E  VS2 65.1 61 337 3.87 3.78 2.49 
#> 10 0.23 Very Good  H  VS1 59.4 61 338 4.00 4.05 2.39 
#> # ... with 53,930 more rows 
+0

它會給不同顏色的中間價格 –