2015-01-14 35 views
1

1)我想在Gnu R中進行子集操作,數據集here只有巴西,時間和關於收入份額的所有系列名稱(如「收入份額最低10%」,「所持收入份額最低20%」等),共有7個收入份額系列名稱複雜的子集數據集設置爲數據框

我試過以下命令但不能子集一個以上的 「Series.Name」:

test <- melt(subset(WDI, subset = Series.Name == "Income share held by lowest 10%", select = -c(Time.Code, Series.Code, Argentina, Canada, Chile, Colombia, Mexico, USA, Venezuela)), id.vars = c("Series.Name", "Time")) 

2)在另一個第二步中,我想刪除具有NA值的所有行。

完整的代碼我用的是以下幾點:

WDI <- read.csv(https://dl.dropboxusercontent.com/u/109495328/WDI_Data_final.csv, na.strings = "..") 
library(reshape) 
library(reshape2) 
WDI <- rename(WDI, (c(Argentina..ARG.="Argentina", Brazil..BRA.="Brazil", Canada..CAN.="Canada", Chile..CHL.="Chile", Colombia..COL.="Colombia", Mexico..MEX.="Mexico", United.States..USA.="USA", Venezuela..RB..VEN.="Venezuela"))) 
income_brazil_long <- melt(subset(WDI, subset = Series.Name == "Income share held by lowest 10%", select = -c(Time.Code, Series.Code, Argentina, Canada, Chile, Colombia, Mexico, USA, Venezuela)), id.vars = c("Series.Name", "Time")) 
+0

中的數據的問題的措施128 KB的大小設置。我認爲提供原始數據可能會更好,而不是爲了防止誤解而編寫一些隨機數據。 –

+0

http://stackoverflow.com/questions/4862178/remove-rows-with-nas-in-data-frame萬一也。 –

+0

謝謝,這是解決方案。 –

回答

3

看你的數據,這其實是最簡單的大概使用grepl幫助的子集。

我們使用grepl通過「Series.Name」列包含字符串「舉辦收入佔比」的所有行進行搜索。這會創建一個邏輯向量來指示我們想要的行。我們想要的列是第一,第三和第六。

總結這一切在na.omit得到與NA值去掉任何行。

WDI_Brazil <- na.omit(WDI[grepl("Income share held", WDI$Series.Name), 
          c(1, 3, 6)]) 

該數據已經「長」,所以沒有必要meltdata.frame是什麼樣的?

summary(WDI_Brazil) 
#       Series.Name  Time  Brazil..BRA. 
# Income share held by fourth 20% :28 Min. :1981 Min. : 0.600 
# Income share held by highest 10%:28 1st Qu.:1988 1st Qu.: 2.895 
# Income share held by highest 20%:28 Median :1996 Median :10.320 
# Income share held by lowest 10% :28 Mean :1996 Mean :20.948 
# Income share held by lowest 20% :28 3rd Qu.:2004 3rd Qu.:43.797 
# Income share held by second 20% :28 Max. :2012 Max. :67.310 
# (Other)       :28         
table(droplevels(WDI_Brazil$Series.Name)) 
# 
# Income share held by fourth 20% Income share held by highest 10% Income share held by highest 20% 
#        28        28        28 
# Income share held by lowest 10% Income share held by lowest 20% Income share held by second 20% 
#        28        28        28 
# Income share held by third 20% 
#        28 

請注意,根據預期,「Series.Name」中有七個因子級別。

+0

太棒了!喜歡它的清晰度。 –

1

好吧,你可以做你與base功能尋找什麼。

WDI <- read.csv("WDI_Data_final.csv", header=T, na.strings="..") 

# The colnames are strange from the file so reset for clarity 
colnames(WDI) <- c("Series.Name", "Series.Code", "Time","Time.Code","Argentina", 
        "Brazil", "Canada", "Chile", "Colombia","Mexico", 
        "USA", "Venezuela") 

# do the subsetting 
test <- with(WDI, 
      WDI[Series.Name=="Income share held by lowest 10%", 
       c("Brazil","Time", "Series.Name")]) 

# if you want more, use %in% and specify the Series.Names you care about 
test <- with(WDI, 
      WDI[Series.Name %in% c("Income share held by lowest 10%", 
            "Income share held by lowest 20%"), 
       c("Brazil","Time", "Series.Name")]) 

# if you want all the 'income shares', the grepl solution above by 
# Ananda is the most concise. 

# you can then use reshape2::melt 
melted_test <- melt(test, id.vars=c("Series.Name", "Time")) 

要刪除NA只使用complete.cases

test[complete.cases(test),] 
+0

@ColonelBeauvel良好的用眼,固定 – cdeterman

+0

非常感謝您的回答,cdeterman。我的答案是,我的代碼太複雜了。不知道有'colnames'存在。豎起大拇指,男人! –