在使用read.csv
讀取數據時,可以使用skip = 1
來避免此問題。我從原始數據中抓取了幾行,看起來沒問題。
第一行是不必要的,它實際上會將標題行向下推入第一行,然後在讀取時將列轉換爲因子。當您使用as.numeric
時,實際上是將所有因子值更改爲其數值,這些數值與原始數值不同,並且可能不正確。這是你描述的「歪斜」。
txt <- '506,13,,,,,,,,,,,,
"CRIM","ZN","INDUS","CHAS","NOX","RM","AGE","DIS","RAD","TAX","PTRATIO","B","LSTAT","MEDV"
0.00632,18,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24
0.02731,0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
0.02729,0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
0.03237,0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4'
您當前的呼叫產生的因素:
sapply(read.csv(text = txt), class)
# X506 X13 X X.1 X.2 X.3 X.4
# "factor" "factor" "factor" "factor" "factor" "factor" "factor"
# X.5 X.6 X.7 X.8 X.9 X.10 X.11
# "factor" "factor" "factor" "factor" "factor" "factor" "factor"
skip = 1
似乎這樣的伎倆,因爲它產生的數字列:如果你改變你的第一線,
sapply(read.csv(text = txt, skip = 1), class)
# CRIM ZN INDUS CHAS NOX RM AGE
# "numeric" "integer" "numeric" "integer" "numeric" "numeric" "numeric"
# DIS RAD TAX PTRATIO B LSTAT MEDV
# "numeric" "integer" "integer" "numeric" "numeric" "numeric" "numeric"
所以
y <- read.csv("boston_house_prices.csv", skip = 1)
一切都應該罰款之後,沒有其他必要的轉換
這並不是那麼明顯。它雖然現在工作!這個文件在哪裏?我查看了http://cran.r-project.org/doc/manuals/R-data.html,找不到更多的跳過參數。 – leonard 2014-09-28 04:15:55
那麼這是一個Python包,所以我不希望這發生在R github數據集上。 'skip'記錄在'?read.table'文件中,實際上整個幫助文件是非常有用的 – 2014-09-28 04:18:02