如何以編程方式從UCI數據存儲庫獲取數據集的標題信息R

我正在努力收集公開可用的datasets from UCI repository的R。我知道有很多數據集已經可以用於幾個R包，例如mlbench.但是仍然有一些我需要從UCI存儲庫獲得的數據集。如何以編程方式從UCI數據存儲庫獲取數據集的標題信息R

這是一招我學到

url="http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data" 
credit<-read.csv(url, header=F)

但是，這並沒有得到頭（變量名）的信息。該信息以文本格式存在於*.names文件中。任何想法如何我可以以編程方式獲得標題信息？

來源

2012-11-08 Tae-Sung Shin

我懷疑你將不得不使用正則表達式來實現這一點。這是一個醜陋的，但一般的解決方案，應該適用於各種* .names文件，假設它們的格式與您發佈的類似。

names.file.url <-'http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.names' 
names.file.lines <- readLines(names.file.url) 

# get run lengths of consecutive lines containing a colon. 
# then find the position of the subgrouping that has a run length 
# equal to the number of columns in credit and sum the run lengths up 
# to that value to get the index of the last line in the names block. 
end.of.names <- with(rle(grepl(':', names.file.lines)), 
         sum(lengths[1:match(ncol(credit), lengths)])) 

# extract those lines 
names.lines <- names.file.lines[(end.of.names - ncol(credit) + 1):end.of.names] 

# extract the names from those lines 
names <- regmatches(names.lines, regexpr('(\\w)+(?=:)', names.lines, perl=TRUE)) 

# [1] "A1"  "A2"  "A3"  "A4"  "A5"  "A6"  "A7"  "A8"  "A9"  "A10" "A11" 
# [12] "A12" "A13" "A14" "A15" "A16"

來源

2012-11-08 18:23:01

第三條線對我來說很神奇。謝謝。 –

@Thomas，對不起。我通常會盡量避免寫出看起來很神奇的東西。目前時間短。 –

我猜Attribute Information必須是您指定的特定文件中的名稱。這是一個非常非常骯髒的解決方案。我使用的是事實，是有規律可循 - 你的名字後面跟着:所以我們用scan: separte字符的字符串，然後抓住從原始載體名稱：

url="http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data" 
credit<-read.csv(url, header=F) 
url.names="http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.names" 
mess <- scan(url.names, what="character", sep=":") 
#your names are located from 31 to 61, every second place in the vector 
mess.names <- mess[seq(31,61,2)] 
names(credit) <- mess.names

來源

2012-11-08 18:19:31

實際上，在這種情況下，變量名是'A1'，...'A16.'但我明白了你的意思。謝謝 –

@Thomas好吧，有道理，我糾正了我的答案，但它不像你想要的那樣普遍.. –

如何以編程方式從UCI數據存儲庫獲取數據集的標題信息R

回答

相關問題