2012-11-08 27 views
3

我正在努力收集公開可用的datasets from UCI repositoryR。我知道有很多數據集已經可以用於幾個R包,例如mlbench.但是仍然有一些我需要從UCI存儲庫獲得的數據集。如何以編程方式從UCI數據存儲庫獲取數據集的標題信息R

這是一招我學到

url="http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data" 
credit<-read.csv(url, header=F) 

但是,這並沒有得到頭(變量名)的信息。該信息以文本格式存在於*.names文件中。任何想法如何我可以以編程方式獲得標題信息?

回答

3

我懷疑你將不得不使用正則表達式來實現這一點。這是一個醜陋的,但一般的解決方案,應該適用於各種* .names文件,假設它們的格式與您發佈的類似。

names.file.url <-'http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.names' 
names.file.lines <- readLines(names.file.url) 

# get run lengths of consecutive lines containing a colon. 
# then find the position of the subgrouping that has a run length 
# equal to the number of columns in credit and sum the run lengths up 
# to that value to get the index of the last line in the names block. 
end.of.names <- with(rle(grepl(':', names.file.lines)), 
         sum(lengths[1:match(ncol(credit), lengths)])) 

# extract those lines 
names.lines <- names.file.lines[(end.of.names - ncol(credit) + 1):end.of.names] 

# extract the names from those lines 
names <- regmatches(names.lines, regexpr('(\\w)+(?=:)', names.lines, perl=TRUE)) 

# [1] "A1"  "A2"  "A3"  "A4"  "A5"  "A6"  "A7"  "A8"  "A9"  "A10" "A11" 
# [12] "A12" "A13" "A14" "A15" "A16" 
+0

第三條線對我來說很神奇。謝謝。 –

+0

@Thomas,對不起。我通常會盡量避免寫出看起來很神奇的東西。目前時間短。 –

1

我猜Attribute Information必須是您指定的特定文件中的名稱。這是一個非常非常骯髒的解決方案。我使用的是事實,是有規律可循 - 你的名字後面跟着:所以我們用scan: separte字符的字符串,然後抓住從原始載體名稱:

url="http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data" 
credit<-read.csv(url, header=F) 
url.names="http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.names" 
mess <- scan(url.names, what="character", sep=":") 
#your names are located from 31 to 61, every second place in the vector 
mess.names <- mess[seq(31,61,2)] 
names(credit) <- mess.names 
+0

實際上,在這種情況下,變量名是'A1',...'A16.'但我明白了你的意思。謝謝 –

+0

@Thomas好吧,有道理,我糾正了我的答案,但它不像你想要的那樣普遍.. –

相關問題