2016-03-15 53 views
0

我正在尋找從statistics.gov.scot網站下載一些數據。例如,我想提供一些關於住院率的數據。到源中的數據表我感興趣的查詢格式:閱讀奇怪格式化程序的CSV文件

http://statistics.gov.scot/slice/observations.csv?&dataset=http%3A%2F%2Fstatistics.gov.scot%2Fdata%2Freconvictions&http%3A%2F%2Fpurl.org%2Flinked-data%2Fcube%23measureType=http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fmeasure-properties%2Fratio&http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fdimension%2Fage=http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fconcept%2Fage%2Fall&http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fdimension%2Fgender=http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fconcept%2Fgender%2Fall 

,並通過this link訪問,對於那些誰想要嘗試。該查詢會生成一個包含相關信息的*.CSV文件,但該文件的格式會帶來一些挑戰。

文件例如

文件內容看起來像這樣:

Generated by http://statistics.gov.scot,2016-03-15T10:41:28+00:00 
http://statistics.gov.scot/data/hospital-admissions,Hospital Admissions 
measure type,"" 
Admission Type,"" 
Age,"" 
Gender,"" 
Measure (cell values): ,"Ratio (Rate Per 100,000 Population)" 

,,http://reference.data.gov.uk/id/year/2002,http://reference.data.gov.uk/id/year/2003,http://reference.data.gov.uk/id/year/2004,http://reference.data.gov.uk/id/year/2005,http://reference.data.gov.uk/id/year/2006,http://reference.data.gov.uk/id/year/2007,http://reference.data.gov.uk/id/year/2008,http://reference.data.gov.uk/id/year/2009,http://reference.data.gov.uk/id/year/2010,http://reference.data.gov.uk/id/year/2011,http://reference.data.gov.uk/id/year/2012 
http://purl.org/linked-data/sdmx/2009/dimension#refArea,Reference Area,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012 
http://statistics.gov.scot/id/statistical-geography/S92000003,Scotland,"9,351","9,262","9,261","9,347","9,723","10,517","10,293","10,150","10,024","10,232","10,194" 

在導入到Excel:

Excel import

然而,當通過read.csv進口[R它看起來像這樣:

> head(problematicFile) 
                V1      V2 
1    Generated by http://statistics.gov.scot 2016-03-15T10:36:29+00:00 
2 http://statistics.gov.scot/data/hospital-admissions  Hospital Admissions 
3          measure type       
4          Admission Type       
5             Age       
6            Gender 

問題

read.csv進口只返回兩列。我猜測這個問題涉及到一些最初的列是空的。我想以類似於Excel中所示的導入的方式讀取此文件。重點是,我打算使用列AB列中的行,並且自然地使用下面的數據表。在生成data.frame方面,如果存在空單元但其尺寸等同於Excel中的尺寸,我將很樂意包含NA值。我嘗試過:

read.csv(file = link, header = FALSE, na.strings = "", 
           fill = TRUE) 

但我一直在抵達同樣的問題。

期望的結果

的期望的結果看起來應該(用手產生提取物):

Generated by http://statistics.gov.scot 2016-03-15T10:41:28+00:00 NA NA NA NA NA NA NA 
http://statistics.gov.scot/data/hospital-admissions Hospital Admissions NA NA NA NA NA NA NA 
measure type NA NA NA NA NA NA NA NA 
Admission Type NA NA NA NA NA NA NA NA 
Age NA NA NA NA NA NA NA NA 
Gender NA NA NA NA NA NA NA NA 
Measure (cell values): Ratio (Rate Per 100,000 Population)   NA NA NA NA NA 
NA NA NA NA NA NA NA NA NA 
NA NA http://reference.data.gov.uk/id/year/2002 http://reference.data.gov.uk/id/year/2003 http://reference.data.gov.uk/id/year/2004 http://reference.data.gov.uk/id/year/2005 http://reference.data.gov.uk/id/year/2006 http://reference.data.gov.uk/id/year/2007 http://reference.data.gov.uk/id/year/2008 
http://purl.org/linked-data/sdmx/2009/dimension#refArea Reference Area 2002 2003 2004 2005 2006 2007 2008 
http://statistics.gov.scot/id/statistical-geography/S92000003 Scotland 9,351 9,262 9,261 9,347 9,723 10,517 10,293 
http://statistics.gov.scot/id/statistical-geography/S16000082 Angus South 8,236 8,500 8,523 8,371 8,616 8,978 9,325 
http://statistics.gov.scot/id/statistical-geography/S16000106 Edinburgh Northern and Leith 9,040 8,040 7,925 9,042 10,355 11,833 8,916 
http://statistics.gov.scot/id/statistical-geography/S16000140 Renfrewshire South 9,391 9,122 9,491 9,586 10,425 10,900 11,065 
http://statistics.gov.scot/id/statistical-geography/S16000108 Edinburgh Southern 5,878 5,910 6,101 6,035 7,426 9,343 6,766 
http://statistics.gov.scot/id/statistical-geography/S16000075 Aberdeen Donside 10,047 10,963 10,629 10,512 10,383 10,787 10,685 
http://statistics.gov.scot/id/statistical-geography/S16000137 Perthshire North 9,388 9,524 7,799 9,350 9,543 9,791 9,991 
http://statistics.gov.scot/id/statistical-geography/S16000077 Aberdeenshire East 7,211 7,300 7,153 7,411 7,435 7,268 7,547 
http://statistics.gov.scot/id/statistical-geography/S16000114 Galloway and West Dumfries 9,861 9,165 8,143 9,258 7,508 10,213 10,399 
http://statistics.gov.scot/id/statistical-geography/S16000096 Dumbarton 8,703 8,570 8,727 9,310 9,389 9,885 10,237 

截圖

只是爲了進一步說明這一點,我想保持的尺寸和用NA填充缺失值:

Excel with NAs

回答

2

從頭文件解析元數據有點棘手。您可能更願意下載整個標準化數據集,而不是該交叉表片。

> reconv <- read.csv("http://statistics.gov.scot/downloads/cube-table?uri=http%3A%2F%2Fstatistics.gov.scot%2Fdata%2Freconvictions") 

> head(reconv) 

    GeographyCode DateCode Measurement        Units Value Gender Age 
1  S92000003  2003  Mean Average reconvictions per offender 0.62 All All 
2  S92000003  2004  Mean Average reconvictions per offender 0.33 All All 
3  S92000003  2004  Mean Average reconvictions per offender 0.61 All All 
4  S92000003  2005  Mean Average reconvictions per offender 0.60 All All 
5  S92000003  2006  Mean Average reconvictions per offender 0.60 All All 
6  S92000003  2007  Mean Average reconvictions per offender 0.11 All All 

這將會把所有因子水平的元數據(所以您不必對它進行解析)的:

> str(reconv) 

'data.frame': 10119 obs. of 7 variables: 
$ GeographyCode: Factor w/ 26 levels "S12000005","S12000006",..: 26 26 26 26 26 26 26 26 26 26 ... 
$ DateCode  : int 2003 2004 2004 2005 2006 2007 2007 2008 2008 2009 ... 
$ Measurement : Factor w/ 2 levels "Mean","Ratio": 1 1 1 1 1 1 1 1 1 1 ... 
$ Units  : Factor w/ 2 levels "Average reconvictions per offender",..: 1 1 1 1 1 1 1 1 1 1 ... 
$ Value  : num 0.62 0.33 0.61 0.6 0.6 0.11 0.57 0.6 0.33 0.33 ... 
$ Gender  : Factor w/ 3 levels "All","Female",..: 1 1 1 1 1 1 1 1 1 1 ... 
$ Age   : Factor w/ 6 levels "21-25","26-30",..: 4 4 4 4 4 4 4 4 4 4 ... 

您可以選擇切片你感興趣:

> slice <- subset(reconv, Measurement=="Ratio" & Gender=="All" & Age=="All") 

,回到原來的交叉列表切片,如果你想:

> library(reshape2) 
> dcast(slice, GeographyCode ~ DateCode, value.var="Value", fun.aggregate = first) 

    GeographyCode 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 
1  S12000005 41.4 34.3 41.0 40.7 37.4 37.2 33.3 34.6 35.8 33.0 32.8 
2  S12000006 34.9 36.0 31.9 34.2 31.1 28.7 27.9 29.6 27.5 26.8 27.0 
3  S12000008 33.7 33.2 33.7 33.2 31.7 32.8 30.4 31.5 29.1 28.1 28.7 
4  S12000010 26.7 24.5 25.7 26.9 26.7 27.8 29.3 25.1 22.4 29.0 28.2 
5  S12000013 31.7 26.1 30.6 35.4 31.6 25.9 24.0 18.9 30.5 22.8 18.6 
... 
1

您需要手動指定col.names以強制read.csv讀取多個列。同時指定na.strings作爲空字符串會將NA值保留在空列中。

read.csv(<parameters>, col.names=c("Col1","Col2".....), na.strings="") 
+0

感謝您的興趣,但我正在努力避免這種情況。我需要這些信息,因爲它包含指標名稱和其他一些我將要使用的數據。如果我跳過這個文件,我將不得不閱讀它**兩次**一次,以獲得相關元數據的前9行,然後又一次獲取實際數據。我想避免這種情況,我想有一個大的表格,將NAs放置在空白列中,然後引用我需要的值,**包括**第一列中的內容。 – Konrad

+0

@Konrad看看這個更改是否有幫助 –

+0

'列名比列名更多'問題是我在導入文件之前不會知道文件的大小。另一種方法可以是通過'readLines'來讀取它,然後用數據和前幾行中的其他值從行中導出表格。理想情況下,我寧願有一個帶有NAs的表格,所以我可以這樣做:'indicatorName < - x [7,2]'或其他任何我可能需要從中選擇的東西。 – Konrad

0

您可以使用read指定列數。表和列名的供應:

read.table(file = link, 
      fill = TRUE, 
      sep = ",", 
      na.strings = "", 
      col.names = paste("c", 1:12, sep = "")) 

不過,我不知道這是否是因爲你需要知道的列數先驗很好的解決方案。

另一種方法是將整個csv讀作字符串。然後,您可以通過將標題存儲在另一個對象(例如列表)中進行預處理,並且可以將「表格部分」用作數據框。

+0

謝謝,這是一個開始。我希望能夠以某種方式一次性讀取所有內容,因爲我可以跳過'data.frame'並選擇我想要的內容。我有一個列表,這些文件在一個循環中,所以我可以進一步將它分解成兩個對象,一個標題,但認爲可以避免這種情況。 – Konrad