我正在尋找從statistics.gov.scot網站下載一些數據。例如,我想提供一些關於住院率的數據。到源中的數據表我感興趣的查詢格式:閱讀奇怪格式化程序的CSV文件
http://statistics.gov.scot/slice/observations.csv?&dataset=http%3A%2F%2Fstatistics.gov.scot%2Fdata%2Freconvictions&http%3A%2F%2Fpurl.org%2Flinked-data%2Fcube%23measureType=http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fmeasure-properties%2Fratio&http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fdimension%2Fage=http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fconcept%2Fage%2Fall&http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fdimension%2Fgender=http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fconcept%2Fgender%2Fall
,並通過this link訪問,對於那些誰想要嘗試。該查詢會生成一個包含相關信息的*.CSV
文件,但該文件的格式會帶來一些挑戰。
文件例如
文件內容看起來像這樣:
Generated by http://statistics.gov.scot,2016-03-15T10:41:28+00:00
http://statistics.gov.scot/data/hospital-admissions,Hospital Admissions
measure type,""
Admission Type,""
Age,""
Gender,""
Measure (cell values): ,"Ratio (Rate Per 100,000 Population)"
,,http://reference.data.gov.uk/id/year/2002,http://reference.data.gov.uk/id/year/2003,http://reference.data.gov.uk/id/year/2004,http://reference.data.gov.uk/id/year/2005,http://reference.data.gov.uk/id/year/2006,http://reference.data.gov.uk/id/year/2007,http://reference.data.gov.uk/id/year/2008,http://reference.data.gov.uk/id/year/2009,http://reference.data.gov.uk/id/year/2010,http://reference.data.gov.uk/id/year/2011,http://reference.data.gov.uk/id/year/2012
http://purl.org/linked-data/sdmx/2009/dimension#refArea,Reference Area,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012
http://statistics.gov.scot/id/statistical-geography/S92000003,Scotland,"9,351","9,262","9,261","9,347","9,723","10,517","10,293","10,150","10,024","10,232","10,194"
在導入到Excel:
然而,當通過read.csv
進口[R它看起來像這樣:
> head(problematicFile)
V1 V2
1 Generated by http://statistics.gov.scot 2016-03-15T10:36:29+00:00
2 http://statistics.gov.scot/data/hospital-admissions Hospital Admissions
3 measure type
4 Admission Type
5 Age
6 Gender
問題
的read.csv
進口只返回兩列。我猜測這個問題涉及到一些最初的列是空的。我想以類似於Excel中所示的導入的方式讀取此文件。重點是,我打算使用列A和B列中的行,並且自然地使用下面的數據表。在生成data.frame
方面,如果存在空單元但其尺寸等同於Excel中的尺寸,我將很樂意包含NA
值。我嘗試過:
read.csv(file = link, header = FALSE, na.strings = "",
fill = TRUE)
但我一直在抵達同樣的問題。
期望的結果
的期望的結果看起來應該(用手產生提取物):
Generated by http://statistics.gov.scot 2016-03-15T10:41:28+00:00 NA NA NA NA NA NA NA
http://statistics.gov.scot/data/hospital-admissions Hospital Admissions NA NA NA NA NA NA NA
measure type NA NA NA NA NA NA NA NA
Admission Type NA NA NA NA NA NA NA NA
Age NA NA NA NA NA NA NA NA
Gender NA NA NA NA NA NA NA NA
Measure (cell values): Ratio (Rate Per 100,000 Population) NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA http://reference.data.gov.uk/id/year/2002 http://reference.data.gov.uk/id/year/2003 http://reference.data.gov.uk/id/year/2004 http://reference.data.gov.uk/id/year/2005 http://reference.data.gov.uk/id/year/2006 http://reference.data.gov.uk/id/year/2007 http://reference.data.gov.uk/id/year/2008
http://purl.org/linked-data/sdmx/2009/dimension#refArea Reference Area 2002 2003 2004 2005 2006 2007 2008
http://statistics.gov.scot/id/statistical-geography/S92000003 Scotland 9,351 9,262 9,261 9,347 9,723 10,517 10,293
http://statistics.gov.scot/id/statistical-geography/S16000082 Angus South 8,236 8,500 8,523 8,371 8,616 8,978 9,325
http://statistics.gov.scot/id/statistical-geography/S16000106 Edinburgh Northern and Leith 9,040 8,040 7,925 9,042 10,355 11,833 8,916
http://statistics.gov.scot/id/statistical-geography/S16000140 Renfrewshire South 9,391 9,122 9,491 9,586 10,425 10,900 11,065
http://statistics.gov.scot/id/statistical-geography/S16000108 Edinburgh Southern 5,878 5,910 6,101 6,035 7,426 9,343 6,766
http://statistics.gov.scot/id/statistical-geography/S16000075 Aberdeen Donside 10,047 10,963 10,629 10,512 10,383 10,787 10,685
http://statistics.gov.scot/id/statistical-geography/S16000137 Perthshire North 9,388 9,524 7,799 9,350 9,543 9,791 9,991
http://statistics.gov.scot/id/statistical-geography/S16000077 Aberdeenshire East 7,211 7,300 7,153 7,411 7,435 7,268 7,547
http://statistics.gov.scot/id/statistical-geography/S16000114 Galloway and West Dumfries 9,861 9,165 8,143 9,258 7,508 10,213 10,399
http://statistics.gov.scot/id/statistical-geography/S16000096 Dumbarton 8,703 8,570 8,727 9,310 9,389 9,885 10,237
截圖
只是爲了進一步說明這一點,我想保持的尺寸和用NA
填充缺失值:
感謝您的興趣,但我正在努力避免這種情況。我需要這些信息,因爲它包含指標名稱和其他一些我將要使用的數據。如果我跳過這個文件,我將不得不閱讀它**兩次**一次,以獲得相關元數據的前9行,然後又一次獲取實際數據。我想避免這種情況,我想有一個大的表格,將NAs放置在空白列中,然後引用我需要的值,**包括**第一列中的內容。 – Konrad
@Konrad看看這個更改是否有幫助 –
'列名比列名更多'問題是我在導入文件之前不會知道文件的大小。另一種方法可以是通過'readLines'來讀取它,然後用數據和前幾行中的其他值從行中導出表格。理想情況下,我寧願有一個帶有NAs的表格,所以我可以這樣做:'indicatorName < - x [7,2]'或其他任何我可能需要從中選擇的東西。 – Konrad