2017-06-10 134 views
0

我無法導入.tsv文件中R的數據文件是從Eurostats,並可以公開獲取:http://ec.europa.eu/eurostat/en/web/products-datasets/-/MIGR_IMM10CTB導入.tsv文件

我用下面的代碼導入它:

immig <- read.table(file="immig.tsv", sep="\t", header=TRUE) 

但是,代碼似乎不起作用。我沒有收到任何錯誤消息,但輸出如下所示:

> immig[1:3, 1:3] 
    age.agedef.c_birth.unit.sex.geo.time X2015 X2014 
1 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,AT 4723 4093 
2 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,BE 1017 953 
3 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,BG 559 577 

我在做什麼錯?我嘗試使用sep=","代替,但似乎在創建其他問題時解決了一些問題。

+1

你指的表是不是製表符分隔值的文本文件... – ssp3nc3r

+0

下載一個可用的格式文件瀏覽:http://appsso.eurostat.ec.europa.eu /nui/setupDownloads.do – ssp3nc3r

+0

頁面顯示「會話無效!」 – neutral

回答

1

是你缺少2013數據的問題?

我下載的文件的鏈接,使用命令行工具解壓它,然後它可以使用readr庫導入就好了:

library(readr) 

immigration <- read_tsv("~/Downloads/migr_imm10ctb.tsv", na = ":") 
#> Parsed with column specification: 
#> cols(
#> `age,agedef,c_birth,unit,sex,geo\time` = col_character(), 
#> `2015` = col_character(), 
#> `2014` = col_character(), 
#> `2013` = col_character() 
#>) 

immigration 
#> # A tibble: 45,558 x 4 
#> `age,agedef,c_birth,unit,sex,geo\\time` `2015` `2014` `2013` 
#>          <chr> <chr> <chr> <chr> 
#> 1 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,AT 4723 4093 4085 
#> 2 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,BE 1017 953 1035 
#> 3 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,BG 559 577 743 p 
#> 4 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,CH 2876 2766 2758 
#> 5 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,CY <NA> <NA>  54 
#> 6 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,CZ 120 106 155 
#> 7 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,DE <NA> <NA> 14984 
#> 8 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,DK 372 365 405 
#> 9 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,EE  23  7  16 
#> 10 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,EL <NA> <NA> 234 
#> # ... with 45,548 more rows 

看起來有一些多餘的字符漂浮在(743 p)哪裏應該只有數字,所以你需要做更多的清潔,然後轉換爲數字。

library(dplyr) 
library(stringr) 

immigration %>% 
    mutate_at(vars(`2015`:`2013`), str_extract, pattern = "[0-9]+") %>% 
    mutate_at(vars(`2015`:`2013`), as.numeric) 
#> # A tibble: 45,558 x 4 
#> `age,agedef,c_birth,unit,sex,geo\\time` `2015` `2014` `2013` 
#>          <chr> <dbl> <dbl> <dbl> 
#> 1 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,AT 4723 4093 4085 
#> 2 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,BE 1017 953 1035 
#> 3 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,BG 559 577 743 
#> 4 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,CH 2876 2766 2758 
#> 5 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,CY  NA  NA  54 
#> 6 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,CZ 120 106 155 
#> 7 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,DE  NA  NA 14984 
#> 8 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,DK 372 365 405 
#> 9 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,EE  23  7  16 
#> 10 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,EL  NA  NA 234 
#> # ... with 45,548 more rows 

這是一個製表符分隔的文件,但第一列是全部用逗號放在一起,所以,如果你是想被分離出這些信息,你可以做到這一點與tidyr::separate()

library(tidyr) 

immigration %>% 
    separate(`age,agedef,c_birth,unit,sex,geo\\time`, 
      c("age", "agedef", "c_birth", "unit", "sex", "geo"), 
      sep = ",") 
#> # A tibble: 45,558 x 9 
#>  age agedef   c_birth unit sex geo `2015` `2014` `2013` 
#> * <chr> <chr>   <chr> <chr> <chr> <chr> <chr> <chr> <chr> 
#> 1 TOTAL COMPLET CC5_13_FOR_X_IS NR  F AT 4723 4093 4085 
#> 2 TOTAL COMPLET CC5_13_FOR_X_IS NR  F BE 1017 953 1035 
#> 3 TOTAL COMPLET CC5_13_FOR_X_IS NR  F BG 559 577 743 p 
#> 4 TOTAL COMPLET CC5_13_FOR_X_IS NR  F CH 2876 2766 2758 
#> 5 TOTAL COMPLET CC5_13_FOR_X_IS NR  F CY <NA> <NA>  54 
#> 6 TOTAL COMPLET CC5_13_FOR_X_IS NR  F CZ 120 106 155 
#> 7 TOTAL COMPLET CC5_13_FOR_X_IS NR  F DE <NA> <NA> 14984 
#> 8 TOTAL COMPLET CC5_13_FOR_X_IS NR  F DK 372 365 405 
#> 9 TOTAL COMPLET CC5_13_FOR_X_IS NR  F EE  23  7  16 
#> 10 TOTAL COMPLET CC5_13_FOR_X_IS NR  F EL <NA> <NA> 234 
#> # ... with 45,548 more rows 
0

這樣的事情可能是一個起點:

link <- "http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/migr_imm10ctb.tsv.gz" 

data <- readr::read_csv(link) %>% 
     separate("geo\\time\t2015 \t2014 \t2013", into = c("geo", "2015", "2014", "2013"), sep = "\t")