2017-03-16 123 views
0

如何將此列'seriesID'拆分爲多列以使其看起來像下表一樣?基本上我需要將字符串分成多個長度爲(3,3,6,1,1,3)的字符串。如何拆分沒有定義分隔符的數據幀列

seriesID 
1 ISU111aaaaaa33001 
2 ISU222bbbbbb33001 
3 ISU000cccccc63001 
4 ISU333dddddd63001 


seriesID    pre supp ind  data case area 
1 ISU111aaaaaa33001 ISU 111 aaaaaa 3  3  001 
2 ISU222bbbbbb33001 ISU 222 bbbbbb 3  3  001 
3 ISU000cccccc63001 ISU 000 cccccc 6  3  001 
4 ISU333dddddd63001 ISU 333 dddddd 6  3  001 

謝謝!

回答

1
seriesID <- c('ISU00000000033001', 
      'ISU00000000033001', 
      'ISU00000000063001', 
      'ISU00000000063001') 



df <- data.frame(pre = substr(seriesID,1,3), 
      supp =substr(seriesID,4,6), 
      ind =substr(seriesID,7,12), 
      data =substr(seriesID,13,13), 
      case =substr(seriesID,14,14), 
      area =substr(seriesID,15,17)) 

df 


pre supp ind data case area 
1 ISU 000 000000 3 3 001 
2 ISU 000 000000 3 3 001 
3 ISU 000 000000 6 3 001 
4 ISU 000 000000 6 3 001 
1

您可以使用readr作爲固定wdith文件「重新讀取」​​您的數據。例如

series=c("ISU00000000033001","ISU00000000033001","ISU00000000063001","ISU00000000063001") 

read_fwf(paste(series, collapse="\n"), fwf_widths(c(3,3,6,1,1,3))) 
# A tibble: 4 × 6 
#  X1 X2  X3 X4 X5 X6 
# <chr> <chr> <chr> <int> <int> <chr> 
# 1 ISU 000 000000  3  3 001 
# 2 ISU 000 000000  3  3 001 
# 3 ISU 000 000000  6  3 001 
# 4 ISU 000 000000  6  3 001 

注意,我們摺疊串矢量成單個字符串與新的線,其可能是低效的大型載體。

+0

這是不是愚蠢? – zx8754

0

這聽起來像你真正應該處理這個,當你閱讀使用read.fwf()數據:https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.fwf.html

但要解決這個問題提出,只要使用substr()

seriesID <- c('ISU00000000033001', 'ISU00000000033001', 'ISU00000000063001', 'ISU00000000063001') 

df <- data.frame(seriesID = seriesID, 
    pre = substr(seriesID, 1, 3), 
    supp = substr(seriesID, 4, 6), 
    ind = substr(seriesID, 7, 12), 
    data = substr(seriesID, 13, 13), 
    case = substr(seriesID, 14, 14), 
    area = substr(seriesID, 15, 17)) 

print(df) 
#   seriesID pre supp ind data case area 
# 1 ISU00000000033001 ISU 000 000000 3 3 001 
# 2 ISU00000000033001 ISU 000 000000 3 3 001 
# 3 ISU00000000063001 ISU 000 000000 6 3 001 
# 4 ISU00000000063001 ISU 000 000000 6 3 001 
2

您還可以使用substr

widths = c(3,3,6,1,1,3) 
end = cumsum(widths) 
start = c(1, head(end, -1) + 1) 

as.data.frame(mapply(substr, start, end, MoreArgs = list(x=df$seriesID))) 

# V1 V2  V3 V4 V5 V6 
#1 ISU 000 000000 3 3 001 
#2 ISU 000 000000 3 3 001 
#3 ISU 000 000000 6 3 001 
#4 ISU 000 000000 6 3 001 
1

您可以使用separate從包tidyr

df <- data.frame(series=c("ISU00000000033001","ISU00000000033001","ISU00000000063001","ISU00000000063001"), stringsAsFactors=FALSE) 

library(tidyr) 
df %>% 
    separate(series, 
      c("pre", "supp", "ind", "data", "case", "area"), 
      sep=cumsum(c(3,3,6,1,1))) 

    pre supp ind data case area 
1 ISU 000 000000 3 3 001 
2 ISU 000 000000 3 3 001 
3 ISU 000 000000 6 3 001 
4 ISU 000 000000 6 3 001