2017-01-12 91 views
2

我有一個csv文件,有6列,其中一列用逗號分隔,例如BOLT,RD HD SQ SHORT NECK,METRIC。在列中用逗號讀取一個CSV文件

當我在R中讀取此文件時,此列出現溢出,隨後數據移動到新行。

下面我粘貼幾行

014003051906,ETN5080,0450,螺栓套件上軸5的速度,1.000,F 014003051906,ETN5967,0460,傳感器傳感器FH BACKSHAFT速度,1.000,F 014003051906,ETN64267,0470,傾斜單元傳感器,1.000,F

014003065376,03M7184,0020,螺栓 - 中號8.0 X 1.250 X 20.0 - 8.8 鋅,4.000mmol,G 014003065376,03M7386,0090,螺栓,RD HD SQ SHORT NECK,METRIC,18.000,G 014003065376,14M7296,0090,NUT,METRIC,HEX FLANGE,14.000,G

最後兩行是問題所在。 「NUT,METRIC,HEX FLANGE」應該屬於一個變量。

這怎麼解決?

+4

你是怎麼看到這些數據的? (從Excel中保存爲CSV?)最好的解決方案是請求將數據保存爲引用數據或使用不同分隔符的格式。 – Benjamin

+0

@Benjamin我想到了這一點。但不幸的是,這是我們唯一的來源。 – darkage

+0

你可以用正則表達式 –

回答

8
data <- readLines(con = textConnection("014003051906,ETN5080 ,0450,BOLT KIT UPPER SHAFT WITH 5 SPEED,1.000,F 
014003051906,ETN5967 ,0460,SENSOR SENSOR FH BACKSHAFT SPEED,1.000,F 
014003051906,ETN64267 ,0470,TILT UNIT SENSOR,1.000,F 

014003065376,03M7184 ,0020,BOLT - M 8.0 X 1.250 X 20.0 - 8.8-Zinc,4.000,G 
014003065376,03M7386 ,0090,BOLT, RD HD SQ SHORT NECK, METRIC,18.000,G 
014003065376,14M7296 ,0090,NUT, METRIC, HEX FLANGE,14.000,G")) 

pattern <- "^([^,]*),([^,]*),([^,]*),(.*),([^,]*),([^,]*)$" 

library(stringr) 
str_match(data, pattern)[, - 1] 
#  [,1]   [,2]  [,3] [,4]          [,5]  [,6] 
# [1,] "014003051906" "ETN5080 " "0450" "BOLT KIT UPPER SHAFT WITH 5 SPEED"  "1.000" "F" 
# [2,] "014003051906" "ETN5967 " "0460" "SENSOR SENSOR FH BACKSHAFT SPEED"  "1.000" "F" 
# [3,] "014003051906" "ETN64267 " "0470" "TILT UNIT SENSOR"      "1.000" "F" 
# [4,] NA    NA   NA  NA          NA  NA 
# [5,] "014003065376" "03M7184 " "0020" "BOLT - M 8.0 X 1.250 X 20.0 - 8.8-Zinc" "4.000" "G" 
# [6,] "014003065376" "03M7386 " "0090" "BOLT, RD HD SQ SHORT NECK, METRIC"  "18.000" "G" 
# [7,] "014003065376" "14M7296 " "0090" "NUT, METRIC, HEX FLANGE"    "14.000" "G" 

編輯:
正則表達式的解釋對於初學者來說,在平原的話,請原諒不準確:

  • 初始^和終端$意味着開始和字符串的結尾。
  • Parens用於分組(str_match()將提取的組)。
  • .表示任何字符,而.*表示任何字符的任意數量。
  • [^,]表示任何不是逗號的字符。

當放在一起時,它的意思是:start of string - substring without a comma - comma(重複3次) - substring possibly containing commas - comma - substring without a comma - comma - substring without a comma - end of string,只有帶括號的組被提取。

+1

擴展你的想法,你可以在第四個字段周圍添加逗號並調用'read.csv':'read.csv(text = gsub(「^([^,] *,[^,] *,[^,] * ,)(。*)(,[^,] *,[^,] *)$「,」\\ 1 \「\\ 2 \」\\ 3「,data),header = FALSE)' – nicola

+1

@Apom我是新來的正則表達式,你能解釋一下正則表達式部分嗎? – darkage

+0

@darkage看我的編輯 –

相關問題