我試圖導入R(3.4.0)中的text file,它實際上包含4列,但第4列大多是空的,直到第200,000 +行。我用的是FREAD()封裝內data.table(版本1.10.4)r - 錯誤:處理fread(data.table)中的所有列後的文本
fread("test.txt",fill = TRUE, sep = "\t", quote = "", header = FALSE)
我得到這個錯誤信息:
Error in fread("test.txt", fill = TRUE, sep = "\t", quote = "", header = FALSE) :
Expecting 3 cols, but line 258088 contains text after processing all cols. Try again with fill=TRUE. Another reason could be that fread's logic in distinguishing one or more fields having embedded sep=' ' and/or (unescaped) '\n' characters within unbalanced unescaped quotes has failed. If quote='' doesn't help, please file an issue to figure out if the logic could be improved.
我檢查了文件,並有在第二十五萬八千零八十八行附加文本第四欄(「8-4」)。
不過,fill = TRUE沒有解決這個問題,因爲我預期。我認爲這可能是fread()不適當地確定列號,因爲附加列在文件中發生得非常晚。所以我試過這個:
fread("test.txt", fill = TRUE, header = FALSE, sep = "\t", skip = 250000)
錯誤依然存在。另一方面,
fread("test.txt", fill = TRUE, header = FALSE, sep = "\t", skip = 258080)
這沒有錯誤。
我以爲我找到了原因,但是當我通過生成的dummy file測試奇怪的事情發生了:在第二十五萬行的第4列
write.table(matrix(c(1:990000), nrow = 330000), "test2.txt", sep = "\t", row.names = FALSE)
與又多了一個「8-4」由Excel。當FREAD()閱讀:
fread("test2.txt", fill = TRUE, header = FALSE, sep = "\t")
它能正常工作,沒有錯誤消息,這應該表明有些晚了額外的列不一定觸發錯誤。
我也嘗試更改編碼(「Latin-1」和「UTF-8」)或引號,但都沒有幫助。
現在我感到無能爲力了,希望我做足了我的功課,並提供可重複的信息。感謝您的幫助。
關於環境保護的信息,我的sessionInfo()是:
R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.5
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] zh_TW.UTF-8/zh_TW.UTF-8/zh_TW.UTF-8/C/zh_TW.UTF-8/zh_TW.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_0.5.0 purrr_0.2.2.2 readr_1.1.1 tidyr_0.6.3
[5] tibble_1.3.3 ggplot2_2.2.1 tidyverse_1.1.1 stringr_1.2.0
[9] microbenchmark_1.4-2.1 data.table_1.10.4
loaded via a namespace (and not attached):
[1] Rcpp_0.12.11 cellranger_1.1.0 compiler_3.4.0 plyr_1.8.4 forcats_0.2.0
[6] tools_3.4.0 jsonlite_1.5 lubridate_1.6.0 nlme_3.1-131 gtable_0.2.0
[11] lattice_0.20-35 rlang_0.1.1 psych_1.7.5 DBI_0.6-1 parallel_3.4.0
[16] haven_1.0.0 xml2_1.1.1 httr_1.2.1 hms_0.3 grid_3.4.0
[21] R6_2.2.1 readxl_1.0.0 foreign_0.8-68 reshape2_1.4.2 modelr_0.1.0
[26] magrittr_1.5 scales_0.4.1 rvest_0.3.2 assertthat_0.2.0 mnormt_1.5-5
[31] colorspace_1.3-2 stringi_1.1.5 lazyeval_0.2.0 munsell_0.4.3 broom_0.4.2
我認爲解決這個問題的最簡單的方法是將頭添加到文件的頂部,與標籤在你的文件中分離出來的標題。 'fread'默認查看數據的前30行,並使用它來推斷它有多少列,所以在第4列中沒有數據的情況下,它假定只有3個字段。 – Marius
也許加上'quote =「」' –
我想這不是'fread'或'read.csv'的問題。該文件有問題。每個行的csv應該有相同數量的列,而您的文件不會。您應該處理生成文件的過程,而不是如何導入它。 – nicola