read.csv警告'引用字符串中的EOF'阻止完整讀取文件

104

我有a CSV file (24.1 MB)，我無法完全讀入我的R會話。當我在電子表格程序中打開文件時，可以看到112,544行。當我與read.csv讀入R I只得到56952行，這樣的警告：read.csv警告'引用字符串中的EOF'阻止完整讀取文件

cit <- read.csv("citations.CSV", row.names = NULL, 
       comment.char = "", header = TRUE, 
       stringsAsFactors = FALSE, 
       colClasses= "character", encoding= "utf-8") 

Warning message: 
In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : 
    EOF within quoted string

我可以readLines讀取整個文件到R：

rl <- readLines(file("citations.CSV", encoding = "utf-8")) 
length(rl) 
[1] 112545

但我不能把它恢復爲R爲表（通過read.csv）：

write.table(rl, "rl.txt", quote = FALSE, row.names = FALSE) 
rl_in <- read.csv("rl.txt", skip = 1, row.names = NULL) 

Warning message: 
In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : 
    EOF within quoted string

我該如何解決或變通方法此EOF消息（這似乎是超過一個警告錯誤的），以獲得整個文件導入我的R會話？

我也有類似的問題，閱讀的CSV文件等方法：（）

require(sqldf) 
cit_sql <- read.csv.sql("citations.CSV", sql = "select * from file") 
require(data.table) 
cit_dt <- fread("citations.CSV") 
require(ff) 
cit_ff <- read.csv.ffdf(file="citations.CSV")

這裏是我的sessionInfo

R version 3.0.1 (2013-05-16) 
Platform: x86_64-w64-mingw32/x64 (64-bit) 

locale: 
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C       
[5] LC_TIME=English_United States.1252  

attached base packages: 
[1] tools  tcltk  stats  graphics grDevices utils  datasets methods base  

other attached packages: 
[1] ff_2.2-11    bit_1.1-10   data.table_1.8.8  sqldf_0.4-6.4   
[5] RSQLite.extfuns_0.0.1 RSQLite_0.11.4  chron_2.3-43   gsubfn_0.6-5   
[9] proto_0.3-10   DBI_0.2-7

來源

2013-07-01 Ben

154

你需要禁用引用。

cit <- read.csv("citations.CSV", quote = "", 
       row.names = NULL, 
       stringsAsFactors = FALSE) 

str(cit) 
## 'data.frame': 112543 obs. of 13 variables: 
## $ row.names : chr "10.2307/675394" "10.2307/30007362" "10.2307/4254931" "10.2307/20537934" ... 
## $ id   : chr "10.2307/675394\t" "10.2307/30007362\t" "10.2307/4254931\t" "10.2307/20537934\t" ... 
## $ doi   : chr "Archaeological Inference and Inductive Confirmation\t" "Sound and Sense in Cath Almaine\t" "Oak Galls Preserved by the Eruption of Mount Vesuvius in A.D. 79_ and Their Probable Use\t" "The Arts Four Thousand Years Ago\t" ... 
## $ title  : chr "Bruce D. Smith\t" "Tomás Ó Cathasaigh\t" "Hiram G. Larew\t" "\t" ... 
## $ author  : chr "American Anthropologist\t" "Ériu\t" "Economic Botany\t" "The Illustrated Magazine of Art\t" ... 
## $ journaltitle : chr "79\t" "54\t" "41\t" "1\t" ... 
## $ volume  : chr "3\t" "\t" "1\t" "3\t" ... 
## $ issue  : chr "1977-09-01T00:00:00Z\t" "2004-01-01T00:00:00Z\t" "1987-01-01T00:00:00Z\t" "1853-01-01T00:00:00Z\t" ... 
## $ pubdate  : chr "pp. 598-617\t" "pp. 41-47\t" "pp. 33-40\t" "pp. 171-172\t" ... 
## $ pagerange : chr "American Anthropological Association\tWiley\t" "Royal Irish Academy\t" "New York Botanical Garden Press\tSpringer\t" "\t" ... 
## $ publisher : chr "fla\t" "fla\t" "fla\t" "fla\t" ... 
## $ type   : logi NA NA NA NA NA NA ... 
## $ reviewed.work: logi NA NA NA NA NA NA ...

我想是因爲這種線的（檢查「刺」和「減」）

readLines("citations.CSV")[82] 
[1] "10.2307/3642839,10.2307/3642839\t,\"Thorn\" and \"Minus\" in Hieroglyphic Luvian Orthography\t,H. Craig Melchert\t,Anatolian Studies\t,38\t,\t,1988-01-01T00:00:00Z\t,pp. 29-42\t,British Institute at Ankara\t,fla\t,\t,"

來源

2013-07-01 23:04:57 dickoa

謝謝，這是一個簡單的修復。現在你怎麼看待fread在這種情況下工作？我更喜歡這樣做，因爲它比'read.csv'快得多。但'fread'似乎並沒有引用'quote'的參數。 – Ben

@Ben我試圖讓它工作也沒有成功，正如你指出的那樣'fread'對嵌入式引號不會很好，但我肯定會很快有一個解決方法。 http://stackoverflow.com/questions/16094025/data-tablefread-and-unbalanced – dickoa

我看，謝謝檢查。 – Ben

我也碰到了這個問題，並能夠解決類似的EOF錯誤使用：

read.table("....csv", sep=",", ...)

請注意，分隔符參數是在更一般的read.table()內定義的。

來源

2013-08-01 17:38:20

嗨，這不適用於我...我得到一個錯誤在read.table錯誤（「。csv」，：多列列名 - 似乎跳過（skip = 6）無法正常工作。 .. – maycca

我有類似的問題：EOF -warning，只有一部分數據是用read.csv（）加載的。我嘗試了quotes =「」，但它只是刪除了EOF警告。

但看着沒有加載的第一行，我發現有一個特殊字符，在其中一個單元格中有一個箭頭→（十六進制值0x1A）。刪除箭頭後，我得到的數據正常加載。

來源

2015-07-12 08:30:02 ElinaJ

同樣的問題，有沒有另外的方法來解決這個問題，沒有任何人工干預？ – Mohit

在R幫助部分，正如上面所指出的，只是禁止引用乾脆，通過簡單地添加：

quote = ""

到read.csv（）爲我工作。

錯誤，「引用的字符串內EOF」，與發生：

> iproscan.53A.neg  = read.csv("interproscan.53A.neg.n.csv", 
    +      colClasses=c(pb.id  = "character", 
    +          genLoc  = "character", 
    +          icode  = "character", 
    +          length  = "character", 
    +          proteinDB = "character", 
    +          protein.id = "character", 
    +          prot.desc = "character", 
    +          start  = "character", 
    +          end  = "character", 
    +          evalue  = "character", 
    +          tchar  = "character", 
    +          date  = "character", 
    +          ipro.id = "character", 
    +          prot.name = "character", 
    +          go.cat  = "character", 
    +          reactome.id= "character"), 
    +          as.is=T,header=F) 
    Warning message: 
    In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : 
     EOF within quoted string 
    > dim(iproscan.53A.neg) 
    [1] 69383 16

和文件讀取失蹤6619線。但是通過禁用引用

> iproscan.53A.neg  = read.csv("interproscan.53A.neg.n.csv", 
    +      colClasses=c(pb.id  = "character", 
    +          genLoc  = "character", 
    +          icode  = "character", 
    +          length  = "character", 
    +          proteinDB = "character", 
    +          protein.id = "character", 
    +          prot.desc = "character", 
    +          start  = "character", 
    +          end  = "character", 
    +          evalue  = "character", 
    +          tchar  = "character", 
    +          date  = "character", 
    +          ipro.id = "character", 
    +          prot.name = "character", 
    +          go.cat  = "character", 
    +          reactome.id= "character"), 
    +          as.is=T,header=F,**quote=""**)  
    > 
    > dim(iproscan.53A.neg) 
    [1] 76002 16

工作沒有錯誤，並且所有線路均順利讀取。

來源

2015-09-26 21:06:19

您正在重複一個較早的答案，然後通過在代碼塊內添加不必要的側翼雙星號來削弱其實用性。 –

我是一個新十歲上下[R用戶我想我會張貼此情況下，它可以幫助別人。我試圖從文本文件中讀取數據（用逗號分隔），其中包含一些西班牙文字符，並且我花了很長時間才弄明白。我知道我需要使用UTF-8編碼，將標題arg設置爲TRUE，並且我需要將sep arguemnt設置爲「，」，但我仍然有掛起。 After reading this post我試着將填充參數設置爲TRUE，但後來得到了相同的「EOF在引用字符串」，我可以用上述相同的方式修復。我的函數read.table成功看起來是這樣的：

target <- read.table("target2.txt", fill=TRUE, header=TRUE, quote="", sep=",", encoding="UTF-8")

其結果是西班牙語言的字符和相同變暗我原本，所以我稱它是成功的！謝謝大家！

來源

2015-10-13 21:30:59 mjd876

實際上，使用read.csv()來讀取文本內容並不是一個好主意，禁止引用爲set quote =「」只是一個臨時解決方案，它只能使用單獨的引號。還有其他原因會導致警告，例如一些特殊字符。

所以對於這些特殊字符的情況，永久的解決方案是檢查你的文件，找出那些特殊字符是什麼，並使用正則表達式來消除它們。

你有沒有想過安裝包{data.table}並使用fread()來讀取文件。它速度更快，並且不會打擾您使用此EOF警告。請注意，您讀取的文件不是類data.frame，data.table
有很多很好的功能，但如果需要，您可以使用as.data.frame()進行更改。

來源

2016-12-16 14:08:08 floatsd

read.csv警告'引用字符串中的EOF'阻止完整讀取文件

回答

相關問題