2013-07-01 53 views
104

我有a CSV file (24.1 MB),我無法完全讀入我的R會話。當我在電子表格程序中打開文件時,可以看到112,544行。當我與read.csv讀入R I只得到56952行,這樣的警告:read.csv警告'引用字符串中的EOF'阻止完整讀取文件

cit <- read.csv("citations.CSV", row.names = NULL, 
       comment.char = "", header = TRUE, 
       stringsAsFactors = FALSE, 
       colClasses= "character", encoding= "utf-8") 

Warning message: 
In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : 
    EOF within quoted string 

我可以readLines讀取整個文件到R:

rl <- readLines(file("citations.CSV", encoding = "utf-8")) 
length(rl) 
[1] 112545 

但我不能把它恢復爲R爲表(通過read.csv):

write.table(rl, "rl.txt", quote = FALSE, row.names = FALSE) 
rl_in <- read.csv("rl.txt", skip = 1, row.names = NULL) 

Warning message: 
In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : 
    EOF within quoted string 

我該如何解決或變通方法此EOF消息(這似乎是超過一個警告錯誤的),以獲得整個文件導入我的R會話?

我也有類似的問題,閱讀的CSV文件等方法:()

require(sqldf) 
cit_sql <- read.csv.sql("citations.CSV", sql = "select * from file") 
require(data.table) 
cit_dt <- fread("citations.CSV") 
require(ff) 
cit_ff <- read.csv.ffdf(file="citations.CSV") 

這裏是我的sessionInfo

R version 3.0.1 (2013-05-16) 
Platform: x86_64-w64-mingw32/x64 (64-bit) 

locale: 
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C       
[5] LC_TIME=English_United States.1252  

attached base packages: 
[1] tools  tcltk  stats  graphics grDevices utils  datasets methods base  

other attached packages: 
[1] ff_2.2-11    bit_1.1-10   data.table_1.8.8  sqldf_0.4-6.4   
[5] RSQLite.extfuns_0.0.1 RSQLite_0.11.4  chron_2.3-43   gsubfn_0.6-5   
[9] proto_0.3-10   DBI_0.2-7 

回答

154

你需要禁用引用。

cit <- read.csv("citations.CSV", quote = "", 
       row.names = NULL, 
       stringsAsFactors = FALSE) 

str(cit) 
## 'data.frame': 112543 obs. of 13 variables: 
## $ row.names : chr "10.2307/675394" "10.2307/30007362" "10.2307/4254931" "10.2307/20537934" ... 
## $ id   : chr "10.2307/675394\t" "10.2307/30007362\t" "10.2307/4254931\t" "10.2307/20537934\t" ... 
## $ doi   : chr "Archaeological Inference and Inductive Confirmation\t" "Sound and Sense in Cath Almaine\t" "Oak Galls Preserved by the Eruption of Mount Vesuvius in A.D. 79_ and Their Probable Use\t" "The Arts Four Thousand Years Ago\t" ... 
## $ title  : chr "Bruce D. Smith\t" "Tomás Ó Cathasaigh\t" "Hiram G. Larew\t" "\t" ... 
## $ author  : chr "American Anthropologist\t" "Ériu\t" "Economic Botany\t" "The Illustrated Magazine of Art\t" ... 
## $ journaltitle : chr "79\t" "54\t" "41\t" "1\t" ... 
## $ volume  : chr "3\t" "\t" "1\t" "3\t" ... 
## $ issue  : chr "1977-09-01T00:00:00Z\t" "2004-01-01T00:00:00Z\t" "1987-01-01T00:00:00Z\t" "1853-01-01T00:00:00Z\t" ... 
## $ pubdate  : chr "pp. 598-617\t" "pp. 41-47\t" "pp. 33-40\t" "pp. 171-172\t" ... 
## $ pagerange : chr "American Anthropological Association\tWiley\t" "Royal Irish Academy\t" "New York Botanical Garden Press\tSpringer\t" "\t" ... 
## $ publisher : chr "fla\t" "fla\t" "fla\t" "fla\t" ... 
## $ type   : logi NA NA NA NA NA NA ... 
## $ reviewed.work: logi NA NA NA NA NA NA ... 

我想是因爲這種線的(檢查「刺」和「減」)

readLines("citations.CSV")[82] 
[1] "10.2307/3642839,10.2307/3642839\t,\"Thorn\" and \"Minus\" in Hieroglyphic Luvian Orthography\t,H. Craig Melchert\t,Anatolian Studies\t,38\t,\t,1988-01-01T00:00:00Z\t,pp. 29-42\t,British Institute at Ankara\t,fla\t,\t," 
+0

謝謝,這是一個簡單的修復。現在你怎麼看待fread在這種情況下工作?我更喜歡這樣做,因爲它比'read.csv'快得多。但'fread'似乎並沒有引用'quote'的參數。 – Ben

+1

@Ben我試圖讓它工作也沒有成功,正如你指出的那樣'fread'對嵌入式引號不會很好,但我肯定會很快有一個解決方法。 http://stackoverflow.com/questions/16094025/data-tablefread-and-unbalanced – dickoa

+0

我看,謝謝檢查。 – Ben

2

我也碰到了這個問題,並能夠解決類似的EOF錯誤使用:

read.table("....csv", sep=",", ...) 

請注意,分隔符參數是在更一般的read.table()內定義的。

+0

嗨,這不適用於我...我得到一個錯誤在read.table錯誤(「。csv」,: 多列列名 - 似乎跳過(skip = 6)無法正常工作。 .. – maycca

1

我有類似的問題:EOF -warning,只有一部分數據是用read.csv()加載的。我嘗試了quotes =「」,但它只是刪除了EOF警告。

但看着沒有加載的第一行,我發現有一個特殊字符,在其中一個單元格中有一個箭頭→(十六進制值0x1A)。刪除箭頭後,我得到的數據正常加載。

+0

同樣的問題,有沒有另外的方法來解決這個問題,沒有任何人工干預? – Mohit

4

在R幫助部分,正如上面所指出的,只是禁止引用乾脆,通過簡單地添加:

quote = "" 

到read.csv()爲我工作。

錯誤,「引用的字符串內EOF」,與發生:

> iproscan.53A.neg  = read.csv("interproscan.53A.neg.n.csv", 
    +      colClasses=c(pb.id  = "character", 
    +          genLoc  = "character", 
    +          icode  = "character", 
    +          length  = "character", 
    +          proteinDB = "character", 
    +          protein.id = "character", 
    +          prot.desc = "character", 
    +          start  = "character", 
    +          end  = "character", 
    +          evalue  = "character", 
    +          tchar  = "character", 
    +          date  = "character", 
    +          ipro.id = "character", 
    +          prot.name = "character", 
    +          go.cat  = "character", 
    +          reactome.id= "character"), 
    +          as.is=T,header=F) 
    Warning message: 
    In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : 
     EOF within quoted string 
    > dim(iproscan.53A.neg) 
    [1] 69383 16 

和文件讀取失蹤6619線。但是通過禁用引用

> iproscan.53A.neg  = read.csv("interproscan.53A.neg.n.csv", 
    +      colClasses=c(pb.id  = "character", 
    +          genLoc  = "character", 
    +          icode  = "character", 
    +          length  = "character", 
    +          proteinDB = "character", 
    +          protein.id = "character", 
    +          prot.desc = "character", 
    +          start  = "character", 
    +          end  = "character", 
    +          evalue  = "character", 
    +          tchar  = "character", 
    +          date  = "character", 
    +          ipro.id = "character", 
    +          prot.name = "character", 
    +          go.cat  = "character", 
    +          reactome.id= "character"), 
    +          as.is=T,header=F,**quote=""**)  
    > 
    > dim(iproscan.53A.neg) 
    [1] 76002 16 

工作沒有錯誤,並且所有線路均順利讀取。

+4

您正在重複一個較早的答案,然後通過在代碼塊內添加不必要的側翼雙星號來削弱其實用性。 –

6

我是一個新十歲上下[R用戶我想我會張貼此情況下,它可以幫助別人。我試圖從文本文件中讀取數據(用逗號分隔),其中包含一些西班牙文字符,並且我花了很長時間才弄明白。 我知道我需要使用UTF-8編碼,將標題arg設置爲TRUE,並且我需要將sep arguemnt設置爲「,」,但我仍然有掛起。 After reading this post我試着將填充參數設置爲TRUE,但後來得到了相同的「EOF在引用字符串」,我可以用上述相同的方式修復。我的函數read.table成功看起來是這樣的:

target <- read.table("target2.txt", fill=TRUE, header=TRUE, quote="", sep=",", encoding="UTF-8")

其結果是西班牙語言的字符和相同變暗我原本,所以我稱它是成功的!謝謝大家!

2

實際上,使用read.csv()來讀取文本內容並不是一個好主意,禁止引用爲set quote =「」只是一個臨時解決方案,它只能使用單獨的引號。還有其他原因會導致警告,例如一些特殊字符。

所以對於這些特殊字符的情況,永久的解決方案是檢查你的文件,找出那些特殊字符是什麼,並使用正則表達式來消除它們。

你有沒有想過安裝包{data.table}並使用fread()來讀取文件。它速度更快,並且不會打擾您使用此EOF警告。請注意,您讀取的文件不是類data.frame,data.table
有很多很好的功能,但如果需要,您可以使用as.data.frame()進行更改。