2017-10-13 118 views
5

我試圖與sparklyr閱讀2GB〜(5MI線)爲.csv:Sparklyr忽略行分隔符

bigcsvspark <- spark_read_csv(sc, "bigtxt", "path", 
           delimiter = "!", 
           infer_schema = FALSE, 
           memory = TRUE, 
           overwrite = TRUE, 
           columns = list(
            SUPRESSED COLUMNS AS = 'character')) 

並獲得以下錯誤:

Job aborted due to stage failure: Task 9 in stage 15.0 failed 4 times, most recent failure: Lost task 9.3 in stage 15.0 (TID 3963, 
10.1.4.16): com.univocity.parsers.common.TextParsingException: Length of parsed input (1000001) exceeds the maximum number of characters defined in your parser settings (1000000). Identified line separator characters in the parsed content. This may be the cause of the error. The line separator in your parser settings is set to '\n'. Parsed content: ---lines of my csv---[\n] 
---begin of a splited line --- Parser Configuration: CsvParserSettings:  ... default settings ... 

和:

CsvFormat: 
    Comment character=\0 
    Field delimiter=! 
    Line separator (normalized)=\n 
    Line separator sequence=\n 
    Quote character=" 
    Quote escape character=\ 
    Quote escape escape character=null Internal state when error was thrown: 
     line=10599, 
     column=6, 
     record=8221, 
     charIndex=4430464, 
     headers=[---SUPRESSED HEADER---], 
     content parsed=---more lines without the delimiter.--- 

如上所示,在某些時候,行分隔符開始被忽略。在純R中可以讀取沒有問題,只需read.csv傳遞路徑和分隔符。

+0

正如作者所建議的,嘗試使用Dplyrs過濾器來移除/識別不需要的行。 https://github.com/rstudio/sparklyr/issues/83 – Igor

+0

我會嘗試一下,起初我懷疑緩衝區不能處理數據,但由於數據是一個巨大的混亂,它可能是一個數據問題,我也想寫一個Scala腳本來轉換爲Parquet。 –

回答

1

它看起來像文件不是一個真正的CSV,我想知道spark_read_text()會在這種情況下更好地工作。你應該能夠把所有的行都放到Spark中,並將行分割成內存中的字段,最後一部分將是最棘手的。