2015-11-11 24 views
1

我一直在嘗試解析groovy中的csv文件,目前使用庫org.apache.commons.csv 2.4。我的要求是,在csv單元格中有無效的數據值,比如無效字符,而不是在第一個無效行/單元格上拋出異常,我想收集這些異常並在csv文件中迭代直到結束,那麼我會得到這個csv文件有無效數據的完整列表。在異常容忍的Groovy中解析CSV

爲了達到這個目的,我嘗試了多種方法來使用這個apache庫,但不幸的是,只要它使用CSVParser.getNextRecord()進行迭代,迭代器就會中止。

放代碼,像這樣:

def records = new CSVParser(reader, CSVFormat.EXCEL.withHeader().withIgnoreSurroundingSpaces()) 

    // at this line, the iterator() inside CSVParser is always using getNextRecord() for its next() implementation, and it may throw exception on invalid char 
    records.each {record-> 
     // if the exception is thrown from .each, that makes below try/catch in vain 
     try{ 

     }catch(e){ //want collect Errors here } 
    } 

那麼,有沒有別的,我應該在這個庫挖?或者有人能指出我另一個更可行的解決方案嗎?非常感謝大家!

更新: 樣品CSV

"Company code for WBS element","WBS Element","PS: Short description (1st text line)","Responsible Cost Center for WBS Element","OBJNR","WBS Status" 

"1001","RE-01768-011","Opex - To present a paper on Career con","0000016400","PR00031497","X" 
"1001","RE-01768-011","Opex - To present a paper on "Career con","0000016400","PR00031497","X" 

第二個數據行有無效字符",使得解析器拋出異常

+0

你能給格式和「無效字符」的例子嗎? – jalopaba

回答

2

您遇到的問題是,一個單元格中的一個字符是解析器根據所選格式使用的quote字符:CSVFormat.EXCEL

引號字符

用於封裝值包含特殊字符

所以在你的榜樣報價被誤用的字符和解析器抱怨它。

您可以使用不同的CSVFormat解決方法。例如,一個沒有引號字符:

@Grapes(
    @Grab(group='org.apache.commons', module='commons-csv', version='1.2') 
) 

import java.nio.charset.* 
import org.apache.commons.csv.* 

def text = '''"Company code for WBS element","WBS Element","PS: Short description (1st text line)","Responsible Cost Center for WBS Element","OBJNR","WBS Status" 

"1001","RE-01768-011","Opex - To present a paper on Career con","0000016400","PR00031497","X" 
"1002","RE-01768-011","Opex - To present a paper on "Career con","0000016400","PR00031497","X" 
"1003","RE-01768-011","Opex - To present a paper on Career con","0000016400","PR00031497","X"''' 

def parsed = CSVParser.parse(text, CSVFormat.EXCEL.withHeader().withIgnoreSurroundingSpaces().withQuote(null)) 

parsed.getRecords().each { 
    println it.toMap().values() 
} 

和上面的產量:

[] 
["0000016400", "1001", "RE-01768-011", "Opex - To present a paper on Career con", "X", "PR00031497"] 
["0000016400", "1002", "RE-01768-011", "Opex - To present a paper on "Career con", "X", "PR00031497"] 
["0000016400", "1003", "RE-01768-011", "Opex - To present a paper on Career con", "X", "PR00031497"] 

當然,上述解決辦法,你有報價")包括在每一個領域。

可以全部更換,如果你想:

parsed.getRecords().each { 
    println it.toMap().values().collect({ it.replace('"', '') }) 
} 
+0

這更好@jalopaba,但其他無效字符怎麼樣?我的意思是這次在單元格中引發額外的引號,並且存在我們可以使用的ignoreQuote格式,但它是否適合所有無效字符?或者真的在解析CSV世界中唯一惱火的解析是額外的引號?我在這方面很新,可能只是問了一個愚蠢的問題。 –

+0

您應該考慮CSVFormat中的轉義字符,因爲它也會影響解析完成的方式。 – jalopaba

+0

剛剛檢查了commons.csv.Lexer類,它驗證了字符的異常只針對引號和轉義。 withQuote(null)在我的情況下工作,另一個問題立即提出...如果在單元格值中存在分隔符char(本例中爲'',此例中爲'),並且它將打破整個數據行... –

0

的問題是,如果CSV文件有無效數據,這意味着數據打破了csv格式的規則,那麼解析器不能...解析。這就是爲什麼它不能可靠地解析遇到的第一個錯誤。