gzip壓縮XML當我試圖讀取xml.gz文件到斯卡拉,我收到以下錯誤:閱讀Scala的
com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:701)
at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:567)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1896)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.arrangeCapacity(XMLEntityScanner.java:1761)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipString(XMLEntityScanner.java:1799)
at com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion(XMLVersionDetector.java:156)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:812)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:327)
at scala.xml.factory.XMLLoader$class.loadXML(XMLLoader.scala:41)
at scala.xml.XML$.loadXML(XML.scala:60)
at scala.xml.factory.XMLLoader$class.loadFile(XMLLoader.scala:50)
at scala.xml.X
我有以下代碼:
import scala.xml.XML
val xml = XML.loadFile("/home/vagrant/miniprojects/spark/allVotes/part-00380.xml.gz")
我有更多的要讀入2000 xml.gz文件。對此有什麼解決方法?非常感謝你!!
顯示你在儘可能小的完整和可測試的形式工作(你怎麼做的解析,特別是如何在做gzip的解壓)會一個開始的地方。看到http://stackoverflow.com/help/mcve –
...所以,你*不是*做gzip解壓縮的話。如果您不能像讀取XML文件一樣閱讀gzip文件,這會讓您感到驚訝嗎? –
感謝您的提醒。正如我gunzip該文件,它佔用了太多的內存... – achimneyswallow