MapReduce程序中的XMLParser程序與消息失敗：實體名稱必須緊跟在實體引用中的'＆'後面。

我一直在hadoop集羣上執行分佈式XML解析。我在我的map-reduce程序中使用this XmlInputFormat。它工作得很好，我對該貢獻者表示誠摯的謝意。MapReduce程序中的XMLParser程序與消息失敗：實體名稱必須緊跟在實體引用中的'＆'後面。

但是，這裏要說的是我遇到的問題：

在測試這幾個地圖紅色工作崗位與下面XMLStreamException失敗。

java.io.IOException: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[21,69] 
Message: The entity name must immediately follow the '&' in the entity reference. 
at org.apache.hadoop.examples.XMLRecordCount$Map.map(XMLRecordCount.java:197) 
at org.apache.hadoop.examples.XMLRecordCount$Map.map(XMLRecordCount.java:1) 
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) 
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) 
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) 
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:214) 
Caused by: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[21,69] 
Message: The entity name must immediately follow the '&' in the entity reference. 
at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:594) 
at org.apache.hadoop.examples.XMLRecordCount$Map.map(XMLRecordCount.java:168)

根據我的理解，這是由於數據中的字符&造成的。例如 - "<name>Alen & Bob </name>"

我正在處理包含上述這些數據的日誌。但正因爲如此，整個工作都失敗了。

我可以將預處理視爲一種解決方案，但對我來說可能不是一種有效的選擇。

你能給我一個建議，我可以跳過這樣的壞記錄/或者只是用Java XML API來代替這樣的字符嗎？

來源

2013-07-01 Dev.Next

在map函數（在the example您發佈），而不是捕捉和重新拋出任何異常，只需抓住XMLStreamException並在catch塊中什麼都不做。什麼都不會發出，作業不會失敗。不過，您可能想增加一個計數器來跟蹤無效記錄。

僞代碼：

protected void map(LongWritable key, Text value, Mapper.Context context) throws IOException 
     try { 
      XMLStreamReader reader = ... 
      context.write(...); 
     } catch(XMLStreamException e){ 
      // do nothing 
      context.getCounter(INVALID_RECORDS).increment(1); 
     } 
    }

來源

2013-07-01 15:38:51 jkovacs

你可能這樣做，如果你使用的是XmlParser11.java作爲你提到 - 你可以取代「&」文檔字符串對象是這樣的：

document = document.replace("&", "your_desired_working_string_here"); 
... 
... 
XMLStreamReader reader = XMLInputFactory.newInstance().createXMLStreamReader(new 
         ByteArrayInputStream(document.getBytes()));

，然後在從地圖發光的時間（），則可以再次與'&'替換

"your_desired_working_string_here"。

希望有所幫助。

來源

2013-07-01 20:19:37

而不是您的XML中的&，請嘗試使用&。即代替<name>Alen & Bob </name>有<name>Alen & Bob </name>

來源

2013-07-03 06:41:05

MapReduce程序中的XMLParser程序與消息失敗：實體名稱必須緊跟在實體引用中的'＆'後面。

回答

相關問題