2014-02-10 100 views
0

我與XML很新,和一個壞消息是,我有以下結構的XML:拆分XML轉換成指定大小的小XML文件

<record> 
    <record_id>200</record_id> 
    <record_rows> 
     <record_row>some text</record_row> 
     ................................. 
    </record_rows> 
</record> 

記錄行數是每個記錄不同,所以,每個記錄的大小都不相同。我的任務是將文件(大於1GB)分割成指定大小的單獨xml文件。哪個解析器是最好的?此外,我想我應該採用一些唱片選擇策略,以接近目標大小(並且我無法想象任何在考慮到輸入文件大小和下一個記錄大小的不可預測性)

唯一的希望是你,我的朋友們。你會如何處理這個問題?

+0

是否大小必須是準確的? (如果這樣的文件需要_valid_ XML)? –

+0

文件應儘可能接近指定的大小,但不是確切的。文件應該是有效的XML – StackExploded

+1

「哪個分析器」是一個意見問題。所以「實際上你會怎麼做」......但我自己的建議是修改標準的SAX讀寫回寫示例,以確認每次退出「」時,它應該檢查輸出文檔的長度,並且如果距離邊界太近,就會終止該文件並開始一個新的。 – keshlam

回答

1

假設您的記錄行不超過您單個文件的期望大小,您可以使用SAX解析器按順序讀取文件並對讀取的字符進行計數,將迄今爲止讀取的數據存儲在緩衝區中。當字符計數達到一個接近您的大小限制的值時,它將創建一個僅包含迄今爲止讀取的記錄的新文件,重置緩衝區和字符計數,並將繼續讀取另一個集合,直到再次達到限制,並且等等。最後,您將擁有一組大小基本相同的文件(除了最後一個可能更小)以及包含相同數據的文件。

要使用SAX解析器,您將需要一個包含下面的代碼的可執行文件:(相對於在運行該應用程序)

import java.io.*; 
import javax.xml.parsers.*; 
import org.xml.sax.*; 

public class SAXReader { 

    public static final String PATH = "src/main/resources"; 

    public static void main(String[] args) throws ParserConfigurationException, SAXException, IOException { 
     SAXParserFactory spf = SAXParserFactory.newInstance(); 
     SAXParser sp = spf.newSAXParser(); 
     XMLReader reader = sp.getXMLReader(); 
     reader.setContentHandler(new DataSaxHandler()); // need to implement this file 
     reader.parse(new InputSource(new FileInputStream(new File(PATH, "data.xml")))); 
    } 
} 

你的XML文件存儲在src/main/resources/data.xml。你可能想改變它。

如果分割文件是格式良好的XML,它們也應該有一個根元素,並且可能保留諸如record_id之類的信息,以便您可以知道它們來自哪條記錄。我添加了一個屬性part,其中包含排序文件片段的順序號。生成的文件看起來像這樣:

data_part_1.xml

<record part='1'><record_id>200</record_id><record_rows><record_row>...</record_row><record_row>...</record_row> ... <record_row>...</record_row></record_rows></record> 

data_part_2.xml

<record part='2'><record_id>200</record_id><record_rows><record_row>...</record_row><record_row>...</record_row> ... <record_row>...</record_row></record_rows></record> 

...

data_part_n.xml

<record part='n'><record_id>200</record_id><record_rows><record_row>...</record_row><record_row>...</record_row><record_row>...</record_row><record_row>...</record_row></record_rows></record> 

其中'n'是創建的文件數。

實現此結果的SAX ContentHandler實現如下所示。你可能想改變DIRECTORYMAX_SIZE常數:

import java.io.*; 
import org.xml.sax.*; 
import org.xml.sax.helpers.DefaultHandler; 

class DataSaxHandler extends DefaultHandler { 

    // Change this to the directory where the files will be stored 
    public static final String DIRECTORY = "target/results"; 

    // Change this to the approximate size of the resulting files (in characters(
    public static final long MAX_SIZE = 1024; 


    public static final long TAG_CHAR_SIZE = 5; //"<></>" 

    // counts number of files created 
    private int fileCount = 0; 

    // counts characters to decide where to split file 
    private long charCount = 0; 
    // data line buffer (is reset when the file is split) 
    private StringBuilder recordRowDataLines = new StringBuilder(); 

    // temporary variables used for the parser events 
    private String currentElement = null; 
    private String currentRecordId = null; 
    private String currentRecordRowData = null; 

    @Override 
    public void startDocument() throws SAXException { 
     File dir = new File(DIRECTORY); 
     if (!dir.exists()) { 
      dir.mkdir(); 
     } 
    } 

    @Override 
    public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException { 
     currentElement = qName; 
    } 

    @Override 
    public void endElement(String uri, String localName, String qName) throws SAXException { 
     if (qName.equals("record_rows")) { // no more records - save last file here! 
      try { 
       saveFragment(); 
      } catch (IOException ex) { 
       throw new SAXException(ex); 
      } 
     } 
     if (qName.equals("record_row")) { // one record finished - save in buffer & calculate size so far 
      charCount += tagSize("record_row"); 
      recordRowDataLines.append("<record_row>") 
           .append(currentRecordRowData) 
           .append("</record_row>"); 
      if (charCount >= MAX_SIZE) { // if max size was reached, save what was read so far in a new file 
       try { 
        saveFragment(); 
       } catch (IOException ex) { 
        throw new SAXException(ex); 
       } 
      } 
     } 
     currentElement = null; 
    } 

    @Override 
    public void characters(char[] ch, int start, int length) throws SAXException { 
     System.out.println(new String(ch, start, length)); 
     if (currentElement == null) { 
      return; 
     } 
     if (currentElement.equals("record_id")) { 
      currentRecordId = new String(ch, start, length); 
     } 
     if (currentElement.equals("record_row")) { 
      currentRecordRowData = new String(ch, start, length); 
      charCount += currentRecordRowData.length(); // storing size so far 
     } 
    } 

    public long tagSize(String tagName) { 
     return TAG_CHAR_SIZE + tagName.length() * 2; // size of text + tags 
    } 

    /** 
    * Saves a new file containing approximately MAX_SIZE in chars 
    */ 
    public void saveFragment() throws IOException { 
     ++fileCount; 
     StringBuilder fileContent = new StringBuilder(); 
     fileContent.append("<record part='") 
        .append(fileCount) 
        .append("'><record_id>") 
        .append(currentRecordId) 
        .append("</record_id>") 
        .append("<record_rows>") 
        .append(recordRowDataLines) 
        .append("</record_rows></record>"); 
     File fragment = new File(DIRECTORY, "data_part_" + fileCount + ".xml"); 
     FileWriter out = new FileWriter(fragment); 
     out.write(fileContent.toString()); 
     out.flush(); 
     out.close(); 

     // reset fragment data - record buffer and char count 
     recordRowDataLines = new StringBuilder(); 
     charCount = 0; 
    } 

}