假設您的記錄行不超過您單個文件的期望大小,您可以使用SAX解析器按順序讀取文件並對讀取的字符進行計數,將迄今爲止讀取的數據存儲在緩衝區中。當字符計數達到一個接近您的大小限制的值時,它將創建一個僅包含迄今爲止讀取的記錄的新文件,重置緩衝區和字符計數,並將繼續讀取另一個集合,直到再次達到限制,並且等等。最後,您將擁有一組大小基本相同的文件(除了最後一個可能更小)以及包含相同數據的文件。
要使用SAX解析器,您將需要一個包含下面的代碼的可執行文件:(相對於在運行該應用程序)
import java.io.*;
import javax.xml.parsers.*;
import org.xml.sax.*;
public class SAXReader {
public static final String PATH = "src/main/resources";
public static void main(String[] args) throws ParserConfigurationException, SAXException, IOException {
SAXParserFactory spf = SAXParserFactory.newInstance();
SAXParser sp = spf.newSAXParser();
XMLReader reader = sp.getXMLReader();
reader.setContentHandler(new DataSaxHandler()); // need to implement this file
reader.parse(new InputSource(new FileInputStream(new File(PATH, "data.xml"))));
}
}
你的XML文件存儲在src/main/resources/data.xml
。你可能想改變它。
如果分割文件是格式良好的XML,它們也應該有一個根元素,並且可能保留諸如record_id
之類的信息,以便您可以知道它們來自哪條記錄。我添加了一個屬性part
,其中包含排序文件片段的順序號。生成的文件看起來像這樣:
data_part_1.xml
<record part='1'><record_id>200</record_id><record_rows><record_row>...</record_row><record_row>...</record_row> ... <record_row>...</record_row></record_rows></record>
data_part_2.xml
<record part='2'><record_id>200</record_id><record_rows><record_row>...</record_row><record_row>...</record_row> ... <record_row>...</record_row></record_rows></record>
...
data_part_n.xml
<record part='n'><record_id>200</record_id><record_rows><record_row>...</record_row><record_row>...</record_row><record_row>...</record_row><record_row>...</record_row></record_rows></record>
其中'n'是創建的文件數。
實現此結果的SAX ContentHandler實現如下所示。你可能想改變DIRECTORY
和MAX_SIZE
常數:
import java.io.*;
import org.xml.sax.*;
import org.xml.sax.helpers.DefaultHandler;
class DataSaxHandler extends DefaultHandler {
// Change this to the directory where the files will be stored
public static final String DIRECTORY = "target/results";
// Change this to the approximate size of the resulting files (in characters(
public static final long MAX_SIZE = 1024;
public static final long TAG_CHAR_SIZE = 5; //"<></>"
// counts number of files created
private int fileCount = 0;
// counts characters to decide where to split file
private long charCount = 0;
// data line buffer (is reset when the file is split)
private StringBuilder recordRowDataLines = new StringBuilder();
// temporary variables used for the parser events
private String currentElement = null;
private String currentRecordId = null;
private String currentRecordRowData = null;
@Override
public void startDocument() throws SAXException {
File dir = new File(DIRECTORY);
if (!dir.exists()) {
dir.mkdir();
}
}
@Override
public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException {
currentElement = qName;
}
@Override
public void endElement(String uri, String localName, String qName) throws SAXException {
if (qName.equals("record_rows")) { // no more records - save last file here!
try {
saveFragment();
} catch (IOException ex) {
throw new SAXException(ex);
}
}
if (qName.equals("record_row")) { // one record finished - save in buffer & calculate size so far
charCount += tagSize("record_row");
recordRowDataLines.append("<record_row>")
.append(currentRecordRowData)
.append("</record_row>");
if (charCount >= MAX_SIZE) { // if max size was reached, save what was read so far in a new file
try {
saveFragment();
} catch (IOException ex) {
throw new SAXException(ex);
}
}
}
currentElement = null;
}
@Override
public void characters(char[] ch, int start, int length) throws SAXException {
System.out.println(new String(ch, start, length));
if (currentElement == null) {
return;
}
if (currentElement.equals("record_id")) {
currentRecordId = new String(ch, start, length);
}
if (currentElement.equals("record_row")) {
currentRecordRowData = new String(ch, start, length);
charCount += currentRecordRowData.length(); // storing size so far
}
}
public long tagSize(String tagName) {
return TAG_CHAR_SIZE + tagName.length() * 2; // size of text + tags
}
/**
* Saves a new file containing approximately MAX_SIZE in chars
*/
public void saveFragment() throws IOException {
++fileCount;
StringBuilder fileContent = new StringBuilder();
fileContent.append("<record part='")
.append(fileCount)
.append("'><record_id>")
.append(currentRecordId)
.append("</record_id>")
.append("<record_rows>")
.append(recordRowDataLines)
.append("</record_rows></record>");
File fragment = new File(DIRECTORY, "data_part_" + fileCount + ".xml");
FileWriter out = new FileWriter(fragment);
out.write(fileContent.toString());
out.flush();
out.close();
// reset fragment data - record buffer and char count
recordRowDataLines = new StringBuilder();
charCount = 0;
}
}
是否大小必須是準確的? (如果這樣的文件需要_valid_ XML)? –
文件應儘可能接近指定的大小,但不是確切的。文件應該是有效的XML – StackExploded
「哪個分析器」是一個意見問題。所以「實際上你會怎麼做」......但我自己的建議是修改標準的SAX讀寫回寫示例,以確認每次退出「」時,它應該檢查輸出文檔的長度,並且如果距離邊界太近,就會終止該文件並開始一個新的。 –
keshlam