我遇到了800,000個單元格和3M字符的問題,其中XSSF分配1GB的堆!
我用Python openpyxl
和numpy
來讀取xlsx文件(來自Java代碼)並首先將其轉換爲普通文本。然後我用java加載文本文件。它似乎有很大的開銷,但確實很快。
的Python腳本看起來像
import openpyxl as px
import numpy as np
# xlsx file is given through command line foo.xlsx
fname = sys.argv[1]
W = px.load_workbook(fname, read_only = True)
p = W.get_sheet_by_name(name = 'Sheet1')
a=[]
# number of rows and columns
m = p.max_row
n = p.max_column
for row in p.iter_rows():
for k in row:
a.append(k.value)
# convert list a to matrix (for example maxRows*maxColumns)
aa= np.resize(a, [m, n])
# output file is also given in the command line foo.txt
oname = sys.argv[2]
print (oname)
file = open(oname,"w")
mm = m-1
for i in range(mm):
for j in range(n):
file.write("%s " %aa[i,j] )
file.write ("\n")
# to prevent extra newline in the text file
for j in range(n):
file.write("%s " %aa[m-1,j])
file.close()
然後在我的Java代碼,我寫了
try {
// `pwd`\python_script foo.xlsx foo.txt
String pythonScript = System.getProperty("user.dir") + "\\exread.py ";
String cmdline = "python " + pythonScript +
workingDirectoryPath + "\\" + fullFileName + " " +
workingDirectoryPath + "\\" + shortFileName + ".txt";
Process p = Runtime.getRuntime().exec(cmdline);
int exitCode = p.waitFor();
if (exitCode != 0) {
throw new IOException("Python command exited with " + exitCode);
}
} catch (IOException e) {
System.out.println(e.getMessage());
} catch (InterruptedException e) {
ReadInfo.append(e.getMessage());
}
之後,你會得到foo.txt的這類似於foo.xlsx,但在文本格式。
在哪裏運行此代碼?內部應用程序/ Web服務器還是獨立的? – JSS 2011-02-04 12:28:47
我在Tomcat 6.0內部運行它 – miah 2011-02-04 12:48:43
在啓動時分配給Tomcat的默認內存是多少? – JSS 2011-02-04 12:58:18