0
PDFTextStripper stripper = new PDFText2HTML(encoding);
String result = stripper.getText(document).trim();
結果解析包含有類似PDFTextStripper與錯誤的編碼
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd"> <html><head><title>Inserat
SeLe EE rev</title> <meta http-equiv="Content-Type"
content="text/html; charset=utf-8"> </head> <body> <div
style="page-break-before:always;
page-break-after:always"><div><p>�&#...
,而不是
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd"> <html><head><title>Inserat
SeLe EE rev</title> <meta http-equiv="Content-Type"
content="text/html; charset=utf-8"> </head> <body> <div
style="page-break-before:always; page-break-after:always"><div><p>any
blablabla characters...
當我改變編碼窗口1252或UTF-8的結果沒有改變。 Bad pdf url http://www.permaco.ch/fileadmin/user_upload/jobs/Inserat_SeLe_EE_rev.pdf
如何解析此pdf?
+1很好的答案,很好的總結。 – nickb