1
我解析PDF文件以使用Apache Tika提取文本。使用Apache Tika從文本/ PDF中刪除特殊字符
//Create a body content handler
BodyContentHandler handler = new BodyContentHandler();
//Metadata
Metadata metadata = new Metadata();
//Input file path
FileInputStream inputstream = new FileInputStream(new File(faInputFileName));
//Parser context. It is used to parse InputStream
ParseContext pcontext = new ParseContext();
try
{
//parsing the document using PDF parser from Tika.
PDFParser pdfparser = new PDFParser();
//Do the parsing by calling the parse function of pdfparser
pdfparser.parse(inputstream, handler, metadata,pcontext);
}catch(Exception e)
{
System.out.println("Exception caught:");
}
String extractedText = handler.toString();
以上代碼作品和PDF文本被提取。
PDF文件中有一些特殊字符(如@/& /£或商標符號等)。我如何在提取過程中或提取過程後去除那些特殊的字符?
w^ith在字符串上的正則表達式?使用[String.replace](https://docs.oracle.com/javase/7/docs/api/java/lang/String.html#replace(java.lang.CharSequence,%20java.lang.CharSequence))? – Gagravarr