如何使Apache Tika在.java和.xml（etc.）文件中找到文本

好的。我想通了如何使用Apache提卡搜索一些它可以處理的文件類型沒有我提供更多的代碼比存在於tika-example：如何使Apache Tika在.java和.xml（etc.）文件中找到文本

public class MyFirstTika { 

    public static boolean contains(File file, String s) throws MalformedURLException, 
    IOException, MimeTypeException, SAXException, TikaException{ 

    ContentHandler handler = new BodyContentHandler(); 

    MimeTypes mimeRegistry = TikaConfig.getDefaultConfig().getMimeRepository(); 

    Detector mimeDetector = (Detector) mimeRegistry; 

    LanguageIdentifier lang = new LanguageIdentifier(new LanguageProfile(FileUtils.readFileToString(file))); 

    Parser parser = TikaConfig.getDefaultConfig().getParser(MediaType.parse(mimeRegistry.getMimeType(file).getName())); 

    Metadata parsedMet = new Metadata(); 

    parser.parse(file.toURI().toURL().openStream(), handler,parsedMet, new ParseContext()); 

    return handler.toString().toLowerCase().contains(s.toLowerCase()); 
    } 

    public static void main(String[] args) throws Exception 
    { 
    String searchString = "champion"; 
    String filename = "schedule.pdf"; //test.docx";//"meds.xlsx";//Test2.Doc"; 
    File file = new File(filename); 

    System.out.println(file + " contains " + searchString + ": " 
      + contains(file, searchString)); 
    } 
}

以上可以跟以下類型的文件是否包含一個單詞或短語： .DOC .DOCX 的.xlsx .PDF .TXT 的.html

它不爲.java文件或.xml文件。（a）我該做什麼我想查看擴展名爲.java或.xml的文本文件是否包含單詞或短語？（b）這些不是我經常創建或編輯的唯一類型的文件。有沒有辦法讓Apache Tika檢測文件是否爲文本文件而不指定其擴展名？

編輯背景：我寫了一個Windows搜索程序，它比搜索命令更好。現在我試圖在模式匹配的文件中添加搜索特定文本。

編輯

這裏的程序（修改，以提供以下信息），當我有搜索void在Copy.java的輸出：

Examining: [copy.java] 
The MIME type (based on filename) is: [text/x-java-source] 
The MIME type (based on MAGIC) is: [application/octet-stream 
The MIME type (based on the Detector interface) is: [text/plain] 
The language of this content is: [et] 
Parsed Metadata: 

Parsed Text: 

copy.java contains void: false

那麼，爲什麼沒有呢找到void？（答案：？。因爲它沒有發現任何Parsed Metadata或Parsed Text，但爲什麼沒有把它找到那些它應該顯示的整個文件

我複製到copy.java計劃copy.txt確實發現void同樣的事情發生了，當我複製到build.xmlbuild.txt

也許這增加信息有助於回答這個問題：「如何處理.java和.xml等文本文件，如.c等？「

請從搜索copy.TXT注意輸出：

run: 
Examining: [copy.TXT] 
The MIME type (based on filename) is: [text/plain] 
The MIME type (based on MAGIC) is: [application/octet-stream 
The MIME type (based on the Detector interface) is: [text/plain] 
The language of this content is: [et]

解析的元數據：

Content-Encoding=UTF-8 Content-Type=text/plain; charset=UTF-8

解析的文本：

public static void main(String[] args) throws IOException { 
    EventQueue.invokeLater(new Runnable() 
     { @Override 
      public void run() { 
       insUserIO = new UserIO(); 
      } 
     } 
    ); 
    } 

copy.TXT contains void: true 
BUILD SUCCESSFUL (total time: 1 second)

整訂正方案

package org.apache.tika.example; 

    import java.io.File; 
    import java.io.IOException; 
    import java.net.MalformedURLException; 

    import org.apache.commons.io.FileUtils; 
    import org.apache.tika.config.TikaConfig; 
    import org.apache.tika.detect.Detector; 
    import org.apache.tika.exception.TikaException; 
    import org.apache.tika.language.LanguageIdentifier; 
    import org.apache.tika.language.LanguageProfile; 
    import org.apache.tika.metadata.Metadata; 
    import org.apache.tika.mime.MediaType; 
    import org.apache.tika.mime.MimeTypeException; 
    import org.apache.tika.mime.MimeTypes; 
    import org.apache.tika.parser.ParseContext; 
    import org.apache.tika.parser.Parser; 
    import org.apache.tika.sax.BodyContentHandler; 
    import org.xml.sax.ContentHandler; 
    import org.xml.sax.SAXException; 

    public class MyFirstTika { 

     static boolean debugging = true; 

     public static boolean contains(File file, String s) throws MalformedURLException, IOException, MimeTypeException, SAXException, TikaException{ 

     ContentHandler handler = new BodyContentHandler(); 

      MimeTypes mimeRegistry = TikaConfig.getDefaultConfig() 
        .getMimeRepository(); 

      if(debugging) System.out.println("Examining: [" + file + "]"); 

      if(debugging) System.out.println("The MIME type (based on filename) is: [" 
        + mimeRegistry.getMimeType(file.toString()) + "]"); 

      if(debugging) System.out.println("The MIME type (based on MAGIC) is: [" 
        + mimeRegistry.getMimeType(file + "]")); 

      Detector mimeDetector = (Detector) mimeRegistry; 
      if(debugging) System.out 
        .println("The MIME type (based on the Detector interface) is: [" 
          + mimeDetector.detect(file.toURI().toURL() 
            .openStream(), new Metadata()) + "]"); 

      LanguageIdentifier lang = new LanguageIdentifier(new LanguageProfile(
        FileUtils.readFileToString(file))); 

      if(debugging) System.out.println("The language of this content is: [" 
        + lang.getLanguage() + "]"); 

      Parser parser = TikaConfig.getDefaultConfig().getParser(
        MediaType.parse(mimeRegistry.getMimeType(file).getName())); 

     Metadata parsedMet = new Metadata(); 
      parser.parse(file.toURI().toURL().openStream(), handler, 
        parsedMet, new ParseContext()); 

      if(debugging) System.out.println("Parsed Metadata: "); 
      if(debugging) System.out.println(parsedMet); 
      if(debugging) System.out.println("Parsed Text: "); 
      if(debugging) System.out.println(handler.toString()); 
     return handler.toString().toLowerCase().contains(s.toLowerCase()); 
     } 

     public static void main(String[] args) throws Exception 
     { 
     File file = new File(filename); 

     System.out.println(file + " contains " + searchString + ": " 
       + contains(file, searchString)); 
     } 

     static String searchString = "void"; 
     static String filename = "copy.TXT"; 
    }

來源

2015-08-08 DSlomer64

您使用的是什麼版本的Apache Tika？如果它不是最新的（2015年8月爲1.10），那麼升級會發生什麼？ – Gagravarr

@ Gagravarr - 我幾天前下載了1.9。我沒有要求1.9或1.10，我只是拿着那天在www，apache.tika上顯示的內容。但是由於'.java'和'.xml'文件是ASCII文本（不是他們？），文本解析器應該像'.txt'一樣工作。那是錯的嗎？ – DSlomer64

Apache Tika有一堆JUnit單元測試，它們能夠提取java和xml文件的文本內容，所以你必須做錯了什麼。如果您按照Tika網站上的示例進行操作，並使用簡單的'AutoDetectParser'而不是當前奇怪而複雜的設置，會發生什麼？ – Gagravarr

萬分感謝TO @Gagravarr用於掌控我AutoDetectParser！

下面找到文本的小程序中的「文本」各類（人類可讀）（以及其他 - 例如，.doc*）文件，包括private.properties，ParsingExample.java（！本身），test.doc，test.pdf（由Word生產），gradlew.bat等等

package org.apache.tika.example; 
import java.io.File; 
import java.io.FileInputStream; 
import java.io.IOException; 
import java.io.InputStream; 
import java.net.MalformedURLException; 
import org.apache.tika.exception.TikaException; 
import org.apache.tika.metadata.Metadata; 
import org.apache.tika.mime.MimeTypeException; 
import org.apache.tika.parser.AutoDetectParser; 
import org.apache.tika.sax.BodyContentHandler; 
import org.xml.sax.SAXException; 

public class ParsingExample { 

    public static boolean contains(File file, String s) throws MalformedURLException, 
        IOException, MimeTypeException, SAXException, TikaException 
    { 
    InputStream   stream = new FileInputStream(file); 
    AutoDetectParser parser = new AutoDetectParser(); 
    BodyContentHandler handler = new BodyContentHandler(-1); 
    Metadata   metadata = new Metadata(); 
    try{ 
     parser.parse(stream, handler, metadata); 
     return handler.toString().toLowerCase().contains(s.toLowerCase()); 
    } 
    catch (IOException | SAXException | TikaException e){ 
     System.out.println(file + ": " + e + "\n"); 
     return false; 
    } 
    } 
    public static void main(String[] args) 
    { 
     try { 
     System.out.println("File " + filename + " contains <" + searchString + "> : " 
      + contains(new File(filename), searchString)); 
     } catch (IOException | SAXException | TikaException ex){ 
     System.out.println("Error: " + ex); 
     } 
    } 

    static String parseExample = ":("; 
    static String searchString = "slom"; 
    static String filename = "C:\\Users\\Dov\\x.pdf"; 
} 
    /** 
    * Example of how to use Tika to parse a file when you do not know its file type 
    * ahead of time. 
    * 
    * AutoDetectParser attempts to discover the file's type automatically, then call 
    * the exact Parser built for that file type. 
    * 
    * The stream to be parsed by the Parser. In this case, we get a file from the 
    * resources folder of this project. 
    * 
    * Handlers are used to get the exact information you want out of the host of 
    * information gathered by Parsers. The body content handler, intuitively, extracts 
    * everything that would go between HTML body tags. 
    * 
    * The Metadata object will be filled by the Parser with Metadata discovered about 
    * the file being parsed. 
    * 
    * Note: This example will extract content from the outer document and all 
    * embedded documents. However, if you choose to use a {@link ParseContext}, 
    * make sure to set a {@link Parser} or else embedded content will not be 
    * parsed. 
    * 
    * @return The content of a file. 
    * I let Netbeans add next 3 lines: 
    * @throws java.io.IOException 
    * @throws org.xml.sax.SAXException 
    * @throws org.apache.tika.exception.TikaException 
    */

來源

2015-08-09 18:15:56 DSlomer64

而不是'FileInputStream'，我建議你做'InputStream stream = TikaInputStream.get（file）' - 它會稍微快一些，並且在某些文件上的內存會更低類型 – Gagravarr

如何使Apache Tika在.java和.xml（etc.）文件中找到文本

回答

相關問題