使用Apache TIKA獲取內容，關鍵字和頁面標題

此代碼有任何錯誤。如果我在Ti t = new Ti();下面添加此行（String c= t.parseToString(content);），那麼我會返回url的實際內容，但在此之後，我會得到null值，關鍵字，標題和作者。如果我刪除這一行（String c= t.parseToString(content);），那麼我會得到標題，作者和關鍵字的實際值。爲什麼這樣呢？使用Apache TIKA獲取內容，關鍵字和頁面標題

HttpGet request = new HttpGet("http://xyz.com/d/index.html"); 

     HttpResponse response = client.execute(request); 
     HttpEntity entity = response.getEntity(); 
     InputStream content = entity.getContent(); 
     System.out.println(content)  

     Ti t = new Ti(); 
     String ct= t.parseToString(content); 
     System.out.println(ct); 

     Metadata md = new Metadata(); 



     Reader r = t.parse(content, md); 
     System.out.println(md); 


     System.out.println("Keywords: " +md.get("keywords")); 
     System.out.println("Title: " +md.get("title")); 
     System.out.println("Authors: " +md.get("authors"));

來源

2011-07-16 ferhan

您正在多次閱讀相同的流。完整閱讀完一個流後，您無法再讀取它。做類似的事情，

HttpResponse response = client.execute(request); 
HttpEntity entity = response.getEntity(); 

//http://stackoverflow.com/questions/1264709/convert-inputstream-to-byte-in-java 
byte[] content = streamToByteArray(entity.getContent()); 

String ct = t.parseToString(new ByteArrayInputStream(content)); 
System.out.println(ct); 

Metadata md = new Metadata(); 
Reader r = t.parse(new ByteArrayInputStream(content), md); 
System.out.println(md);

來源

2011-07-16 03:14:41 sbridges

你的代碼令我困惑。什麼是內容？我們可以在哪裏使用內容..？ – ferhan

更新，內容是一個字節[] – sbridges

我必須創建一個方法'streamToByteArray'或我必須包括的東西..因爲我得到錯誤... – ferhan

使用Apache TIKA獲取內容，關鍵字和頁面標題

回答

相關問題