2011-07-15 39 views
0

如何從網址提取文本?在我的代碼中,它提取了該網址的源代碼...使用TIKA提取網址的內容(文本)

DefaultHttpClient client = null; 
client = new DefaultHttpClient(); 
client.getCredentialsProvider().setCredentials(
       new AuthScope(AuthScope.ANY_HOST, AuthScope.ANY_PORT, AuthScope.ANY_REALM), 
       new UsernamePasswordCredentials("test", "test")); 
client.getParams().setParameter(ClientPNames.ALLOW_CIRCULAR_REDIRECTS, true);     
HttpGet request = new HttpGet("http://somehost.com");   
HttpResponse response = client.execute(request); 
HttpEntity entity = response.getEntity(); 
InputStream content = entity.getContent(); 

Tika t = new Tika(); 
Metadata md = new Metadata(); 
Reader r = t.parse(content, md); 
System.out.println(md); 
System.out.println("Yes1: " +md.get("keywords")); 
System.out.println("Yes2: " +md.get("title")); 
System.out.println("Yes3: " +md.get("authors")); 

//This gives the source code of that url not the actual content... 
String ss= t.parseToString(content); 
System.out.println("Yes4: " +ss); 

任何建議?

回答

1

正如我讀過..你可以用蒂卡使用此代碼

byte[] raw = content.getContent(); 
ContentHandler handler = new BodyContentHandler(); 
Metadata metadata = new Metadata(); 
Parser parser = new AutoDetectParser(); 
parser.parse(new ByteArrayInputStream(raw), handler, metadata, new ParseContext()); 
LOG.info("content: " + handler.toString()); 

即使我測試,但我發現,handler.toString()是空的做吧!

1

BoilerpipeContentHandler允許您在沒有標記的情況下提取主體內容。包含的命令行實用程序顯示瞭如何在程序中使用它並測試各種格式。

+0

感謝您回覆..與TIKA ..有什麼辦法..? – ferhan

+0

這個班級是蒂卡的一部分! – fvu

+0

任何例子將不勝感激基於我的代碼..如果你可以給我的一些例子,將是偉大的鏈接... – ferhan