0
如何從網址提取文本?在我的代碼中,它提取了該網址的源代碼...使用TIKA提取網址的內容(文本)
DefaultHttpClient client = null;
client = new DefaultHttpClient();
client.getCredentialsProvider().setCredentials(
new AuthScope(AuthScope.ANY_HOST, AuthScope.ANY_PORT, AuthScope.ANY_REALM),
new UsernamePasswordCredentials("test", "test"));
client.getParams().setParameter(ClientPNames.ALLOW_CIRCULAR_REDIRECTS, true);
HttpGet request = new HttpGet("http://somehost.com");
HttpResponse response = client.execute(request);
HttpEntity entity = response.getEntity();
InputStream content = entity.getContent();
Tika t = new Tika();
Metadata md = new Metadata();
Reader r = t.parse(content, md);
System.out.println(md);
System.out.println("Yes1: " +md.get("keywords"));
System.out.println("Yes2: " +md.get("title"));
System.out.println("Yes3: " +md.get("authors"));
//This gives the source code of that url not the actual content...
String ss= t.parseToString(content);
System.out.println("Yes4: " +ss);
任何建議?
感謝您回覆..與TIKA ..有什麼辦法..? – ferhan
這個班級是蒂卡的一部分! – fvu
任何例子將不勝感激基於我的代碼..如果你可以給我的一些例子,將是偉大的鏈接... – ferhan