2013-08-06 36 views
0

在問這個問題之前,我嘗試了幾種不同的方法,當然嘗試了一些方向/答案的谷歌搜索。我已經通過StackOverflow檢查,似乎無法找到解決方案。Java使用xpath與谷歌

基本上,我想創建一個工具,返回基於URL和XPath例如

URL:  http://www.google.co.uk/search?q=wicked+games 
XPath:  id('rso')/li/div/h3/a 

應該返回這些結果

http://puu.sh/3V4JG.jpg

我可以解析XML精細數據從其他網址的例如,如果我要抓住一個確切的XML文件,如http://renualsoft.com/jordon/person.xml但我不確定我會如何做到這一點谷歌?

我想這

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); 
    factory.setNamespaceAware(true); 
    DocumentBuilder builder; 
    Document doc = null; 
    XPathExpression expr = null; 
    builder = factory.newDocumentBuilder(); 
    doc = builder.parse("http://www.google.co.uk/search?q=wicked+games"); 
    XPathFactory xFactory = XPathFactory.newInstance(); 
    XPath xpath = xFactory.newXPath(); 

    expr = xpath.compile("id('rso')/li/div/h3/a/@href"); 
    Object result = expr.evaluate(doc, XPathConstants.NODESET); 
    NodeList nodes = (NodeList) result; 
    for (int i = 0; i < nodes.getLength(); i++) { 
     System.out.println(nodes.item(i).getNodeValue()); 
    } 

但是我得到這個例外

Exception in thread "main" java.io.IOException: Server returned HTTP response code: 403 for URL: http://www.google.co.uk/search?q=wicked+games 
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1625) 
    at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:633) 
    at com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion(XMLVersionDetector.java:189) 
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:799) 
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764) 
    at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:123) 
    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:237) 
    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:300) 
    at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:177) 
    at NewEmptyJUnitTest.query(NewEmptyJUnitTest.java:35) 
    at NewEmptyJUnitTest.main(NewEmptyJUnitTest.java:77) 
Java Result: 1 

任何幫助或指導將是巨大的感謝,我曾嘗試在其他地方尋找答案,但就像我說我不能」找到有用的東西。

+0

我只注意到一個有趣的標籤說明。查看谷歌標籤。 – keyser

+2

發生這種情況是因爲未設置用戶代理。 Google也不希望你以這種方式獲取他們的搜索結果。它反對他們的TOS。使用谷歌搜索API更好的更清潔的方式來搜索 –

+0

@keyser y。好的發現;) –

回答

0

HTMLUnit不適合。爲你?

import com.gargoylesoftware.htmlunit.BrowserVersion; 
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException; 
import com.gargoylesoftware.htmlunit.WebClient; 
import com.gargoylesoftware.htmlunit.html.HtmlAnchor; 
import com.gargoylesoftware.htmlunit.html.HtmlPage; 

class Example 
{ 
    public static void main(final String args[]) throws FailingHttpStatusCodeException, MalformedURLException, IOException 
    { 
     final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_17); 
     webClient.getOptions().setCssEnabled(false); 

     final HtmlPage page = webClient.getPage("http://www.google.co.uk/search?q=wicked+games"); 

     final List<?> byXPath = page.getByXPath("//ol['rso']//h3/a"); 

     for (final Object object : byXPath) 
     { 
      System.out.println(((HtmlAnchor) object).getTextContent()); 
     } 
    } 
} 

這將打印:

Chris Isaak - Wicked Game - YouTube The Weeknd - Wicked Games (Explicit) - 
YouTube Emika - Wicked Game - YouTube Wicked Game - Wikipedia, the 
free encyclopedia THE WEEKND - WICKED GAMES LYRICS THE WEEKND LYRICS - 
Wicked Games - A-Z Lyrics The Weeknd – Wicked Games Lyrics | Rap 
Genius Chris Isaak - Wicked Game - Video Dailymotion Wicked Game | 
Chris Isaak | Music Video | MTV Wicked Games 

Maven的相關性:

<dependency> 
    <groupId>net.sourceforge.htmlunit</groupId> 
    <artifactId>htmlunit</artifactId> 
    <version>2.12</version> 
</dependency> 
+0

嘿,這是返回一個異常在線程「主」java.lang.NoClassDefFoundError:org/apache/http/NoHttpResponseException – TehBawz

+0

@JordonBarber你添加了maven的依賴? – d0x

+0

這個類來自commons-httpclient包。這應該在你的類路徑中。 (它帶有HTMLUnit) – d0x