解析爲何使用Jsoup比使用本網站<a href="http://www.flashscore.com/nhl/" rel="nofollow noreferrer">http://www.flashscore.com/nhl/</a>上的瀏覽器

我的網站的時候，我試圖提取「今天的比賽」表的鏈接HTML代碼是不同的。解析爲何使用Jsoup比使用本網站<a href="http://www.flashscore.com/nhl/" rel="nofollow noreferrer">http://www.flashscore.com/nhl/</a>上的瀏覽器

我用下面的代碼嘗試，但它不工作，你能指出其中的錯誤是什麼？

final Document page = Jsoup 
    .connect("http://d.flashscore.com/x/feed/t_4_200_G2Op923t_1_en_1") 
    .cookie("_ga","GA1.2.47011772.1485726144") 
    .referrer("http://d.flashscore.com/x/feed/proxy-local") 
    .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36") 
    .header("X-Fsign", "SW9D1eZo") 
    .header("X-GeoIP", "1") 
    .header("X-Requested-With", "XMLHttpRequest") 
    .header("Accept" , "*/*") 
    .get(); 

for (Element game : page.select("table.hockey tr")) { 
Elements links = game.getElementsByClass("tr-first stage-finished"); 
for (Element link : links) { 
    String linkHref = link.attr("href"); 
    String linkText = link.text(); 
} 
}

要嘗試修復它，我開始調試它。它顯示我們得到的頁面（如果我們正在得到一種奇怪的HTML）。之後，調試顯示for循環甚至沒有啓動。我試圖將page.select（「」）部分更改爲不同的部分（如getElementByAttribute等），但我剛剛開始學習網頁抓取，因此我需要熟悉這些方法來瀏覽文檔。我應該如何提取這些數據？

來源

2017-04-10 Armin Beda

至於說意見，這個網站需要以建立一個鏈接元素來執行一些JavaScript。 Jsoup只解析HTML，它不運行任何JS，並且如果您從瀏覽器獲得或者您從Jsoup獲得，則不會看到相同的HTML源代碼。

你需要得到該網站，如果你是一個真正的瀏覽器中運行它。您可以使用WebDriver和Firefox以編程方式執行此操作。

我試着用你的榜樣網站和作品：

的pom.xml

<project> 

<modelVersion>4.0.0</modelVersion> 
<groupId>com.test</groupId> 
<artifactId>test</artifactId> 
<version>1.0-SNAPSHOT</version> 
<build> 
    <plugins> 
    <plugin> 
    <groupId>org.apache.maven.plugins</groupId> 
    <artifactId>maven-compiler-plugin</artifactId> 
    <configuration> 
     <source>1.8</source> 
     <target>1.8</target> 
     </configuration> 
    </plugin> 
    </plugins> 
</build> 
<packaging>jar</packaging> 

<name>test</name> 
<url>http://maven.apache.org</url> 

<properties> 
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> 
</properties> 

<dependencies> 
    <dependency> 
    <groupId>org.seleniumhq.selenium</groupId> 
    <artifactId>selenium-firefox-driver</artifactId> 
    <version>2.43.0</version> 
    </dependency> 
</dependencies> 

</project>

App.java

package com.test; 

import org.openqa.selenium.By; 
import org.openqa.selenium.WebDriver; 
import org.openqa.selenium.firefox.FirefoxDriver; 
import java.util.Collections; 
import java.util.List; 
import java.util.stream.Collectors; 

public class App { 

public static void main(String[] args) { 
    App app = new App(); 
    List<String> links = app.parseLinks(); 
    links.forEach(System.out::println); 
} 

public List<String> parseLinks() { 
    try { 
     WebDriver driver ; 
     // should download geckodriver https://github.com/mozilla/geckodriver/releases and set according your local file 
     System.setProperty("webdriver.firefox.marionette","C:\\apps\\geckodriver.exe"); 
     driver = new FirefoxDriver(); 
     String baseUrl = "http://www.flashscore.com/nhl/"; 

     driver.get(baseUrl); 

     return driver.findElement(By.className("hockey")) 
       .findElements(By.tagName("tr")) 
       .stream() 
       .distinct() 
       .filter(we -> !we.getAttribute("id").isEmpty()) 
       .map(we -> createLink(we.getAttribute("id"))) 
       .collect(Collectors.toList()); 

    } catch (Exception e) { 
     e.printStackTrace(); 
     return Collections.EMPTY_LIST; 
    } 
} 

private String createLink(String id) { 
    return String.format("http://www.flashscore.com/match/%s/#match-summary", extractId(id)); 
} 

private String extractId(String id) { 
    if (id.contains("x_4_")) { 
     id = id.replace("x_4_",""); 
    } else if (id.contains("g_4_")) { 
     id = id.replace("g_4_",""); 
    } 

    return id; 
} 
}

輸出：

http://www.flashscore.com/match/f9MJJI69/#match-summary 
http://www.flashscore.com/match/zZCyd0dC/#match-summary 
http://www.flashscore.com/match/drEXdts6/#match-summary 
http://www.flashscore.com/match/EJOScMRa/#match-summary 
http://www.flashscore.com/match/0GKOb2Cg/#match-summary 
http://www.flashscore.com/match/6gLKarcm/#match-summary 
... 
...

PS：使用Firefox版本32.0和Selenium 2.43.0工作。在Selenium和Firefox之間使用不支持的版本是一個常見的錯誤。

來源

2017-04-10 11:44:09 exoddus

Hi @exoddues，非常感謝你的解決方案，它的作用就像魅力。你能告訴我怎麼可能只過濾掉那些有今天約會的人？所以我們說今天的日期是變量'String date'。我想我應該使用'.filter（）'。 –

乍一看，'今日比賽'被放置在ID爲「fscountry」的div中。一種方式可能是例如做一個過濾器，獲取ID爲「fscountry」的div內的tr元素。嘗試使用，而不是前兩個。findElement調用使用類似.findElement（By.id（「fscountry」））。findElements（By.tagName（「tr」） – exoddus

嗨@exoddus，這是一個很好的提示。「fscountry」它沒有工作，但與「fs」如果你檢查te元素，你可以看到爲什麼在'今日比賽'表中總是有兩個元素具有相同的id（兩行，上面的主隊，下面的Away隊），我改變了這樣的事情： ... .collections（toSet（））「，所以我只有一次相同的ID。我不知道這是否是最好的解決方案，但它的工作原理。 –

您在.connect("http://d.flashscore.com/x/feed/t_4_200_G2Op923t_1_en_1")得到錯誤的地址 - 你需要在那裏使用.connect("http://www.flashscore.com/nhl/")。

然後，這個網站使用JS，你會得到正確的頁面之後 - 它會呈現不同於瀏覽器，例如將不會有班級「曲棍球」的桌子。你會在你會看到的頁面中看到它。所以，你需要改變定位器。或考慮使用WebDriver。

來源

2017-04-10 06:46:16

你是對的，我在.connect改變字符串，現在我得到正確的HTML。謝謝。 :)你有什麼想法我應該怎麼寫for循環來提取遊戲的鏈接？ –

結果被包裹到'div id =「tournament-page-data-summary-results」中，它不包含任何鏈接。我認爲JSoup不可能。嘗試使用像ChromeDriver或FirefoxDriver的任何'WebDriver'命令 - 支持JS –

解析爲何使用Jsoup比使用本網站<a href="http://www.flashscore.com/nhl/" rel="nofollow noreferrer">http://www.flashscore.com/nhl/</a>上的瀏覽器

回答

相關問題