使用crawler4j抓取https頁面

幾個月後，我們使用crawler4j抓取https站點。突然，從上個星期五開始，我們無法抓取同一個https站點。 https協議有變化嗎？該網站是https://enot.publicprocurement.be/enot-war/home.do 使用crawler4j抓取https頁面

作爲測試，只是試圖抓住標題：韋爾科姆運HET平臺E-通知

任何幫助深表感謝。

來源

2014-01-28 Heinz Uller

我有同樣的問題。爲了解決這個問題，我們需要一個自定義的PageFetcher。你可以在這裏找到樣品。 http://code.google.com/p/crawler4j/issues/detail?id=174

來源

2014-02-03 10:06:27 SANN3

您可以使用此PageFetcher子類代替PageFetcher。這爲我解決了所有問題。

import java.security.KeyManagementException; 
import java.security.KeyStoreException; 
import java.security.NoSuchAlgorithmException; 

import javax.net.ssl.SSLContext; 

import org.apache.http.ssl.SSLContextBuilder; 
import org.apache.http.client.config.RequestConfig; 
import org.apache.http.conn.ssl.NoopHostnameVerifier; 
import org.apache.http.impl.client.HttpClients; 
import org.apache.http.impl.conn.PoolingHttpClientConnectionManager; 

import edu.uci.ics.crawler4j.crawler.CrawlConfig; 
import edu.uci.ics.crawler4j.fetcher.PageFetcher; 

public class PageFetcher2 extends PageFetcher { 

public static final String DEFAULT_USER_AGENT = "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:45.0) Gecko/20100101 Firefox/45.0"; 
public static final RequestConfig DEFAULT_REQUEST_CONFIG = RequestConfig.custom().setConnectTimeout(30 * 1000) 
     .setSocketTimeout(60 * 1000).build(); 

public PageFetcher2(CrawlConfig config) throws KeyManagementException, NoSuchAlgorithmException, KeyStoreException { 
    super(config); 

    PoolingHttpClientConnectionManager connectionManager = new PoolingHttpClientConnectionManager(); 
    connectionManager.setMaxTotal(30); 
    connectionManager.setDefaultMaxPerRoute(30); 

    SSLContext sslContext = new SSLContextBuilder() 
       .loadTrustMaterial(null, (certificate, authType) -> true).build(); 

    httpClient = HttpClients.custom() 
       .setSSLContext(sslContext) 
       .setSSLHostnameVerifier(new NoopHostnameVerifier()) 
       .setConnectionManager(connectionManager) 
       .setUserAgent(DEFAULT_USER_AGENT) 
       .setDefaultRequestConfig(DEFAULT_REQUEST_CONFIG) 
       .build(); 
} 

}

來源

2017-09-29 10:27:59 ed22

我發現設置CrawlConfig

CrawlConfig config = new CrawlConfig(); 
config.setIncludeHttpsPages(true); 
config.setUserAgentString("Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"); 
PageFetcher pageFetcher = new PageFetcher(config);

時，它的效果最好

來源

2018-01-25 11:07:09 KompiKompi

使用crawler4j抓取https頁面

回答

相關問題