並行化許多GET請求

是否有任何有效的方法來並行化Java中的大量GET請求？我有一個包含200,000行的文件，每個文件都需要維基媒體的GET請求。然後我必須將一部分響應寫入一個通用文件。我粘貼了我的代碼的主要部分作爲參考。並行化許多GET請求

while ((line = br.readLine()) != null) { 
    count++; 
    if ((count % 1000) == 0) { 
     System.out.println(count + " tags parsed"); 
     fbw.flush(); 
     bw.flush(); 
    } 
    //System.out.println(line); 
    String target = new String(line); 
    if (target.startsWith("\"") && (target.endsWith("\""))) { 
     target = target.replaceAll("\"", ""); 
    } 
    String url = "http://en.wikipedia.org/w/api.php?action=query&prop=revisions&format=xml&rvprop=timestamp&rvlimit=1&rvdir=newer&titles="; 
    url = url + URLEncoder.encode(target, "UTF-8"); 
    URL obj = new URL(url); 
    HttpURLConnection con = (HttpURLConnection) obj.openConnection(); 
    // optional default is GET 
    con.setRequestMethod("GET"); 
    //add request header 
    //con.setRequestProperty("User-Agent", USER_AGENT); 
    int responsecode = con.getResponseCode(); 
    //System.out.println("Sending 'Get' request to URL: " + url); 
    BufferedReader in = new BufferedReader(new InputStreamReader(con.getInputStream())); 
    String inputLine; 
    StringBuffer response = new StringBuffer(); 
    while ((inputLine = in.readLine()) != null) { 
     response.append(inputLine);   
    } 
    Document doc = loadXMLFromString(response.toString()); 
    NodeList x = doc.getElementsByTagName("revisions"); 
    if (x.getLength() == 1) { 
     String time = x.item(0).getFirstChild().getAttributes().item(0).getTextContent().substring(0,10).replaceAll("-", ""); 
     bw.write(line + "\t" + time + "\n"); 
    } else if (x.getLength() == 2) { 
     String time = x.item(1).getFirstChild().getAttributes().item(0).getTextContent().substring(0, 10).replaceAll("-", "");   
     bw.write(line + "\t" + time + "\n"); 
    } else { 
     fbw.write(line + "\t" + "NULL" + "\n"); 
    } 
}

我用google搜索了一下，似乎有兩種選擇。一個是創建線程，另一個是使用稱爲Executor的東西。有人可以提供一點指導，說明哪一個更適合這項任務？

來源

2013-07-31 user1943079

使用'Executor'這是使用線程更簡單的方法。另外考慮使用專用庫來通過重新使用連接來最大限度地減少TCP開銷。 –

有效的問題，但有了這麼多的請求，你可能會考慮只是[下載維基百科數據庫]（http://en.wikipedia.org/wiki/Wikipedia:Database_download），而不是一塊一塊地請求它？他們不一定[像網絡爬蟲]（http://en.wikipedia.org/wiki/Wikipedia:Database_download#Why_not_just_retrieve_data_from_wikipedia.org_at_runtime.3F）。 –

如果你真的，真的需要通過GET請求來做到這一點，我建議你使用一個小線程池中的ThreadPoolExecutor（2或3）避免超載維基百科的服務器。這將避免大量的編碼...

也考慮使用Apache HttpClient庫（持久連接！）。

但是使用數據庫下載選項更好一些。根據你在做什麼，你可以選擇一個較小的下載。 This page討論了各種選項。

注意：維基百科傾向於人下載數據庫轉儲（etcetera），而不是在他們的網絡服務器上衝擊。

來源

2013-07-31 06:55:46

你需要這是什麼道理：

有一個生產者線程讀取每一行，並將其添加到隊列。
有一個ThreadPool其中每個線程都需要一個URL並執行GET請求
它獲取響應並將其添加到隊列中。
還有一個消費者線程，它檢查隊列並將其添加到文件中。

來源

2013-07-31 06:42:43 Jatin

如上所述，您應該根據服務器的容量來確定並行GET請求的數量。如果你想堅持使用JVM，但想要使用Groovy，下面是一個非常簡短的並行GET請求示例。

最初有一個您想要獲取的URL列表。完成後，任務列表將包含可通過get（）方法訪問的所有結果以供日後處理。這裏只是作爲例子打印出來。

import groovyx.net.http.AsyncHTTPBuilder 

def urls = [ 
    'http://www.someurl.com', 
    'http://www.anotherurl.com' 
] 
AsyncHTTPBuilder http = new AsyncHTTPBuilder(poolSize:urls.size()) 
def tasks = [] 
urls.each{ 
    tasks.add(http.get(uri:it) { resp, html -> return html }) 
} 
tasks.each { println it.get() }

請注意，您的生產環境需要照顧超時，錯誤響應等等

來源

2014-10-20 20:40:14

並行化許多GET請求

回答

相關問題