用於在網站上抓取線條的Java URL庫

我希望能夠在特定的URL（例如，）上抓取N線條（以新行開始的HTML文本內容）。 www.sitename.com並將它們作爲strings存儲在一個數組中。用於在網站上抓取線條的Java URL庫

像

public void grabLines(){ 

//create instance of class from imported library 

//pass sitename into it 

//from the instance, call a method for grabbing the lines on the site and pass in "N" as a parameter 

//the method returns an array/list of N Strings that I can access later 

}

是否有本地Java庫我可以導入做到這一點？它允許我做我想要的東西嗎？

謝謝

來源

2011-06-25 algorithmicCoder

什麼是在網站上線？ – Sjoerd

你的意思是，HTML內容的行？不是url本身的一部分？ – Bozho

文本行......以不同的行開頭的句子@Bozho yes是指HTML內容的行。 – algorithmicCoder

你想製作一個屏幕刮刀嗎？你會拉動HTML而不是你看到的。同樣，如果網站是動態的，你將無法獲得你能看到的所有內容。如果你只想HTML和東西，你可以嘗試這樣的事情。我試圖建立一個bloomberg屏幕刮板，然後解析出隨機的html標籤。

try { 
     URL bbg = new URL("http://www.bloomberg.com/markets/economic-calendar/"); 
     BufferedReader r = new BufferedReader(new InputStreamReader(bbg.openStream())); 
     while((temp = r.readLine())!= null){ 
      System.out.println(temp); 
     } 

    } catch (Exception e){ 
     e.printStackTrace(); 
    }

來源

2011-06-25 18:53:37 jhlu87

請注意，此解決方案不能正確處理不同的字符集。您應該使用頁面的字符集將字節轉換爲字符。 – jtahlborn

@jtahlborn，是的，你是對的。正如你可能會說，這是非常懶惰的編碼。我很好奇。你如何獲得字符集？有沒有更好的方法，然後嘗試檢測標籤並基於該標籤進行切換？ – jhlu87

@jtahlborn你能解釋一下你的意思嗎？ – algorithmicCoder

Apache的HttpClient的是上面的URL /閱讀器技術之上的抽象，但類似：Apache HTTP Client

來源

2011-06-25 18:58:49

用於在網站上抓取線條的Java URL庫

回答

相關問題