如何從URL中獲取HTML鏈接

我剛剛開始使用網絡分配，並且我已經卡住了。作業要求我檢查用戶提供的網站鏈接，並通過閱讀標題信息來確定它們是否處於活動或不活動狀態。到目前爲止，谷歌搜索後，我只是有這個代碼檢索網站。我不知道如何查看這些信息並查找HTML鏈接。這裏是代碼：如何從URL中獲取HTML鏈接

import java.net.*; 
import java.io.*; 

public class url_checker { 
    public static void main(String[] args) throws Exception { 
     URL yahoo = new URL("http://yahoo.com"); 
     URLConnection yc = yahoo.openConnection(); 
     BufferedReader in = new BufferedReader( 
           new InputStreamReader( 
           yc.getInputStream())); 
     String inputLine; 
     int count = 0; 
     while ((inputLine = in.readLine()) != null) { 
      System.out.println (inputLine);     
      }  
     in.close(); 
    } 
}

請幫忙。謝謝！

來源

2011-02-07 careless_monkey

您需要獲取服務器返回的響應的HTTP狀態碼。如果頁面不存在，服務器將返回404。

看看這個： http://download.oracle.com/javase/1.4.2/docs/api/java/net/HttpURLConnection.html

最特別的getResponseCode方法。

來源

2011-02-07 01:06:54 SammoSammo

感謝您的回覆！我在查找用戶提供的網頁上的鏈接時遇到問題。在確定了所有鏈接後，我將使用您的方法。 – 2011-02-07 02:26:29

我會用像NekoHTML這樣的工具解析HTML。它基本上爲您修復格式錯誤的HTML，並允許像XML一樣訪問它。然後，您可以處理鏈接元素並嘗試按照原來的頁面進行操作。

您可以查看一些sample code that does this。

來源

2011-02-07 01:16:41

感謝您的回覆。可悲的是，我不能在我的作業上使用任何外部庫。 :-( – 2011-02-07 02:28:31

我不明白如何去在這個信息，尋找HTML鏈接

我不能在我的分配使用任何外部庫

你有兩個選擇：

1）您可以將網頁讀入HTMLDocument。然後你可以從Document獲得一個迭代器來找到所有的HTML.Tag.A標籤。一旦找到attrbute標籤，就可以從attribute標籤的屬性集中獲取HTML.Attribute.HREF。

2）您可以擴展HTMLEditor.ParserCallback並實現handleStartTag（...）方法。然後，每當你找到一個A標籤時，你可以得到href屬性，它將再次包含鏈接。調用解析器回調的基本代碼是：

MyParserCallback parser = new MyParserCallback(); 

// simple test 
String file = "<html><head><here>abc<div>def</div></here></head></html>"; 
StringReader reader = new StringReader(file); 

// read a page from the internet 
//URLConnection conn = new URL("http://yahoo.com").openConnection(); 
//Reader reader = new InputStreamReader(conn.getInputStream()); 

try 
{ 
    new ParserDelegator().parse(reader, parser, true); 
} 
catch (IOException e) 
{ 
    System.out.println(e); 
}

來源

2011-02-07 04:21:53 camickr

您還可以嘗試jsoup html檢索器和解析器。

Document doc = Jsoup.parse(new URL("<url>"), 2000); 

Elements resultLinks = doc.select("div.post-title > a"); 
for (Element link : resultLinks) { 
    String href = link.attr("href"); 
    System.out.println("title: " + link.text()); 
    System.out.println("href: " + href); 
}

通過此代碼，您可以列出並分析div中的所有元素，其中包含來自url的「post-title」類。

來源

2011-02-07 09:22:15 Impiastro

你可以試試這個：

URL url = new URL(link); 
Reader reader= new InputStreamReader((InputStream) url.getContent()); 
new ParserDelegator().parse(reader, new Page(), true);

然後創建一個名爲頁

class Page extends HTMLEditorKit.ParserCallback { 

    public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) { 
     if (t == HTML.Tag.A) { 
      String link = null; 
      Enumeration<?> attributeNames = a.getAttributeNames(); 
      if (attributeNames.nextElement().equals(HTML.Attribute.HREF)) 
       link = a.getAttribute(HTML.Attribute.HREF).toString(); 
      //save link some where 
     } 
    } 
}

來源

2012-04-24 13:15:18

HtmlParser類是什麼您這裏需要。很多事情都可以用它來完成。

來源

2012-04-24 19:21:48 mtk

如何從URL中獲取HTML鏈接

回答

相關問題