解析與JTidy的鏈接

我目前正在使用JTidy解析HTML文檔並獲取給定HTML文檔中所有錨定標記的集合。然後，我提取每個標籤的href屬性的值，以在頁面上創建一個鏈接集合。解析與JTidy的鏈接

不幸的是，這些鏈接可以在幾個不同的方式表達：一些絕對（http://www.example.com/page.html），一些相對（/page.html，page.html，或者../page.html）。更有甚者，有些可以成爲主播（#paragraphA）。當我在瀏覽器中訪問我的頁面時，如果我點擊鏈接，它會自動知道如何處理這些不同的href值，但是如果我要通過編程方式使用HTTPClient從JTidy中檢索這些鏈接之一，首先需要提供一個有效的URL（例如，我首先需要將/page.html,page.html和http://www.example.com/page.html轉換爲http://www.example.com/page.html）。

是否有一些內置的功能，無論是在JTidy或其他地方，可以爲我實現這一點？或者我需要創建自己的規則來將這些不同的URL轉換爲絕對URL？

來源

2011-12-19 Andrew

假設您可以計算出使用哪種上下文，那麼vanilla URL類可能會爲您帶來大部分的途徑。下面是一些例子：

package grimbo.url; 

import java.net.MalformedURLException; 
import java.net.URL; 

public class TestURL { 
    public static void main(String[] args) { 
     // context1 
     URL c1 = u(null, "http://www.example.com/page.html"); 
     u(c1, "http://www.example.com/page.html"); 
     u(c1, "/page.html"); 
     u(c1, "page.html"); 
     u(c1, "../page.html"); 
     u(c1, "#paragraphA"); 

     System.out.println(); 

     // context2 
     URL c2 = u(null, "http://www.example.com/path/to/page.html"); 
     u(c2, "http://www.example.com/page.html"); 
     u(c2, "/page.html"); 
     u(c2, "page.html"); 
     u(c2, "../page.html"); 
     u(c2, "#paragraphA"); 
    } 

    public static URL u(URL context, String url) { 
     try { 
      URL u = null != context ? new URL(context, url) : new URL(url); 
      System.out.println(u); 
      return u; 
     } catch (MalformedURLException e) { 
      e.printStackTrace(); 
      return null; 
     } 
    } 
}

結果：

http://www.example.com/page.html 
http://www.example.com/page.html 
http://www.example.com/page.html 
http://www.example.com/page.html 
http://www.example.com/../page.html 
http://www.example.com/page.html#paragraphA 

http://www.example.com/path/to/page.html 
http://www.example.com/page.html 
http://www.example.com/page.html 
http://www.example.com/path/to/page.html 
http://www.example.com/path/page.html 
http://www.example.com/path/to/page.html#paragraphA

正如你所看到的，也有一些結果是不是你想要的。因此，也許你首先嚐試使用new URL(value)解析URL，如果這導致MalformedURLException，則可以嘗試相對於上下文URL。

來源

2011-12-20 00:16:06

你最好的最好的是最有可能遵循同樣的決議過程，瀏覽器確實如outlined in the HTML spec：

用戶代理必須計算根據下面的優先級URI基地（最高優先級到最低）：

基本URI由BASE元素設置。

基本URI由在協議交互期間發現的元數據給出，例如HTTP頭（參見[RFC2616]）。

默認情況下，基本URI是當前文檔的基本URI。並非所有的HTML文檔都有一個基本的URI（例如，一個有效的HTML文檔可能在電子郵件中出現，並且可能不會被URI指定）。如果這些HTML文檔包含相對URI並且依賴於默認的基本URI，則被認爲是錯誤的。

在實踐中，你可能最關心的是數字1和2（即檢查是否有<base href="..."和使用，要麼（如果存在）或當前文檔的URI）。

來源

2011-12-19 23:57:43

解析與JTidy的鏈接

回答

相關問題