在字符串中查找URL

嗨即時查找字符串中的URL，我使用正則表達式創建了許多關於此問題的主題，但我遇到了問題。使用此模式：在字符串中查找URL

String regex = "\\b(((ht|f)tp(s?)\\:\\/\\/|~\\/|\\/)|www.)" + 
      "(\\w+:\\[email protected])?(([-\\w]+\\.)+(com|org|net|gov" + 
      "|mil|biz|info|mobi|name|aero|jobs|museum" + 
      "|travel|[a-z]{2}))(:[\\d]{1,5})?" + 
      "(((\\/([-\\w~!$+|.,=]|%[a-f\\d]{2})+)+|\\/)+|\\?|#)?" + 
      "((\\?([-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" + 
      "([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)" + 
      "(&(?:[-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" + 
      "([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)*)*" + 
      "(#([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)?\\b";

它的作品在大多數頁面中都很好，但我遇到了其他問題。例如：

http://hello.com/hello world

回報

http://hello.com/hello

的問題是空間。

任何人都有一個很好的模式，解決這個問題？

謝謝。

編輯::這是我的代碼

private ArrayList<String> pullLinks(String text) { 
    ArrayList<String> links = new ArrayList<String>(); 

    String regex = "\\b(((ht|f)tp(s?)\\:\\/\\/|~\\/|\\/)|www.)" + 
      "(\\w+:\\[email protected])?(([-\\w]+\\.)+(com|org|net|gov" + 
      "|mil|biz|info|mobi|name|aero|jobs|museum" + 
      "|travel|[a-z]{2}))(:[\\d]{1,5})?" + 
      "(((\\/([-\\w~!$+|.,=]|%[a-f\\d]{2})+)+|\\/)+|\\?|#)?" + 
      "((\\?([-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" + 
      "([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)" + 
      "(&(?:[-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" + 
      "([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)*)*" + 
      "(#([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)?\\b"; 

    Pattern p = Pattern.compile(regex); 
    Matcher m = p.matcher(text); 
    while(m.find()) { 
    String urlStr = m.group(); 
    if (urlStr.startsWith("(") && urlStr.endsWith(")")) 
    { 
    urlStr = urlStr.substring(1, urlStr.length() - 1); 
    } 
    links.add(urlStr); 
    } 
    return links; 
    }

來源

2012-03-16 Alexx Perez

Offtopic：還有更多的頂級頂級域名（TLD）多於2個字母，即您列出的頂級域名。查看[Wikipedia的TLD列表]（http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains）。你的正則表達式也會丟失這樣寫的URL：'example.com'。 – 2012-03-16 13:10:46

Offtopic，但這裏有一個很好的匹配網址的模式，按行解釋：http://daringfireball.net/2010/07/improved_regex_for_matching_urls – Holm 2012-03-16 13:18:14

空格網址中不允許（他們需要通過%20代替）。例如，參見這個問題的答案：

Spaces in URLs?

如果允許的URL包含空格，無論如何，那麼你將如何解釋，例如http://www.google.com/ig is a nice webpage？顯然/ig之後的部分不應包括在內！

來源

2012-03-16 13:01:44 aioobe

所以沒有任何方法可以檢測到網址爲％20？ – 2012-03-16 13:38:00

當然有。你已經做過的表達。在'％[a-f \ d] {2}'（表示'％'後面跟着'{2}'字符，範圍在'a-f'或數字中）尋找實例。 – aioobe 2012-03-16 13:42:41

這不適合我。用我的代碼編輯問題。謝謝 – 2012-03-16 13:55:39

空格不是有效的URL字符。

此外，如果您不使用空格作爲終止符，您將如何查找URL的結尾？

您的正則表達式也無法解釋其他頂級域名（如.int）。我並不確定它爲什麼要尋找特定的TLD，因爲它們不需要形成有效的URL。

來源

2012-03-16 13:03:52 Dev

它對我來說不是一個問題，因爲.int或其他錯誤。我的網址總是：http://something.es/some some.jpg – 2012-03-16 13:39:32

在字符串中查找URL

回答

相關問題