獲取URL的標記的正則表達式是什麼？

說我有一個字符串像這樣：獲取URL的標記的正則表達式是什麼？

bunch of other html<a href="http://domain.com/133742/The_Token_I_Want.zip" more html and stuff 
bunch of other html<a href="http://domain.com/12345/another_token.zip" more html and stuff 
bunch of other html<a href="http://domain.com/0981723/YET_ANOTHER_TOKEN.zip" more html and stuff

什麼是正則表達式匹配The_Token_I_Want，another_token，YET_ANOTHER_TOKEN？

來源

2010-08-15 Nick Strupat

不要使用正則表達式來解析HTML。你在哪個平臺上？可以有多個子目錄嗎？ – 2010-08-15 20:33:27

從字符串結尾返回： /([^\/]+)\..+$/ – 2010-08-15 20:34:12

正則表達式將在JavaScript中運行？ – Topera 2010-08-15 20:34:50

的RFC 2396附錄B給出了一個正則表達式的一個謊言分裂一個URI到它的組件，我們可以使其適用於你的情況

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*/([^.]+)[^?#]*)(\?([^#]*))?(#(.*))? 
            #######

這使得The_Token_I_Want在$6，這是「hashderlined」上面的子表達式。（請注意，哈希是不是模式的一部分。）親身體驗：

#! /usr/bin/perl 

$_ = "http://domain.com/133742/The_Token_I_Want.zip";  
if (m!^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*/([^.]+)[^?#]*)(\?([^#]*))?(#(.*))?!) { 
    print "$6\n"; 
} 
else { 
    print "no match\n"; 
}

輸出：

$ ./prog.pl 
The_Token_I_Want

更新：我在您使用boost::regex評論看，所以請記住在C++程序中轉義反斜槓。

#include <boost/foreach.hpp> 
#include <boost/regex.hpp> 
#include <iostream> 
#include <string> 

int main() 
{ 
    boost::regex token("^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*" 
        "/([^.]+)" 
        // ####### I CAN HAZ HASHDERLINE PLZ 
        "[^?#]*)(\\?([^#]*))?(#(.*))?"); 

    const char * const urls[] = { 
    "http://domain.com/133742/The_Token_I_Want.zip", 
    "http://domain.com/12345/another_token.zip", 
    "http://domain.com/0981723/YET_ANOTHER_TOKEN.zip", 
    }; 

    BOOST_FOREACH(const char *url, urls) { 
    std::cout << url << ":\n"; 

    std::string t; 
    boost::cmatch m; 
    if (boost::regex_match(url, m, token)) 
     t = m[6]; 
    else 
     t = "<no match>"; 

    std::cout << " - " << m[6] << '\n'; 
    } 

    return 0; 
}

輸出：

http://domain.com/133742/The_Token_I_Want.zip: 
    - The_Token_I_Want 
http://domain.com/12345/another_token.zip: 
    - another_token 
http://domain.com/0981723/YET_ANOTHER_TOKEN.zip: 
    - YET_ANOTHER_TOKEN

來源

2010-08-15 20:41:13

這不會有點矯枉過正只是一個組件？ – Thomas 2010-08-15 20:48:23

矯枉過正或不對，我投票給「hashderlined」添加到字典中。 – 2010-08-15 20:50:58

首先，使用HTML解析器並獲取DOM。然後獲取錨元素並在其上尋找hrefs。不要試圖直接從字符串中取出令牌。

然後：

圓滑的解釋是：

/(The_Token_I_Want.zip)/

你可能想一點更精確的話單的例子。

我猜你實際上是在尋找：

/([^/]+)$/

來源

2010-08-15 20:33:58 Quentin

m/The_Token_I_Want/

你必須要具體談談什麼樣的令牌是。一個號碼？一個字符串？它重複嗎？它有它的形式或模式嗎？

來源

2010-08-15 20:34:03

最好使用比RegEx更聰明的東西。例如，如果您使用C＃，則可以使用System.Uri類爲您解析它。

來源

2010-08-15 20:36:08

嘗試這種情況：

/(?:f|ht)tps?:/{2}(?:www.)?domain[^/] +（[^ /] +）（[^ /] +）/ I

或

/\ W {3,5}：。/ {2}（。?:瓦特{3}）？域[^ /] +。（[^ /] +）。（[^ /] +）/ I

來源

2010-08-15 20:45:33 Jet

/a href="http://domain.com/[0-9]+/([a-zA-Z_]+).zip"/

可能要更多的字符添加到[A-ZA-Z _] +

來源

2010-08-15 20:46:03 Thomas

您可以使用：

(http|ftp)+://[[:alnum:]./_]+/([[:alnum:]._-]+).[[:alnum:]_-]+

（[[:alnum:]._-]+）是匹配模式的組，在您的示例中，其值將是The_Token_I_Want。訪問此組，使用\ 2或$ 2，因爲（http|ftp）是第一組和（[[:alnum:]._-]+）是匹配的圖案的第二組。

來源

2010-08-15 20:49:13

獲取URL的標記的正則表達式是什麼？

回答

相關問題