Google+如何解析帖子中的網址？

Google+似乎使用The-King-of-URL-Regexes來解析用戶帖子中的吸盤。它不需要協議，並且可以忽略標點符號。例如：如果我發佈「我喜歡plus.google.com」，該網站會將其轉換爲「我喜歡plus.google.com」。因此，如果有人知道可以使用和不使用協議解析URL的正則表達式，並且善於忽略標點符號，請回答它。Google+如何解析帖子中的網址？

我不認爲這個問題是一個騙局，因爲所有我見過的類似問題的答案似乎需要在URL中的協議。

由於

來源

2012-12-20 JoshNaro

這個博客有你需要我想。 http://blog.mattheworiordan.com/post/13174566389/url-regular-expression-for-links-with-or-without-the – zer0bit

@ zer0bit看起來像你提供的鏈接無法匹配url加.google.com – cheesemacfly

這是一個棘手的問題......但這裏是一個很好的開始。 http://mathiasbynens.be/demo/url-regex – zer0bit

合理的策略是使用一個正則表達式匹配由點之前頂級域（TLD），然後運行公知的主機表查找或DNS查詢作爲在疑似驗證步驟主機名字符串。

例如這裏是一個使用perl演示策略的第一部分的會話：

$ cat hostname-detector 
#!/usr/bin/perl -w 
# Add more country/new TLDs for completeness 
my $TLD = '(?:com|net|info|org|gov|edu)'; 
while (<>) { 
    while (/((?:[-\w]+\.)+?$TLD)/g) { 
     print "found hostname: $&\n"; 
    } 
} 


$ ./hostname-detector 
"I like plus.google.com." 
found hostname: plus.google.com 

a sentence without a hostname. 

here's another host: free.org 
found hostname: free.org 

a longer.host.name.psu.edu should work too.      
found hostname: longer.host.name.psu.edu 

a host.with-dashes.gov ... 
found hostname: host.with-dashes.gov

來源

2013-02-05 04:50:22 arielf

最終目標是擊中網站並檢索元數據，因此目標驗證步驟將發生。但是，我希望能夠檢測到所有有效的URL;包括正斜槓，查詢字符串以及URL所包含的所有其他好東西。 – JoshNaro

這是一個更完整的（完整的URL）實現。請注意，它不完全符合RFC 3986，缺少一些TLD，允許某些非法國家TLD，允許刪除協議部分（按照原始Q中的要求），還有其他一些缺陷。好處是它具有很多簡單性，比許多其他實現要短得多，而且工作量大於95％。

#!/usr/bin/perl -w 
# URL grammar, not 100% RFC 3986 but pretty good considering the simplicity. 
# For more complete implementation options see: 
# http://mathiasbynens.be/demo/url-regex 
# https://gist.github.com/dperini/729294 
# https://github.com/garycourt/uri-js (RFC 3986 compliant) 
# 
my $Protocol = '(?:https?|ftp)://'; 
# Add more new TLDs for completeness 
my $TLD = '(?:com|net|info|org|gov|edu|[a-z]{2})'; 
my $UserAuth = '(?:[^\s:@]+:[^\[email protected]]*@)'; 
my $HostName = '(?:(?:[-\w]+\.)+?' . ${TLD} . ')'; 
my $Port = '(?::\d+)'; 
my $Pathname = '/[^\s?#&]*'; 
my $Arg = '\w+(?:=[^\s&])*'; 
my $ArgList = "${Arg}(?:\&${Arg})*"; 
my $QueryArgs = '\?' . ${ArgList}; 
my $URL = qr/ 
    (?:${Protocol})? # Optional, not per RFC! 
    ${UserAuth}? 
    ${HostName} 
    ${Port}? 
    (?:${Pathname})? 
    (?:${QueryArgs})? 
/sox; 

while (<>) { 
    while (/($URL)/g) { 
     print "found URL: $&\n"; 
    } 
}

來源

2013-02-06 07:35:44 arielf

@arielf

它看起來對我來說，下面一行：

my $HostName = '(?:(?:[-\w]+\.)+?' . ${TLD} . ')';

應該是固定的這樣：

my $HostName = '(?:(?:[-\w]+\.)+' . ${TLD} . ')';

否則，輸入http://www.google.com被解析爲

found URL: http://www.go 
found URL: ogle.com

來源

2013-04-18 14:54:13 aixtal

Google+如何解析帖子中的網址？

回答

相關問題