2016-08-14 37 views
1

提取URL的域名

解析URL的另一個請求,但我發現了許多不完整或理論的示例。我想確定一些在Perl中有效的東西。Perl:提取域名

我有以下網址:

https://vimdoc.sourceforge.net/htmldoc/pattern.html 
http://linksyssmartwifi.com/ui/1.0.1.1001/dynamic/login.html 
http://www.catonmat.net/download/perl1line.txt 
https://github.com/robbyrussell/oh-my-zsh/wiki/Cheatsheet 
https://drive.google.com/drive/u/0/folders/0B5jNDUmF2eUJuSnM 
http://www.gnu.org/software/coreutils/manual/coreutils.html 
http://www.catonmat.net/download/perl1line.txt 
https://feedly.com/i/my 
http://vimhelp.appspot.com/ 
https://git-scm.com/doc 
https://read.amazon.com/ 
https://github.com/netsamir/following 
https://scotch.io/ 
https://servicios.dgi.gub.uy/ 
https://sourcemaking.com/ 
https://stackedit.io/editor 
https://stripe.com/be 
https://toolbelt.heroku.com/ 
https://training.github.com/ 
https://vimeo.com/54505525 
https://vimeo.com/tag:drew+neil 
https://web.whatsapp.com/ 
https://www.ctan.org/ 
https://www.eff.org/ 
https://www.mybeluga.com/ 
https://www.solveforx.com/ 
https://www.symynd.com/ 
https://www.symynd.com/# 
https://www.tizen.org/ 
http://workforall.net/CDS-Credit-default-Swaps.html#Credit_Default_Swaps_CDS 

儘量只提取域名。例如:

linksyssmartwifi.com 
amazon.com 
github.com 

我試過用Perl和Vim,但是無法完成任務。我最好的 逼近如下

perl -pe 's!(^https?\://.*[\.](.+\..+?)/.*$)!$1 -- [$2] !g' all_urls_sorted.txt 

其中有些是正確解析(請參閱[]),其他未:

https://sites.google.com/site/steveyegge2/singleton-considered-stupid -- [google.com] 
https://sourcemaking.com/ 
https://stackedit.io/editor 
https://stripe.com/be 
https://toolbelt.heroku.com/ -- [heroku.com] 
https://training.github.com/ -- [github.com] 
https://vimeo.com/54505525 
https://vimeo.com/tag:drew+neil 
https://web.whatsapp.com/ -- [whatsapp.com] 
https://wiki.haskell.org/GHC -- [haskell.org] 

由於我的測試表明,該URL,從直開始// (在https?://中)被排除在外。

如果你知道如何解決這個問題,我會很高興。

感謝

回答

5

使用URI模塊:

#!/usr/bin/env perl 

use strict; 
use warnings; 
use v5.10; 

use URI; 

while (<DATA>) { 
    chomp; 
    my $uri = URI->new($_); 
    my $host = $uri->host; 
    my ($domain) = $host =~ m/([^.]+\.[^.]+$)/; 
    say $domain; 
} 

__DATA__ 
https://vimdoc.sourceforge.net/htmldoc/pattern.html 
http://linksyssmartwifi.com/ui/1.0.1.1001/dynamic/login.html 
http://www.catonmat.net/download/perl1line.txt 
https://github.com/robbyrussell/oh-my-zsh/wiki/Cheatsheet 
https://drive.google.com/drive/u/0/folders/0B5jNDUmF2eUJuSnM 
http://www.gnu.org/software/coreutils/manual/coreutils.html 
http://www.catonmat.net/download/perl1line.txt 
https://feedly.com/i/my 
http://vimhelp.appspot.com/ 
https://git-scm.com/doc 
https://read.amazon.com/ 
https://github.com/netsamir/following 
https://scotch.io/ 
https://servicios.dgi.gub.uy/ 
https://sourcemaking.com/ 
https://stackedit.io/editor 
https://stripe.com/be 
https://toolbelt.heroku.com/ 
https://training.github.com/ 
https://vimeo.com/54505525 
https://vimeo.com/tag:drew+neil 
https://web.whatsapp.com/ 
https://www.ctan.org/ 
https://www.eff.org/ 
https://www.mybeluga.com/ 
https://www.solveforx.com/ 
https://www.symynd.com/ 
https://www.symynd.com/# 
https://www.tizen.org/ 
http://workforall.net/CDS-Credit-default-Swaps.html#Credit_Default_Swaps_CDS 

輸出:

sourceforge.net 
linksyssmartwifi.com 
catonmat.net 
github.com 
google.com 
gnu.org 
catonmat.net 
feedly.com 
appspot.com 
git-scm.com 
amazon.com 
github.com 
scotch.io 
gub.uy 
sourcemaking.com 
stackedit.io 
stripe.com 
heroku.com 
github.com 
vimeo.com 
vimeo.com 
whatsapp.com 
ctan.org 
eff.org 
mybeluga.com 
solveforx.com 
symynd.com 
symynd.com 
tizen.org 
workforall.net 
+2

「www.bbc.co.uk」怎麼樣? – Borodin

+1

這就是'Domain :: PublicSuffix'試圖做的事情。 – ernix

3

我最好的近似值是URI::URL

foreach my $uri (@filecontents) { 
    my $uriobj = URL::URL->new($uri); 
    my $host = $uriobj -> host; 
    my @parts = split /\./, $host; 
    print "$uri -- $parts[-2]$parts[-1]\n"; 
} 

希望有所幫助。

+2

URI :: URL用於向後兼容。對於新代碼,請改用[URI](https://metacpan.org/pod/URI)。 – Schwern

1

一個正則表達式的解決方案是:

//(?:[^./]+[.])*([^/.]+[.][^/.]+)/ 

如果最後的斜線是可選的,只需添加一個?

//(?:[^./]+[.])*([^/.]+[.][^/.]+)/? 

這應該與全球改性劑和比/以外的分隔符來使用。

本質上,它看起來在//和下一個/之間。

如果有任何額外的子域,他們將被(?:[^./]+[.])*捕獲。主域名將落入捕獲組([^/.]+[.][^/.]+)

+0

如果主機名稱沒有/之後會怎麼樣? – ysth

+0

我已經測試過這個解決方案,它的工作原理是線性的,因爲它是純正的perl與正則表達式。沒有/在主機名後失敗。 (?:[^。/] + [。])*與我自己的解決方案有所不同。謝謝。 –

+0

@SamirSadek調整可選的尾部斜線非常簡單,只需添加'?'(參見編輯)。 – Laurel