提取URL的域名
解析URL的另一個請求,但我發現了許多不完整或理論的示例。我想確定一些在Perl中有效的東西。Perl:提取域名
我有以下網址:
https://vimdoc.sourceforge.net/htmldoc/pattern.html
http://linksyssmartwifi.com/ui/1.0.1.1001/dynamic/login.html
http://www.catonmat.net/download/perl1line.txt
https://github.com/robbyrussell/oh-my-zsh/wiki/Cheatsheet
https://drive.google.com/drive/u/0/folders/0B5jNDUmF2eUJuSnM
http://www.gnu.org/software/coreutils/manual/coreutils.html
http://www.catonmat.net/download/perl1line.txt
https://feedly.com/i/my
http://vimhelp.appspot.com/
https://git-scm.com/doc
https://read.amazon.com/
https://github.com/netsamir/following
https://scotch.io/
https://servicios.dgi.gub.uy/
https://sourcemaking.com/
https://stackedit.io/editor
https://stripe.com/be
https://toolbelt.heroku.com/
https://training.github.com/
https://vimeo.com/54505525
https://vimeo.com/tag:drew+neil
https://web.whatsapp.com/
https://www.ctan.org/
https://www.eff.org/
https://www.mybeluga.com/
https://www.solveforx.com/
https://www.symynd.com/
https://www.symynd.com/#
https://www.tizen.org/
http://workforall.net/CDS-Credit-default-Swaps.html#Credit_Default_Swaps_CDS
儘量只提取域名。例如:
linksyssmartwifi.com
amazon.com
github.com
我試過用Perl和Vim,但是無法完成任務。我最好的 逼近如下
perl -pe 's!(^https?\://.*[\.](.+\..+?)/.*$)!$1 -- [$2] !g' all_urls_sorted.txt
其中有些是正確解析(請參閱[]),其他未:
https://sites.google.com/site/steveyegge2/singleton-considered-stupid -- [google.com]
https://sourcemaking.com/
https://stackedit.io/editor
https://stripe.com/be
https://toolbelt.heroku.com/ -- [heroku.com]
https://training.github.com/ -- [github.com]
https://vimeo.com/54505525
https://vimeo.com/tag:drew+neil
https://web.whatsapp.com/ -- [whatsapp.com]
https://wiki.haskell.org/GHC -- [haskell.org]
由於我的測試表明,該URL,從直開始// (在https?://中)被排除在外。
如果你知道如何解決這個問題,我會很高興。
感謝
「www.bbc.co.uk」怎麼樣? – Borodin
這就是'Domain :: PublicSuffix'試圖做的事情。 – ernix