url的正則表達式意外的結果

我想匹配url中的一部分。此網址已被處理，並僅包含域名。url的正則表達式意外的結果

例如：

的網址我現在已經是business.time.com 現在我想擺脫頂級域名（.COM）的。我想結果是business.time

我使用下面的代碼：

gawk'{ 
match($1, /[a-zA-Z0-9\-\.]+[^(.com|.org|.edu|.gov|.mil)]/, where) 
print where[0] 
print where[1] 
}' test

在測試中，有四條線：

business.time.com 
mybest.try.com 
this.is.a.example.org 
this.is.another.example.edu

我期待這樣的：

business.time 

mybest.try 

this.is.a.example 

this.is.another.example

但是，輸出是

business.t 

mybest.try 

this.is.a.examp 

this.is.another.examp

誰能告訴我什麼是錯的，我該怎麼辦？

感謝

來源

2013-07-29 TJ Liu

爲什麼不要使用點作爲字段分隔符，並且： awk -F. 'sub(FS $NF,x)' test

或使用更易讀的東西，如rev test|cut -d. -f 2-|rev。

來源

2013-07-29 22:47:07 micke

感謝您的回答。我認爲第一個可能會更好地應用，因爲某些域名地址還包含國家代碼。我在考慮首先使用rev/cut，然後搜索剩餘的域名。希望這可以工作 –

你可以這樣說：

rev domains.txt | cut -d '.' -f 2- | rev

但如果你有更復雜的終端刪除，您可以使用一個明確的列表的sed：

sed -r 's/\.(com(\.hk)?|org|edu|net|gov|mil)//' domains.txt

來源

2013-07-29 22:46:52

Aww男人我正要發佈。 – micke

感謝您的建議。不過，我認爲，某些域名地址有兩個頂級域名，例如google.com.hk。我應該如何削減他們兩個？特別是當google.com和google.com.hk都存在時 –

@TJLiu：你可以使用sed，看我的例子。 –

的問題是，[^]僅用於排除單個字符，沒有表情，所以你基本上是有一個正則表達式是：

match($1, /[a-zA-Z0-9\-\.]+[^()|.cedgilmoruv)]/, where)

這就是爲什麼它可以不比賽ime.com從buisiness.time.com因爲所有這些字符在[^]表達式中。

我無法找到gawk的一個很好的消極比賽，但沒有建立什麼是下面，我希望會爲你工作：

match($1, /([a-zA-Z0-9\-\.]+)(\.com|\.org|\.edu|\.gov|\.mil)/, where) 
print where[0] 
print where[1] 
print where[2] 
> }' test

所以第一部分在[1]和在那裏結束了[2]有高級域名

business.time.com 
business.time 
.com 
mybest.try.com 
mybest.try 
.com 
this.is.a.example.org 
this.is.a.example 
.org 
this.is.another.example.edu 
this.is.another.example 
.edu

來源

2013-07-29 23:12:48

這真的有幫助！這正是我所困惑的。另外，當我在線搜索函數匹配時，它只有兩個參數。第三個如何工作？ –

@TJ Liu這是通過使用表達式組（圍繞正則表達式表達式的括號）來工作的。（[a-z ...] +）和（.com | ...）。換句話說，[0]是匹配的並且將是整個字符串。 [1]是第一組匹配的，[2]是第二組匹配的匹配。 –

感謝您的解釋。現在這更清楚了 –

url的正則表達式意外的結果

回答

相關問題