2012-06-11 49 views
0

我有一個列表,其中包含很多頁面url。我想檢索獨特的網站。從url列表中檢索獨特的網站

"http://www.gadgetgiants.com/products/mica-8-inch-touchscreen-android-2-3-tablet-wifi-1-2ghz-cpu-flash10-3" 
"http://www.malma.mx/products/pan-digital" 
"http://www.gadgetgiants.com/products/snowpad-7-capacitive-multi-touch-screen-android-2-3-tabletwifi-samsung-cortex-a8-1-2ghz-cpu-camera-1080p-external-3g" 
"http://www.spiritualityandwellness.com/products/internalized-motivation" 
"http://www.spiritualityandwellness.com/products/evergreen-motivation" 

會導致到:

www.gadgetgiants.com 
www.malma.mx 
www.spiritualityandwellness.com 

回答

1

egrep -o "www\.[a-zA-Z0-9.-]*\.[a-zA-Z]{2,4}" YOUR_FILE_NAME | sort -u

here

(編輯)示例用法和輸出

$ cat ur.txt 
"http://www.gadgetgiants.com/products/mica-8-inch-touchscreen-android-2-3" 
"http://www.malma.mx/products/pan-digital" 
"http://www.gadgetgiants.com/products/snowpad-7-capacitive-multi-touch" 
"http://www.spiritualityandwellness.com/products/internalized-motivation" 
"http://www.spiritualityandwellness.com/products/evergreen-motivation" 
"http://www.swellness.com.au/products/evergreen-motivation" 

$ egrep -o "www\.[a-zA-Z0-9.-]*\.[a-zA-Z]{2,4}" ur.txt | sort -u 
www.gadgetgiants.com 
www.malma.mx 
www.spiritualityandwellness.com 
www.swellness.com.au 
得到正則表達式
+0

這是否也刪除重複項? – user1095332

+0

是的最後一部分'排序-u'會對它們進行排序,'-u'代表**唯一** –

+0

這不適用於Windows。 :( – user1095332

0

理念W/O型的正則表達式:

從每個地址檢索主機:

Uri uri = new Uri (yourLink); 
string host = uri.Host; 

現在,你可以把所有這些主機到HashSet的什麼的。

+0

用什麼語言? – user1095332

+0

C Sharp。http://msdn.microsoft.com/en-us/library/system.uri.host。 aspx http://msdn.microsoft.com/en-us/library/bb359438.aspx – zgnilec