查找http：//和或www。並從域中剝離。離開domain.com

我很新的python。我試圖解析一個URL文件只留下域名。查找http：//和或www。並從域中剝離。離開domain.com

我的日誌文件中的一些網址以http：//開頭，有些以www.some開頭。

這是我的代碼的一部分剝去http：//部分。我需要添加什麼來查找http和www。並刪除？

line = re.findall(r'(https?://\S+)', line)

目前，當我運行的代碼只有http：//被剝離。如果我將代碼更改爲以下內容：

line = re.findall(r'(https?://www.\S+)', line)

只有以兩者開頭的域纔會受到影響。我需要更有條件的代碼。 TIA

編輯...這裏是我的全部代碼...

import re 
import sys 
from urlparse import urlparse 

f = open(sys.argv[1], "r") 

for line in f.readlines(): 
line = re.findall(r'(https?://\S+)', line) 
if line: 
    parsed=urlparse(line[0]) 
    print parsed.hostname 
f.close()

我mistagged通過原來的職位爲正則表達式。它的確使用了urlparse。

來源

2013-01-31 Paul Tricklebank

剛一說明：您知不知道'www.domain.com'是* FROM'domain.com'不同*，權，並可能指向完全不同的IP地址？ –

域名「www.www.com」和「www.com」怎麼樣？ – Matthias

Duplicate：http://stackoverflow.com/questions/1521592/get-root-domain-of-link –

你可以在這裏沒有正則表達式。

with open("file_path","r") as f: 
    lines = f.read() 
    lines = lines.replace("http://","") 
    lines = lines.replace("www.", "") # May replace some false positives ('www.com') 
    urls = [url.split('/')[0] for url in lines.split()] 
    print '\n'.join(urls)

實施例的文件輸入：

http://foo.com/index.html 
http://www.foobar.com 
www.bar.com/?q=res 
www.foobar.com

輸出：

foo.com 
foobar.com 
bar.com 
foobar.com

編輯：

有可能是一個棘手的URL等foobarwww.com，和上述方法將剝離www。我們將不得不恢復使用正則表達式。

將線lines = lines.replace("www.", "")替換爲lines = re.sub(r'(www.)(?!com)',r'',lines)。當然，每種可能的TLD都應該用於不匹配的模式。

來源

2013-01-31 12:25:15 sidi

如果URL是「http://abcwww.com」'？ – DSM

@DSM不要擔心，它不被使用;） –

謝謝，這有效:)任何想法如何我可以刪除.co.uk/.com等後的一切？ –

退房urlparse library，它可以自動爲你做這些事情。

>>> urlparse.urlsplit('http://www.google.com.au/q?test') 
SplitResult(scheme='http', netloc='www.google.com.au', path='/q', query='test', fragment='')

來源

2013-01-31 12:27:59 Tom

這可能是矯枉過正的這種具體情況，但我一般會用urlparse.urlsplit（Python的2）或urllib.parse.urlsplit（Python 3中）。

from urllib.parse import urlsplit # Python 3 
from urlparse import urlsplit # Python 2 
import re 

url = 'www.python.org' 

# URLs must have a scheme 
# www.python.org is an invalid URL 
# http://www.python.org is valid 

if not re.match(r'http(s?)\:', url): 
    url = 'http://' + url 

# url is now 'http://www.python.org' 

parsed = urlsplit(url) 

# parsed.scheme is 'http' 
# parsed.netloc is 'www.python.org' 
# parsed.path is None, since (strictly speaking) the path was not defined 

host = parsed.netloc # www.python.org 

# Removing www. 
# This is a bad idea, because www.python.org could 
# resolve to something different than python.org 

if host.startswith('www.'): 
    host = host[4:]

來源

2013-01-31 12:31:11

對於沒有「http：//」的網址，不會立即生效。 'urlparse.urlsplit（「www.foo.com」）.netloc'會返回'''''。 – sidi

是的，那是因爲「www.foo.com」不是有效的URL。 –

問題是OP文件中的某些網址是這種格式。 – sidi

您可以使用urlparse。另外，解決方案應該是通用的，以便在域名之前除去'www'之外的東西（即處理像server1.domain.com這樣的情況）。以下是應該工作的快速嘗試：

from urlparse import urlparse 

url = 'http://www.muneeb.org/files/alan_turing_thesis.jpg' 

o = urlparse(url) 

domain = o.hostname 

temp = domain.rsplit('.') 

if(len(temp) == 3): 
    domain = temp[1] + '.' + temp[2] 

print domain

來源

2013-07-03 17:54:53

我遇到了同樣的問題。這是基於正則表達式的解決方案：

>>> import re 
>>> rec = re.compile(r"https?://(www\.)?") 

>>> rec.sub('', 'https://domain.com/bla/').strip().strip('/') 
'domain.com/bla' 

>>> rec.sub('', 'https://domain.com/bla/ ').strip().strip('/') 
'domain.com/bla' 

>>> rec.sub('', 'http://domain.com/bla/ ').strip().strip('/') 
'domain.com/bla' 

>>> rec.sub('', 'http://www.domain.com/bla/ ').strip().strip('/') 
'domain.com/bla'

來源

2016-04-20 20:16:12 thet

查找http：//和或www。並從域中剝離。離開domain.com

回答

相關問題