2015-10-27 61 views
0

我有一個HTML字符串,替換URL到錨標記使用Python的正則表達式

I was surfing http://www.google.com, where I found my tweet, 
check it out <a href="http://tinyurl.com/blah">http://tinyurl.com/blah</a> 
<span>http://www.google.com</span> 

此,

I was surfing <a href="http://www.google.com">http://www.google.com</a>, where I found my tweet, 
check it out <a href="http://tinyurl.com/blah">http://tinyurl.com/blah</a> 
<span><a href="http://www.google.com">http://www.google.com</a></span> 

我試試這個Demo

我的Python代碼

import re 
p = re.compile(ur'<a\b[^>]*>.*?</a>|((ftp|http|https):\/\/(\w+:{0,1}\w*@)?(\S+)(:[0-9]+)?(\/|\/([\w#!:.?+=&%@!\-\/]))?)', re.MULTILINE) 
test_str = u"I was surfing http://www.google.com, where I found my tweet, check it out <a href=\"http://tinyurl.com/blah\">http://tinyurl.com/blah</a>" 

for item in re.finditer(p, test_str): 
    print item.group(0) 

Ou tput的:

>>> http://www.google.com, 
>>> <a href="http://tinyurl.com/blah">http://tinyurl.com/blah</a> 
+0

那麼你錯過了什麼?你找到的網址,現在只是檢查它是不是已經和並取代,對吧? – mikus

+0

@mikus我更新我的問題,當我在我的Python代碼中使用它時,它也返回錨標籤。 –

+0

因此,所需的輸出只是「>>> http:// www.google.com」,「? –

回答

0

你可以做正則表達式的更復雜,但作爲mikus建議,它似乎更容易做到以下幾點:

for item in re.finditer(p, test_str): 
    result = item.group(0) 
    if not "<a " in result.lower(): 
     print(result) 
+0

它不是一個正確的方式,它使用正則表達式完成。謝謝! –

1

我希望這可以幫助你。

代碼:

import re 
p = re.compile(ur'''[^<">]((ftp|http|https):\/\/(\w+:{0,1}\w*@)?(\S+)(:[0-9]+)?(\/|\/([\w#!:.?+=&%@!\-\/]))?)[^< ,"'>]''', re.MULTILINE) 
test_str = u"I was surfing http://www.google.com, where I found my tweet, check it out <a href=\"http://tinyurl.com/blah\">http://tinyurl.com/blah</a>" 

for item in re.finditer(p, test_str): 
    result = item.group(0) 
    result = result.replace(' ', '') 
    print result 
    end_result = test_str.replace(result, '<a href="' + result + '">' + result + '</a>') 

print end_result 

輸出:

http://www.google.com 
I was surfing <a href="http://www.google.com">http://www.google.com</a>, where I found my tweet, check it out <a href="http://tinyurl.com/blah">http://tinyurl.com/blah</a> 
+0

它的工作,但假設網址跨度或其他標籤,然後它也忽略。我只會忽略錨標籤,所以請幫助我解決這個問題。謝謝!! –

+0

我改變字符串問題謝謝! –

0

好吧,我想我終於找到你要找的內容。基本的想法是嘗試匹配<a href和一個URL。如果有<a href則不要做任何事情,但如果沒有,請添加鏈接。下面是代碼:

import re 
test_str = """I was surfing http://www.google.com, where I found my tweet, 
check it out <a href="http://tinyurl.com/blah">http://tinyurl.com/blah</a> 
<span>http://www.google.com</span> 
""" 
def repl_func(matchObj): 
    href_tag, url = matchObj.groups() 
    if href_tag: 
     # Since it has an href tag, this isn't what we want to change, 
     # so return the whole match. 
     return matchObj.group(0) 
    else: 
     return '<a href="%s">%s</a>' % (url, url) 

pattern = re.compile(
    r'((?:<a href[^>]+>)|(?:<a href="))?' 
    r'((?:https?):(?:(?://)|(?:\\\\))+' 
    r"(?:[\w\d:#@%/;$()~_?\+\-=\\\.&](?:#!)?)*)", 
    flags=re.IGNORECASE) 
result = re.sub(pattern, repl_func, test_str) 
print(result) 

輸出:

I was surfing <a href="http://www.google.com">http://www.google.com</a>, where I found my tweet, 
check it out <a href="http://tinyurl.com/blah">http://tinyurl.com/blah</a> 
<span><a href="http://www.google.com">http://www.google.com</a></span> 

主要思想是從https://stackoverflow.com/a/3580700/5100564。我也借了https://stackoverflow.com/a/6718696/5100564