Python的字符串替換：關鍵字到URL中

我將使用URL來代替某些關鍵字的字符串，例如，Python的字符串替換：關鍵字到URL中

content.replace("Google","<a href="http://www.google.com">Google</a>")

不過，我只想與網址將只如果不是已經包裹在一個取代關鍵字網址。

內容是簡單的HTML：

<p><b>This is an example!</b></p><p>I love <a href="http://www.google.com">Google</a></p><p><a href="http://www.google.com"><img src="/google.jpg" /></a></p>

主要<a>和<img>標籤。

主要問題：如何確定一個關鍵字是否已經包裝在<a>或<img>標記中？

這裏是一個類似的問題，在PHP find and replace keywords with urls ONLY if not already wrapped in a url，但答案不是一個有效的。

Python中是否有更好的解決方案？更好的代碼示例。謝謝！

來源

2012-06-09 Susan Mayer

可不可以給一個您想要運行此功能的文本類型的示例？ – Acorn

@Acorn HTML網頁。例如：'

這是一個例子！

我愛Google

' –

可以使用的例子，我有如下所示創建匹配以或標籤正則表達式。 – tabchas

由於克里斯 - 頂說，BeautifulSoup是要走的路：

from BeautifulSoup import BeautifulSoup, Tag, NavigableString 
import re  

html = """ 
<div> 
    <p>The quick brown <a href='http://en.wikipedia.org/wiki/Dog'>fox</a> jumped over the lazy Dog</p> 
    <p>The <a href='http://en.wikipedia.org/wiki/Dog'>dog</a>, who was, in reality, not so lazy, gave chase to the fox.</p> 
    <p>See image for reference:</p> 
    <img src='dog_chasing_fox.jpg' title='Dog chasing fox'/> 
</div> 
""" 
soup = BeautifulSoup(html) 

#search term, url reference 
keywords = [("dog","http://en.wikipedia.org/wiki/Dog"), 
      ("fox","http://en.wikipedia.org/wiki/Fox")] 

def insertLinks(string_value,string_href): 
    for t in soup.findAll(text=re.compile(string_value, re.IGNORECASE)): 
      if t.parent.name !='a': 
        a = Tag('a', name='a') 
        a['href'] = string_href 
        a.insert(0, NavigableString(string_value)) 
        string_list = re.compile(string_value, re.IGNORECASE).split(t) 
        replacement_text = soup.new_string(string_list[0]) 
        t.replace_with(replacement_text) 
        replacement_text.insert_after(a) 
        a.insert_after(soup.new_string(string_list[1])) 


for word in keywords: 
    insertLinks(word[0],word[1]) 

print soup

將產生：

<div> 
    <p>The quick brown <a href="http://en.wikipedia.org/wiki/Dog">fox</a> jumped over the lazy <a href="http://en.wikipedia.org/wiki/Dog">dog</a></p> 
    <p>The <a href="http://en.wikipedia.org/wiki/Dog">dog</a>, who was, in reality, not so lazy, gave chase to the <a href="http://en.wikipedia.org/wiki/Fox">fox</a>.</p> 
    <p>See image for reference:</p> 
    <img src="dog_chasing_fox.jpg" title="Dog chasing fox"/> 
</div>

來源

2012-06-09 22:02:40

哇這整個時間我試圖解決問題使用HTMLParser庫...我正在爲它工作了3小時...然後有一個庫已經爲它:( – tabchas

@Kevin P感謝把提交一些工作代碼的時間:) – topless

您可以嘗試添加上一篇文章中提到的正則表達式。首先根據正則表達式檢查您的字符串，以檢查它是否已包裝在URL中。這應該是非常簡單的，因爲簡單地調用re庫和它的search（）方法應該可以做到。

這裏是一個很好的教程，如果你需要對正則表達式和搜索方法具體爲：http://www.tutorialspoint.com/python/python_reg_expressions.htm

後您檢查字符串，看看它是否已經包裹在一個URL或沒有，你可以調用替換功能如果它尚未包裝在URL中。

下面是一個簡單的例子，我寫道：

import re 

    x = "<a href=""http://www.google.com"">Google</a>" 
    y = 'Google' 

    def checkURL(string): 
     if re.search(r'<a href.+', string): 
      print "URL Wrapped Already" 
      print string 
     else: 
      string = string.replace('Google', "<a href=""http://www.google.com"">Google</a>") 
      print "URL Not Wrapped:" 
      print string 

    checkURL(x) 
    checkURL(y)

我希望這回答您的問題！

來源

2012-06-09 11:34:18 tabchas

咦？我似乎沒有得到你。我不搜索特定的字符串。我只想用urls替換關鍵字，如果尚未包含在url中。 –

你能舉一個你可以使用的文字的例子嗎？ – tabchas

我使用Beatiful Soup解析我的HTML，因爲parsing HTML與正則表達式可以證明棘手。如果你使用美麗的湯，你可以玩previous_sibling和previous_element找出你需要的東西。

你以這種方式互動：

soup.find_all('img')

來源

2012-06-09 21:02:42 topless

Python的字符串替換：關鍵字到URL中

回答

相關問題