與不需要的鏈接匹配

我寫過一個庫，它通過從維基百科中提取href鏈接並保存它們來創建一個持久層。我意識到我有一個鏈接，我不關心那個標籤爲/wiki/Cookbook:Table_of_Contents。與不需要的鏈接匹配

模仿!~（不匹配）並保持Pythonic的最佳方式是什麼？

爲了更好的上下文和理解，我會解決這個問題，像這樣的紅寶石：

if link =~ %r{^/wiki/Cookbook} && link !~ /Table_of_Contents/

我的代碼：

def fetch_links(self, proxy): 
    if not self._valid_proxy(proxy): 
     raise ValueError('invalid proxy address: {}'.format(proxy)) 
    self.browser.set_proxies({'http': proxy}) 
    page = self.browser.open(self.wiki_recipes) 
    html = page.read() 

    link_tags = SoupStrainer('a', href=True) 
    soup = BeautifulSoup(html, parse_only=link_tags) 
    recipe_regex = r'^\/wiki\/Cookbook' 
    return [link['href'] for link in soup.find_all('a') if 
      re.match(recipe_regex, link['href'])]

來源

2014-10-06 theGrayFox

爲什麼downvote？我只是在尋找第二種意見或更好的選擇，而不是釣魚竿。 – theGrayFox 2014-10-06 22:19:46

有多種方法來排除不需要的鏈接。

一種選擇將是pass a function在href參數值：

soup.find_all('a', href=lambda x: 'Table_of_Contents' not in x)

這將過濾掉那些沒有Table_of_Contents的href屬性中a標籤。

例子：

from bs4 import BeautifulSoup 

data = """ 
<div> 
    <a href="/wiki/Cookbook:Table_of_Contents">cookbook</a> 
    <a href="/wiki/legal_link">legal</a> 
    <a href="http://google.com">google</a> 
    <a href="/Table_of_Contents/">contents</a> 
</div> 
""" 

soup = BeautifulSoup(data) 
print [a.text for a in soup.find_all('a', href=lambda x: 'Table_of_Contents' not in x)]

打印：

[u'legal', u'google']

來源

2014-10-06 22:21:55 alecxe

+1文檔鏈接。我從來不會想過將href傳遞給一個函數，但只要它返回一個布爾值就會合法。你是怎麼想出這個想法的？非常聰明。 – theGrayFox 2014-10-06 22:26:07

@ TheGrayFox yup，這是什麼讓這個標籤湯美麗 - 這是一個偉大的圖書館。你越熟悉它越多，你意識到它是python中最方便，最愉快的庫之一。並且，僅供參考，您可以傳遞一個[彙編正則表達式模式]（http://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-regular-expression）作爲參數值：'soup.find_all （'a'，href = re.compile（r'my_pattern_here'））'。謝謝。 – alecxe 2014-10-06 22:28:36

感謝您的提示，我會清理它。 – theGrayFox 2014-10-06 22:33:25

與不需要的鏈接匹配

回答

相關問題