列表中的黑名單Python，同時從網頁中獲取數據

基本上，我創建了一個非常混亂的代碼來從bing搜索查詢中獲取鏈接。我面臨的問題是，我收到太多的bing相關鏈接。列表中的黑名單Python，同時從網頁中獲取數據

我試過這個當前的代碼來刪除這些，但我寧願更喜歡黑名單。

這是我的代碼：

import re, urllib 
class MyOpener(urllib.FancyURLopener): 
    version = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15' 
myopener = MyOpener() 
dork = raw_input("Dork:") 
pagevar = ['1','11','23','34','45','46','47','58','69'] 
for page in pagevar: 
    bingdork = "http://www.bing.com/search?q=" + str(dork) + "&first=" + str(page) 
    bingdork.replace(" ", "+") 
    links = re.findall('''href=["'](.[^"']+)["']''', myopener.open(bingdork).read(), re.I) 
    toremove = [] 
    for i in links: 
     if "bing.com" in i: 
      toremove.append(i) 
     elif "wlflag.ico" in i: 
      toremove.append(i) 
     elif "/account/web?sh=" in i: 
      toremove.append(i) 
     elif "/?FORM" in i: 
      toremove.append(i) 
     elif "javascript:void(0);" in i: 
      toremove.append(i) 
     elif "javascript:" in i: 
      toremove.append(i) 
     elif "go.microsoft.com/fwlink" in i: 
      toremove.append(i) 
     elif "g.msn.com" in i: 
      toremove.append(i) 
     elif "onlinehelp.microsoft.com" in i: 
      toremove.append(i) 
     elif "feedback.discoverbing.com" in i: 
      toremove.append(i) 
     elif "/account/web?sh=" in i: 
      toremove.append(i) 
     elif "/?scope=web" in i: 
      toremove.append(i) 
     elif "/explore?q=" in i: 
      toremove.append(i) 
     elif "https://feedback.discoverbing.com" in i: 
      toremove.append(i) 
     elif "/images/" in i: 
      toremove.append(i) 
     elif "/videos/" in i: 
      toremove.append(i) 
     elif "/maps/" in i: 
      toremove.append(i) 
     elif "/news/" in i: 
      toremove.append(i) 
      for i in toremove: 
       links.remove(i) 
       for i in links: 
        print i

假設我輸入：愚蠢者：CFM ID

我會得到的結果將是：

http://pastebin.com/eGgUKYwV

凡爲結果我想是：

http://pastebin.com/Xi28BzXs

我想刪除的東西，如：

/search?q=cfm+id&amp;lf=1&amp;qpvt=cfm+id 
/account/web?sh=5&amp;ru=%2fsearch%3fq%3dcfm%2520id%26first%3d69&amp;qpvt=cfm+id 
/search?q=cfm+id&amp;rf=1&amp;qpvt=cfm+id 
/search?q=cfm+id&amp;first=69&amp;format=rss 
/search?q=cfm+id&amp;first=69&amp;format=rss 
/?FORM=Z9FD1 
javascript:void(0); 
/account/general?ru=http%3a%2f%2fwww.bing.com%2fsearch%3fq%3dcfm+id%26first%3d69&amp;FORM=SEFD 
/?scope=web&amp;FORM=HDRSC1 
/images/search?q=cfm+id&amp;FORM=HDRSC2 
/videos/search?q=cfm+id&amp;FORM=HDRSC3

基本上，我需要一個過濾器，讓我抓住從冰唯一有效的鏈接，並刪除所有來自bings側的廢話。

非常感謝， BK P.S對不起，如果我的解釋不好。

來源

2012-09-24 user1694469

你有沒有使用beautifulsoup，LXML或html5lib（與lxml.etree者優先），僞代碼試圖用CSS/XPath的quering的HTML解析路線：

html = htmlparse.parse(open(url)) 
hrefs = [] 

for a in html.xpath('//a'): 
    if a['href'].startswith('http://') or a['href'].startswith('https://'): 
     hrefs.append(a['href'])

這當然是僞代碼，您應調整無論你用beautifulsoup，lxml或html5lib

如果你正在尋找更像消毒/基於你可能會喜歡使用CleanText白名單清理HTML頁面，這個程序可以進一步定製過濾使用屬性正則表達式，但這是離開作爲練習;）

來源

2012-09-24 13:25:16 amirouche

列表中的黑名單Python，同時從網頁中獲取數據

回答

相關問題