2012-09-24 25 views
0

基本上,我創建了一個非常混亂的代碼來從bing搜索查詢中獲取鏈接。 我面臨的問題是,我收到太多的bing相關鏈接。列表中的黑名單Python,同時從網頁中獲取數據

我試過這個當前的代碼來刪除這些,但我寧願更喜歡黑名單。

這是我的代碼:

import re, urllib 
class MyOpener(urllib.FancyURLopener): 
    version = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15' 
myopener = MyOpener() 
dork = raw_input("Dork:") 
pagevar = ['1','11','23','34','45','46','47','58','69'] 
for page in pagevar: 
    bingdork = "http://www.bing.com/search?q=" + str(dork) + "&first=" + str(page) 
    bingdork.replace(" ", "+") 
    links = re.findall('''href=["'](.[^"']+)["']''', myopener.open(bingdork).read(), re.I) 
    toremove = [] 
    for i in links: 
     if "bing.com" in i: 
      toremove.append(i) 
     elif "wlflag.ico" in i: 
      toremove.append(i) 
     elif "/account/web?sh=" in i: 
      toremove.append(i) 
     elif "/?FORM" in i: 
      toremove.append(i) 
     elif "javascript:void(0);" in i: 
      toremove.append(i) 
     elif "javascript:" in i: 
      toremove.append(i) 
     elif "go.microsoft.com/fwlink" in i: 
      toremove.append(i) 
     elif "g.msn.com" in i: 
      toremove.append(i) 
     elif "onlinehelp.microsoft.com" in i: 
      toremove.append(i) 
     elif "feedback.discoverbing.com" in i: 
      toremove.append(i) 
     elif "/account/web?sh=" in i: 
      toremove.append(i) 
     elif "/?scope=web" in i: 
      toremove.append(i) 
     elif "/explore?q=" in i: 
      toremove.append(i) 
     elif "https://feedback.discoverbing.com" in i: 
      toremove.append(i) 
     elif "/images/" in i: 
      toremove.append(i) 
     elif "/videos/" in i: 
      toremove.append(i) 
     elif "/maps/" in i: 
      toremove.append(i) 
     elif "/news/" in i: 
      toremove.append(i) 
      for i in toremove: 
       links.remove(i) 
       for i in links: 
        print i 

假設我輸入: 愚蠢者:CFM ID

我會得到的結果將是:

http://pastebin.com/eGgUKYwV

凡爲結果我想是:

http://pastebin.com/Xi28BzXs

我想刪除的東西,如:

/search?q=cfm+id&lf=1&qpvt=cfm+id 
/account/web?sh=5&ru=%2fsearch%3fq%3dcfm%2520id%26first%3d69&qpvt=cfm+id 
/search?q=cfm+id&rf=1&qpvt=cfm+id 
/search?q=cfm+id&first=69&format=rss 
/search?q=cfm+id&first=69&format=rss 
/?FORM=Z9FD1 
javascript:void(0); 
/account/general?ru=http%3a%2f%2fwww.bing.com%2fsearch%3fq%3dcfm+id%26first%3d69&FORM=SEFD 
/?scope=web&FORM=HDRSC1 
/images/search?q=cfm+id&FORM=HDRSC2 
/videos/search?q=cfm+id&FORM=HDRSC3 

基本上,我需要一個過濾器,讓我抓住從冰唯一有效的鏈接,並刪除所有來自bings側的廢話。

非常感謝, BK P.S對不起,如果我的解釋不好。

回答

0

你有沒有使用beautifulsoup,LXML或html5lib(與lxml.etree者優先),僞代碼試圖用CSS/XPath的quering的HTML解析路線:

html = htmlparse.parse(open(url)) 
hrefs = [] 

for a in html.xpath('//a'): 
    if a['href'].startswith('http://') or a['href'].startswith('https://'): 
     hrefs.append(a['href']) 

這當然是僞代碼,您應調整無論你用beautifulsouplxmlhtml5lib

如果你正在尋找更像消毒/基於你可能會喜歡使用CleanText白名單清理HTML頁面,這個程序可以進一步定製過濾使用屬性正則表達式,但這是離開作爲練習;)