2015-06-16 43 views
0

我試圖創建一個小腳本來簡單地把一個給定的網站和一個關鍵字一起,遵循所有鏈接一定次數(只有網站的域名鏈接),最後搜索所有找到的關鍵字鏈接並返回任何成功的比賽。最終目標是,如果你記得一個網站,你看到了一些東西,並且知道該網頁包含一個好的關鍵字,這個程序可能能夠幫助找到失去的頁面的鏈接。現在我的bug:在循環遍歷所有這些頁面,提取它們的URL並創建它們的列表時,它似乎以某種方式最終重複並從列表中刪除相同的鏈接。我確實爲此添加了安全措施,但似乎沒有按預期工作。我覺得有些網址被錯誤地複製到列表中,並最終被無數次檢查。鏈接刮擦程序冗餘?

這裏是我完整的代碼(約長對不起),問題領域似乎是在最末尾的for循環:

import bs4, requests, sys 

def getDomain(url): 
    if "www" in url: 
     domain = url[url.find('.')+1:url.rfind('.')] 
    elif "http" in url: 
     domain = url[url.find("//")+2:url.rfind('.')] 
    else: 
     domain = url[:url.rfind(".")] 
    return domain 

def findHref(html): 
    '''Will find the link in a given BeautifulSoup match object.''' 
    link_start = html.find('href="')+6 
    link_end = html.find('"', link_start) 
    return html[link_start:link_end] 

def pageExists(url): 
    '''Returns true if url returns a 200 response and doesn't redirect to a dns search. 
    url must be a requests.get() object.''' 
    response = requests.get(url) 
    try: 
     response.raise_for_status() 
     if response.text.find("dnsrsearch") >= 0: 
      print response.text.find("dnsrsearch") 
      print "Website does not exist" 
      return False 
    except Exception as e: 
     print "Bad response:",e 
     return False 
    return True 

def extractURLs(url): 
    '''Returns list of urls in url that belong to same domain.''' 
    response = requests.get(url) 
    soup = bs4.BeautifulSoup(response.text) 
    matches = soup.find_all('a') 
    urls = [] 
    for index, link in enumerate(matches): 
     match_url = findHref(str(link).lower()) 
     if "." in match_url: 
      if not domain in match_url: 
       print "Removing",match_url 
      else: 
       urls.append(match_url) 
     else: 
      urls.append(url + match_url) 
    return urls 

def searchURL(url): 
    '''Search url for keyword.''' 
    pass 

print "Enter homepage:(no http://)" 
homepage = "http://" + raw_input("> ") 
homepage_response = requests.get(homepage) 
if not pageExists(homepage): 
    sys.exit() 
domain = getDomain(homepage) 

print "Enter keyword:" 
#keyword = raw_input("> ") 
print "Enter maximum branches:" 
max_branches = int(raw_input("> ")) 

links = [homepage] 
for n in range(max_branches): 
    for link in links: 
     results = extractURLs(link) 
     for result in results: 
      if result not in links: 
       links.append(result) 

的部分輸出(約0.000000000001%):

Removing /store/apps/details?id=com.handmark.sportcaster 
Removing /store/apps/details?id=com.handmark.sportcaster 
Removing /store/apps/details?id=com.mobisystems.office 
Removing /store/apps/details?id=com.mobisystems.office 
Removing /store/apps/details?id=com.mobisystems.office 
Removing /store/apps/details?id=com.mobisystems.office 
Removing /store/apps/details?id=com.mobisystems.office 
Removing /store/apps/details?id=com.mobisystems.office 
Removing /store/apps/details?id=com.joelapenna.foursquared 
Removing /store/apps/details?id=com.joelapenna.foursquared 
Removing /store/apps/details?id=com.joelapenna.foursquared 
Removing /store/apps/details?id=com.joelapenna.foursquared 
Removing /store/apps/details?id=com.joelapenna.foursquared 
Removing /store/apps/details?id=com.joelapenna.foursquared 
Removing /store/apps/details?id=com.dashlabs.dash.android 
Removing /store/apps/details?id=com.dashlabs.dash.android 
Removing /store/apps/details?id=com.dashlabs.dash.android 
Removing /store/apps/details?id=com.dashlabs.dash.android 
Removing /store/apps/details?id=com.dashlabs.dash.android 
Removing /store/apps/details?id=com.dashlabs.dash.android 
Removing /store/apps/details?id=com.eweware.heard 
Removing /store/apps/details?id=com.eweware.heard 
Removing /store/apps/details?id=com.eweware.heard 

回答

1

您反覆在同一個鏈接多次與外環循環:

for n in range(max_branches): 
    for link in links: 
     results = extractURLs(link) 

我也將是CA Reful添加到你正在迭代的列表中,或者你最終可能以無限循環結束

0

好吧,我找到了一個解決方案。我所做的只是將鏈接變量更改爲字典,值爲0代表未搜索鏈接,1代表搜索鏈接。然後我通過密鑰副本進行迭代,以保留分支,而不是讓它瘋狂地追蹤循環中添加的每個鏈接。最後,如果找到的鏈接不在鏈接中,則會添加並設置爲0進行搜索。

links = {homepage: 0} 
for n in range(max_branches): 
    for link in links.keys()[:]: 
     if not links[link]: 
      results = extractURLs(link) 
      for result in results: 
       if result not in links: 
        links[result] = 0