2014-01-30 82 views
1
號網頁上的鏈接

因此,我爲我的朋友編寫了一個抓取工具,它將瀏覽大量搜索結果的網頁,將所有鏈接從頁面中取出,檢查它們是否位於輸出文件中如果他們不在那裏,請添加。它花了很多調試,但它很好!不幸的是,這個小小的玩家對於它認爲足夠重要的錨定標籤真的很挑剔。Python抓取工具忽略了第

下面的代碼:

#!C:\Python27\Python.exe 
from bs4 import BeautifulSoup 
from urlparse import urljoin #urljoin is a class that's included in urlparse 
import urllib2 
import requests #not necessary but keeping here in case additions to code in future 

urls_filename = "myurls.txt" #this is the input text file,list of urls or objects to scan 
output_filename = "output.txt" #this is the output file that you will export to Excel 
keyword = "skin" #optional keyword, not used for this script. Ignore 

with open(urls_filename, "r") as f: 
    url_list = f.read() #This command opens the input text file and reads the information inside it 

with open(output_filename, "w") as f: 
    for url in url_list.split("\n"): #This command splits the text file into separate  lines so it's easier to scan    
      hdr = {'User-Agent': 'Mozilla/5.0'} #This (attempts) to tell the webpage that the program is a Firefox browser 
      try: 
        response = urllib2.urlopen(url) #tells program to open the url from the text file 
      except: 
        print "Could not access", url 
        continue 
      page = response.read() #this assigns a variable to the open page. like algebra, X=page opened 
      soup = BeautifulSoup(page) #we are feeding the variable to BeautifulSoup so it can analyze it 
      urls_all = soup('a') #beautiful soup is analyzing all the 'anchored' links in the page 
      for link in urls_all: 
        if('href' in dict(link.attrs)): 
          url = urljoin(url, link['href']) #this combines the relative link e.g. "/support/contactus.html" and adds to domain 
        if url.find("'")!=-1: continue #explicit statement that the value is not void. if it's NOT void, continue 
        url=url.split('#')[0] 
        if (url[0:4] == 'http' and url not in output_filename): #this checks if the item is a webpage and if it's already in the list 
          f.write(url + "\n") #if it's not in the list, it writes it to the output_filename 

它的工作原理,除了以下鏈接偉大: 「tvotech.asp提交=列出& ID = 796」 https://research.bidmc.harvard.edu/TVO/tvotech.asp

此鏈接有一些像而且忽視它們是直接的。進入我的輸出文件的唯一錨點是主頁本身。這是奇怪的,因爲看源代碼,他們的錨點是非常標準的,像 - 他們有'a'和'href',我沒有理由bs4只是通過它,只包括主鏈接。請幫忙。我嘗試從第30行刪除http或將其更改爲https,並且僅刪除所有結果,即使主頁出現在輸出中。

回答

0

這導致其中一個鏈接有一個mailto在它的href,然後設置爲url參數,並打破其餘鏈接以及導致不通過url[0:4] == 'http'條件,它看起來像這樣:

mailto:[email protected]?subject=Question about TVO Available Technology Abstracts 

您應該篩選出來或不使用相同的參數url的循環,注意改變URL1

for link in urls_all: 
    if('href' in dict(link.attrs)): 
      url1 = urljoin(url, link['href']) #this combines the relative link e.g. "/support/contactus.html" and adds to domain 
    if url1.find("'")!=-1: continue #explicit statement that the value is not void. if it's NOT void, continue 
    url1=url1.split('#')[0] 
    if (url1[0:4] == 'http' and url1 not in output_filename): #this checks if the item is a webpage and if it's already in the list 
      f.write(url1 + "\n") #if it's not in the list, it writes it to the output_filename