生成一個列表，以將網址提供給網頁抓取工具

我已經將Python作爲第一語言學習了幾個月，並且正在嘗試構建一個網頁抓取工具，而不是依賴我給它的網址抓取網站爲我獲得網址。生成一個列表，以將網址提供給網頁抓取工具

我已經確定該網站的哪些部分包含我需要的網址，並且知道/認爲我需要2個列表來完成我想要的操作。

第一個是城市的網址列表，第二個是這些網站中單元的網址列表。這是我最終想要迭代並從中獲取數據的單元的url。到目前爲止，我有以下代碼：

def get_cities(): 
    city_sauce = urllib.request.urlopen('the_url') 
    city_soup = BeautifulSoup(city_sauce, 'html.parser') 
    the_city_links = [] 
    for city in city_soup.findAll('div', class_="city-location-menu"): 
     for a in city.findAll('a', href=True, text=True): 
       the_city_links.append('first_half_of_url' + a['href']) 
    return the_city_links

當我打印出來就說明我需要的所有網址，所以我覺得我已經成功地創建了此鏈接列表？

第二部分如下：

def get_units(): 
    for theLinks in get_cities(): 
     unit_sauce = urllib.request.urlopen(theLinks) 
     unit_soup = BeautifulSoup(unit_sauce, 'html.parser') 
     the_unit_links = [] 
     for unit in unit_soup.findAll('div', class_="btn white-green icon-right-open-big"): 
      for aa in unit.findAll('a', href=True, text=True): 
       the_unit_links.append(aa) 
     return the_unit_links

印刷當此只是簡單地返回[]。我不確定我哪裏出錯，任何幫助將不勝感激！

第2部分修訂：

def get_units(): 
    for the_city_links in get_cities(): 
     unit_sauce = urllib.request.urlopen(the_city_links) 
     unit_soup = BeautifulSoup(unit_sauce, 'html.parser') 
     the_unit_links = [] 
     for unit in unit_soup.findAll('div', class_="btn white-green icon-right-open-big"): 
      for aa in unit.findAll('a', href=True, text=True): 
       the_unit_links.append(aa) 
     return the_unit_links

來源

2017-02-19 Maverick

您需要提供哪些鏈接是你想獲取？可能是你錯過了拿東西，或者可能是你拿錯了課。 –

我把這個url放在'city_sauce'中，我希望''unit_sauce''將這些鏈接存儲在一個列表中，將它們解析爲'unit_soup'，然後進入每個鏈接並獲取hrefs 'div'，class _ =「btn white-green icon-right-open-big」，然後將它們添加到'the_unit_links'列表中，然後在我的scraper中迭代。有任何想法嗎？ @ PiyushS.Wanare我稍微修改了第二部分，參見修訂。 – Maverick

如果你把數據放在一個函數中，它會更好。 –

# Crawls main site to get a list of city URLs 
def getCityLinks(): 
    city_sauce = urllib.request.urlopen('the_url') 
    city_soup = BeautifulSoup(city_sauce, 'html.parser') 
    the_city_links = [] 

    for city in city_soup.findAll('div', class_="city-location-menu"): 
     for a in city.findAll('a', href=True, text=True): 
      the_city_links.append('the_url' + a['href']) 
    #print(the_city_links) 
    return the_city_links 

# Crawls each of the city web pages to get a list of unit URLs 
def getUnitLinks(): 
    getCityLinks() 
    for the_city_links in getCityLinks(): 
     unit_sauce = urllib.request.urlopen(the_city_links) 
     unit_soup = BeautifulSoup(unit_sauce, 'html.parser') 
     the_unit_links = [] 
     for unit_href in unit_soup.findAll('a', class_="btn white-green icon-right-open-big", href=True): 
      the_unit_links.append('the_url' + unit_href['href']) 
     yield the_unit_links

來源

2017-02-20 16:26:06 Maverick

假設我理解你如何使用這個 - 你的函數將在get_cities（）的第一個鏈接後返回，這可能沒有單位？我想你需要在函數的開始處設置the_unit_links = []，然後將函數的返回行移入一個縮進 - 所以只有當get_cities中的所有鏈接都被刪除後纔會返回。

來源

2017-02-19 14:39:04

謝謝你的建議，不幸的是，返回[]以及！ – Maverick

def getLinks(): 
    city_sauce = urllib.request.urlopen('the_url') 
    city_soup = BeautifulSoup(city_sauce, 'html.parser') 
    the_city_links = [] 

    for city in city_soup.findAll('div', class_="city-location-menu"): 
      for a in city.findAll('a', href=True, text=True): 
        the_city_links.append('first_half_of_url' + a['href']) 
     #return the_city_links 

    # print the_city_links 

    for the_city_links in the_city_links: 
     unit_sauce = urllib.request.urlopen(the_city_links) 
     unit_soup = BeautifulSoup(unit_sauce, 'html.parser') 
     the_unit_links = [] 
     for unit in unit_soup.findAll('div', class_="btn white-green icon-right-open-big"): 
      for aa in unit.findAll('a', href=True, text=True): 
       the_unit_links.append(aa) 
     return the_unit_links

注： -Print the_city_links和檢查您是否獲得預期的輸出，然後在運行另一個循環，以取回其相應unit_links

來源

2017-02-20 10:26:22

生成一個列表，以將網址提供給網頁抓取工具

回答

相關問題