2017-01-01 135 views
1

名錄的HREF我最近發佈的要求,以颳去名錄和@alecxe數據幫助了噸在我面前展現了一些新的方法來提取數據,但我堅持再次,想湊數據爲每個鏈接在黃頁,所以我可以得到有更多數據的黃頁頁面。我想添加一個名爲「url」的變量,並獲取業務的href,而不是實際的業務網站,而是業務的黃頁頁面。我嘗試了各種各樣的東西,但似乎沒有任何工作。 href在「class = business-name」之下。抓取與蟒蛇

import csv 
import requests 
from bs4 import BeautifulSoup 


with open('cities_louisiana.csv','r') as cities: 
    lines = cities.read().splitlines() 
cities.close() 

for city in lines: 
    print(city) 
url = "http://www.yellowpages.com/search?search_terms=businesses&geo_location_terms="baton%rouge+LA&page="+str(count) 

for city in lines: 
    for x in range (0, 50): 
     print("http://www.yellowpages.com/search?search_terms=businesses&geo_location_terms=baton%rouge+LA&page="+str(x)) 
     page = requests.get("http://www.yellowpages.com/search?search_terms=businesses&geo_location_terms=baton%rouge+LA&page="+str(x)) 
     soup = BeautifulSoup(page.text, "html.parser") 
     for result in soup.select(".search-results .result"): 
      try: 
       name = result.select_one(".business-name").get_text(strip=True, separator=" ") 
      except: 
       pass 
      try: 
       streetAddress = result.select_one(".street-address").get_text(strip=True, separator=" ") 
      except: 
       pass 
      try: 
       city = result.select_one(".locality").get_text(strip=True, separator=" ") 
       city = city.replace(",", "") 
       state = "LA" 
       zip = result.select_one('span[itemprop$="postalCode"]').get_text(strip=True, separator=" ") 
      except: 
       pass 

      try: 
       telephone = result.select_one(".phones").get_text(strip=True, separator=" ") 
      except: 
       telephone = "No Telephone" 
      try: 
       categories = result.select_one(".categories").get_text(strip=True, separator=" ") 
      except: 
       categories = "No Categories" 
      completeData = name, streetAddress, city, state, zip, telephone, categories 
      print(completeData) 
      with open("yellowpages_businesses_louisiana.csv", "a", newline="") as write: 
       wrt = csv.writer(write) 
       wrt.writerow(completeData) 
       write.close() 

回答

1

多事情,你應該實現:

  • 提取元素的href屬性的業務聯繫與business-name類 - 在BeautifulSoup這可以通過「治療」的元素像一本字典來完成
  • 使鏈接絕對使用urljoin()
  • 向商家頁面發出請求,同時維持網絡抓取會話
  • 解析與BeautifulSoup商業版以及和提取所需信息
  • 添加時間延遲,以避免擊中部位往往
  • 從搜索結果頁面和打印出企業名稱

完整的工作示例從業務概況頁業務描述:

from urllib.parse import urljoin 

import requests 
import time 
from bs4 import BeautifulSoup 


url = "http://www.yellowpages.com/search?search_terms=businesses&geo_location_terms=baton%rouge+LA&page=1" 


with requests.Session() as session: 
    session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'} 

    page = session.get(url) 
    soup = BeautifulSoup(page.text, "html.parser") 
    for result in soup.select(".search-results .result"): 
     business_name_element = result.select_one(".business-name") 
     name = business_name_element.get_text(strip=True, separator=" ") 

     link = urljoin(page.url, business_name_element["href"]) 

     # extract additional business information 
     business_page = session.get(link) 
     business_soup = BeautifulSoup(business_page.text, "html.parser") 
     description = business_soup.select_one("dd.description").text 

     print(name, description) 

     time.sleep(1) # time delay to not hit the site too often 
+1

很不錯的!我仍然對Python和編程一般都很陌生。儘管我只是通過添加'business_name_element = result.select_one(「。business-name」)'和link = urljoin(page.url,business_name_element [「href」])''做了一些小改動,您的解決方案仍然很棒。當我閱讀你的代碼時,我會對其進行逆向工程,這樣纔有意義。感謝您的支持! –