2014-09-20 40 views
3

我想從[www.quicktransportsolutions.com][1]提取公司名稱,地址和郵政編碼。我寫了下面的代碼來亂寫網站並返回我需要的信息。美麗的湯巢嵌套div(添加額外的功能)

import requests 
from bs4 import BeautifulSoup 

def trade_spider(max_pages): 
    page = 1 
    while page <= max_pages: 
     url = 'http://www.quicktransportsolutions.com/carrier/missouri/adrian.php' 
     source_code = requests.get(url) 
     plain_text = source_code.text 
     soup = BeautifulSoup(plain_text) 
     for link in soup.findAll('div', {'class': 'well well-sm'}): 
      title = link.string 
      print(link) 
trade_spider(1) 

運行代碼後,我看到我想要的信息,但我很困惑如何得到它沒有所有的非相關信息打印。

以上的

print(link) 

我以爲我可以有link.string拉公司的名稱,但失敗了。有什麼建議麼?

輸出:

div class="well well-sm"> 
<b>2 OLD BOYS TRUCKING LLC</b><br><a href="/truckingcompany/missouri/2-old-boys-trucking-usdot-2474795.php" itemprop="url" target="_blank" title="Missouri Trucking Company 2 OLD BOYS TRUCKING ADRIAN"><u><span itemprop="name"><b>2 OLD BOYS TRUCKING</b></span></u></a><br> <span itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress"><a href="http://maps.google.com/maps?q=227+E+2ND,ADRIAN,MO+64720&amp;ie=UTF8&amp;z=8&amp;iwloc=addr" target="_blank"><span itemprop="streetAddress">227 E 2ND</span></a> 
<br> 
<span itemprop="addressLocality">Adrian</span>, <span itemprop="addressRegion">MO</span> <span itemprop="postalCode">64720</span></br></span><br> 
       Trucks: 2  Drivers: 2<br> 
<abbr class="initialism" title="Unique Number to identify Companies operating commercial vehicles to transport passengers or haul cargo in interstate commerce">USDOT</abbr> 2474795    <br><span class="glyphicon glyphicon-phone"></span><b itemprop="telephone"> 417-955-0651</b> 
<br><a href="/inspectionreports/2-old-boys-trucking-usdot-2474795.php" itemprop="url" target="_blank" title="Trucking Company 2 OLD BOYS TRUCKING Inspection Reports"> 

大家,

感謝您的幫助迄今爲止...我想要一個額外的功能添加到我的小爬蟲。我寫了下面的代碼:

def Crawl_State_Page(max_pages): 
    url = 'http://www.quicktransportsolutions.com/carrier/alabama/trucking-companies.php' 
    while i <= len(url): 
     response = requests.get(url) 
     soup = BeautifulSoup(response.content) 
     table = soup.find("table", {"class" : "table table-condensed table-striped table-hover table-bordered"}) 
     for link in table.find_all(href=True): 
      print link['href'] 

Output: 

    abbeville.php 
    adamsville.php 
    addison.php 
    adger.php 
    akron.php 
    alabaster.php 
    alberta.php 
    albertville.php 
    alexander-city.php 
    alexandria.php 
    aliceville.php 


    alpine.php 

... # goes all the way to Z I cut the output short for spacing.. 

我想在這裏實現是把所有的HREF與city.php並將其寫入文件。 ..但現在,我陷入了一個無限循環,在那裏它通過URL循環。有關如何增加它的任何提示?我的最終目標是創建一個反饋到我的trade_spider與www.site.com/state/city.php另一個函數,然後通過所有50個日期循環......某事的

while i < len(states,cities): 
    url = "http://www.quicktransportsolutions.com/carrier" + states + cities[i] +" 

而且效果那麼這將循環到我的trade_spider函數中,提取我需要的所有信息。

但是,在我去的那部分,我需要一點幫助,讓我的無限循環。有什麼建議麼?或者我將要遇到的可預見的問題?

我試圖創建一個爬網程序,它可以遍歷頁面上的每個鏈接,然後如果它發現trade_spider可以爬取的頁面上的內容,它會將其寫入文件中......但是,那有點兒現在,我的技能集合。所以,我正在嘗試這種方法。

回答

2

我將依靠每個公司不同標籤的itemprop屬性。它們方便地設置爲nameurladdress等:

import requests 
from bs4 import BeautifulSoup 

def trade_spider(max_pages): 
    page = 1 
    while page <= max_pages: 
     url = 'http://www.quicktransportsolutions.com/carrier/missouri/adrian.php' 
     response = requests.get(url) 
     soup = BeautifulSoup(response.content) 
     for company in soup.find_all('div', {'class': 'well well-sm'}): 
      link = company.find('a', itemprop='url').get('href').strip() 
      name = company.find('span', itemprop='name').text.strip() 
      address = company.find('span', itemprop='address').text.strip() 

      print name, link, address 
      print "----" 

trade_spider(1) 

打印:

2 OLD BOYS TRUCKING /truckingcompany/missouri/2-old-boys-trucking-usdot-2474795.php 227 E 2ND 

Adrian, MO 64720 
---- 
HILLTOP SERVICE & EQUIPMENT /truckingcompany/missouri/hilltop-service-equipment-usdot-1047604.php ROUTE 2 BOX 453 

Adrian, MO 64720 
----