2017-01-24 91 views
0

我想從圖像中提取標題。我設法提取了url,但不知道如何編碼提取圖像的標題。Python BeautifulSoup提取標題網頁爬蟲

Code:

import requests 
from bs4 import BeautifulSoup 

def trade_spider(max_pages): 
    page = 1 
    while page <= max_pages: 
     url = 'http://www.gurstree.com.au/s—cars—vans—utes/melbourne/page—' + str(page) + '/c1832013001317' 
     source_code = requests.get(url) 
     plain_text = source_code.text 
     soup = BeautifulSoup(plain_text) 
     for link in soup.findAll('a', {'class': 'ad—listing_title—link'}): 
      href = 'http://www.gumtree.com.au/' + link.get('href') 
      print(href) 
     page += 1 

trade_spider(1) 

The HTML is:

<a itemprop="url" class="ad-listing__thumb-link" name="1124692138" href="/s-ad/derrimut/cars-vans-utes/2015-toyota-86-coupe-12-month-warranty-/1124692138" data-ref="searchTopAd"> 
    <span id="r-image-TOP_AD-1124692138" title="2015 Toyota 86 Coupe **12 MONTH WARRANTY** Derrimut Brimbank Area Preview" class="j-responsive-image ad-listing__thumb" data-index="1">...</span> 
</a> 

第一行是href,但我想要得到的title按照HTML的span塊突出。

謝謝!

+2

發佈您的代碼,而不是像 –

+0

ü可以在這裏添加網址是什麼?很難從代碼圖片 –

回答

0
link.span.get('title') 

使用.找到下一個span並獲得title

使用regex在addribute匹配字符串:

import re  
soup.find('span', id=re.compile(r'r-image')) 
+0

好吧,我設法讓它與link.get('title')一起工作。如果我想使用'id'引用和'r-image-TOP_AD-1124692138',如果每個帖子的-Top_AD-末尾的數字都改變了,我怎麼能使用它? – Chris

+0

真棒謝謝你! – Chris