Python美麗的湯刮從網頁上的網址

我想從html格式的網站刮刮鬍子。我用美麗的湯。這是html的一部分。Python美麗的湯刮從網頁上的網址

      <li style="display: block;"> 
           <article itemscope itemtype="http://schema.org/Article"> 
            <div class="col-md-3 col-sm-3 col-xs-12" > 
             <a href="/stroke?p=3083" class="article-image"> 
              <img itemprop="image" src="/FileUploads/Post/3083.jpg?w=300&h=160&mode=crop" alt="Banana" title="Good for health"> 
             </a> 
            </div> 

            <div class="col-md-9 col-sm-9 col-xs-12"> 
             <div class="article-content"> 

               <a href="/stroke"> 
                <img src="/assets/home/v2016/img/icon/stroke.png" style="float:left;margin-right:5px;width: 4%;"> 
               </a> 
              <a href="/stroke?p=3083" class="article-title"> 
               <div> 
                <h4 itemprop="name" id="playground"> 
Banana Good for health               </h4> 
               </div> 
              </a> 
              <div>            
               <div class="clear"></div> 
               <span itemprop="dateCreated" style="font-size:10pt;color:#777;"> 
                <i class="fa fa-clock-o" aria-hidden="true"></i> 
09/10              </span> 
              </div> 
              <p itemprop="description" class="hidden-phone"> 
               <a href="/stroke?p=3083"> 
                I love Banana. 
               </a> 
              </p> 
             </div> 
            </div> 
           </article> 
          </li>

我的代碼：

from bs4 import BeautifulSoup 
re=requests.get('http://xxxxxx') 
bs=BeautifulSoup(re.text.encode('utf-8'), "html.parser") 
for link in bs.find_all('a') : 
    if link.has_attr('href'): 
     print (link.attrs['href'])

結果會打印出所有從該頁面的URL，但是這不是我所期待的，我只想要一個特別的人喜歡「/行程？ p = 3083「在這個例子中，我怎樣才能在python中設置條件？（我知道這裏共有三個「/ stroke？p = 3083」，但我只需要一個）

另一個問題。此網址不完整，我需要將它們與「http://www.abcde.com」合併，因此結果將爲「http://www.abcde.com/stroke?p=3083」。我知道我可以在R中使用粘貼，但是如何在Python中執行此操作？提前致謝！ :)

來源

2017-10-12 Makiyo

只是放在那裏的鏈接在刮板更換some_link，並給它一展身手。我想你會得到你想要的鏈接以及它的完整形式。

import requests 
from bs4 import BeautifulSoup 
from urllib.parse import urljoin 

res = requests.get(some_link).text 
soup = BeautifulSoup(res,"lxml") 
for item in soup.select(".article-image"): 
    print(urljoin(some_link,item['href']))

來源

2017-10-12 08:32:39 SIM

另一個問題。此網址不完整，我需要將它們與http://www.abcde.com「」合併，因此結果將爲「http://www.abcde.com/stroke?p=3083」。我知道我可以在R中使用粘貼，但如何在Python中做到這一點？提前致謝！ :)

link = 'http://abcde.com' + link

來源

2017-10-12 08:20:23

你得到的大部分是正確的了。收集的鏈接如下（你在做什麼，只是已經在列表解析版）

urls = [url for url in bs.findall('a') if url.has_attr('href')]

這會給你的URL。爲了讓他們中的一個，並將其追加到ABCDE網址，你可以簡單地做到以下幾點：

if urls: 
    new_url = 'http://www.abcde.com{}'.format(urls[0])

來源

2017-10-12 08:31:07 CHURLZ

Python美麗的湯刮從網頁上的網址

回答

相關問題