2016-09-18 28 views
-2

我的代碼:BS4網頁抓取不返回任何東西

res=requests.get('https://www.flickr.com/photos/') 
res.raise_for_status() 

soup = bs4.BeautifulSoup(res.text, 'html.parser') 
linkItem = soup.select('div.photo-list-photo-interaction 
a[href^=/photos]') 
print(linkItem) 

沒有返回任何值。 檢查元素後,照片在<div class "photo-list-photo-interaction">內。所以以上soup.select應該已經工作。但事實並非如此。有任何想法嗎?

+0

它的href:

您可以使用CSS選擇用正則表達式做。我必須意外刪除「h」。仍然沒有工作... –

回答

1

如果你看一下實際的來源,你可以看到:

你在你的瀏覽器中使用CSS是動態創建看到的URL。在內部你可以看到background-image: url(//c6.staticflickr.com/9/8279/29103697453_ca811d0e07_z.jpg)這是你需要得到的。在我的原代碼

In [1]: import requests 
In [2]: from bs4 import BeautifulSoup  
In [3]: import re 
In [4]: url_re = re.compile("url\(//(.*?)\)") 

In [5]: res = requests.get('https://www.flickr.com/photos/') 

In [6]: soup = BeautifulSoup(res.text, 'html.parser') 

In [7]: urls = [url_re.search(d["style"]).group(1) for d in soup.select('div.view.photo-list-view div[style*=url(//]')] 

In [8]: print(urls) 
[u'c1.staticflickr.com/9/8385/29133157664_856aef9bc3_n.jpg', u'c5.staticflickr.com/9/8212/29128075804_0c166556c5_n.jpg', u'c3.staticflickr.com/9/8070/29138685794_984cf0a7f2.jpg', u'c3.staticflickr.com/9/8084/29465161650_4a1a160928.jpg', u'c5.staticflickr.com/9/8202/29642526492_357d7da694_n.jpg', u'c8.staticflickr.com/9/8164/29769287735_6523928d3d.jpg', u'c5.staticflickr.com/9/8313/29722500236_76d7bdbdd8.jpg', u'c8.staticflickr.com/9/8580/29776721935_f1ce85e967_n.jpg', u'c3.staticflickr.com/9/8364/29731556026_1f9d166845.jpg', u'c3.staticflickr.com/9/8178/29726200506_4439500c3d.jpg', u'c4.staticflickr.com/9/8405/29138108963_288aa48d06.jpg', u'c4.staticflickr.com/9/8565/29137949003_fb41535bd6.jpg', u'c5.staticflickr.com/9/8109/29735723636_0e494810a2.jpg', u'c3.staticflickr.com/9/8482/29662415042_5b0d05c8f3.jpg', u'c1.staticflickr.com/9/8346/29726788896_8c293fbdf7.jpg', u'c3.staticflickr.com/9/8524/29725439906_2b067f0212.jpg', u'c6.staticflickr.com/9/8303/29140293093_e355f8e8cd.jpg', u'c3.staticflickr.com/9/8011/29477607810_db00655d55.jpg', u'c1.staticflickr.com/9/8227/29465026920_36ab1c9637.jpg', u'c2.staticflickr.com/9/8014/29770085625_5163a499d1.jpg', u'c1.staticflickr.com/9/8090/29719718136_5f5ab26519.jpg', u'c1.staticflickr.com/9/8198/29645435472_f5284dedfd.jpg', u'c1.staticflickr.com/9/8692/29469829440_4481cea5e2.jpg', u'c4.staticflickr.com/9/8126/29142193643_f7a2100439.jpg', u'c3.staticflickr.com/9/8395/29646613162_8bbfcb4783.jpg', u'c1.staticflickr.com/9/8182/29482891560_66a7453201.jpg', u'c6.staticflickr.com/9/8078/29137768373_f8c8ebc474.jpg', u'c4.staticflickr.com/9/8142/29754486795_5517360b29.jpg', u'c1.staticflickr.com/9/8276/29138669944_c94fb64f7e.jpg', u'c7.staticflickr.com/9/8189/29658148142_44845e5842.jpg', u'c3.staticflickr.com/9/8168/29724488906_dd17d56015_n.jpg', u'c1.staticflickr.com/9/8450/29727877336_b9d852bc7b.jpg', u'c7.staticflickr.com/8/7471/29129926854_ceff45aaeb.jpg', u'c4.staticflickr.com/9/8298/29690071131_5a7589870d.jpg', u'c8.staticflickr.com/9/8003/29131670143_9cb629648a.jpg', u'c3.staticflickr.com/9/8022/29722826586_c07240a926_n.jpg', u'c3.staticflickr.com/9/8332/29663153602_c9364a94ac.jpg', u'c4.staticflickr.com/9/8219/29767151515_3c9c12d47a.jpg', u'c6.staticflickr.com/9/8475/29675880341_baa5c43403.jpg', u'c5.staticflickr.com/9/8246/29646906852_ff44a93f55_n.jpg', u'c2.staticflickr.com/9/8113/29141997673_64184d61fd_n.jpg', u'c7.staticflickr.com/9/8517/29131116894_1319f5a4af.jpg', u'c5.staticflickr.com/9/8169/29472205700_4930f81031_n.jpg', u'c3.staticflickr.com/9/8051/29466854090_804671e48d.jpg', u'c4.staticflickr.com/9/8459/29772050115_0d602920a9_n.jpg', u'c6.staticflickr.com/9/8413/29762049765_951f4c683c_n.jpg', u'c8.staticflickr.com/9/8480/29132401623_50619e22c5_n.jpg', u'c7.staticflickr.com/9/8410/29482793550_8b338c8432_z.jpg', u'c6.staticflickr.com/8/7501/29693717381_dd907ac02a.jpg'] 
+0

謝謝。我遵循的書不使用re。是否有可能實現相同的目標,只需使用bs4提供的內容? –

+0

不幸的是,bs4是一個HTML解析器,它只能解析它給出的內容。如果你想獲得你在瀏覽器中看到的源代碼,你將需要使用能夠呈現像硒這樣的動態內容的東西。 –

+0

我仍然一無所獲。這是我所擁有的:urls = [url_re.search(d ['style'])。group(1)for d in soup.select('div.view.photo-list-view requiredToShowOnServer')] print(url ) –