美麗的湯刮圖案？

我試圖刮包含以下HTML代碼網站：美麗的湯刮圖案？

<div class="content-sidebar-wrap"><main class="content"><article 
class="post-773 post type-post status-publish format-standard has-post- 
thumbnail category-money entry" itemscope 
itemtype="http://schema.org/CreativeWork">

這包含數據我感興趣......我一直在使用BeautifulSoup解析它嘗試過，但以下回報：

<div class="content-sidebar-wrap"><main class="content"><article 
class="entry"> 
<h1 class="entry-title">Not found, error 404</h1><div class="entry-content 
"><p>"The page you are looking for no longer exists. Perhaps you can return 
back to the site's "<a href="http://www.totalsportek.com/">homepage</a> and 
see if you can find what you are looking for. Or, you can try finding it 
by using the search form below.</p><form 
action="http://www.totalsportek.com/" class="search-form" 
itemprop="potentialAction" itemscope="" 
itemtype="http://schema.org/SearchAction" method="get" role="search"> 

# I've made small modifications to make it readable

美麗的湯元素不包含我想要的代碼。我不太熟悉html，但我假設這會調用一些外部服務來返回數據..？我讀過這個與Schema有關的東西。

無論如何我可以訪問這些數據嗎？

來源

2017-07-29 T.Mung

您想從HTML代碼中獲得什麼？ –

一個html表。試圖解析表格直接返回一個無 –

嗯我還是不明白，你試圖從中獲取信息的網站到底是什麼？如果信息是由JavaScript構建的，「requests」將不起作用。 –

您在發出請求時需要指定User-Agent標頭。打印文章標題和內容的工作示例：

import requests 
from bs4 import BeautifulSoup 

url = "http://www.totalsportek.com/money/barcelona-player-salaries/" 

response = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.78 Safari/537.36"}) 
soup = BeautifulSoup(response.content, "html.parser") 

article = soup.select_one(".content article.post.entry.status-publish") 
header = article.header.get_text(strip=True) 
content = article.select_one(".entry-content").get_text(strip=True) 

print(header) 
print(content)

來源

2017-07-29 03:37:23 alecxe

您的代碼有效。但是，我在自己的代碼中指定了您的用戶代理，但它仍然無效。除了設置用戶代理以外，還必須做其他事情嗎？運行一個簡單的soup.table將不會返回任何應該至少有一個表的地方。 –

此外，我更喜歡直接訪問html表格，而不是將其解析爲文本 –

瞭解了表格....我只是使用您的代碼。如果可能，我想知道爲什麼soup.find不起作用 –

美麗的湯刮圖案？

回答

相關問題