使用刮從一個網頁的表格beautifulsoup，表中沒有找到

我一直在試圖從 here湊表，但在我看來，BeautifulSoup沒有找到任何表。使用刮從一個網頁的表格beautifulsoup，表中沒有找到

我寫道：

import requests 
import pandas as pd 
from bs4 import BeautifulSoup 
import csv 

url = "http://www.payscale.com/college-salary-report/bachelors?page=65" 
r=requests.get(url) 
data=r.text 

soup=BeautifulSoup(data,'xml') 
table=soup.find_all('table') 
print table #prints nothing..

根據其他類似的問題，我認爲在HTML中好歹壞了，但我不是專家.. 無法找到這些答案：（Beautiful soup missing some html table tags），（Extracting a table from a website），（Scraping a table using BeautifulSoup），甚至（Python+BeautifulSoup: scraping a particular table from a webpage）

感謝一大堆！

來源

2017-02-18 oba2311

我將打印'data'，看看如果你發現頁面中的表。 – metame

謝謝@metama。我這樣做 - 唯一的 - <！ - Tablet Image - > 和bsoup wouldnt找到它.. 另外 - 如果尋找表不是這種情況下，那麼你會怎麼去它？謝謝！ – oba2311

頁面中沒有表格標籤。所有表格信息都在腳本標籤中。 –

您解析的是html，但您使用了xml解析器。
您應該使用soup=BeautifulSoup(data,"html.parser")
你需要的數據是在script標籤，其實是沒有table標籤實際。因此，您需要在script內找到文本。
N.B：如果您使用Python 2.x，則使用「HTMLParser」而不是「html.parser」。

這是代碼。

import csv 
import requests 
from bs4 import BeautifulSoup 

url = "http://www.payscale.com/college-salary-report/bachelors?page=65" 
r=requests.get(url) 
data=r.text 

soup=BeautifulSoup(data,"html.parser") 
scripts = soup.find_all("script") 

file_name = open("table.csv","w",newline="") 
writer = csv.writer(file_name) 
list_to_write = [] 

list_to_write.append(["Rank","School Name","School Type","Early Career Median Pay","Mid-Career Median Pay","% High Job Meaning","% STEM"]) 

for script in scripts: 
    text = script.text 
    start = 0 
    end = 0 
    if(len(text) > 10000): 
     while(start > -1): 
      start = text.find('"School Name":"',start) 
      if(start == -1): 
       break 
      start += len('"School Name":"') 
      end = text.find('"',start) 
      school_name = text[start:end] 

      start = text.find('"Early Career Median Pay":"',start) 
      start += len('"Early Career Median Pay":"') 
      end = text.find('"',start) 
      early_pay = text[start:end] 

      start = text.find('"Mid-Career Median Pay":"',start) 
      start += len('"Mid-Career Median Pay":"') 
      end = text.find('"',start) 
      mid_pay = text[start:end] 

      start = text.find('"Rank":"',start) 
      start += len('"Rank":"') 
      end = text.find('"',start) 
      rank = text[start:end] 

      start = text.find('"% High Job Meaning":"',start) 
      start += len('"% High Job Meaning":"') 
      end = text.find('"',start) 
      high_job = text[start:end] 

      start = text.find('"School Type":"',start) 
      start += len('"School Type":"') 
      end = text.find('"',start) 
      school_type = text[start:end] 

      start = text.find('"% STEM":"',start) 
      start += len('"% STEM":"') 
      end = text.find('"',start) 
      stem = text[start:end] 

      list_to_write.append([rank,school_name,school_type,early_pay,mid_pay,high_job,stem]) 
writer.writerows(list_to_write) 
file_name.close()

這將在csv中生成您的必要表格。完成後不要忘記關閉文件。

來源

2017-02-18 03:50:33

**謝謝** @Khairul您的代碼（幾乎）有效。我不得不刪除_newline =「」_，因爲python認爲它不會將這個參數放入_open（）_函數中。當我查閱以瞭解它時，我看到它是一個爲python 3引入的新參數。我在說這行： 'file_name = open（「table.csv」，「w」，newline = 「」）' 這是非常有用的 - >謝謝。 – oba2311

我編碼在python 3.5中，對不起，我不知道它是python 3中的新參數，否則我會提到。 @ oba2311 –

不用擔心@Khairul你確實解決了這個問題！因此檢查 - 謝謝。 – oba2311

雖然這不會發現這不是在r.text表，你問BeautifulSoup使用xml解析器，而不是html.parser所以我會建議改變該行：

soup=BeautifulSoup(data,'html.parser')

之一與網絡抓取相關的問題是所謂的「客戶端呈現」網站與服務器呈現網站。基本上，這意味着您通過requests模塊或curl從基本html請求獲得的頁面與在Web瀏覽器中呈現的內容不同。一些常見的框架是React和Angular。如果你檢查你想要抓取的頁面的源代碼，他們在他們的幾個html元素上有data-react-id。 Angular頁面的常見說明是類似的元素屬性，其前綴爲ng，例如ng-if或ng-bind。您可以通過各自的開發工具在Chrome或Firefox中查看頁面的源代碼，可以在任一瀏覽器中使用鍵盤快捷鍵Ctrl+Shift+I啓動。值得注意的是，並非所有的React & Angular頁面只是客戶端呈現。

爲了獲得此類內容，您需要使用像Selenium這樣的無頭瀏覽器工具。使用Selenium和Python進行網頁抓取時有很多資源。

來源

2017-02-18 02:46:50 metame

數據位於JavaScript變量中，您應該找到js文本數據，然後使用正則表達式來提取它。當你得到這些數據時，它是包含900+學校字典的json列表對象，你應該使用json模塊將它加載到python列表obejct中。

import requests, bs4, re, json 

url = "http://www.payscale.com/college-salary-report/bachelors?page=65" 
r = requests.get(url) 
data = r.text 
soup = bs4.BeautifulSoup(data, 'lxml') 
var = soup.find(text=re.compile('collegeSalaryReportData')) 
table_text = re.search(r'collegeSalaryReportData = (\[.+\]);\n var', var, re.DOTALL).group(1) 
table_data = json.loads(table_text) 
pprint(table_data) 
print('The number of school', len(table_data))

出來：

{'% Female': '0.57', 
    '% High Job Meaning': 'N/A', 
    '% Male': '0.43', 
    '% Pell': 'N/A', 
    '% STEM': '0.1', 
    '% who Recommend School': 'N/A', 
    'Division 1 Basketball Classifications': 'Not Division 1 Basketball', 
    'Division 1 Football Classifications': 'Not Division 1 Football', 
    'Early Career Median Pay': '36200', 
    'IPEDS ID': '199643', 
    'ImageUrl': '/content/school_logos/Shaw University_50px.png', 
    'Mid-Career Median Pay': '45600', 
    'Rank': '963', 
    'School Name': 'Shaw University', 
    'School Sector': 'Private not-for-profit', 
    'School Type': 'Private School, Religious', 
    'State': 'North Carolina', 
    'Undergraduate Enrollment': '1664', 
    'Url': '/research/US/School=Shaw_University/Salary', 
    'Zip Code': '27601'}] 
The number of school 963

來源

2017-02-18 10:04:47

使用刮從一個網頁的表格beautifulsoup，表中沒有找到

回答

相關問題