獲取表的內容BeautifulSoup

我有一個網站上的以下表，我與BeautifulSoup 提取這是URL（我還附上了圖片獲取表的內容BeautifulSoup

理想我想有每家公司在一個排在CSV但是我得到它在不同的行。請參見圖片連接。

我想它有它像場「d」但我得到它在A1，A2，A3 ...

這是我用來提取代碼：

def _writeInCSV(text): 
    print "Writing in CSV File" 
    with open('sara.csv', 'wb') as csvfile: 
     #spamwriter = csv.writer(csvfile, delimiter='\t',quotechar='\n', quoting=csv.QUOTE_MINIMAL) 
     spamwriter = csv.writer(csvfile, delimiter='\t',quotechar="\n") 

     for item in text: 
      spamwriter.writerow([item]) 

read_list=[] 
initial_list=[] 


url="http://www.nse.com.ng/Issuers-section/corporate-disclosures/corporate-actions/closure-of-register" 
r=requests.get(url) 
soup = BeautifulSoup(r._content, "html.parser") 

#gdata_even=soup.find_all("td", {"class":"ms-rteTableEvenRow-3"}) 

gdata_even=soup.find_all("td", {"class":"ms-rteTable-default"}) 




for item in gdata_even: 
    print item.text.encode("utf-8") 
    initial_list.append(item.text.encode("utf-8")) 
    print "" 

_writeInCSV(initial_list)

有人可以幫助嗎？

來源

2015-09-07 Nant

它甚至會更好，我可以複製整個表以CSV但我用怎麼辦掙扎是 – Nant

這裏的理念是：

讀取來自表中的標題單元
讀取從表
壓縮所有的數據行細胞與標頭產生字典
使用csv.DictWriter()轉儲到csv

實施：

import csv 
from pprint import pprint 

from bs4 import BeautifulSoup 
import requests 

url = "http://www.nse.com.ng/Issuers-section/corporate-disclosures/corporate-actions/closure-of-register" 
soup = BeautifulSoup(requests.get(url).content, "html.parser") 

rows = soup.select("table.ms-rteTable-default tr") 
headers = [header.get_text(strip=True).encode("utf-8") for header in rows[0].find_all("td")] 

data = [dict(zip(headers, [cell.get_text(strip=True).encode("utf-8") for cell in row.find_all("td")])) 
     for row in rows[1:]] 

# see what the data looks like at this point 
pprint(data) 

with open('sara.csv', 'wb') as csvfile: 
    spamwriter = csv.DictWriter(csvfile, headers, delimiter='\t', quotechar="\n") 

    for row in data: 
     spamwriter.writerow(row)

來源

2015-09-07 09:08:47 alecxe

由於@alecxe已經提供了一個驚人的答案，下面是使用pandas庫的另一種方法。

import pandas as pd 

url = "http://www.nse.com.ng/Issuers-section/corporate-disclosures/corporate-actions/closure-of-register" 
tables = pd.read_html(url) 

tb1 = tables[0] # Get the first table. 
tb1.columns = tb1.iloc[0] # Assign the first row as header. 
tb1 = tb1.iloc[1:] # Drop the first row. 
tb1.reset_index(drop=True, inplace=True) # Reset the index. 

print tb1.head() # Print first 5 rows. 
# tb1.to_csv("table1.csv") # Export to CSV file.

結果：

In [5]: runfile('C:/Users/.../.spyder2/temp.py', wdir='C:/Users/.../.spyder2') 
0     Company  Dividend Bonus  Closure of Register \ 
0 Nigerian Breweries Plc   N3.50  Nil 5th - 11th March 2015 
1   Forte Oil Plc   N2.50 1 for 5 1st – 7th April 2015 
2   Nestle Nigeria   N17.50  Nil   27th April 2015 
3  Greif Nigeria Plc  60 kobo  Nil 25th - 27th March 2015 
4  Guaranty Bank Plc N1.50 (final)  Nil   17th March 2015 

0   AGM Date  Payment Date 
0  13th May 2015 14th May 2015 
1 15th April 2015 22nd April 2015 
2  11th May 2015 12th May 2015 
3 28th April 2015  5th May 2015 
4 31st March 2015 31st March 2015 

In [6]:

來源

2015-09-07 09:44:03 Manhattan

我得到的錯誤： C：\ Python27 \ python.exe C：/Users/Anant/XetraWebBot/Test/ReadCSV.py Traceback（最近一次調用最後一次）：文件「C：/Users/Anant/XetraWebBot/Test/ReadCSV.py」，第4行，在 tables = pd.read_html（url） AttributeError：'module 'object has no attribute'read_html' – Nant

很可能你沒有更新的'pandas'或者你沒有'html5lib'模塊。應該預先警告：'pandas'可以簡化表格拼寫，正如你在上面看到的那樣，但是除非你使用像Anaconda這樣的發行版（這是我用於上面的），否則設置它可能是相當有問題的。 – Manhattan

獲取表的內容BeautifulSoup

回答

相關問題