2015-09-07 121 views
4

我有一個網站上的以下表,我與BeautifulSoup 提取這是URL(我還附上了圖片enter image description here獲取表的內容BeautifulSoup

理想我想有每家公司在一個排在CSV但是我得到它在不同的行。請參見圖片連接。

enter image description here

我想它有它像場「d」但我得到它在A1,A2,A3 ...

這是我用來提取代碼:

def _writeInCSV(text): 
    print "Writing in CSV File" 
    with open('sara.csv', 'wb') as csvfile: 
     #spamwriter = csv.writer(csvfile, delimiter='\t',quotechar='\n', quoting=csv.QUOTE_MINIMAL) 
     spamwriter = csv.writer(csvfile, delimiter='\t',quotechar="\n") 

     for item in text: 
      spamwriter.writerow([item]) 

read_list=[] 
initial_list=[] 


url="http://www.nse.com.ng/Issuers-section/corporate-disclosures/corporate-actions/closure-of-register" 
r=requests.get(url) 
soup = BeautifulSoup(r._content, "html.parser") 

#gdata_even=soup.find_all("td", {"class":"ms-rteTableEvenRow-3"}) 

gdata_even=soup.find_all("td", {"class":"ms-rteTable-default"}) 




for item in gdata_even: 
    print item.text.encode("utf-8") 
    initial_list.append(item.text.encode("utf-8")) 
    print "" 

_writeInCSV(initial_list) 

有人可以幫助嗎?

+0

它甚至會更好,我可以複製整個表以CSV但我用怎麼辦掙扎是 – Nant

回答

3

這裏的理念是:

  • 讀取來自表中的標題單元
  • 讀取從表
  • 壓縮所有的數據行細胞與標頭產生字典
  • 的列表中的所有其他行
  • 使用csv.DictWriter()轉儲到csv

實施:

import csv 
from pprint import pprint 

from bs4 import BeautifulSoup 
import requests 

url = "http://www.nse.com.ng/Issuers-section/corporate-disclosures/corporate-actions/closure-of-register" 
soup = BeautifulSoup(requests.get(url).content, "html.parser") 

rows = soup.select("table.ms-rteTable-default tr") 
headers = [header.get_text(strip=True).encode("utf-8") for header in rows[0].find_all("td")] 

data = [dict(zip(headers, [cell.get_text(strip=True).encode("utf-8") for cell in row.find_all("td")])) 
     for row in rows[1:]] 

# see what the data looks like at this point 
pprint(data) 

with open('sara.csv', 'wb') as csvfile: 
    spamwriter = csv.DictWriter(csvfile, headers, delimiter='\t', quotechar="\n") 

    for row in data: 
     spamwriter.writerow(row) 
1

由於@alecxe已經提供了一個驚人的答案,下面是使用pandas庫的另一種方法。

import pandas as pd 

url = "http://www.nse.com.ng/Issuers-section/corporate-disclosures/corporate-actions/closure-of-register" 
tables = pd.read_html(url) 

tb1 = tables[0] # Get the first table. 
tb1.columns = tb1.iloc[0] # Assign the first row as header. 
tb1 = tb1.iloc[1:] # Drop the first row. 
tb1.reset_index(drop=True, inplace=True) # Reset the index. 

print tb1.head() # Print first 5 rows. 
# tb1.to_csv("table1.csv") # Export to CSV file. 

結果:

In [5]: runfile('C:/Users/.../.spyder2/temp.py', wdir='C:/Users/.../.spyder2') 
0     Company  Dividend Bonus  Closure of Register \ 
0 Nigerian Breweries Plc   N3.50  Nil 5th - 11th March 2015 
1   Forte Oil Plc   N2.50 1 for 5 1st – 7th April 2015 
2   Nestle Nigeria   N17.50  Nil   27th April 2015 
3  Greif Nigeria Plc  60 kobo  Nil 25th - 27th March 2015 
4  Guaranty Bank Plc N1.50 (final)  Nil   17th March 2015 

0   AGM Date  Payment Date 
0  13th May 2015 14th May 2015 
1 15th April 2015 22nd April 2015 
2  11th May 2015 12th May 2015 
3 28th April 2015  5th May 2015 
4 ​31st March 2015 31st March 2015 

In [6]: 
+0

我得到的錯誤: C:\ Python27 \ python.exe C:/Users/Anant/XetraWebBot/Test/ReadCSV.py Traceback(最近一次調用最後一次): 文件「C:/Users/Anant/XetraWebBot/Test/ReadCSV.py」,第4行,在 tables = pd.read_html(url) AttributeError:'module 'object has no attribute'read_html' – Nant

+0

很可能你沒有更新的'pandas'或者你沒有'html5lib'模塊。應該預先警告:'pandas'可以簡化表格拼寫,正如你在上面看到的那樣,但是除非你使用像Anaconda這樣的發行版(這是我用於上面的),否則設置它可能是相當有問題的。 – Manhattan