2016-12-14 90 views
0

我正在嘗試使用python讀取網頁,並以csv格式保存數據以導入爲熊貓數據框。從指定網頁提取特定列

我有下面的代碼,從所有頁面中提取鏈接,而不是我想讀取某些列字段。

for i in range(10): 
    url='https://pythonexpress.in/workshop/'+str(i).zfill(3) 
    import urllib2 
    from bs4 import BeautifulSoup 
    try: 
     page = urllib2.urlopen(url).read() 
     soup = BeautifulSoup(page) 
     for anchor in soup.find_all('div', {'class':'col-xs-8'})[:9]: 
      print i, anchor.text 
    except: 
     pass 

我可以將這9列保存爲熊貓數據框嗎?

df.columns=['Organiser', 'Instructors', 'Date', 'Venue', 'Level', 'participants', 'Section', 'Status', 'Description'] 
+0

難道你不能只從結果中選擇感興趣的列?例如'df = df [cols_I want]' – EdChum

回答

1

這將返回前10頁的正確結果 - 但100頁需要大量時間。任何建議,使其更快?

import urllib2 
from bs4 import BeautifulSoup 

finallist=list() 
for i in range(10): 
    url='https://pythonexpress.in/workshop/'+str(i).zfill(3) 
    try: 
     page = urllib2.urlopen(url).read() 
     soup = BeautifulSoup(page) 
     mylist=list() 
     for anchor in soup.find_all('div', {'class':'col-xs-8'})[:9]: 
      mylist.append(anchor.text) 
     finallist.append(mylist) 
    except: 
     pass 

import pandas as pd 
df=pd.DataFrame(finallist) 

df.columns=['Organiser', 'Instructors', 'Date', 'Venue', 'Level', 'participants', 'Section', 'Status', 'Description'] 

df['Date'] = pd.to_datetime(df['Date'],infer_datetime_format=True) 
df['participants'] = df['participants'].astype(int)