2017-01-27 24 views
0

我希望你能幫助我,所以我需要創建解析文本的功能,並提取數據到大熊貓數據幀:解析和提取數據到大熊貓數據幀:BeautifulSoup和XML

「」 「 功能 --------- rcp_poll_data

Extract poll information from an XML string, and convert to a DataFrame 

Parameters 
---------- 
xml : str 
    A string, containing the XML data from a page like 
    get_poll_xml(1044) 

Returns 
------- 
A pandas DataFrame with the following columns: 
    date: The date for each entry 
    title_n: The data value for the gid=n graph (take the column name from the `title` tag) 

This DataFrame should be sorted by date 

Example 
------- 
Consider the following simple xml page: 

<chart> 
<series> 
<value xid="0">1/27/2009</value> 
<value xid="1">1/28/2009</value> 
</series> 
<graphs> 
<graph gid="1" color="#000000" balloon_color="#000000" title="Approve"> 
<value xid="0">63.3</value> 
<value xid="1">63.3</value> 
</graph> 
<graph gid="2" color="#FF0000" balloon_color="#FF0000" title="Disapprove"> 
<value xid="0">20.0</value> 
<value xid="1">20.0</value> 
</graph> 
</graphs> 
</chart> 

Given this string, rcp_poll_data should return 
result = pd.DataFrame({'date': pd.to_datetime(['1/27/2009', '1/28/2009']), 
         'Approve': [63.3, 63.3], 'Disapprove': [20.0, 20.0]}) 

mycode的

def rcp_poll_data(xml): 
soup = BeautifulSoup(xml,'xml') 
dates=soup.find("series") 
datesval=soup.findChildren(string=True) 
del datesval[-7:] 
obama=soup.find("graph",gid="1") 
obamaval={"title":obama["title"],"color":obama["color"]} 
romney=soup.find("graph",gid="2") 
romneyval={"title":romney["title"],"color":romney["color"]} 
result = pd.DataFrame({'date': pd.to_datetime(datesval,errors="ignore"), 'GID1':obamaval, 'GID2':romneyval}) 
return result 

」「」 但是當我執行程序時,我總是收到這個錯誤: 與非系列字符串混合可能會導致模糊的排序。

請幫忙! PS:在get_poll功能是這樣的:

def get_poll_xml(poll_id): 
url="http://charts.realclearpolitics.com/charts/"+str(poll_id)+".xml" 
return requests.get(url).content 

poll_id = 1044例如

回答

0

考慮使用內置xml.etree.ElementTree超過BeautifulSoup(更好地爲HTML網頁抓取)來解析XML具有方法內容如iterfind,findall,find通過子節點添加到XPath,即使有謂詞如@gid='1'。而且,由於在這兩個<series><graph>父標籤<value>元素是相同的長度,可以循環在zip()

import requests 
import pandas as pd 
import xml.etree.ElementTree as et 

def get_poll_xml(poll_id): 
    url="http://charts.realclearpolitics.com/charts/{}.xml".format(poll_id) 
    return requests.get(url).content 

def rcp_poll_data(xml): 

    tree = et.fromstring(xml) 

    dates = []; graphlist1 = []; graphlist2 = [] 

    g1title = tree.find("./graphs/graph[@gid='1']").get('title') 
    g2title = tree.find("./graphs/graph[@gid='2']").get('title') 

    for s, g1, g2 in zip(tree.iterfind("./series/value"), 
         tree.iterfind("./graphs/graph[@gid='1']/value"), 
         tree.iterfind("./graphs/graph[@gid='2']/value")): 
     dates.append(s.text) 
     graphlist1.append(g1.text) 
     graphlist2.append(g2.text) 

    return pd.DataFrame({'Date':pd.to_datetime(dates, errors="ignore"), 
         g1title: graphlist1, 
         g2title: graphlist2}) 

poll_id = 1044 
xml_str = get_poll_xml(poll_id) 
df = rcp_poll_data(xml_str) 

輸出

print(df.head(20)) 

# Approve  Date Disapprove 
# 0  63.3 2009-01-27  20.0 
# 1  63.3 2009-01-28  20.0 
# 2  63.5 2009-01-29  19.3 
# 3  63.5 2009-01-30  19.3 
# 4  61.8 2009-01-31  19.4 
# 5  61.8 2009-02-01  19.4 
# 6  61.8 2009-02-02  19.4 
# 7  61.8 2009-02-03  19.4 
# 8  61.8 2009-02-04  19.4 
# 9  61.8 2009-02-05  19.4 
# 10 61.6 2009-02-06  21.4 
# 11 61.6 2009-02-07  21.4 
# 12 61.6 2009-02-08  21.4 
# 13 65.4 2009-02-09  22.6 
# 14 65.4 2009-02-10  22.6 
# 15 64.2 2009-02-11  23.3 
# 16 64.2 2009-02-12  23.3 
# 17 64.2 2009-02-13  23.3 
# 18 64.8 2009-02-14  25.4 
# 19 65.5 2009-02-15  25.5 
+0

哇,太感謝你了,我也沒知道xml.etree.ElementTree,謝謝你指出我! –