2016-08-18 180 views
0

我在從網上抓取數據時從列表中創建熊貓df時遇到了一些麻煩。在這裏,我使用beautifulsoup從localharvest.org(農場名稱,城市和描述)中提取有關本地農場的一些信息。我能夠有效地抓取數據,在每次傳遞中創建一個對象列表。我遇到的麻煩是將這些列表輸出到表格df中。從列表中創建熊貓數據框的麻煩

我的完整代碼如下:

import requests 
from bs4 import BeautifulSoup 
import pandas 

url = "http://www.localharvest.org/search.jsp?jmp&lat=44.80798&lon=-69.22736&scale=8&ty=6" 
r = requests.get(url) 
soup = BeautifulSoup(r.content) 


data = soup.find_all("div", {'class': 'membercell'}) 

fname = [] 
fcity = [] 
fdesc = [] 

for item in data: 
    name = item.contents[1].text 
    fname.append(name) 
    city = item.contents[3].text 
    fcity.append(city) 
    desc = item.find_all("div", {'class': 'short-desc'})[0].text 
    fdesc.append(desc) 

df = pandas.DataFrame({'fname': fname, 'fcity': fcity, 'fdesc': fdesc}) 

print (df) 

df.to_csv('farmdata.csv') 

有趣的是,print(df)功能表明,所有三個名單已傳遞到數據幀。但結果.CSV輸出僅包含一列值(fcity),其中存在fname和fdesc列標籤。 Interstingly,如果我做了一些瘋狂的事情,比如試圖強制標籤描述的輸出爲df.to_csv('farmdata.csv', sep='\t'),我得到一個帶有混亂輸出的列,但它至少會傳遞數據框的其他元素。

先感謝您的任何輸入。

回答

1

嘗試剝離出換行和空格字符:

import requests 
from bs4 import BeautifulSoup 
import pandas 

url = "http://www.localharvest.org/search.jsp?jmp&lat=44.80798&lon=-69.22736&scale=8&ty=6" 
r = requests.get(url) 
soup = BeautifulSoup(r.content) 


data = soup.find_all("div", {'class': 'membercell'}) 

fname = [] 
fcity = [] 
fdesc = [] 

for item in data: 
    name = item.contents[1].text.split() 
    fname.append(' '.join(name)) 
    city = item.contents[3].text.split() 
    fcity.append(' '.join(city)) 
    desc = item.find_all("div", {'class': 'short-desc'})[0].text.split() 
    fdesc.append(' '.join(desc)) 

df = pandas.DataFrame({'fname': fname, 'fcity': fcity, 'fdesc': fdesc}) 

print (df) 

df.to_csv('farmdata.csv') 
+1

這工作完美。我認爲問題來自desc領域。我注意到beautifulsoup往往會添加大量的換行符。其他人主張使用'.split()'去除換行符。這是一個很大的幫助。謝謝。 – JeremyD

+0

如果您使用電子表格程序查看csv文件,我猜測換行符會使它看起來像單元格爲空,而實際上它只顯示第一個(空)行。真高興你做到了。請考慮upvoting和/或[接受](http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work)您找到有用的答案:) – zarak

1

它爲我的作品:

# Taking a few slices of each substring of a given string after stripping off whitespaces 
df['fname'] = df['fname'].str.strip().str.slice(start=0, stop=20) 
df['fdesc'] = df['fdesc'].str.strip().str.slice(start=0, stop=20) 
df.to_csv('farmdata.csv') 
df 

       fcity     fdesc     fname 
0 South Portland, ME Gromaine Farm is pro   Gromaine Farm 
1   Newport, ME We are a diversified Parker Family Farm 
2   Unity, ME The Buckle Farm is a  The Buckle Farm 
3  Kenduskeag, ME Visit wiseacresfarm.  Wise Acres Farm 
4  Winterport, ME Winter Cove Farm is  Winter Cove Farm 
5   Albion, ME MISTY BROOK FARM off  Misty Brook Farm 
6 Dover-Foxcroft, ME We want you to becom   Ripley Farm 
7   Madison, ME Hide and Go Peep Far Hide and Go Peep Far 
8   Etna, ME Fail Better Farm is  Fail Better Farm 
9  Pittsfield, ME We are a family farm Snakeroot Organic Fa 

也許你有很多這是由默認的分隔符誤解空的空間(),並因爲它包含因此拿起fcity柱()在這導致訂購受到影響。

+1

這也適用

熊貓方法。剝去過多的換行符和空格似乎是關鍵。感謝您的幫助! – JeremyD

0

請考慮使用字典列表或字典詞典,而不是使用您所搜索的每個農場實體的信息列表。例如:

[{name:farm1, city: San Jose... etc}, 
{name: farm2, city: Oakland...etc}] 

現在,你可以在上面定義的類型的字典列表上調用Pandas.DataFrame.from_dict()http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.from_dict.html

可能會更詳細地描述這種解決方案的答案:Convert Python dict into a dataframe