2016-04-24 30 views
1

我是熊貓的新手。我必須創建一個數據框,其中一列是「源」,第二列是「數量」。Python 3.x - 創建數據框並指定列名稱

創建一個新的數據幀

df=[] 

現在我怎麼能一列「源」和「量」到這個數據幀。最終的結果是

打印(DF)

Source Amount 
S1  10 
S2  12 
S3  8 
S4  5 

的數據將來自一個for循環。每次迭代生成源,然後生成數量。我想創建這樣一個數據幀 -

df=[] 

for str in some_variable: 
    df['Source'].append(str[0]) #str[0] will contain source elements 
    df['Amount'].append(str[1]) #str[1] will contain amount elements 

碼 -

import requests 
import pandas as pd 
import bs4 
import string 
import matplotlib.pyplot as plt 
url = "https://www.revisor.mn.gov/laws/?year=2014&type=0&doctype=Chapter&id=294" 

result = requests.get(url) 
soup = bs4.BeautifulSoup(result.content) 
summary = soup.find("div", {"class":"bill_section","id": "laws.1.1.0"}) 
tables = summary.findAll('table') 
data_table = tables[1] 

df=pd.DataFrame(['Source','Amount']) #Trying to incorrectly add columns to df 
for row in data_table.findAll("tr"): 
    cells = row.findAll("td") 
    try : 
     for char in cells[0].findAll("ins"): 
      df['Source'] = df['Source'].append() #This is where the issue is 
     for char in cells[2].findAll("ins"): 
      df['Amount'] = df['Amount'].append() #And here 
    except: 
     pass 
+3

關於[docs](http://pandas.pydata.org/pandas-docs/version/0.18.0/api.html#id1)有什麼不清楚的地方嗎? – EdChum

+0

用於創建新數據幀的源數據是列表,numpy數組,系列,另一個DataFrame? – jezrael

+1

首先查看教程可能會有幫助:http://pandas.pydata.org/pandas-docs/stable/10min.html – ayhan

回答

3

我認爲你可以使用read_htmlreplacestr.stripstr.replace和最後to_numeric

import pandas as pd 
import matplotlib.pyplot as plt 
url = "https://www.revisor.mn.gov/laws/?year=2014&type=0&doctype=Chapter&id=294" 

#read second table in url 
df = pd.read_html(url)[1] 
#replace texts to empty string 
df = df.replace('new text begin','', regex=True).replace('new text end','', regex=True) 
#set new columns names 
df.columns = ['Source','b','Amount'] 
#remove first row 
df = df[1:] 
#remove second column b 
df = df.drop('b', axis=1) 
#strip whitespaces 
df.Source = df.Source.str.strip() 
#strip whitespaces and remove(), 
df.Amount = df.Amount.str.strip().str.replace(r'[(),]','') 
#convert column Amount to numeric 
df.Amount = pd.to_numeric(df.Amount) 
#reset index 
df = df.reset_index(drop=True) 
print df 
              Source  Amount 
0       University of Minnesota 119367000 
1   Minnesota State Colleges and Universities 159812000 
2           Education 7491000 
3       Minnesota State Academies 11354000 
4     Perpich Center for Arts Education 2000000 
5         Natural Resources 63480000 
6       Pollution Control Agency 2625000 
7     Board of Water and Soil Resources 8000000 
8          Agriculture  203000 
9         Zoological Garden 12000000 
10         Administration 127000000 
11    Minnesota Amateur Sports Commission 7973000 
12         Military Affairs 3244000 
13         Public Safety 4030000 
14         Transportation 57263000 
15        Metropolitan Council 45968000 
16         Human Services 86387000 
17         Veterans Affairs 2800000 
18          Corrections 11881000 
19    Employment and Economic Development 92130000 
20      Public Facilities Authority 45993000 
21       Housing Finance Agency 20000000 
22      Minnesota Historical Society 12002000 
23        Bond Sale Expenses  900000 
24         Cancellations 10849000 
25           TOTAL 893054000 
26 Bond Proceeds Fund (General Fund Debt Service) 814745000 
27 Bond Proceeds Fund (User Financed Debt Service) 39104000 
28      State Transportation Fund 36613000 
29     Maximum Effort School Loan Fund 5491000 
30        Trunk Highway Fund 7950000 
31      Bond Proceeds Cancellations 10849000 
print df.dtypes 

Source object 
Amount  int64 
dtype: object 

但是如果你需要分析數據解決方案與BeautifulSoup,第一Source和量數據被追加,然後創建DataFrame

import requests 
import pandas as pd 
import bs4 
import string 
import matplotlib.pyplot as plt 
url = "https://www.revisor.mn.gov/laws/?year=2014&type=0&doctype=Chapter&id=294" 

result = requests.get(url) 
soup = bs4.BeautifulSoup(result.content) 
summary = soup.find("div", {"class":"bill_section","id": "laws.1.1.0"}) 
tables = summary.findAll('table') 
data_table = tables[1] 

Source, Amount = [], [] 
for row in data_table.findAll("tr"): 
    cells = row.findAll("td") 
    try : 
     for char in cells[0].findAll("ins"): 
      Source.append(char.text) #This is where the issue is 
     for char in cells[2].findAll("ins"): 
      Amount.append(char.text) #And here 
    except: 
     pass 
print Source 
[u'SUMMARY', u'University of Minnesota', u'Minnesota State Colleges and Universities', u'Education', u'Minnesota State Academies', u'Perpich Center for Arts Education', u'Natural Resources', u'Pollution Control Agency', u'Board of Water and Soil Resources', u'Agriculture', u'Zoological Garden', u'Administration', u'Minnesota Amateur Sports Commission', u'Military Affairs', u'Public Safety', u'Transportation', u'Metropolitan Council', u'Human Services', u'Veterans Affairs', u'Corrections', u'Employment and Economic Development', u'Public Facilities Authority', u'Housing Finance Agency', u'Minnesota Historical Society', u'Bond Sale Expenses', u'Cancellations', u'TOTAL', u'Bond Proceeds Fund (General Fund Debt Service)', u'Bond Proceeds Fund (User Financed Debt Service)', u'State Transportation Fund', u'Maximum Effort School Loan Fund', u'Trunk Highway Fund', u'Bond Proceeds Cancellations'] 
print Amount 
[u'119,367,000', u'159,812,000', u'7,491,000', u'11,354,000', u'2,000,000', u'63,480,000', u'2,625,000', u'8,000,000', u'203,000', u'12,000,000', u'127,000,000', u'7,973,000', u'3,244,000', u'4,030,000', u'57,263,000', u'45,968,000', u'86,387,000', u'2,800,000', u'11,881,000', u'92,130,000', u'45,993,000', u'20,000,000', u'12,002,000', u'900,000', u'(10,849,000)', u'893,054,000', u'814,745,000', u'39,104,000', u'36,613,000', u'5,491,000', u'7,950,000', u'(10,849,000)'] 
print len(Source) 
#33 
print len(Amount) 
#32 
#remove first element 
Source = Source[1:] 

df=pd.DataFrame({'Source':Source,'Amount':Amount}, columns=['Source','Amount'])  
print df 
              Source  Amount 
0       University of Minnesota 119,367,000 
1   Minnesota State Colleges and Universities 159,812,000 
2           Education  7,491,000 
3       Minnesota State Academies 11,354,000 
4     Perpich Center for Arts Education  2,000,000 
5         Natural Resources 63,480,000 
6       Pollution Control Agency  2,625,000 
7     Board of Water and Soil Resources  8,000,000 
8          Agriculture  203,000 
9         Zoological Garden 12,000,000 
10         Administration 127,000,000 
11    Minnesota Amateur Sports Commission  7,973,000 
12         Military Affairs  3,244,000 
13         Public Safety  4,030,000 
14         Transportation 57,263,000 
15        Metropolitan Council 45,968,000 
16         Human Services 86,387,000 
17         Veterans Affairs  2,800,000 
18          Corrections 11,881,000 
19    Employment and Economic Development 92,130,000 
20      Public Facilities Authority 45,993,000 
21       Housing Finance Agency 20,000,000 
22      Minnesota Historical Society 12,002,000 
23        Bond Sale Expenses  900,000 
24         Cancellations (10,849,000) 
25           TOTAL 893,054,000 
26 Bond Proceeds Fund (General Fund Debt Service) 814,745,000 
27 Bond Proceeds Fund (User Financed Debt Service) 39,104,000 
28      State Transportation Fund 36,613,000 
29     Maximum Effort School Loan Fund  5,491,000 
30        Trunk Highway Fund  7,950,000 
31      Bond Proceeds Cancellations (10,849,000) 
+0

請檢查輸出,這是你想要的嗎? – jezrael

+0

這就是我正在尋找的。謝謝! –

+1

如果我的回答很有幫助,請不要忘記[接受](http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work)它。謝謝。 – jezrael

相關問題