2017-06-27 34 views
-1

我有一個存儲在字符串中的值。我希望將該值附加到符合特定條件的行,而不是其他任何其他行。Python - 將str值附加到數據框中的某些行上

下圖顯示了我需要解析的表格。我可以使用BeautifulSoup輕鬆解析文件,並將其轉化爲Pandas DataFrame,但對於以下兩個表格,我都在努力捕獲並將Package價格附加到整個DataFrame。理想情況下,價格值將與每個魚類重量對並排;所以單列價格相同。

enter image description here

這裏是我用來解析表的代碼:

with open(file_path) as in_f: 
    msg = email.message_from_file(in_f) #type: <class 'email.message.Messgae'> 

html_msg = msg.get_payload(1) #type: <class 'email.message.Message'> 

body = html_msg.get_payload(decode=True) #type: <class 'bytes'> or type: 'int' 

html = body.decode() #type: <class 'str'> 

tablez = BeautifulSoup(html).find_all("table") #type: <class 'bs4.element.ResultSet'> 
data = [] 
for table in tablez: 
    for row in table.find_all("tr"): 
     data.append([cell.text.strip() for cell in row.find_all("td")]) 

fish_frame = pd.DataFrame(data) 

這是data是:

data: [['Species', 'Price', 'Weight'], ['GBW Cod', '.55', '8,059'], ['GBE Haddock', '.03', '14,628'], ['GBW Haddock', '.02', '87,451'], ['GB YT', '1.50', '1,818'], ['Witch', '1.25', '1,414'], ['GB Winter', '.40', '23,757'], ['Redfish', '.02', '123'], ['White Hake', '.40', '934'], ['Pollock', '.02', '7,900'], ['Package Price:', '', '$21,151.67'], ['Species', 'Weight'], ['GBE Cod', '820'], ['GBW Cod', '15,279'], ['GBE Haddock', '32,250'], ['GBW Haddock', '192,793'], ['GB YT', '6,239'], ['SNE YT', '2,018'], ['GOM YT', '1,511'], ['Plaice', '2,944'], ['Witch', '1,100'], ['GB Winter', '158,608'], ['White Hake', '31'], ['Pollock', '1,983'], ['SNE Winter', '7,257'], ['Price', '$58,500.00'], ['Species', 'Weight'], ['GBE Cod', '792'], ['GBW Cod', '14,767'], ['GBE Haddock', '29,199'], ['GBW Haddock', '174,556'], ['GB YT', '5,268'], ['SNE YT', '544'], ['GOM YT', '1,957'], ['Plaice', '2,452'], ['Witch', '896'], ['GB Winter', '163,980'], ['White Hake', '8'], ['Pollock', '1,743'], ['SNE Winter', '3,709'], ['Price', '$57,750.00']] 

然後我用這段代碼捕獲Package價格:

stew = BeautifulSoup(html, 'html.parser') 
chunks = stew.find_all('p', {'class' : "MsoNormal"})   
for line in chunks: 
    if 'Package' in line.text: 
     package_price = line.text 
     print("package_price:", package_price) 

但我現在正努力將價格值添加到數據框中的自己的列。做一個命令,如fish_frame = pd.DataFrame(package_price)結果:

Traceback (most recent call last): File "Z:/Code/NEFS_stock_then_weight_attempt3.py", line 236, in <module> fish_frame = pd.DataFrame(package_price) File "C:\Users\stephen.mahala\AppData\Local\Programs\Python\Python35-32\lib\site-packages\pandas\core\frame.py", line 345, in __init__ raise PandasError('DataFrame constructor not properly called!') pandas.core.common.PandasError: DataFrame constructor not properly called!

由於所不知道的我的原因。然而,將它轉換爲list會導致字符串被分解,並且每個字符都會成爲自己的列表,因此每個字符都將成爲DataFrame中的自己的單元格。

有沒有一種方法PandasBeautifulSoup,我不知道這會簡化將這個單一值添加到我的DataFrame的過程?

+0

你應該修改你的問題,以顯示您收到的錯誤的完整的具體回溯。 –

+0

我在創建'fish_frame'後立即解析表,在我的第一塊代碼 – theprowler

+0

是的,我看你是如何創建/初始化它可以顯示* full * traceback? –

回答

1

當我創建fish_framepd.DataFrame(data),我得到它由套表格數據如下:

    0   1   2 
0   Species  Price  Weight 
1   GBW Cod   .55  8,059 
2  GBE Haddock   .03  14,628 
3  GBW Haddock   .02  87,451 
4   GB YT  1.50  1,818 
5   Witch  1.25  1,414 
6  GB Winter   .40  23,757 
7   Redfish   .02   123 
8  White Hake   .40   934 
9   Pollock   .02  7,900 
10 Package Price:    $21,151.67 
11   Species  Weight  None 
12   GBE Cod   820  None 
13   GBW Cod  15,279  None 
14  GBE Haddock  32,250  None 
15  GBW Haddock  192,793  None 
16   GB YT  6,239  None 
17   SNE YT  2,018  None 
18   GOM YT  1,511  None 
19   Plaice  2,944  None 
20   Witch  1,100  None 
21  GB Winter  158,608  None 
22  White Hake   31  None 
23   Pollock  1,983  None 
24  SNE Winter  7,257  None 
25   Price $58,500.00  None 
26   Species  Weight  None 
27   GBE Cod   792  None 
28   GBW Cod  14,767  None 
29  GBE Haddock  29,199  None 
30  GBW Haddock  174,556  None 
31   GB YT  5,268  None 
32   SNE YT   544  None 
33   GOM YT  1,957  None 
34   Plaice  2,452  None 
35   Witch   896  None 
36  GB Winter  163,980  None 
37  White Hake   8  None 
38   Pollock  1,743  None 
39  SNE Winter  3,709  None 
40   Price $57,750.00  None 

如果你擺脫外環for table in tablez:的,只是做for row in tablez[0]我想你「會結束:

data = [['Species', 'Price', 'Weight'], ['GBW Cod', '.55', '8,059'], 
     ['GBE Haddock', '.03', '14,628'], ['GBW Haddock', '.02', '87,451'], 
     ['GB YT', '1.50', '1,818'], ['Witch', '1.25', '1,414'], 
     ['GB Winter', '.40', '23,757'], ['Redfish', '.02', '123'], 
     ['White Hake', '.40', '934'], ['Pollock', '.02', '7,900'], 
     ['Package Price:', '', '$21,151.67']] 

然後fish_frame=pd.DataFrame(data)將導致:

    0  1   2 
0   Species Price  Weight 
1   GBW Cod .55  8,059 
2  GBE Haddock .03  14,628 
3  GBW Haddock .02  87,451 
4   GB YT 1.50  1,818 
5   Witch 1.25  1,414 
6  GB Winter .40  23,757 
7   Redfish .02   123 
8  White Hake .40   934 
9   Pollock .02  7,900 
10 Package Price:   $21,151.67 

無論你做出的改變與否,這將一列添加到fish_frame

srs = pd.Series([package_price]*len(fish_frame)) 
fish_frame[3] = pd.Series(srs,index=fish_frame.index) 

而且你應該結束了,然後用:

    0  1   2 3 
0   Species Price  Weight #891-2: Package for $21,151.67 but willing to sell species individually 
1   GBW Cod .55  8,059 #891-2: Package for $21,151.67 but willing to sell species individually 
2  GBE Haddock .03  14,628 #891-2: Package for $21,151.67 but willing to sell species individually 
3  GBW Haddock .02  87,451 #891-2: Package for $21,151.67 but willing to sell species individually 
4   GB YT 1.50  1,818 #891-2: Package for $21,151.67 but willing to sell species individually 
... 
+0

好吧好吧,看起來像一個完美的打印輸出,我現在在我的代碼中嘗試它,但是我理解正確,不是一次攻擊所有表格,而是將它們全部直接放入DataFrame中,而是將其更改爲Python,而不是一行一行地捕獲數據,是對的還是?還有,我從來沒有完全理解'系列'與'熊貓'有什麼關係(我是新手,如果t帽子不明顯),但我現在看到它用於編輯DataFrame ..? – theprowler

+0

嗯,我不知道第一件事,所有(2)表都必須存在於您的DataFrame中?只有你可以回答。如果你需要這兩個對象,我會想創建兩個單獨的DataFrame對象,否則只能從你操作的表中創建DF。至於'Series',IDK,這實際上是我第一次使用熊貓。 –

+0

你已經「一行一行」了,但是你正在「爲每個表逐行排序」,我的建議是你可能只需要來自'fish_frame'中* first *表的數據,所以你可以省略'tablez'中的表格,並簡單地迭代'tablez [0]'中的行。 –

相關問題