我有一個包含調查響應的幾列的電子表格。這個電子表格將被合併到其他電子表格中,然後我將有與下面類似的重複行。然後,我將需要採用相同文本的所有問題,並根據整個合併文檔計算答案的百分比。Python3,Pandas - 基於列到左側數據的新列值(動態)
例Excel數據
**Poll Question** **Poll Responses**
The content was clear and effectively delivered 37 Total Votes
Strongly Agree 24.30%
Agree 70.30%
Neutral 2.70%
Disagree 2.70%
Strongly Disagree 0.00%
The Instructor(s) were engaging and motivating 37 Total Votes
Strongly Agree 21.60%
Agree 73.00%
Neutral 2.70%
Disagree 2.70%
Strongly Disagree 0.00%
I would attend another training session delivered by this Instructor(s) 37 Total Votes
Strongly Agree 21.60%
Agree 73.00%
Neutral 5.40%
Disagree 0.00%
Strongly Disagree 0.00%
This was a good format for my training 37 Total Votes
Strongly Agree 24.30%
Agree 62.20%
Neutral 8.10%
Disagree 2.70%
Strongly Disagree 2.70%
Any comments/suggestions about this training course? 5 Total Votes
我的用於計算票的非%的數目的方法將是百分比轉換爲數字。例如。從37 Total Votes
中查找並提取37
,然後使用以下公式獲取在該特定答案上投票的用戶數:percent * total/100
。
所以24.30 * 37/100 = 8.99
舍入意味着37人中有9人投票贊成「非常同意」。
這裏是希望我能夠做一個例子電子表格:
**Poll Question** **Poll Responses** **non-percent** **subtotal**
... 37 Total Votes 0 37
... 24.30% 9 37
... 70.30% 26 37
... 2.70% 1 37
... 2.70% 1 37
... 0.00% 0 37
(注:非百分之和大部將新創建的列)
目前我拿着一個文件夾完整的.xls
文件,我循環通過該文件夾,以.xlsx
格式保存到另一個文件夾。在該循環內,我添加了一個註釋塊,其中包含我的# NEW test CODE
,我試圖將邏輯放在此處。你可以看到,我試圖定位單元格並獲取值,然後得到一些正則表達式並從中提取數字(然後將它添加到該行中的subtotal
列。然後我想添加。它,直到我看到
含x Total Votes
行的新實例這裏是我當前的代碼:
import numpy as np
import pandas as pd
files = get_files('/excels/', '.xls')
df_array = []
for i, f in enumerate(files, start=1):
sheet = pd.read_html(f, attrs={'class' : 'reportData'}, flavor='bs4')
event_id = get_event_id(pd.read_html(f, attrs={'id' : 'eventSummary'}))
event_title= get_event_title(pd.read_html(f, attrs={'id' : 'eventSummary'}))
filename = event_id + '.xlsx'
rel_path = 'xlsx/' + filename
writer = pd.ExcelWriter(rel_path)
for df in sheet:
# NEW test CODE
q_total = 0
df.columns = df.columns.str.strip()
if df[df['Poll Responses'].str.contains("Total Votes")]:
# if df['Poll Responses'].str.contains("Total Votes"):
q_total = re.findall(r'.+?(?=\sTotal\sVotes)', df['Poll Responses'].str.contains("Total Votes"))[0]
print(q_total)
# df['Question Total'] = np.where(df['Poll Responses'].str.contains("Total Votes"), 'yes', 'no')
# END NEW test Code
df.insert(0, 'Event ID', event_id)
df.insert(1, 'Event Title', event_title)
df.to_excel(writer,'sheet')
writer.save()
# progress of entire list
if i <= len(files):
print('\r{:*^10}{:.0f}%'.format('Converting: ', i/len(files)*100), end='')
print('\n')
TL; DR 這似乎很令人費解,但如果我能得到兩個新的列包含一個問題的總票數和一個答案的票數(不是百分比),那麼我可以在合併的文檔上做一些VLOOKUP
魔術。任何幫助或方法的建議將不勝感激。謝謝!
對於每個問題,你總會有相同數量的答案嗎?您可以在每張表格中讀入數據框,然後將它們添加到一起。其餘的是熊貓。 – Kyle
可悲的是,沒有。因爲可能存在「評論框」類問題,並且它不會與其他人分開5行。或者用戶可能會選擇不做類似於樣式的測試。 – Kenny