2017-07-24 17 views
1

目前,我有以下情況。如何使用Pandas庫將一個值與Python中的多個值進行比較?

Excel Data Frame =   SQL Data Frame = 
________       ________ _______ ___________ _________ 
|sector|       |sector| | hour| | value_cs| value_ps| 
--------       -------- ------- ----------- --------- 
AXYZ        AXYZ  0  78.90  87.10 
BYYT        RACH  0  87.12  13.90 
IOPL        IOPL  0  93.10  13.87 
XFTR        AXYZ  1  27.90  12.87 
MANU        IOPL  1  23.09  90.09 
            FRES  2  34.09  12.34 
            YYYT  2  12.43  32.98 
            REWT  3  98.09  99.99 

我有一個Excel文件和一組SQL結果和我想的部門列的每個值從對所有的值Excel文件從SQL結果的部門列進行比對,結果,如果這兩列的值之間存在匹配,則將SQL結果中的列小時,value_csvalue_ps添加到新數據框中。 注意: SQL結果的數據與Excel文件的數據大小不一樣。

期望的結果

New data frame 1 for value cs 
    ________ ____ ___ ___ ___ ___ ___ ___  ____      
    |sector| |0| |1| |2| |3| |4| |5| |6| .... |23| 
    -------- ---- --- ---- --- --- --- ----  ----        
    AXYZ 78.90 27.90 78.89 54.90 98.23 85.0 45.90  68.23 
    BYYT 18.94 67.10 65.69 76.32 76.56 56.03 56.23  87.65 
    IOPL 93.10 23.09 34.29 97.34 34.34 14.54 34.91  23.21 
    ...  ... 

New data frame 2 for value ps 
    ________ ____ ___ ___ ___ ___ ___ ___  ____      
    |sector| |0| |1| |2| |3| |4| |5| |6| .... |23| 
    -------- ---- --- ---- --- --- --- ----  ----        
    AXYZ 87.10 12.87 49.89 84.90 76.23 15.01 12.90  68.23 
    BYYT 28.43 27.11 54.69 57.12 19.56 45.12 45.23  47.15 
    IOPL 13.87 90.09 24.19 47.34 18.34 21.54 67.11  13.61 
    ...  ... 

我遵循的方法是在SQL結果轉換成數據幀,以及從Excel文件中的數據,但我不知道如何在不執行比較一個for循環,但只能使用Pandas(for循環會花費太多時間來執行計算)。

import pandas as pd 
import pypyodbc 
from datetime import date 

def get_and_compare(): 

    start_date = date.today() 

    retrieve_values = "[DEV].[CS].[QA_Export] @start_date='{start_date:%Y-%m-%d}'".format(start_date=start_date) 

    # Connect to the database 
    db_connection = pypyodbc.connect(driver="{SQL Server}", server="xxx.xxx.xxx.xxx", uid="xxx", 
             pwd="xxx", Trusted_Connection="No") 

    # Get the sql result into dataframe 
    data_frame_sql = pd.read_sql(retrieve_values,db_connection) 


    #declare new data frames 
    new_df_one = pd.DataFrame(columns=['sector', 'value cs', 'hour 0', 'hour 1', 'hour 2', 'hour 3', 'hour 4', 
            'hour 5', 'hour 6', 'hour 7', 'hour 8', 'hour 9', 'hour 10', 'hour 11', 
            'hour 12', 'hour 13', 'hour 14', 'hour 15', 'hour 16', 'hour 17', 'hour 18', 
            'hour 19', 'hour 20', 'hour 21', 'hour 22', 'hour 23']) 

    new_df_two = pd.DataFrame(columns=['sector', 'value ps', 'hour 0', 'hour 1', 'hour 2', 'hour 3', 'hour 4', 
            'hour 5', 'hour 6', 'hour 7', 'hour 8', 'hour 9', 'hour 10', 'hour 11', 
            'hour 12', 'hour 13', 'hour 14', 'hour 15', 'hour 16', 'hour 17', 'hour 18', 
            'hour 19', 'hour 20', 'hour 21', 'hour 22', 'hour 23']) 


    # Read the Excel file 
    current_wb = pd.ExcelFile \ 
    ("C:\\U\\dev\\testing\\Main values to compare.xlsx") 

    # Get the specific sheet to compare 
    working_values = current_wb.parse("Main values") 

    #Get the column from Excel 
    sector_from_excel = working_values['sector'] 

    #Comparison to perform 
    #.... unknown part 

所有的建議,意見將不勝感激,以幫助我完成這部分的代碼。

回答

1

試試這個:

def get_and_compare(): 

    start_date = date.today() 

    retrieve_values = "[DEV].[CS].[QA_Export] @start_date='{start_date:%Y-%m-%d}'".format(start_date=start_date) 

    # Connect to the database 
    db_connection = pypyodbc.connect(driver="{SQL Server}", server="xxx.xxx.xxx.xxx", uid="xxx", 
             pwd="xxx", Trusted_Connection="No") 

    # Get the sql result into dataframe 
    data_frame_sql = pd.read_sql(retrieve_values,db_connection) 


    # Read the Excel file 
    current_wb = pd.ExcelFile \ 
    ("C:\\U\\dev\\testing\\Main values to compare.xlsx") 

    # Get the specific sheet to compare 
    working_values = current_wb.parse("Main values") 

    #Get the column from Excel 
    sector_from_excel = working_values['sector'] 

    # perform inner join between DataFrames 
    # note: this requires that "sector" is a column (and not an index) 
    # in both DataFrames, and that it is also named as "sector" in each 
    merged_df = data_frame_sql.merge(sector_from_excel, how="inner", on="sector") 

    # use "pivot" to reshape data from wide to long 
    # first with value_cs 
    cs_value_df = merged_df.pivot(index="sector", columns="hour", values="value_cs") 

    # and then with value_ps 
    ps_value_df = merged_df.pivot(index="sector", columns="hour", values="value_ps") 

    # I'd suggest returning both DataFrames in a single object; 
    # in this case I'm using a dict 
    return {"value cs": cs_value_df, "value ps": ps_value_df} 

對於它的價值,我建議拆分此功能分爲多個功能,也許是一個生成的SQL查詢,一個讀取Excel文件,和一個執行熊貓運營。將這麼多的動作寫入一個函數並不是一個好習慣 - 如果有必要這麼做,調試將非常繁瑣。

相關問題