2015-06-11 83 views
1

我正在構建一個python數據庫,用於在星型模式數據庫之上進行分析,並且由於數據框中的某些重複列鍵而難以集成pandas和sqlalchemy。查詢SQLAlchemy到Pandas DF時是否有重複的列?

這裏的類:

class Student(Base): 
    __tablename__ = 'DimStudent' 

    id = Column('StudentKey', Integer, primary_key=True) 
    srcstudentid = ('SrcStudentId', Integer) 
    firstname = Column('FirstName', String) 
    middlename = Column('MiddleName', String) 
    lastname = Column('LastName', String) 
    lep = Column('LimitedEnglishProficiency', String) 
    frl = Column('FreeReducedLunch', String) 
    sped = Column('SpecialEducation', String) 

class School(Base): 
    __tablename__ = 'DimSchool' 

    id = Column('SchoolKey', Integer, primary_key=True) 
    name = Column('SchoolName', String) 
    district = Column('SchoolDistrict', String) 
    statecode = Column('StateCode', String) 

class StudentScore(Base): 
    __tablename__ = 'FactStudentScore' 

    studentkey = Column('StudentKey', Integer, ForeignKey('DimStudent.StudentKey'), primary_key=True) 
    teacherkey = Column('TeacherKey', Integer, ForeignKey('DimTeacher.TeacherKey'), primary_key=True)  
    schoolkey = Column('SchoolKey', Integer, ForeignKey('DimSchool.SchoolKey'), primary_key = True) 
    assessmentkey = Column('AssessmentKey', Integer, ForeignKey('DimAssessment.AssessmentKey'), primary_key=True) 
    subjectkey = Column('SubjectKey', Integer, ForeignKey('DimSubject.SubjectKey'), primary_key=True) 
    yearcyclekey = Column('YearCycleKey', Integer, ForeignKey('DimYearCycle.YearCycleKey'), primary_key=True) 
    pointspossible = Column('PointsPossible', Integer) 
    pointsreceived = Column('PointsReceived', Integer) 

    student = relationship("Student", backref=backref('studentscore')) 
    school = relationship("School", backref=backref('studentscore')) 
    assessment = relationship("Assessment", backref='studentscore') 
    teacher = relationship("Teacher", backref='studentscore') 
    subject = relationship("Subject", backref='studentscore') 
    yearcycle = relationship("YearCycle", backref='studentscore')  

每當我查詢我的數據,我一直想出了數據的重複列,例如,在這個ORM致電學校鍵,然後建立與它有數據幀。

school = session.query(StudentScore, School, Subject)\  
.join(StudentScore.school).join(StudentScore.subject)\ 
.filter(School.name.like('%Dever%'))\ 
.filter(Subject.code == 'Math') 

a = pd.read_sql(school.statement, school.session.bind) 

SO thread提供了一個很好的轉置技術來刪除重複。

a = a.T.drop_duplicates().T 

但是,當我在IDE變量資源管理器中與此數據框交互時,仍然遇到錯誤。錯誤是:「Reindexing只適用於唯一有價值的索引對象」

任何想法的問題是什麼?

+0

作爲臨時黑客你可以使用a.reset_index(inplace = True)? – Gecko

+0

我有一段時間沒有完成sqlalchemy,但是你真的需要.join(School)嗎? – Gecko

+0

骯髒的解決方案可能只是做a.drop_duplicates(inplace = true)http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.drop_duplicates.html – PlagTag

回答

1

找到了正確的答案!取而代之的是最簡單的:

a = a.T.drop_duplicates().T 

我用來代替GROUPBY,以便移除重複:

df.T.groupby(level=0).first().T 

這就是說,我不知道我原來的錯誤的司機。新的代碼行比舊版代碼快10-100倍。

+0

也發現這:http://stackoverflow.com/questions/22115819/handling-duplicate-columns-在pandas-dataframe-constructor-from-sqlalchemy-join 相關的問題,但似乎他們使用純粹的查詢返回。如果他們可以提出一個類似的優雅的解決方案來從返回來的重複,會永遠愛一個人。 – AZhao