0
我有一個函數,使用python和sqlalchemy填充數據庫表。功能運行速度相當緩慢,大約需要17分鐘。我認爲主要問題是我正在循環兩組大數據來構建新表。我在下面的代碼中包含了記錄數。加速Python w/sqlalchemy函數
如何加快速度?我應該嘗試將嵌套的for循環轉換爲一個大的sqlalchemy查詢嗎?我用pycharm描述了這個函數,但是我不確定我完全理解了結果。
def populate(self):
"""Core function to populate positions."""
# get raw annotations with tag Org
# returns 11,659 records
organizations = model.session.query(model.Annotation) \
.filter(model.Annotation.tag == 'Org')\
.filter(model.Annotation.organization_id.isnot(None)).all()
# get raw annotations with tags Support or Oppose
# returns 2,947 records
annotations = model.session.query(model.Annotation) \
.filter((model.Annotation.tag == 'Support') | (model.Annotation.tag == 'Oppose')).all()
for org in organizations:
for anno in annotations:
# Org overlaps with Support or Oppose tag
# start and end columns are integers
if org.start >= anno.start and org.end <= anno.end:
position = model.Position()
# set to de-duplicated organization
position.organization_id = org.organization_id
position.disposition = anno.tag
# look up bill_id from document_bill table
document = model.session.query(model.document_bill)\
.filter_by(document_id=anno.document_id).first()
position.bill_id = document.bill_id
position.document_id = anno.document_id
model.session.add(position)
logging.info('org: {}, disposition: {}, bill: {}'.format(
position.organization_id, position.disposition, position.bill_id)
)
continue
logging.info('committing to database')
model.session.commit()
有關改進已經工作的代碼的問題比Stack Overflow更適合[codereview.se]。 – jwodder
您正在運行至少11,659 * 2947 = 34,359,073查詢來獲取'bill_id';這怎麼可能*不會很慢? 'model.session.commit()'也會使所有'組織'和'註釋'到期,這意味着在你的內部循環中'anno'在'org'的第一次迭代之後每刷新一次,添加另外11,658 * 2947 = 34,356,126個查詢,總計68,715,199個查詢,其中大部分是浪費的工作。您可以從循環之外的單個查詢中查詢「document_bill」,然後確保在提交時不會過期註釋。最後,看看你是否可以批量插入。 – univerio
謝謝!我會嘗試將model.session.commit()移動到外部循環,因此它在最後運行一次。我不明白如何將document_bill查詢移到循環外部,因爲它取決於循環中提供的anno.document_id的當前值。 – Casey