re.match需要很長時間才能完成

我是python的新手，並且編寫了以下運行速度非常慢的代碼。re.match需要很長時間才能完成

我調試了代碼，發現它是最後的re.match()導致代碼運行非常慢。儘管之前的比賽對同一個DataFrame進行了相同類型的比賽，但它會很快恢復。

下面是代碼：

My_Cells = pd.read_csv('SomeFile',index_col = 'Gene/Cell Line(row)').T 
My_Cells_Others = pd.DataFrame(index=My_Cells.index,columns=[col for col in My_Cells if re.match('.*\sCN$|.*\sMUT$|^bladder$|^blood$|^bone$|^breast$|^CNS$|^GI tract$|^kidney$|^lung$|^other$|^ovary$|^pancreas$|^skin$|^soft tissue$|^thyroid$|^upper aerodigestive$|^uterus$',col)]) 
My_Cells_Genes = pd.DataFrame(index=My_Cells.index,columns=[col for col in My_Cells if re.match('.*\sCN$|.*\sMUT$|^bladder$|^blood$|^bone$|^breast$|^CNS$|^GI tract$|^kidney$|^lung$|^other$|^ovary$|^pancreas$|^skin$|^soft tissue$|^thyroid$|^upper aerodigestive$|^uterus$',col) is None ]) 
for col in My_Cells.columns: 
    if re.match('.*\sCN$|.*\sMUT$|^bladder$|^blood$|^bone$|^breast$|^CNS$|^GI tract$|^kidney$|^lung$|^other$|^ovary$|^pancreas$|^skin$|^soft tissue$|^thyroid$|^upper aerodigestive$|^uterus$',col): 
      My_Cells_Others [col] = pd.DataFrame(My_Cells[col]) 
    if re.match('.*\sCN$|.*\sMUT$|^bladder$|^blood$|^bone$|^breast$|^CNS$|^GI tract$|^kidney$|^lung$|^other$|^ovary$|^pancreas$|^skin$|^soft tissue$|^thyroid$|^upper aerodigestive$|^uterus$',col) is None: 
      My_Cells_Genes [col] = pd.DataFrame(My_Cells[col])

我不認爲這個問題是有關正則表達式。下面的代碼仍然運行緩慢。

for col in My_Cells_Others.columns: 
    if (col in lst) or col.endswith(' CN') or col.endswith(' MUT'): 
      My_Cells_Others [col] = My_Cells[col] 
for col in My_Cells_Genes.columns: 
    if not ((col in lst) or col.endswith(' CN') or col.endswith(' MUT')): 
     My_Cells_Genes [col] = My_Cells[col]

來源

2015-04-26 user1050702

如果col.endswith（'CN'）或col.endswith（'MUT'）或col'['bladder'，'blood'，'bone'，...]：' – jedwards

你可以編譯像這樣的正則表達式'p = re.compile（ur'。* \ sCN $ |。* \ sMUT $ |^bladder $ |^blood $ |^bone $ |^breast $ |^CNS $ |^GI tract $ | ^腎$ | ^肺$ | ^其他$ | ^卵巢$ | ^胰腺$ | ^皮膚$ | ^軟組織$ | ^甲狀腺$ | ^上呼吸道消化$ | ^子宮$'）'*外*循環。然後，使用'if（p.match（col））'...... –

特定的，第二個循環以上。數據幀很大，大約有14000列，但我不確定這是什麼原因 – user1050702

「糟糕」設計的正則表達式可能會不必要的慢。

我的猜測是，.*\sCN和*\sMUT一個大字符串的話不匹配，使得它是緩慢的，因爲它迫使你的腳本來檢查所有可能的組合相結合。

正如@jedwards說，你可以代替這段代碼

if re.match('.*\sCN$|.*\sMUT$|^bladder$|^blood$|^bone$|^breast$|^CNS$|^GI tract$|^kidney$|^lung$|^other$|^ovary$|^pancreas$|^skin$|^soft tissue$|^thyroid$|^upper aerodigestive$|^uterus$',col): 
      My_Cells_Others [col] = pd.DataFrame(My_Cells[col])

有：

lst = ['bladder', 'blood', 'bone', 'breast', 'CNS', 'GI tract', 'kidney', 'lung', 'other', 'ovary', 'pancreas', 'skin', 
     'soft tissue', 'thyroid', 'upper aerodigestive', 'uterus'] 

if (col in lst) or col.endswith(' CN') or col.endswith(' MUT'): 
    # Do stuff

另外，如果你想使用re出於某種原因，移動.*\sCN和*\sMUT到正則表達式的結尾可能幫助，這取決於您的數據，因爲除非確實需要，否則不會強制檢查所有這些組合。

來源

2015-04-26 11:25:25

我編輯了原始問題以上 – user1050702

re.match需要很長時間才能完成

回答

相關問題