所以我試圖解析一個巨大的文件,下面的代碼花費太長的時間來解析。該文件的大小是2GB。我希望有人能幫助我加快速度。如何加快下面的python3代碼?
import os, shlex
def extractLinkField(inDir, outDir):
fileList = os.listdir(inDir)
links = set()
for File in fileList:
print(File)
with open(os.path.join(inDir, File), encoding="utf8") as inFile:
for line in inFile:
try:
links.add(shlex.split(line)[2])
except Exception:
continue
outFile = open(os.path.join(outDir, 'extractedLinks.txt'), 'a+', encoding="utf8")
for link in list(links):
outFile.write(link + '\n')
outFile.close()
Path = os.path.join(os.getcwd(), 'logs')
extractLinkField(Path, os.getcwd())
文件格式如下:
90 "m z pd gk y xr vo" "n l v ogtc dj wzb" "d zi pfgyo b tmhek" "df qu venr ls hzw j"
82 "p wgd lv f kt eb uq" " ij cw v a r y qp" " pf qdlcgm jz os y" "f xm n cr ublzig"
89 "c pgib a ost whk" "ria m h fvcb es z" "qzoy g xbr makc" "ms lqc v ektb w "
66 "zxm pe hb vi dj " "rg ebfwp y zv oakm" "b nut ko je m crsh" " imsxtzfw g ka j l "
2 "uyhnpt l dj qak " "o hned j pqub t a " "v hlyc afwi sgr p" "h wtvi g o nc sujqx"
17 "apo ufliz qctbd xh " "k lxgbrcwzf mnhtq p" "z gk m rsbu l" " ds m au w cior "
9 " h t ac jpn ok mz" "aty rs w box vk zefp" "nm fbc x egt zruap " "xg oi j z wyf v dqp"
82 "xs q ve k oi c " " z lfa dwiprxb ku g" "kua p f b oqz jrt " " t wlvy d po qrx e"
51 "cx iq wuvhb gkmo y" " u p yx bv mjz r" "oatc wuxd yfgjs ri " "vbg w h ife myl"
91 "cdqkp rn u ow h f" "ko rt y c eis d q jl" " lv fe r zpju yw " " wz vtxa jn lg s"
83 "bts dl kjycre ozv " " k i q m r ypsu lh " "pr exw sznqa yvu i " " uq tzk nomrx e "
請注意,引號包裹文件中的字符串不應該被拆分,必須作爲一個整體被解析出來(這仍然是包裹在行情)
如果你要投我的問題,請解釋你爲什麼這樣做。謝謝。 – KingMak
你能否請求包括說你要解析的文件的10行,以便我可以嘗試給你寫一個快速的Pandas解析器? – Matt
是的,我現在要做@Matt – KingMak