查找大的文本文件不連續重複

我有web應用程序日誌的幾GB，我需要從客戶端提取客戶數據（誰didnt保持適當的備份。）查找大的文本文件不連續重複

到目前爲止，我已經清理把原木放好一點，我就能看到隧道盡頭的燈光。然而，我意識到有很多重複的條目，似乎每次使用本網站客戶應用中的相同數據存儲在日誌中，繼承人一個簡單的例子：

initial_date=Jul-26-2015&report_center=0&last_name=bar&first_name=foo&sex=M&birthday=Sep-26-1985&sin=123456789&drivers_license=&address1=414+stackoverflow+Street&residence_type=1&address2=Apartment+103&datemovein=Feb-02-2013&postal=a1a1a1&city=townsville&prov=ontario&country=Canada&telephone=5555555555&cell_phone=5555556666 

initial_date=Jan-24-2014&report_center=0&last_name=blah&first_name=steve&sex=M&birthday=aug-11-1983&sin=987654321&drivers_license=&address1=12+stackoverflow+Street&residence_type=1&address2=&datemovein=Jun-02-2011&postal=a9a9a9&city=cityville&prov=ontario&country=Canada&telephone=5551111111&cell_phone=5552222222 

initial_date=Jul-26-2015&report_center=0&last_name=bar&first_name=foo&sex=M&birthday=Sep-26-1985&sin=123456789&drivers_license=&address1=414+stackoverflow+Street&residence_type=1&address2=Apartment+103&datemovein=Feb-02-2013&postal=a1a1a1&city=townsville&prov=ontario&country=Canada&telephone=5555555555&cell_phone=5555556666

我想匹配的獨特最終刪除其餘部分。我試圖用積極的前瞻來完成這項工作，但從我看過的文章看來，只有當重複是連續的，其中有些是，但很多都不是。有沒有辦法讓我獨自使用正則表達式來完成這個任務？

來源

2016-08-13 Phreedom

你可以用前瞻來做到這一點，但它可能會太慢。你使用什麼工具/語言？ –

使用行排序會鬆動位置關係。如果你不在乎，簡單的字符串比較將是fastes。但是，對於1-off類型的事物，如果您使用了類型爲Find [（？m）^（。*）\ n（[\ S \ s] *？^ \ 1）的面向行的正則表達式，請替換' $ 2'（oldlength！= newlength）{oldlength = newlength; str = str.replace（正則表達式，「$ 2」）; newlength = str.length）}循環它會很慢，但有效削減一大層_slag_。 – sln

我會先把它放到數據庫中。這將使以後清理和提取其他數據變得更容易。 – charsi

對此沒有理由使用正則表達式; sort -u將執行您通過示例指定的內容。

來源

2016-08-18 07:23:14 Armali

你爲什麼假設它是linux/unix？它可能是Windows，或者一些工具，文本編輯器...... – ClasG

你爲什麼假設Windows沒有'sort'？ ... – Armali

嗯，它確實，但不是用'-u'選項。 – ClasG

查找大的文本文件不連續重複

回答

相關問題