2016-10-13 118 views
0

我想比較兩個文件(從第一個文件中取出一行,然後在整個第二個文件中查找)以查看它們之間的差異,並將fileA.txt中缺失的行寫入fileB.txt結尾。我是新來的Python因此在第一次我以爲安博這樣簡單的程序:比較兩個文件在python中的差異

import difflib 

file1 = "fileA.txt" 
file2 = "fileB.txt" 

diff = difflib.ndiff(open(file1).readlines(),open(file2).readlines()) 
print ''.join(diff), 

但結果我有兩個文件組合爲每個行合適的變量。我知道我可以用標籤「 - 」查找行開頭,然後將其寫入文件fileB.txt的結尾,但是對於大文件(〜100 MB),此方法效率不高。有人可以幫助我改進計劃嗎?

文件結構將是這樣的:

輸入:

fileA.txt

Oct 9 13:25:31 user sshd[12844]: Accepted password for root from 213.XXX.XXX.XX7 port 33254 ssh2 
Oct 9 13:25:31 user sshd[12844]: pam_unix(sshd:session): session opened for user root by (uid=0) 
Oct 9 13:35:48 user sshd[12868]: Accepted password for root from 213.XXX.XXX.XX7 port 33574 ssh2 
Oct 9 13:35:48 user sshd[12868]: pam_unix(sshd:session): session opened for user root by (uid=0) 
Oct 9 13:46:58 user sshd[12844]: Received disconnect from 213.XXX.XXX.XX7: 11: disconnected by user 
Oct 9 13:46:58 user sshd[12844]: pam_unix(sshd:session): session closed for user root 
Oct 9 15:47:58 user sshd[12868]: pam_unix(sshd:session): session closed for user root 
Oct 11 22:17:31 user sshd[2655]: Accepted password for root from 17X.XXX.XXX.X19 port 5567 ssh2 
Oct 11 22:17:31 user sshd[2655]: pam_unix(sshd:session): session opened for user root by (uid=0) 

fileB.txt

Oct 9 12:19:16 user sshd[12744]: Accepted password for root from 213.XXX.XXX.XX7 port 60554 ssh2 
Oct 9 12:19:16 user sshd[12744]: pam_unix(sshd:session): session opened for user root by (uid=0) 
Oct 9 13:24:42 user sshd[12744]: Received disconnect from 213.XXX.XXX.XX7: 11: disconnected by user 
Oct 9 13:24:42 user sshd[12744]: pam_unix(sshd:session): session closed for user root 
Oct 9 13:25:31 user sshd[12844]: Accepted password for root from 213.XXX.XXX.XX7 port 33254 ssh2 
Oct 9 13:25:31 user sshd[12844]: pam_unix(sshd:session): session opened for user root by (uid=0) 
Oct 9 13:35:48 user sshd[12868]: Accepted password for root from 213.XXX.XXX.XX7 port 33574 ssh2 
Oct 9 13:35:48 user sshd[12868]: pam_unix(sshd:session): session opened for user root by (uid=0) 

輸出:

FILEB _after.txt

Oct 9 12:19:16 user sshd[12744]: Accepted password for root from 213.XXX.XXX.XX7 port 60554 ssh2 
Oct 9 12:19:16 user sshd[12744]: pam_unix(sshd:session): session opened for user root by (uid=0) 
Oct 9 13:24:42 user sshd[12744]: Received disconnect from 213.XXX.XXX.XX7: 11: disconnected by user 
Oct 9 13:24:42 user sshd[12744]: pam_unix(sshd:session): session closed for user root 
Oct 9 13:25:31 user sshd[12844]: Accepted password for root from 213.XXX.XXX.XX7 port 33254 ssh2 
Oct 9 13:25:31 user sshd[12844]: pam_unix(sshd:session): session opened for user root by (uid=0) 
Oct 9 13:35:48 user sshd[12868]: Accepted password for root from 213.XXX.XXX.XX7 port 33574 ssh2 
Oct 9 13:35:48 user sshd[12868]: pam_unix(sshd:session): session opened for user root by (uid=0) 
Oct 9 13:46:58 user sshd[12844]: Received disconnect from 213.XXX.XXX.XX7: 11: disconnected by user 
Oct 9 13:46:58 user sshd[12844]: pam_unix(sshd:session): session closed for user root 
Oct 9 15:47:58 user sshd[12868]: pam_unix(sshd:session): session closed for user root 
Oct 11 22:17:31 user sshd[2655]: Accepted password for root from 17X.XXX.XXX.X19 port 5567 ssh2 
Oct 11 22:17:31 user sshd[2655]: pam_unix(sshd:session): session opened for user root by (uid=0) 
+0

所以基本上要合併兩個文本文件但不保留重複? – MooingRawr

回答

1

這種嘗試在bash

cat fileA.txt fileB.txt | sort -M | uniq > new_file.txt 

sort -M 各種基於初始字符串,包括空格的任何數量的,按一個月的名稱縮寫其次 ,被摺疊到UPPER的情況下,並按照'JAN'的順序 <'FEB'< ... <'DEC'進行比較。無效的名稱比較 低到有效的名稱。 「LC_TIME」區域設置確定月份 拼寫。

uniq:過濾掉文件中的重複行。

|:將一個命令的輸出傳遞給另一個命令以進行進一步處理。

這將完成的是採取兩個文件,以上述方式對它們進行排序,保持獨特的項目,並將它們存儲在new_file.txt

注:這不是一個Python的解決方案,但你所標記的linux問題,所以我想它可能會讓你感興趣。你也可以找到更多關於使用命令的詳細信息,here

+0

我不是bash的專家。我想知道如何工作。 – galaxyan

+0

我的意思是排序後的結果可能不是基於時間戳 – galaxyan

+0

@galaxyan,其實排序有很多選項:http://ss64.com/bash/sort.html – coder

1

在兩個文件中讀取和轉換基於時間
設置兩套
排序並集

找工會聯接設置爲字符串,新的生產線

import datetime 
import 
file1 = "fileA.txt" 
file2 = "fileB.txt" 

with open(file1 ,'rb') as f: 
    sa = set(line for line in f) 
with open(file2 ,'rb') as f: 
    sb = set(line for line in f) 
print '\n'.join(sorted(sa.union(sb), key = lambda x: datetime.datetime.strptime(' '.join(x.split()[:3]), '%b %d %H:%M:%S'))) 



Oct 9 12:19:16 user sshd[12744]: pam_unix(sshd:session): session opened for user root by (uid=0) 
Oct 9 12:19:16 user sshd[12744]: Accepted password for root from 213.XXX.XXX.XX7 port 60554 ssh2 
Oct 9 13:24:42 user sshd[12744]: pam_unix(sshd:session): session closed for user root 
Oct 9 13:24:42 user sshd[12744]: Received disconnect from 213.XXX.XXX.XX7: 11: disconnected by user 
Oct 9 13:25:31 user sshd[12844]: Accepted password for root from 213.XXX.XXX.XX7 port 33254 ssh2 
Oct 9 13:25:31 user sshd[12844]: pam_unix(sshd:session): session opened for user root by (uid=0) 
Oct 9 13:35:48 user sshd[12868]: Accepted password for root from 213.XXX.XXX.XX7 port 33574 ssh2 
Oct 9 13:35:48 user sshd[12868]: pam_unix(sshd:session): session opened for user root by (uid=0) 
Oct 9 13:46:58 user sshd[12844]: pam_unix(sshd:session): session closed for user root 
Oct 9 13:46:58 user sshd[12844]: Received disconnect from 213.XXX.XXX.XX7: 11: disconnected by user 
Oct 9 15:47:58 user sshd[12868]: pam_unix(sshd:session): session closed for user root 
Oct 11 22:17:31 user sshd[2655]: pam_unix(sshd:session): session opened for user root by (uid=0) 
Oct 11 22:17:31 user sshd[2655]: Accepted password for root from 17X.XXX.XXX.X19 port 5567 ssh2