2015-05-11 32 views
0

我有2個文件:文件A包含11746774推文,文件B包含704060推文。我想計算文件A - 文件B中不存在的推文,即1174674 - 704060 = 470614. PFB程序。 MatchWise-Tweets.zip包含49個文件的列表,其中推文存儲在49個單獨的文件中。意圖是獲取文件名並傳遞每個文件名以獲取49個文件中每個文件中存在的推文列表。計算集合A - 在Python中設置B

import csv 
import zipfile 

totTweets = 1174674 
matchTweets = 704060 
remaining = totTweets - matchTweets  
lst = [] 
store = [] 
total = 0  
#opFile = csv.writer(open('leftover.csv', "wb")) 
mainFile = csv.reader(open('final_tweet_set.csv', 'rb'), delimiter=',', quotechar='|') 
with zipfile.ZipFile('MatchWise-Tweets.zip', 'r') as zfile: 
    for name in zfile.namelist(): 
     lst.append(name) 

for getName in lst: 
    inFile = csv.reader(open(getName, 'rb'), delimiter=',', quotechar='|') 
    for row in inFile: 
     store.append(row) 

length = len(store) 
print length 

count=0 
for main_row in mainFile: 
    flag=0 
    main_tweetID = main_row[0] 
    for getTweet in store: 
     get_tweetID = getTweet[0] 
     if main_tweetID == get_tweetID: 
      flag = 1 
      #print "Flag == 1 condition--",flag 
      break 
    if flag ==1: 
     continue 
    elif flag == 0: 
     count+=1 
     remaining-=1 
     #print "Flag == 0 condition--" 
     #print flag 
     opFile.writerow(main_row) 
     print remaining 

實際結果 - 573655

預期結果 - 470614

文件結構 -

566813957629808000,saddest thing about this world cup is that we won't see mo irfan bowling at the waca in perth :(#pakvind #indvspak #cwc15 @waca_cricket,15/02/2015 15:19 
566813959076855000,"#pakvsind 50,000 tickets for the game were sold out in 20 minutes #cwc15 #phordey #indvspak",15/02/2015 15:19 
566813961505366000,think india will give sohail his first 5 for.. smh.. #indvspak #cwc15,15/02/2015 15:19 

第一列鳴叫-ID,第二列鳴叫,文本和第三列是鳴叫更新。我只想知道這個程序是否有問題,因爲我沒有得到想要的結果。

+1

你可以從tweetids中創建集合,然後在python中使用內置的集合差異函數。請參閱https://docs.python.org/2/library/stdtypes.html#set – Dinesh

+0

@Dinesh - 感謝您的建議。但在我嘗試全新的東西之前,我的代碼有什麼問題? – coder05

+0

這兩種來源中是否可能存在一些推文? – Dinesh

回答

0

import difflib 
 
file1 = "PATH OF FILE 1" 
 
file1 = open(file1, "r") 
 
file2 = "PATH OF FILE 2" 
 
file2 = open(file2, "r") 
 
diff = difflib.ndiff(file1.readlines(), file2.readlines()) 
 
file1.close() 
 
file2.close() 
 
delta = ''.join(x[2:] for x in diff if x.startswith('- ')) 
 
print delta

+0

發表在答案,你的意思。請讓我知道如果這是好的? – coder05

+0

@ coder05對於遲到的回覆感到抱歉。你的工作完成了嗎? –

+0

是的。感謝您的幫助:) – coder05

0
import difflib 
import csv 
file1 = open('final_tweet_set.csv', 'rb') 
file2 = open("matchTweets_combined.csv","rb") 
diff = difflib.ndiff(file1.readlines(), file2.readlines()) 
file1.close() 
file2.close() 
delta = ''.join(x[2:] for x in diff if x.startswith('- ')) 

#print delta 
fout = csv.writer(open("leftover_new.csv","wb")) 
for eachrow in delta: 
    fout.writerow(eachrow) 
+0

對於遲到的回覆感到抱歉。它是否完成了你的工作? –

0

您的代碼,使您的文件不包含重複的假設。這可能並非如此,爲什麼你的結果不對。

在列表中使用集合可以更容易地獲得正確的結果並提高速度(因爲它只會比較推特ID而不是整個推文及其元數據)。

以下使用集合,並且更加緊湊和可讀。它不完整,您必須添加打開zip文件和opfile的位(並關閉它們)。

tweet_superset = set() # your store 
for getName in lst: 
    inFile = csv.reader(open(getName, 'rb'), delimiter=',', quotechar='|') 
    tweet_supetset.update(entry[0] for entry in inFile) 
    # using a set means we ignore any duplicate tweets in the 49 source files. 

length = len(tweet_superset) 
print length 

seen_tweets = set() 
for entry in mainFile: 
    id_ = entry[0] 
    if id_ in tweet_superset: 
     if id_ in seen_tweets: 
      print "Error, this tweet appears more than once in mainFile:", entry 
     else: 
      opFile.writerow(entry) 
      seen_tweets.add(id_) 

count = len(seen_tweets) 
print count 
+0

文件中沒有重複項。我收集了這些數據並確保沒有任何重複。 – coder05

+0

獲取此錯誤:tweet_superset.update(id_ for id_,文本,inFile中的datetime) ValueError:太多的值以便解壓 – coder05

+0

對不起,我假定行中的每個條目都是'id,tweet_text,datetime',沒有其他字段。我已經對代碼進行了更改以刪除解包。 – Dunes

相關問題