計算集合A - 在Python中設置B

我有2個文件：文件A包含11746774推文，文件B包含704060推文。我想計算文件A - 文件B中不存在的推文，即1174674 - 704060 = 470614. PFB程序。 MatchWise-Tweets.zip包含49個文件的列表，其中推文存儲在49個單獨的文件中。意圖是獲取文件名並傳遞每個文件名以獲取49個文件中每個文件中存在的推文列表。計算集合A - 在Python中設置B

import csv 
import zipfile 

totTweets = 1174674 
matchTweets = 704060 
remaining = totTweets - matchTweets  
lst = [] 
store = [] 
total = 0  
#opFile = csv.writer(open('leftover.csv', "wb")) 
mainFile = csv.reader(open('final_tweet_set.csv', 'rb'), delimiter=',', quotechar='|') 
with zipfile.ZipFile('MatchWise-Tweets.zip', 'r') as zfile: 
    for name in zfile.namelist(): 
     lst.append(name) 

for getName in lst: 
    inFile = csv.reader(open(getName, 'rb'), delimiter=',', quotechar='|') 
    for row in inFile: 
     store.append(row) 

length = len(store) 
print length 

count=0 
for main_row in mainFile: 
    flag=0 
    main_tweetID = main_row[0] 
    for getTweet in store: 
     get_tweetID = getTweet[0] 
     if main_tweetID == get_tweetID: 
      flag = 1 
      #print "Flag == 1 condition--",flag 
      break 
    if flag ==1: 
     continue 
    elif flag == 0: 
     count+=1 
     remaining-=1 
     #print "Flag == 0 condition--" 
     #print flag 
     opFile.writerow(main_row) 
     print remaining

實際結果 - 573655

預期結果 - 470614

文件結構 -

566813957629808000,saddest thing about this world cup is that we won't see mo irfan bowling at the waca in perth :(#pakvind #indvspak #cwc15 @waca_cricket,15/02/2015 15:19 
566813959076855000,"#pakvsind 50,000 tickets for the game were sold out in 20 minutes #cwc15 #phordey #indvspak",15/02/2015 15:19 
566813961505366000,think india will give sohail his first 5 for.. smh.. #indvspak #cwc15,15/02/2015 15:19

第一列鳴叫-ID，第二列鳴叫，文本和第三列是鳴叫更新。我只想知道這個程序是否有問題，因爲我沒有得到想要的結果。

來源

2015-05-11 coder05

你可以從tweetids中創建集合，然後在python中使用內置的集合差異函數。請參閱https://docs.python.org/2/library/stdtypes.html#set – Dinesh

@Dinesh - 感謝您的建議。但在我嘗試全新的東西之前，我的代碼有什麼問題？ – coder05

這兩種來源中是否可能存在一些推文？ – Dinesh

import difflib 
 
file1 = "PATH OF FILE 1" 
 
file1 = open(file1, "r") 
 
file2 = "PATH OF FILE 2" 
 
file2 = open(file2, "r") 
 
diff = difflib.ndiff(file1.readlines(), file2.readlines()) 
 
file1.close() 
 
file2.close() 
 
delta = ''.join(x[2:] for x in diff if x.startswith('- ')) 
 
print delta

來源

2015-05-11 09:31:39

發表在答案，你的意思。請讓我知道如果這是好的？ – coder05

@ coder05對於遲到的回覆感到抱歉。你的工作完成了嗎？ –

是的。感謝您的幫助:) – coder05

import difflib 
import csv 
file1 = open('final_tweet_set.csv', 'rb') 
file2 = open("matchTweets_combined.csv","rb") 
diff = difflib.ndiff(file1.readlines(), file2.readlines()) 
file1.close() 
file2.close() 
delta = ''.join(x[2:] for x in diff if x.startswith('- ')) 

#print delta 
fout = csv.writer(open("leftover_new.csv","wb")) 
for eachrow in delta: 
    fout.writerow(eachrow)

來源

2015-05-11 11:45:41 coder05

對於遲到的回覆感到抱歉。它是否完成了你的工作？ –

您的代碼，使您的文件不包含重複的假設。這可能並非如此，爲什麼你的結果不對。

在列表中使用集合可以更容易地獲得正確的結果並提高速度（因爲它只會比較推特ID而不是整個推文及其元數據）。

以下使用集合，並且更加緊湊和可讀。它不完整，您必須添加打開zip文件和opfile的位（並關閉它們）。

tweet_superset = set() # your store 
for getName in lst: 
    inFile = csv.reader(open(getName, 'rb'), delimiter=',', quotechar='|') 
    tweet_supetset.update(entry[0] for entry in inFile) 
    # using a set means we ignore any duplicate tweets in the 49 source files. 

length = len(tweet_superset) 
print length 

seen_tweets = set() 
for entry in mainFile: 
    id_ = entry[0] 
    if id_ in tweet_superset: 
     if id_ in seen_tweets: 
      print "Error, this tweet appears more than once in mainFile:", entry 
     else: 
      opFile.writerow(entry) 
      seen_tweets.add(id_) 

count = len(seen_tweets) 
print count

來源

2015-05-11 12:35:00 Dunes

文件中沒有重複項。我收集了這些數據並確保沒有任何重複。 – coder05

獲取此錯誤：tweet_superset.update（id_ for id_，文本，inFile中的datetime） ValueError：太多的值以便解壓 – coder05

對不起，我假定行中的每個條目都是'id，tweet_text，datetime'，沒有其他字段。我已經對代碼進行了更改以刪除解包。 – Dunes

計算集合A - 在Python中設置B

回答

相關問題