我有2個文件:文件A包含11746774推文,文件B包含704060推文。我想計算文件A - 文件B中不存在的推文,即1174674 - 704060 = 470614. PFB程序。 MatchWise-Tweets.zip包含49個文件的列表,其中推文存儲在49個單獨的文件中。意圖是獲取文件名並傳遞每個文件名以獲取49個文件中每個文件中存在的推文列表。計算集合A - 在Python中設置B
import csv
import zipfile
totTweets = 1174674
matchTweets = 704060
remaining = totTweets - matchTweets
lst = []
store = []
total = 0
#opFile = csv.writer(open('leftover.csv', "wb"))
mainFile = csv.reader(open('final_tweet_set.csv', 'rb'), delimiter=',', quotechar='|')
with zipfile.ZipFile('MatchWise-Tweets.zip', 'r') as zfile:
for name in zfile.namelist():
lst.append(name)
for getName in lst:
inFile = csv.reader(open(getName, 'rb'), delimiter=',', quotechar='|')
for row in inFile:
store.append(row)
length = len(store)
print length
count=0
for main_row in mainFile:
flag=0
main_tweetID = main_row[0]
for getTweet in store:
get_tweetID = getTweet[0]
if main_tweetID == get_tweetID:
flag = 1
#print "Flag == 1 condition--",flag
break
if flag ==1:
continue
elif flag == 0:
count+=1
remaining-=1
#print "Flag == 0 condition--"
#print flag
opFile.writerow(main_row)
print remaining
實際結果 - 573655
預期結果 - 470614
文件結構 -
566813957629808000,saddest thing about this world cup is that we won't see mo irfan bowling at the waca in perth :(#pakvind #indvspak #cwc15 @waca_cricket,15/02/2015 15:19
566813959076855000,"#pakvsind 50,000 tickets for the game were sold out in 20 minutes #cwc15 #phordey #indvspak",15/02/2015 15:19
566813961505366000,think india will give sohail his first 5 for.. smh.. #indvspak #cwc15,15/02/2015 15:19
第一列鳴叫-ID,第二列鳴叫,文本和第三列是鳴叫更新。我只想知道這個程序是否有問題,因爲我沒有得到想要的結果。
你可以從tweetids中創建集合,然後在python中使用內置的集合差異函數。請參閱https://docs.python.org/2/library/stdtypes.html#set – Dinesh
@Dinesh - 感謝您的建議。但在我嘗試全新的東西之前,我的代碼有什麼問題? – coder05
這兩種來源中是否可能存在一些推文? – Dinesh