2016-10-07 108 views
0

下面顯示的代碼部分我從twitter獲取推文,並將它們最初存儲在「backup.txt」中。我還創建了一個文件「tweets3.csv」,並保存每個推文的一些特定字段。但我意識到一些推文具有完全相同的文本(重複)。我怎樣才能從我的csv文件中刪除這些文件?從csv文件中刪除重複tweet

from tweepy import Stream 
from tweepy import OAuthHandler 
from tweepy.streaming import StreamListener 
import time 
import json 
import csv 


ckey = '' 
csecret = '' 
atoken = '' 
asecret = '' 

class listener(StreamListener): 
    def on_data(self, data): 
     try: 
      all_data = json.loads(data) 
      with open("backup.txt", 'a') as backup: 
       backup.write(str(all_data) + "\n") 
       backup.close() 

      text = str(all_data["text"]).encode("utf-8") 
      id = str(all_data["id"]).encode("utf-8") 
      timestamp = str(all_data["timestamp_ms"]).encode("utf-8") 
      sn = str(all_data["user"]["screen_name"]).encode("utf-8") 
      user_id = str(all_data["user"]["id"]).encode("utf-8") 
      create = str(all_data["created_at"]).encode("utf-8") 
      follower = str(all_data["user"]["followers_count"]).encode("utf-8") 
      following = str(all_data["user"]["following"]).encode("utf-8") 
      status = str(all_data["user"]["statuses_count"]).encode("utf-8") 

     # text = data.split(',"text":"')[1].split('","source')[0] 
     # name = data.split(',"screen_name":"')[1].split('","location')[0] 
      contentlist = [] 
      contentlist.append(text) 
      contentlist.append(id) 
      contentlist.append(timestamp) 
      contentlist.append(sn) 
      contentlist.append(user_id) 
      contentlist.append(create) 
      contentlist.append(follower) 
      contentlist.append(following) 
      contentlist.append(status) 
      print contentlist 
      f = open("tweets3.csv", 'ab') 
      wrt = csv.writer(f, dialect='excel') 
      try: 
       wrt.writerow(contentlist) 
      except UnicodeEncodeError, UnicodeEncodeError: 
       return True 
      return True 
     except BaseException, e: 

      print 'failed on data',type(e),str(e) 
      time.sleep(3) 

    def on_error(self, status): 
     print "Error status:" + str(status) 


auth = OAuthHandler(ckey, csecret) 
auth.set_access_token(atoken, asecret) 
twitterStream = Stream(auth, listener()) 
twitterStream.filter(track=["zikavirus"], languages=['en']) 
+0

我想你可以做一個列表變量,每次你去鳴叫,你遍歷列表並檢查,如果這個ID存在與否。如果是的話,什麼也不做。如果否,請將該ID添加到列表中。 – Fusseldieb

回答

1

我寫了這個代碼,它會生成一個列表,並且每次它通過推文時,它都會檢查該列表。如果文本不存在,請將其添加到列表中。

# Defines a list - It stores all unique tweets 
tweetChecklist = []; 

# All your tweets. I represent them as a list to test the code 
AllTweets = ["Hello", "HelloFoo", "HelloBar", "Hello", "hello", "Bye"]; 

# Goes over all "tweets" 
for current_tweet in AllTweets: 
     # If tweet doesn't exist in the list 
     if current_tweet not in tweetChecklist: 
      tweetChecklist.append(current_tweet); 
      # Do what you want with this tweet, it won't appear two times... 

# Print ["Hello", "HelloFoo", "HelloBar", "hello", "Bye"] 
# Note that the second Hello doesn't show up - It's what you want 
# However, it's case sensitive. 
print(tweetIDlist); 
# Clear the list 
tweetChecklist = []; 

我想實現我的IT解決方案後,你的代碼應該顯示是這樣的:

from tweepy import Stream 
from tweepy import OAuthHandler 
from tweepy.streaming import StreamListener 
import time 
import json 
import csv 

# Define a list - It stores all unique tweets 
# Clear this list after completion of fetching all tweets 
tweetChecklist = []; 

ckey = '' 
csecret = '' 
atoken = '' 
asecret = '' 

class listener(StreamListener): 
    def on_data(self, data): 
     try: 
      all_data = json.loads(data) 
      with open("backup.txt", 'a') as backup: 
       backup.write(str(all_data) + "\n") 
       backup.close() 

      text = str(all_data["text"]).encode("utf-8") 
      id = str(all_data["id"]).encode("utf-8") 
      timestamp = str(all_data["timestamp_ms"]).encode("utf-8") 
      sn = str(all_data["user"]["screen_name"]).encode("utf-8") 
      user_id = str(all_data["user"]["id"]).encode("utf-8") 
      create = str(all_data["created_at"]).encode("utf-8") 
      follower = str(all_data["user"]["followers_count"]).encode("utf-8") 
      following = str(all_data["user"]["following"]).encode("utf-8") 
      status = str(all_data["user"]["statuses_count"]).encode("utf-8") 

      # If the text does not exist in the list that stores all unique tweets 
      if text not in tweetChecklist: 
       # Store it, so that on further times with the same text, 
       # it didn't reach this code 
       tweetChecklist.append(current_tweet); 

       # Now, do your unique stuff 
       contentlist = [] 
       contentlist.append(text) 
       contentlist.append(id) 
       contentlist.append(timestamp) 
       contentlist.append(sn) 
       contentlist.append(user_id) 
       contentlist.append(create) 
       contentlist.append(follower) 
       contentlist.append(following) 
       contentlist.append(status) 
       print contentlist 
       f = open("tweets3.csv", 'ab') 
       wrt = csv.writer(f, dialect='excel') 
       try: 
        wrt.writerow(contentlist) 
       except UnicodeEncodeError, UnicodeEncodeError: 
        return True 
       return True 
      except BaseException, e: 

       print 'failed on data',type(e),str(e) 
       time.sleep(3) 

     def on_error(self, status): 
      print "Error status:" + str(status) 


    auth = OAuthHandler(ckey, csecret) 
    auth.set_access_token(atoken, asecret) 
    twitterStream = Stream(auth, listener()) 
    twitterStream.filter(track=["zikavirus"], languages=['en']) 
+0

或者,您可以使用一組推文。由於集合的查找通常(很多)比列表更快,對於大文件而言,這可能是相當有益的。 –

+0

@ N.Wouda看一看:http://stackoverflow.com/questions/2831212/python-sets-vs-lists。 [...](集合)在迭代其內容時比列表慢 - 記住:他不想檢查tweet-id,而是檢查他們的_text/content_。 – Fusseldieb

+0

該設置可能只包含內容 - 我從未提及過ID。迭代是沒有必要的,因爲簡單的存在檢查就足夠了 - 引用您的源代碼,「當確定對象是否存在於集合中時,集合會顯着更快」。 –