2015-08-20 102 views
0

因爲我是Python新手,所以不敢和我在一起。 我想,但只有基於文本複製在字典中刪除重複值在二維數組中刪除重複項python

所以比如我想刪除重複的鳴叫名單:

{'text': 'Dear Conservatives: comprehend, if you can RT Iran deal opponents have their "death panels" lie, and it\'s a whopper http://t.co/EcSHCAm9Nn', 'id': 634092907243393024L} 
{'text': 'RT Iran deal opponents now have their "death panels" lie, and it\'s a whopper http://t.co/ntECOXorvK via @voxdotcom #IranDeal', 'id': 634068454207791104L} 
{'text': 'RT : Iran deal quietly picks up some GOP backers via https://t.co/65DRjWT6t8 catoletters: Iran deal quietly picks up some GOP backers \xe2\x80\xa6', 'id': 633631425279991812L} 
{'text': 'RT : Iran deal quietly picks up some GOP backers via https://t.co/QD43vbJft6 catoletters: Iran deal quietly picks up some GOP backers \xe2\x80\xa6', 'id': 633495091584323584L} 
{'text': "RT : Iran Deal's Surprising Supporters: https://t.co/pUG7vht0fE catoletters: Iran Deal's Surprising Supporters: http://t.co/dhdylTNgoG", 'id': 633083989180448768L} 
{'text': "RT : Iran Deal's Surprising Supporters - Today on the Liberty Report: https://t.co/PVHuVTyuAG RonPaul: Iran Deal'\xe2\x80\xa6 https://t.co/sTBhL12llF", 'id': 632525323733729280L} 
{'text': "RT : Iran Deal's Surprising Supporters - Today on the Liberty Report: https://t.co/PVHuVTyuAG RonPaul: Iran Deal'\xe2\x80\xa6 https://t.co/sTBhL12llF", 'id': 632385798277595137L} 
{'text': "RT : Iran Deal's Surprising Supporters: https://t.co/hOUCmreHKA catoletters: Iran Deal's Surprising Supporters: http://t.co/bJSLhd9dqA", 'id': 632370745088323584L} 
{'text': '#News #RT Iran deal debate devolves into clash over Jewish stereotypes and survival - W... http://t.co/foU0Sz6Jej http://t.co/WvcaNkMcu3', 'id': 631952088981868544L} 
{'text': '"@JeffersonObama: RT Iran deal support from Democratic senators is 19-1 so far....but...but Schumer...."', 'id': 631951056189149184L}} 

得到這個:

{'text': 'Dear Conservatives: comprehend, if you can RT Iran deal opponents have their "death panels" lie, and it\'s a whopper http://t.co/EcSHCAm9Nn', 'id': 634092907243393024L} 
{'text': '"@JeffersonObama: RT Iran deal support from Democratic senators is 19-1 so far....but...but Schumer...."', 'id': 631951056189149184L}} 

到目前爲止,我主要是找到基於'正常'字典的答案,其中重複的鍵/值是相同的。在我的情況下,它是一個合併的字典。由於轉推,文字鍵是相同的,但相應的推特ID是不同的

這是整個代碼,以更有效的方式在csv文件中編寫推文的任何提示(使刪除重複項更容易)比歡迎。

import csv 
import codecs 
tweet_text_id = [] 

from TwitterSearch import TwitterSearchOrder, TwitterUserOrder, TwitterSearchException, TwitterSearch 
try: 
tso = TwitterSearchOrder() 
tso.set_keywords(["Iran Deal"]) 
tso.set_language('en') 
tso.set_include_entities(False) 



ts = TwitterSearch(
    consumer_key = "aaaaa", 
    consumer_secret = "bbbbb", 
    access_token = "cccc", 
    access_token_secret = "dddd" 
) 

for tweet in ts.search_tweets_iterable(tso): 
    tweet_text_id.append({'id':tweet['id'], 'text': tweet['text'].encode('utf8')}); 



fieldnames = ['id', 'text'] 
tweet_file = open('tweets.csv', 'wb') 
csvwriter = csv.DictWriter(tweet_file, delimiter=',', fieldnames=fieldnames) 
csvwriter.writerow(dict((fn,fn) for fn in fieldnames)) 
for row in tweet_text_id: 
    csvwriter.writerow(row) 
tweet_file.close() 

except TwitterSearchException as e: 
    print(e) 

感謝您的幫助!

+0

你有什麼是字典列表,而不是一個 「合併字典」。無論如何,你的例子並不清楚。您只想保留第一個和最後一個條目,但其他條目不全是完全重複的。 –

回答

0

我做了過濾掉重複的情況,並在途中」

__all__ = ['filterDuplicates'] 
import re 

hashRegex = re.compile(r'#[a-z0-9]+', re.IGNORECASE) 
trunOne = re.compile(r'^\s+') 
trunTwo = re.compile(r'\s+$') 

def filterDuplicates(tweets): 

    dupes = [] 
    new_dict = [] 

    for dic in tweets: 
     new_txt = hashRegex.sub('', dic['text']) #Removes hashtags 
     new_txt = trunOne.sub('', trunTwo.sub('', new_txt)) #Truncates extra spaces 

     print(new_txt) 

     dic.update({'text':new_txt}) 

     if new_txt in dupes: 
      continue 

     dupes.append(new_txt) 
     new_dict.append(dic) 

    return new_dict 

if __name__ == '__main__': 

    the_tweets = [ 
     {'text':'#yolo #swag something really annoying', 'id':1}, 
     {'text':'something really annoying', 'id':2}, 
     {'text':'thing thing thing haha', 'id':3}, 
     {'text':'#RF thing thing thing haha', 'id':4}, 
     {'text':'thing thing thing haha', 'id':5} 
    ] 

    #Tweets pre-filter 
    for dic in the_tweets: 
     print(dic) 

    #Tweets post-filter 
    for dic in filterDuplicates(the_tweets): 
     print(dic) 

在腳本只需導入這個並運行它過濾掉微博刪除井號標籤模塊!

0

你可以嘗試比較。根據它們之間的「編輯距離」的鳴叫這是我在它的裂紋用fuzzywuzzy [1]比較鳴叫:

from fuzzywuzzy import fuzz 


def clean_tweet(tweet): 
    """very crude. You can improve on this!""" 
    tweet['text'] = tweet['text'].replace("RT :", "") 
    return tweet 


def is_unique(tweet, seen_tweets): 
    for seen_tweet in seen_tweets: 
     ratio = fuzz.ratio(tweet['text'], seen_tweet['text']) 
     if ratio > DUP_THRESHOLD: 
      return False 
    return True 


def dedup(tweets, threshold=50): 
    deduped = [] 
    for tweet in tweets: 
     cleaned = clean_tweet(tweet) 
     if is_unique(cleaned, deduped): 
      deduped.append(cleaned) 

    return deduped 


if __name__ == "__main__": 
    DUP_THRESHOLD = 30 

    tweets = [ 
     {'text': 'Dear Conservatives: comprehend, if you can RT Iran deal opponents have their "death panels" lie, and it\'s a whopper http://t.co/EcSHCAm9Nn', 'id': 634092907243393024}, 
     {'text': 'RT Iran deal opponents now have their "death panels" lie, and it\'s a whopper http://t.co/ntECOXorvK via @voxdotcom #IranDeal', 'id': 634068454207791104}, 
     {'text': 'RT : Iran deal quietly picks up some GOP backers via https://t.co/65DRjWT6t8 catoletters: Iran deal quietly picks up some GOP backers \xe2\x80\xa6', 'id': 633631425279991812}, 
     {'text': 'RT : Iran deal quietly picks up some GOP backers via https://t.co/QD43vbJft6 catoletters: Iran deal quietly picks up some GOP backers \xe2\x80\xa6', 'id': 633495091584323584}, 
     {'text': "RT : Iran Deal's Surprising Supporters: https://t.co/pUG7vht0fE catoletters: Iran Deal's Surprising Supporters: http://t.co/dhdylTNgoG", 'id': 633083989180448768}, 
     {'text': "RT : Iran Deal's Surprising Supporters - Today on the Liberty Report: https://t.co/PVHuVTyuAG RonPaul: Iran Deal'\xe2\x80\xa6 https://t.co/sTBhL12llF", 'id': 632525323733729280}, 
     {'text': "RT : Iran Deal's Surprising Supporters - Today on the Liberty Report: https://t.co/PVHuVTyuAG RonPaul: Iran Deal'\xe2\x80\xa6 https://t.co/sTBhL12llF", 'id': 632385798277595137}, 
     {'text': "RT : Iran Deal's Surprising Supporters: https://t.co/hOUCmreHKA catoletters: Iran Deal's Surprising Supporters: http://t.co/bJSLhd9dqA", 'id': 632370745088323584}, 
     {'text': '#News #RT Iran deal debate devolves into clash over Jewish stereotypes and survival - W... http://t.co/foU0Sz6Jej http://t.co/WvcaNkMcu3', 'id': 631952088981868544}, 
     {'text': '"@JeffersonObama: RT Iran deal support from Democratic senators is 19-1 so far....but...but Schumer...."', 'id': 631951056189149184}, 
    ] 

    deduped = dedup(tweets, threshold=DUP_THRESHOLD) 
    print deduped 

這給輸出:

[ 
    {'text': 'Dear Conservatives: comprehend, if you can RT Iran deal opponents have their "death panels" lie, and it\'s a whopper http://t.co/EcSHCAm9Nn', 'id': 634092907243393024L}, 
    {'text': ' Iran deal quietly picks up some GOP backers via https://t.co/65DRjWT6t8 catoletters: Iran deal quietly picks up some GOP backers \xe2\x80\xa6', 'id': 633631425279991812L} 
] 

[1] https://github.com/seatgeek/fuzzywuzzy