如何從我的電子郵件打印有組織的郵件？

我需要在這一點上做的兩件事情，但我需要你的幫助：如何從我的電子郵件打印有組織的郵件？

最好的做法來清理數據 - 編程刪除多餘的標籤&的「>>>>>>>」，再加上其他非有意義的溝通flotsam和jetsum
一旦它被清理 - 我如何收拾好它在django & sqlite中運行良好。
- 我可以根據日期，人物，主題，單詞將它變成csv，然後將它們輸入到我的數據庫中的數據類中嗎？

嗯，在我進入數據庫，我希望能夠乾淨利落排序排序和顯示數據 - 我很少經歷將東西放入數據庫中，我做的最接近的是從XML，csv和JSON開始工作。

我需要通過排名獲得ngrams，例如某人在一系列電子郵件中出現某個詞的次數。我試圖更加接近地瞭解人們如何與我談論科目等。一個非常基本的版本Jon Kleinberg's work analyzing his own emails.

要溫柔，粗糙但請幫助:)！

>我的輸出目前看起來像這樣：：1， '每個'：1， '我'：1， 'IN \ r \ n \ r \ n2012/1月31日！'：1，'計算器。\ r \ n >>>>>> \ r \ n >>>>>>'：1，'people'：1，'= 97MB \ r \ n> \ r \ n>'：1，''我們'：2，'寫道：\ r \ n >>>>>> \ r \ n >>>>>>'：1，'= \ r \ nwrote：\ r \ n >>>>> \ r \ n >>>>>>'：1，'2012/1/31'：2，'are'：1，'31，'：5，'= 97MB \ r \ n >>>> \ r \ n >>>>'：1， '1:45'：1 '是\ r \ n >>>>>'：1， '已發送'：

import getpass, imaplib, email 

# NGramCounter builds a dictionary relating ngrams (as tuples) to the number 
# of times that ngram occurs in a text (as integers) 
class NGramCounter(object): 

    # parameter n is the 'order' (length) of the desired n-gram 
    def __init__(self, text): 
    self.text = text 
    self.ngrams = dict() 

    # feed method calls tokenize to break the given string up into units 
    def tokenize(self): 
    return self.text.split(" ") 

    # feed method takes text, tokenizes it, and visits every group of n tokens 
    # in turn, adding the group to self.ngrams or incrementing count in same 
    def parse(self): 

    tokens = self.tokenize() 
    #Moves through every individual word in the text, increments counter if already found 
    #else sets count to 1 
    for word in tokens: 
     if word in self.ngrams: 
      self.ngrams[word] += 1 
     else: 
      self.ngrams[word] = 1 

    def get_ngrams(self): 
    return self.ngrams 

#loading profile for login 
M = imaplib.IMAP4_SSL('imap.gmail.com') 
M.login("EMAIL", "PASS") 
M.select() 
new = open('liamartinez.txt', 'w') 
typ, data = M.search(None, 'FROM', 'SEARCHGOES_HERE') #Gets ALL messages 

def get_first_text_part(msg): #where should this be nested? 
    maintype = msg.get_content_maintype() 
    if maintype == 'multipart': 
     for part in msg.get_payload(): 
      if part.get_content_maintype() == 'text': 
       return part.get_payload() 
    elif maintype == 'text': 
     return msg.get_payload() 

for num in data[0].split(): #Loops through all messages 
    typ, data = M.fetch(num, '(RFC822)') #Pulls Message 
    msg = email.message_from_string(data[0][2]) #Puts message into easy to use python objects 
    _from = msg['from'] #pull from 
    _to = msg['to'] #pull to 
    _subject = msg['subject'] #pull subject 
    _body = get_first_text_part(msg) #pull body 
    if _body: 
     ngrams = NGramCounter(_body) 
     ngrams.parse() 
     _feed = ngrams.get_ngrams() 
     # print "\n".join("\t".join(str(_feed) for col in row) for row in tab) 
     print _feed 
    # print 'Content-Type:',msg.get_content_type() 
    #  print _from 
    #  print _to 
    #  print _subject 
    #  print _body 
    #  

    new.write(_from) 

    print '---------------------------------' 

M.close() 
M.logout()

來源

2012-04-12 Will J

不，我不是，感謝要求。哦，等等...... – 2012-04-12 07:10:15

Ignacio的意思是說你的標題應該描述你的實際問題（而不是在帖子中埋藏那麼深）而不是問我們是否試圖寫一個程序。 – agf 2012-04-12 07:11:43

謝謝！編輯得更清楚。任何建議？ – 2012-04-12 19:49:54

你的主循環沒有錯。雖然這個過程有點慢，因爲您需要從外部服務器檢索所有電子郵件。我建議一次下載客戶端上的所有消息。然後將它們保存到數據庫（sqlite，zodb，mongodb ..你喜歡的那個），然後在db對象之後執行所有你想分析的內容。這兩個過程（下載和分析）更好地保持彼此的一部分，否則調整它們會導致複雜並且代碼複雜度會增加。

來源

2012-04-12 07:20:48 luke14free

謝謝Luke，剛剛通過django的一個教程設置了一個投票應用程序（我認爲這是他們的第一個教程）。我想我的下一步實際上是從中定義一個數據庫。正如我在我修改的職位上所說的，重要的組織事項可能是時間，從，到，主體和身體（如ngrams）。 – 2012-04-12 19:15:46

更換

if _body: 
    ngrams = NGramCounter(_body) 
    ngrams.parse() 
    _feed = ngrams.get_ngrams() 
    # print "\n".join("\t".join(str(_feed) for col in row) for row in tab) 
    print _feed

與

if _body: 
    ngrams = NGramCounter(" ".join(_body.strip(">").split())) 
    ngrams.parse() 
    _feed = ngrams.get_ngrams() 
    print _feed

來源

2012-04-14 18:35:02 Duke

感謝您幫我清理乾淨！ – 2012-04-15 15:07:18

如何從我的電子郵件打印有組織的郵件？

回答

相關問題