2016-07-14 25 views
1

我有大約10萬個URL,每個URL都被標記爲正面或負面。我想知道什麼類型的URL對應於正面的? (同樣爲負數)對類似網址進行分組/尋找常用URL模式(Python)

我從分組子域開始,確定了最常見的正面和負面子域。

現在,對於有正負比相等的子域,我想進一步剖析並尋找模式。示例模式:

http://www.clarin.com/politica/ (pattern: domain/section) 
http://www.clarin.com/tema/manifestaciones.html (pattern: domain/tag/tag_name) 
http://www.clarin.com/buscador?q=protesta (pattern: domain/search?=search_term) 

鏈接並不僅限於clarin.com。

有關如何發現此類模式的任何建議?

回答

0

解決此問題:結束從finding largest common substring問題提示。

解決方案包括從url的每個字符構建一個分析樹。樹中的每個節點存儲正數,負數,總數。最後,樹被修剪以返回最常見的模式。

代碼:

def find_patterns(incoming_urls): 
    urls = {} 
    # make the tree 
    for url in incoming_urls: 
     url, atype = line.strip().split("____") # assuming incoming_urls is a list with each entry of type url__class 
     if len(url) < 100: # Take only the initial 100 characters to avoid building a sparse tree 
      bound = len(url) + 1 
     else: 
      bound = 101 
     for x in range(1, bound): 
      if url[:x].lower() not in urls: 
       urls[url[:x].lower()] = {'positive': 0, 'negative': 0, 'total': 0} 
      urls[url[:x].lower()][atype] += 1 
      urls[url[:x].lower()]['total'] += 1 

    new_urls = {} 
    # prune the tree 
    for url in urls: 
     if urls[url]['total'] < 5: # For something to be called as common pattern, there should be at least 5 occurrences of it. 
      continue 
     urls[url]['negative_percentage'] = (float(urls[url]['negative']) * 100)/urls[url]['total'] 
     if urls[url]['negative_percentage'] < 85.0: # Assuming I am interested in finding url patterns for negative class 
      continue 
     length = len(url) 
     found = False 
     # iterate to see if a len+1 url is present with same total count 
     for second in urls: 
      if len(second) <= length: 
       continue 
      if url == second[:length] and urls[url]['total'] == urls[second]['total']: 
       found = True 
       break 
     # discard urls with length less than 20 
     if not found and len(url) > 20: 
      new_urls[url] = urls[url] 

    print "URL Pattern; Positive; Negative; Total; Negative (%)" 
    for url in new_urls: 
     print "%s; %d; %d; %d; %.2f" % (
      url, new_urls[url]['positive'], new_urls[url]['negative'], new_urls[url]['total'], 
      new_urls[url]['negative_percentage'])