2013-09-21 45 views
1

我試圖確定參數值的長度方差,並在各組參數/值組合後打印方差值。確定長度方差

例如,date方差在date=2007-04-14date=2007-08-19的值將是0的id_eveid_eve=479989的值,id_eve=47,和id_eve=479將是2.88。

Group values with common domain and page values,我們有一組URL被分析以提供一組URL的參數/值。

樣本數據集:

www.domain.com/page?id_eve=479989&adm=no 
www.domain.com/page?id_eve=47&adm=yes 
www.domain.com/page?id_eve=479 
domain.com/cal?view=month 
domain.com/cal?view=day 
ww2.domain.com/cal?date=2007-04-14 
ww2.domain.com/cal?date=2007-08-19 
www.domain.edu/some/folder/image.php?l=adm&y=5&id=2&page=http%3A//support.domain.com/downloads/index.asp&unique=12345 
blog.news.org/news/calendar.php?view=day&date=2011-12-10 
www.domain.edu/some/folder/image.php?l=adm&y=5&id=2&page=http%3A//.domain.com/downloads/index.asp&unique=12345 
blog.news.org/news/calendar.php?view=month&date=2011-12-10 

由以下Python代碼解析:

from collections import defaultdict 
from urllib import quote 
from urlparse import parse_qsl, urlparse 

urls = defaultdict(list) 
with open('links.txt') as f: 
    for url in f: 
     parsed_url = urlparse(url.strip()) 
     params = parse_qsl(parsed_url.query, keep_blank_values=True) 
     for key, value in params: 
      urls[parsed_url.path].append("%s=%s" % (key, quote(value))) 

# printing results 
for url, params in urls.iteritems(): 
    print url 
    for param in params: 
     print param 

爲了提供:

ww2.domain.com/cal 
date=2007-04-14 
date=2007-08-19 
www.domain.edu/some/folder/image.php 
l=adm 
y=5 
id=2 
page=http%3A//support.domain.com/downloads/index.asp 
unique=12345 
l=adm 
y=5 
id=2 
page=http%3A//.domain.com/downloads/index.asp 
unique=12345 
domain.com/cal 
view=month 
view=day 
www.domain.com/page 
id_eve=479989 
adm=no 
id_eve=47 
adm=yes 
id_eve=479 
blog.news.org/news/calendar.php 
view=day 
date=2011-12-10 
view=month 
date=2011-12-10 

所需的附加件是每一個參數/值組合來打印參數值的長度變化以用於匹配具有類似U的參數RL在上述輸出中定義(希望能夠清楚地閱讀)。

  • 組參數的URL的分組
  • 參數的計算長度值
  • 確定長度的變異

因此所需的輸出將是:

ww2.domain.com/cal 
date=2007-04-14 
date=2007-08-19 
0 
www.domain.edu/some/folder/image.php 
l=adm 
l=adm 
0 
y=5 
y=5 
0 
id=2 
id=2 
0 
page=http%3A//support.domain.com/downloads/index.asp 
0 
unique=12345 
0  
page=http%3A//.domain.com/downloads/index.asp 
unique=12345 
0 
domain.com/cal 
0 
view=month 
view=day 
1 
www.domain.com/page 
id_eve=479989 
id_eve=47 
id_eve=479 
2.88 
adm=no 
adm=yes 
0.25 
blog.news.org/news/calendar.php 
view=day 
view=month 
1 
date=2011-12-10 
date=2011-12-10 
0 
+0

你能解釋一下爲什麼你的例子是2.88? –

回答

3
from collections import defaultdict 
from urllib import quote 
from urlparse import parse_qsl, urlparse 

我們需要能夠計算方差:

def variance(values): 
    mean = sum(values)/float(len(values)) 
    return sum((elem - mean)**2 for elem in values)/float(len(values)) 

我們要通過組的「鑰匙」,因此而不是把"%s=%s"我們將另一層添加到defaultdict

urls = defaultdict(lambda: defaultdict(list)) 
with open('links.txt') as f: 
    for url in f: 
     parsed_url = urlparse(url.strip()) 
     params = parse_qsl(parsed_url.query, keep_blank_values=True) 
     for key, value in params: 
      urls[parsed_url.path][key].append(quote(value)) 

然後我們可以去通過和打印的東西

for domain, keys in urls.items(): 
    print domain 
    for key, values in keys.items(): 
     for value in values: 
      print "%s=%s" % (key, value) 

     if len(values) > 1: 
      print variance(map(len, values))