2013-07-09 34 views
0

我想從網站URL中提取目錄層次結構。並非所有網站都符合目錄結構。對於做(下面)的網站,我希望能夠創建一個反映目錄層次結構的Python字典(如下)。我怎樣才能建立一個python腳本,可以從url中將結構提取到字典中?從URL中提取目錄結構

Raw data: 
http://www.ex.com 
http://www.ex.com/product_cat_1/ 
http://www.ex.com/product_cat_1/item_1 
http://www.ex.com/product_cat_1/item_2 
http://www.ex.com/product_cat_2/ 
http://www.ex.com/product_cat_2/item_1 
http://www.ex.com/product_cat_2/item_2 
http://www.ex.com/terms_and_conditions/ 
http://www.ex.com/Media_Center 



Example output: 
{'url':'http://www.ex.com', 'sub_dir':[ 
{'url':'http://www.ex.com/product_cat_1/', 'sub_dir':[ 
         {'url':'http://www.ex.com/product_cat_1/item_1'}, {'url':'http://www.ex.com/product_cat_1/item_2'}]}, 
{'url':'http://www.ex.com/product_cat_2/', 'sub_dir':[ 
         {'url':'http://www.ex.com/product_cat_2/item_1'}, 
         'url':'http://www.ex.com/product_cat_2/item_2']}, 
{'url':'http://www.ex.com/terms_and_conditions/'}, 
{'url':'http://www.ex.com/Media_Center'}, 
]} 
+1

你試過了什麼? http://mattgemmell.com/2008/12/08/what-have-you-tried/ – FlavorScape

回答

0
For each item: 
    if it is a subdir of something else: 
     add it to the subdirectory list of that item 
    otherwise: 
     add it to the main list. 
0

這裏有一個稍微不同的輸出格式的解決方案。首先,不是sub_dirurl鍵,目錄結構可以表示爲嵌套字典,其中空字典是空目錄或文件(樹中的葉)。

例如,輸入字符串

"www.foo.com/images/hellokitty.jpg" 
"www.foo.com/images/t-rex.jpg" 
"www.foo.com/videos/" 

將映射這樣的目錄結構:

{ 
    "www.foo.com": { 
     "images": { 
      "hellokitty.jpg": {}, 
      "t-rex.jpg": {} 
     }, 
     "videos": {} 
    } 
} 

在這種模式下,分析你的數據串的for循環,如果一個簡單的組合,一語句和一些字符串函數。

代碼:

raw_data = [ 
    "http://www.ex.com", 
    "http://www.ex.com/product_cat_1/", 
    "http://www.ex.com/product_cat_1/item_1", 
    "http://www.ex.com/product_cat_1/item_2", 
    "http://www.ex.com/product_cat_2/", 
    "http://www.ex.com/product_cat_2/item_1", 
    "http://www.ex.com/product_cat_2/item_2", 
    "http://www.ex.com/terms_and_conditions/", 
    "http://www.ex.com/Media_Center" 
] 

root = {} 

for url in raw_data: 
    last_dir = root 
    for dir_name in url.lstrip("htp:/").rstrip("/").split("/"): 
     if dir_name in last_dir: 
      last_dir = last_dir[dir_name] 
     else: 
      last_dir[dir_name] = {} 

輸出:

{ 
    "www.ex.com": { 
    "Media_Center": {}, 
    "terms_and_conditions": {}, 
    "product_cat_1": { 
     "item_2": {}, 
     "item_1": {} 
    }, 
    "product_cat_2": { 
     "item_2": {}, 
     "item_1": {} 
    } 
    } 
} 
0

這裏的一個腳本直接生產所要求的輸出(NB它需要從文件輸入;指定的文件名作爲第一。 (且僅限於)腳本的命令行參數)。 NB。使用butch的解決方案,然後可能轉變爲這種格式將可能更清潔和更快。

#!/usr/bin/env python 

from urlparse import urlparse 
from itertools import ifilter 


def match(init, path): 
    return path.startswith(init) and init[-1] == "/" 

def add_url(tree, url): 
    while True: 
     if tree["url"] == url: 
      return 
     f = list(ifilter(lambda t: match(t["url"], url), 
         tree.get("sub_dir", []))) 
     if len(f) > 0: 
      tree = f[0] 
      continue 
     sub = {"url": url} 
     tree.setdefault("sub_dir", []).append(sub) 
     return 

def build_tree(urls): 
    urls.sort() 
    url0 = urls[0] 
    tree = {'url': url0} 
    for url in urls[1:]: 
     add_url(tree, url) 
    return tree 

def read_urls(filename): 
    urls = [] 
    with open(filename) as fd: 
     for line in fd: 
      url = urlparse(line.strip()) 
      urls.append("".join([url.scheme, "://", url.netloc, url.path])) 
    return urls 


if __name__ == "__main__": 
    import sys 
    urls = read_urls(sys.argv[1]) 
    tree = build_tree(urls) 
    print("%r" % tree)