2013-07-26 47 views
0

我有很多像這樣的鏈接http://example.com/2013/1520/i2013i1520p100049.htmlhttp://example.com/2013/89/i2013i89p60003.html如何從URL中剪切文件名?

我需要將HTML文件分別保存在文件夾1520中作爲i2013i1520p100049.html和文件夾「89」中的文件作爲i2013i89p60003.html

我可以削減字符串,但其他人有另一個長度。

P.S.我正在使用Python。

回答

0

所以使用這種標準化的格式最快的方法是使用查找和切片:)。正則表達式是不值得的

例如,

>>> a = "http://example.com/2013/1520/i2013i1520p100049.html or http://example.com/2013/89/i2013i89p60003.html" 
>>> lastindex = a.rfind('/') 
>>> a[lastindex+1:] 
'i2013i89p60003.html' 
>>> a[a.rfind('/',0,lastindex)+1:lastindex] 
'89' 

分裂VS發現一個巨大的網址(這些是存在的,但通常不大)

>>> a = range(10000) 
>>> [a.insert(randint(0,10000),'/') for x in range(0,100)] 
>>> a = str(a) 
>>> b = time.time(); a.rfind('/'); time.time()-b 
58493 
1.8835067749023438e-05 
>>> b = time.time(); d=a.split('/'); time.time()-b 
0.00012683868408203125 

更重要的是,你不需要做出的一個巨大的再分配/複製您的列表,當你有1000的,這並不好玩URL的

1

您可以使用類似以下的(如果你想要做的更復雜的工作):

s = 'http://example.com/2013/1520/i2013i1520p100049.html' 

from operator import itemgetter 
from urlparse import urlsplit 

split_url = urlsplit(s) 
path, fname = itemgetter(2, -1)(split_url.path.split('/')) 
print path, fname 
# 1520 i2013i1520p100049.html 

否則:

path, fname = s.rsplit('/', 2)[1:] 
2

使用split()

url = 'http://example.com/2013/1520/i2013i1520p100049.html' 
parts = url.split('/') 

fn = parts[-1] 
dir = parts[-2] 

然後撥打電話,保存源:

import urllib2 

fp = urllib2.urlopen(url).read() 

fullpath_fn = dir + '/' + fn 
with open(fullpath, 'w') as htmlfile: 
    htmlfile.write(fp) 
0
>>> 'http://example.com/2013/1520/i2013i1520p100049.html'.split('/')[-1] 
'i2013i1520p100049.html' 
0

您可以使用該方法split()

url = 'http://example.com/2013/1520/i2013i1520p100049.html' 
tokens = url.split('/') 
file = parts[-1] 
folder = parts[-2] 
2

你可以使用urlparse.urlsplitos.path.split

import os 
import urlparse 
s = 'http://example.com/2013/1520/i2013i1520p100049.html' 

path = urlparse.urlsplit(s).path 
print(path) 
# /2013/1520/i2013i1520p100049.html 

dirname, basename = os.path.split(path) 
dirname, basedir = os.path.split(dirname) 
print(basedir) 
# 1520 
print(basename) 
# i2013i1520p100049.html 
0

只是爲了它的緣故,基於正則表達式回答:

match = re.search(r'([0-9]+)/([a-z0-9]+\.html)$', string) 
if match: 
    folder = match.group(1) 
    file = match.group(2)