0
我目前有一段代碼主要運行,因爲我期望它只打印出原始列表和已被過濾的代碼。基本上我想要做的是從網頁中讀取URL並將它們存儲到列表中(稱爲匹配,這部分工作正常),然後將該列表過濾到新列表中(稱爲fltrmtch),因爲原始包含所有額外的href標籤等。Python的正則表列入另一個列表
例如目前它只會B之後打印出A和B,但林:
乙' http://docs.python.org/devguide/「),
赫雷什代碼:
url = "URL WOULD BE IN HERE BUT NOT ALLOWED TO POST MULTIPLE LINKS" #Name of the url being searched
webpage = urllib.urlopen(url)
content = webpage.read() #places the read url contents into variable content
import re # Imports the re module which allows seaching for matches.
import pprint # This import allows all listitems to be printed on seperate lines.
match = re.findall(r'\<a.*href\=.*http\:.+', content)#matches any content that begins with a href and ands in >
def filterPick(list, filter):
return [(l, m.group(1)) for l in match for m in (filter(l),) if m]
regex=re.compile(r'\"(.+?)\"').search
fltrmtch = filterPick(match, regex)
try:
if match: # defines that if there is a match the below is ran.
print "The number of URL's found is:" , len(match)
match.sort()
print "\nAnd here are the URL's found: "
pprint.pprint(fltrmtch)
except:
print "No URL matches have been found, please try again!"
任何幫助將 非常感激。
預先感謝您。
UPDATE:謝謝你不過頒發的答案,我設法找到破綻
回報[(1,1- m.group(1))l在匹配米(過濾器(L))如果m]
我只是不得不從[(1,m.group(1)))中刪除1。再次感謝。