2016-08-10 145 views
0

在下面的代碼中,符號字符串re.sub('<[^>]*>|[\n]|\[[0-9]*\]', '', htmlread)的每個元素是什麼意思?Python網頁抓取,符號含義

import urllib2 
import re 

htmltext = urllib2.urlopen("https://en.wikipedia.org/wiki/Linkin_Park") 
htmlread = htmltext.read() 
htmlread = re.sub('<[^>]*>|[\n]|\[[0-9]*\]', '', htmlread) 
regex = '(?<=Linkin Park was founded)(.*)(?=the following year.)' 
pattern = re.compile(regex) 
htmlread = re.findall(pattern, htmlread) 
print "Linkin Park was founded" + htmlread[0] + "the following year." 
+1

http://stackoverflow.com/questions/22937618/參考 - 什麼 - 做 - 這正則表達式均值 –

回答

0

htmlread = re.sub('<[^>]*>|[\n]|\[[0-9]*\]', '', htmlread)去除要麼

  • <> OR
  • 換行符之間的表達
  • 括號或空括號

從htmlread之間的數

有趣維基張貼在這裏:Reference - What does this regex mean?

0

替換「」的每一個字符,這意味着從htmlread可變

刪除,請閱讀更多關於正則表達式