我使用此代碼提取信息
# _*_ coding:utf-8 _*_
import urllib2
import urllib
import re
from bs4 import BeautifulSoup
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
def grabHref(url,localfile):
html = urllib2.urlopen(url).read()
html = unicode(html,'gb2312','ignore').encode('utf-8','ignore')
soup = BeautifulSoup(html)
myfile = open(localfile,'wb')
for link in soup.select("div > a[href^=http://www.karmaloop.com/kazbah/browse]"):
for item in BeautifulSoup(urllib2.urlopen(link['href']).read()).select("div > a[href^=mailto]"):
contactInfo = item.get_text()
print link['href']
print contactInfo
myfile.write(link['href'])
myfile.write('\r\n')
myfile.write(contactInfo)
myfile.write('\r\n')
myfile.close()
def main():
url = "http://www.karmaloop.com/brands"
localfile = 'Contact.txt'
grabHref(url,localfile)
if __name__=="__main__":
main()
但我仍然只能得到電子郵件地址在這裏,我怎麼能得到電話號碼和地址?謝謝
看看這個:http://stackoverflow.com/questions/11709079/parsing-html-python – RafaelC
@RafaelCardoso我讀到了。但是如何在「|」之後獲得信息?我的意思是,獲取[email protected]很容易,但很難得到電話和地址 –
也許['split']的文檔(https://docs.python.org/3/library/stdtypes.html#str .split)將向您展示如何提取這些「硬」部分。另外,如果你展示你自己嘗試過的某種形式的代碼,那麼在將來考慮你會得到(更好的)答案。如果你特意寫出獲取電子郵件地址很容易,那麼爲什麼你沒有複製你在問題中使用的代碼?看看[寫完美的問題](http://codeblog.jonskeet.uk/2010/08/29/writing-the-perfect-question/)和[如何問](https://stackoverflow.com/help /如何對問)。 –