-1
我有這樣的代碼:如何拆分和刪除url中不需要的字符串?
import urllib
from bs4 import BeautifulSoup
f = open('log1.txt', 'w')
url ='http://www.brothersoft.com/tamil-font-513607.html'
pageUrl = urllib.urlopen(url)
soup = BeautifulSoup(pageUrl)
for a in soup.select("div.class1.coLeft a[href]"):
try:
suburl = ('http://www.brothersoft.com'+a['href']).encode('utf-8','replace')
f.write ('http://www.brothersoft.com'+a['href']+'\n')
except:
print 'cannot read'
f.write('cannot read:'+'http://www.brothersoft.com'+a['href']+'\n')
pass
content = urllib.urlopen(suburl)
soup = BeautifulSoup(content)
for a in soup.select("div.Sever1.coLeft a[href]"):
try:
suburl2 = ('http://www.brothersoft.com'+a['href']).encode('utf-8','replace')
f.write ('http://www.brothersoft.com'+a['href']+'\n')
except:
print 'cannot read'
f.write('cannot read:'+'http://www.brothersoft.com'+a['href']+'\n')
pass
content = urllib.urlopen(suburl2)
soup = BeautifulSoup(content)
try:
suburl3 = soup.find('body')['onload'][10:-2]
print suburl3.replace("&" + url.split('&')[-1],"")
#f.write (soup.find('body')['onload'][10:-2]+'\n')
except:
print 'cannot read'
f.write(soup.find('body')['onload'][10:-2]+'\n')
pass
f.close()
我想要的輸出應該是這樣的:
它沒有爲我工作。 –
什麼輸出?或者你有什麼錯誤? – ton1c
我編輯了我的問題..我試了你的代碼,但沒有改變。 –