2013-08-30 169 views
-1

我有這樣的代碼:如何拆分和刪除url中不需要的字符串?

import urllib 
from bs4 import BeautifulSoup 

f = open('log1.txt', 'w') 

url ='http://www.brothersoft.com/tamil-font-513607.html' 
pageUrl = urllib.urlopen(url) 
soup = BeautifulSoup(pageUrl) 

for a in soup.select("div.class1.coLeft a[href]"): 
    try: 
     suburl = ('http://www.brothersoft.com'+a['href']).encode('utf-8','replace') 
     f.write ('http://www.brothersoft.com'+a['href']+'\n') 
    except: 
     print 'cannot read' 
     f.write('cannot read:'+'http://www.brothersoft.com'+a['href']+'\n') 

     pass 

    content = urllib.urlopen(suburl) 
    soup = BeautifulSoup(content) 
    for a in soup.select("div.Sever1.coLeft a[href]"): 
     try: 
      suburl2 = ('http://www.brothersoft.com'+a['href']).encode('utf-8','replace') 
      f.write ('http://www.brothersoft.com'+a['href']+'\n') 
     except: 
      print 'cannot read' 
      f.write('cannot read:'+'http://www.brothersoft.com'+a['href']+'\n') 

      pass 

     content = urllib.urlopen(suburl2) 
     soup = BeautifulSoup(content) 
     try: 
      suburl3 = soup.find('body')['onload'][10:-2] 
      print suburl3.replace("&" + url.split('&')[-1],"") 
      #f.write (soup.find('body')['onload'][10:-2]+'\n') 
     except: 
      print 'cannot read' 
      f.write(soup.find('body')['onload'][10:-2]+'\n') 

      pass 
f.close() 

我想要的輸出應該是這樣的:

http://www.brothersoft.com/d.php?soft_id=159403&url=http%3A%2F%2Ffiles.brothersoft.com%2Fmp3_audio%2Fmidi_tools%2FSynthFontSetup.exe

回答

1

試試這個:

url = "http://www.brothersoft.com/d.php?soft_id=159403&url=http%3A%2F%2Ffiles.brothersoft.com%2Fmp3_audio%2Fmidi_tools%2FSynthFontSetup.exe&name=SynthFont" 
print url.replace("&" + url.split('&')[-1],"") 

輸出:

http://www.brothersoft.com/d.php?soft_id=159403&url=http%3A%2F%2Ffiles.brothersoft.com%2Fmp3_audio%2Fmidi_tools%2FSynthFontSetup.exe 

您的代碼(有改動):

import urllib 
from bs4 import BeautifulSoup 

f = open('log1.txt', 'w') 

url ='http://www.brothersoft.com/tamil-font-513607.html' 
pageUrl = urllib.urlopen(url) 
soup = BeautifulSoup(pageUrl) 

for a in soup.select("div.class1.coLeft a[href]"): 
    try: 
     suburl = ('http://www.brothersoft.com'+a['href']).encode('utf-8','replace') 
     f.write ('http://www.brothersoft.com'+a['href']+'\n') 
    except: 
     print 'cannot read' 
     f.write('cannot read:'+'http://www.brothersoft.com'+a['href']+'\n') 

     pass 

    content = urllib.urlopen(suburl) 
    soup = BeautifulSoup(content) 
    for a in soup.select("div.Sever1.coLeft a[href]"): 
     try: 
      suburl2 = ('http://www.brothersoft.com'+a['href']).encode('utf-8','replace') 
      f.write ('http://www.brothersoft.com'+a['href']+'\n') 
     except: 
      print 'cannot read' 
      f.write('cannot read:'+'http://www.brothersoft.com'+a['href']+'\n') 

      pass 

     content = urllib.urlopen(suburl2) 
     soup = BeautifulSoup(content) 
     try: 
      suburl3 = soup.find('body')['onload'][10:-2] 
      print suburl3 
      print suburl3.replace("&" + suburl3.split('&')[-1],"") 
      #f.write (soup.find('body')['onload'][10:-2]+'\n') 
     except: 
      print 'cannot read' 
      f.write(soup.find('body')['onload'][10:-2]+'\n') 

      pass 
f.close() 

輸出:

http://www.brothersoft.com/d.php?soft_id=513607&url=http%3A%2F%2Ffiles.brothersoft.com%2Fphotograph_graphics%2Ffont_tools%2Fkeyman.exe&name=Tamil%20Font 
http://www.brothersoft.com/d.php?soft_id=513607&url=http%3A%2F%2Ffiles.brothersoft.com%2Fphotograph_graphics%2Ffont_tools%2Fkeyman.exe 
http://www.brothersoft.com/d.php?soft_id=513607&url=http%3A%2F%2Fusfiles.brothersoft.com%2Fphotograph_graphics%2Ffont_tools%2Fkeyman.exe&name=Tamil%20Font 
http://www.brothersoft.com/d.php?soft_id=513607&url=http%3A%2F%2Fusfiles.brothersoft.com%2Fphotograph_graphics%2Ffont_tools%2Fkeyman.exe 

這是你想要的嗎?

+0

它沒有爲我工作。 –

+0

什麼輸出?或者你有什麼錯誤? – ton1c

+0

我編輯了我的問題..我試了你的代碼,但沒有改變。 –