如何拆分和刪除url中不需要的字符串？

-1

import urllib 
from bs4 import BeautifulSoup 

f = open('log1.txt', 'w') 

url ='http://www.brothersoft.com/tamil-font-513607.html' 
pageUrl = urllib.urlopen(url) 
soup = BeautifulSoup(pageUrl) 

for a in soup.select("div.class1.coLeft a[href]"): 
    try: 
     suburl = ('http://www.brothersoft.com'+a['href']).encode('utf-8','replace') 
     f.write ('http://www.brothersoft.com'+a['href']+'\n') 
    except: 
     print 'cannot read' 
     f.write('cannot read:'+'http://www.brothersoft.com'+a['href']+'\n') 

     pass 

    content = urllib.urlopen(suburl) 
    soup = BeautifulSoup(content) 
    for a in soup.select("div.Sever1.coLeft a[href]"): 
     try: 
      suburl2 = ('http://www.brothersoft.com'+a['href']).encode('utf-8','replace') 
      f.write ('http://www.brothersoft.com'+a['href']+'\n') 
     except: 
      print 'cannot read' 
      f.write('cannot read:'+'http://www.brothersoft.com'+a['href']+'\n') 

      pass 

     content = urllib.urlopen(suburl2) 
     soup = BeautifulSoup(content) 
     try: 
      suburl3 = soup.find('body')['onload'][10:-2] 
      print suburl3.replace("&" + url.split('&')[-1],"") 
      #f.write (soup.find('body')['onload'][10:-2]+'\n') 
     except: 
      print 'cannot read' 
      f.write(soup.find('body')['onload'][10:-2]+'\n') 

      pass 
f.close()

我想要的輸出應該是這樣的：

http://www.brothersoft.com/d.php?soft_id=159403&url=http%3A%2F%2Ffiles.brothersoft.com%2Fmp3_audio%2Fmidi_tools%2FSynthFontSetup.exe

來源

2013-08-30 wan mohd payed

試試這個：

url = "http://www.brothersoft.com/d.php?soft_id=159403&url=http%3A%2F%2Ffiles.brothersoft.com%2Fmp3_audio%2Fmidi_tools%2FSynthFontSetup.exe&name=SynthFont" 
print url.replace("&" + url.split('&')[-1],"")

輸出：

http://www.brothersoft.com/d.php?soft_id=159403&url=http%3A%2F%2Ffiles.brothersoft.com%2Fmp3_audio%2Fmidi_tools%2FSynthFontSetup.exe

您的代碼（有改動）：

import urllib 
from bs4 import BeautifulSoup 

f = open('log1.txt', 'w') 

url ='http://www.brothersoft.com/tamil-font-513607.html' 
pageUrl = urllib.urlopen(url) 
soup = BeautifulSoup(pageUrl) 

for a in soup.select("div.class1.coLeft a[href]"): 
    try: 
     suburl = ('http://www.brothersoft.com'+a['href']).encode('utf-8','replace') 
     f.write ('http://www.brothersoft.com'+a['href']+'\n') 
    except: 
     print 'cannot read' 
     f.write('cannot read:'+'http://www.brothersoft.com'+a['href']+'\n') 

     pass 

    content = urllib.urlopen(suburl) 
    soup = BeautifulSoup(content) 
    for a in soup.select("div.Sever1.coLeft a[href]"): 
     try: 
      suburl2 = ('http://www.brothersoft.com'+a['href']).encode('utf-8','replace') 
      f.write ('http://www.brothersoft.com'+a['href']+'\n') 
     except: 
      print 'cannot read' 
      f.write('cannot read:'+'http://www.brothersoft.com'+a['href']+'\n') 

      pass 

     content = urllib.urlopen(suburl2) 
     soup = BeautifulSoup(content) 
     try: 
      suburl3 = soup.find('body')['onload'][10:-2] 
      print suburl3 
      print suburl3.replace("&" + suburl3.split('&')[-1],"") 
      #f.write (soup.find('body')['onload'][10:-2]+'\n') 
     except: 
      print 'cannot read' 
      f.write(soup.find('body')['onload'][10:-2]+'\n') 

      pass 
f.close()

輸出：

http://www.brothersoft.com/d.php?soft_id=513607&url=http%3A%2F%2Ffiles.brothersoft.com%2Fphotograph_graphics%2Ffont_tools%2Fkeyman.exe&name=Tamil%20Font 
http://www.brothersoft.com/d.php?soft_id=513607&url=http%3A%2F%2Ffiles.brothersoft.com%2Fphotograph_graphics%2Ffont_tools%2Fkeyman.exe 
http://www.brothersoft.com/d.php?soft_id=513607&url=http%3A%2F%2Fusfiles.brothersoft.com%2Fphotograph_graphics%2Ffont_tools%2Fkeyman.exe&name=Tamil%20Font 
http://www.brothersoft.com/d.php?soft_id=513607&url=http%3A%2F%2Fusfiles.brothersoft.com%2Fphotograph_graphics%2Ffont_tools%2Fkeyman.exe

這是你想要的嗎？

來源

2013-08-30 08:22:18 ton1c

它沒有爲我工作。 –

什麼輸出？或者你有什麼錯誤？ – ton1c

我編輯了我的問題..我試了你的代碼，但沒有改變。 –

如何拆分和刪除url中不需要的字符串？

回答

相關問題