2012-08-03 67 views
1

截至目前我正試圖抓住Good.is。截至目前爲止的代碼給了我正常的形象(把if語句變成True),但我想要更高res圖片。我想知道如何替換某些文本,以便我可以下載高分辨率圖片。我想將html:http://awesome.good.is/transparency/web/1207/invasion-of-the-drones/flash.html更改爲http://awesome.good.is/transparency/web/1207/invasion-of-the-drones/flat.html(結尾不同)。我的代碼是:如何替換Python中的字符串的特定部分

import os, urllib, urllib2 
from BeautifulSoup import BeautifulSoup 
import HTMLParser 

parser = HTMLParser.HTMLParser() 

# make folder. 
folderName = 'Good.is' 
if not os.path.exists(folderName): 
    os.makedirs(folderName) 


list = [] 
# Python ranges start from the first argument and iterate up to one 
# less than the second argument, so we need 36 + 1 = 37 
for i in range(1, 37): 
    list.append("http://www.good.is/infographics/page:" + str(i) + "/sort:recent/range:all") 


listIterator1 = [] 
listIterator1[:] = range(0,37)  
counter = 0 


for x in listIterator1: 


    soup = BeautifulSoup(urllib2.urlopen(list[x]).read()) 

    body = soup.findAll("ul", attrs = {'id': 'gallery_list_elements'}) 

    number = len(body[0].findAll("p")) 
    listIterator = [] 
    listIterator[:] = range(0,number)   

    for i in listIterator: 
     paragraphs = body[0].findAll("p") 
     nextArticle = body[0].findAll("a")[2] 
     text = body[0].findAll("p")[i] 

     if len(paragraphs) > 0: 
      #print image['src'] 
      counter += 1 
      print counter 
      print parser.unescape(text.getText()) 
      print "http://www.good.is" + nextArticle['href'] 
      originalArticle = "http://www.good.is" + nextArticle['href'] 
      article = BeautifulSoup(urllib2.urlopen(originalArticle).read()) 
      title = article.findAll("div", attrs = {'class': 'title_and_image'}) 
      getTitle = title[0].findAll("h1") 
      article1 = article.findAll("div", attrs = {'class': 'body'}) 
      articleImage = article1[0].find("p") 
      betterImage = articleImage.find("a") 
      articleImage1 = articleImage.find("img") 
      paragraphsWithinSection = article1[0].findAll("p") 
      print betterImage['href'] 
      if len(paragraphsWithinSection) > 1: 
       articleText = article1[0].findAll("p")[1] 
      else: 
       articleText = article1[0].findAll("p")[0] 
      print articleImage1['src'] 
      print parser.unescape(getTitle) 
      if not articleText is None: 
       print parser.unescape(articleText.getText()) 
      print '\n' 
      link = articleImage1['src'] 
      x += 1 


      actually_download = False 
      if actually_download: 
       filename = link.split('/')[-1] 
       urllib.urlretrieve(link, filename) 

回答

3

看看str.replace。如果這還不夠完整,那麼您需要使用正則表達式(re - 可能是re.sub)。

>>> str1="http://awesome.good.is/transparency/web/1207/invasion-of-the-drones/flash.html" 
>>> str1.replace("flash","flat") 
'http://awesome.good.is/transparency/web/1207/invasion-of-the-drones/flat.html' 
0

@mgilson有一個很好的解決方案,但問題是它會替換所有出現的字符串;所以如果你有單詞「閃」作爲URL(而不是僅僅尾隨文件名)的一部分,您將有多個替代品:

>>> str = 'hello there hello' 
>>> str.replace('hello','world') 
'world there world' 

另一種解決方案是/後更換的最後一部分與flat.html

>>> url = 'http://www.google.com/this/is/sample/url/flash.html' 
>>> url[:url.rfind('/')+1]+'flat.html' 
'http://www.google.com/this/is/sample/url/flat.html' 
0

使用urlparse你可以做幾個位和鮑勃:

from urlparse import urlsplit, urlunsplit, urljoin 

s = 'http://awesome.good.is/transparency/web/1207/invasion-of-the-drones/flash.html' 

url = urlsplit(s) 
head, tail = url.path.rsplit('/', 1) 
new_path = head, 'flat.html' 
print urlunsplit(url._replace(path=urljoin(*new_path))) 
1

我認爲最安全和最簡單的方法就是給我們e正則表達式:

import re 
url = 'http://www.google.com/this/is/sample/url/flash.html' 
newUrl = re.sub('flash\.html$','flat.html',url) 

「$」表示只匹配字符串的末尾。即使在你的url包含子字符串「flash.html」而不是結尾的地方,這個解決方案也會正常運行,並且如果它沒有結束,也會保持字符串不變(我認爲這是正確的行爲)與'flash.html'。

參見:http://docs.python.org/library/re.html#re.sub