2014-12-28 54 views
2

我有一系列類似於「2014年12月27日星期六」的字符串,我想折騰「週六」並將名稱保存爲「141227」這是年+月+日。到目前爲止,除了我無法獲得daypos或yearpos的正則表達式的工作外,所有的工作都在進行。他們都給出了同樣的錯誤:在BeautifulSoup中使用正則表達式來解析Python中的字符串

Traceback (most recent call last): File "scrapewaybackblog.py", line 17, in daypos = byline.find(re.compile("[A-Z][a-z]*\s")) TypeError: expected a character buffer object

什麼是字符緩衝區對象?這是否意味着我的表情出了問題?這是我的腳本:

for i in xrange(3, 1, -1): 
     page = urllib2.urlopen("http://web.archive.org/web/20090204221349/http://www.americansforprosperity.org/nationalblog?page={}".format(i)) 
     soup = BeautifulSoup(page.read()) 
     snippet = soup.find_all('div', attrs={'class': 'blog-box'}) 
     for div in snippet: 
      byline = div.find('div', attrs={'class': 'date'}).text.encode('utf-8') 
      text = div.find('div', attrs={'class': 'right-box'}).text.encode('utf-8') 

      monthpos = byline.find(",") 
      daypos = byline.find(re.compile("[A-Z][a-z]*\s")) 
      yearpos = byline.find(re.compile("[A-Z][a-z]*\D\d*\w*\s")) 
      endpos = monthpos + len(byline) 

      month = byline[monthpos+1:daypos] 
      day = byline[daypos+0:yearpos] 
      year = byline[yearpos+2:endpos] 

      output_files_pathname = 'Data/' # path where output will go 
      new_filename = year + month + day + ".txt" 
      outfile = open(output_files_pathname + new_filename,'w') 
      outfile.write(date) 
      outfile.write("\n") 
      outfile.write(text) 
      outfile.close() 
     print "finished another url from page {}".format(i) 

我還沒有想出如何使12月= 12但這是另一次。請幫助我找到合適的職位。

回答

5

,而不是分析日期字符串與正則表達式,與dateutil解析它:

from dateutil.parser import parse 

for div in soup.select('div.blog-box'): 
    byline = div.find('div', attrs={'class': 'date'}).text.encode('utf-8') 
    text = div.find('div', attrs={'class': 'right-box'}).text.encode('utf-8') 

    dt = parse(byline) 
    new_filename = "{dt.year}{dt.month}{dt.day}.txt".format(dt=dt) 
    ... 

或者,你可以用datetime.strptime()解析字符串,但你需要採取的suffixes護理:

byline = re.sub(r"(?<=\d)(st|nd|rd|th)", "", byline) 
dt = datetime.strptime(byline, '%A, %B %d %Y') 

re.sub()此處發現stndrdth字符串after a digit並用空字符串替換後綴秒。它之後的日期字符串是匹配的'%A, %B %d %Y'格式,請參閱:


一些其他注意事項:

修正版本打交道時:

import os 
import urllib2 

from bs4 import BeautifulSoup 
from dateutil.parser import parse 


for i in xrange(3, 1, -1): 
    page = urllib2.urlopen("http://web.archive.org/web/20090204221349/http://www.americansforprosperity.org/nationalblog?page={}".format(i)) 
    soup = BeautifulSoup(page) 

    for div in soup.select('div.blog-box'): 
     byline = div.find('div', attrs={'class': 'date'}).text.encode('utf-8') 
     text = div.find('div', attrs={'class': 'right-box'}).text.encode('utf-8') 

     dt = parse(byline) 

     new_filename = "{dt.year}{dt.month}{dt.day}.txt".format(dt=dt) 
     with open(os.path.join('Data', new_filename), 'w') as outfile: 
      outfile.write(byline) 
      outfile.write("\n") 
      outfile.write(text) 

    print "finished another url from page {}".format(i) 
+0

你是真棒!謝謝。 –

相關問題