在BeautifulSoup中使用正則表達式來解析Python中的字符串

我有一系列類似於「2014年12月27日星期六」的字符串，我想折騰「週六」並將名稱保存爲「141227」這是年+月+日。到目前爲止，除了我無法獲得daypos或yearpos的正則表達式的工作外，所有的工作都在進行。他們都給出了同樣的錯誤：在BeautifulSoup中使用正則表達式來解析Python中的字符串

Traceback (most recent call last): File "scrapewaybackblog.py", line 17, in daypos = byline.find(re.compile("[A-Z][a-z]*\s")) TypeError: expected a character buffer object

什麼是字符緩衝區對象？這是否意味着我的表情出了問題？這是我的腳本：

for i in xrange(3, 1, -1): 
     page = urllib2.urlopen("http://web.archive.org/web/20090204221349/http://www.americansforprosperity.org/nationalblog?page={}".format(i)) 
     soup = BeautifulSoup(page.read()) 
     snippet = soup.find_all('div', attrs={'class': 'blog-box'}) 
     for div in snippet: 
      byline = div.find('div', attrs={'class': 'date'}).text.encode('utf-8') 
      text = div.find('div', attrs={'class': 'right-box'}).text.encode('utf-8') 

      monthpos = byline.find(",") 
      daypos = byline.find(re.compile("[A-Z][a-z]*\s")) 
      yearpos = byline.find(re.compile("[A-Z][a-z]*\D\d*\w*\s")) 
      endpos = monthpos + len(byline) 

      month = byline[monthpos+1:daypos] 
      day = byline[daypos+0:yearpos] 
      year = byline[yearpos+2:endpos] 

      output_files_pathname = 'Data/' # path where output will go 
      new_filename = year + month + day + ".txt" 
      outfile = open(output_files_pathname + new_filename,'w') 
      outfile.write(date) 
      outfile.write("\n") 
      outfile.write(text) 
      outfile.close() 
     print "finished another url from page {}".format(i)

我還沒有想出如何使12月= 12但這是另一次。請幫助我找到合適的職位。

來源

2014-12-28 Jolijt Tamanaha

，而不是分析日期字符串與正則表達式，與dateutil解析它：

from dateutil.parser import parse 

for div in soup.select('div.blog-box'): 
    byline = div.find('div', attrs={'class': 'date'}).text.encode('utf-8') 
    text = div.find('div', attrs={'class': 'right-box'}).text.encode('utf-8') 

    dt = parse(byline) 
    new_filename = "{dt.year}{dt.month}{dt.day}.txt".format(dt=dt) 
    ...

或者，你可以用datetime.strptime()解析字符串，但你需要採取的suffixes護理：

byline = re.sub(r"(?<=\d)(st|nd|rd|th)", "", byline) 
dt = datetime.strptime(byline, '%A, %B %d %Y')

re.sub()此處發現st或nd或rd或th字符串after a digit並用空字符串替換後綴秒。它之後的日期字符串是匹配的'%A, %B %d %Y'格式，請參閱：

strftime() and strptime() Behavior

一些其他注意事項：

你可以直接傳遞的urlopen()結果到BeautifulSoup構造
而不是find_all()按類名，使用CSS Selectordiv.blog-box
加入系統路徑，使用os.path.join()
使用with context manager與文件

修正版本打交道時：

import os 
import urllib2 

from bs4 import BeautifulSoup 
from dateutil.parser import parse 


for i in xrange(3, 1, -1): 
    page = urllib2.urlopen("http://web.archive.org/web/20090204221349/http://www.americansforprosperity.org/nationalblog?page={}".format(i)) 
    soup = BeautifulSoup(page) 

    for div in soup.select('div.blog-box'): 
     byline = div.find('div', attrs={'class': 'date'}).text.encode('utf-8') 
     text = div.find('div', attrs={'class': 'right-box'}).text.encode('utf-8') 

     dt = parse(byline) 

     new_filename = "{dt.year}{dt.month}{dt.day}.txt".format(dt=dt) 
     with open(os.path.join('Data', new_filename), 'w') as outfile: 
      outfile.write(byline) 
      outfile.write("\n") 
      outfile.write(text) 

    print "finished another url from page {}".format(i)

來源

2014-12-28 00:23:27 alecxe

你是真棒！謝謝。 –

在BeautifulSoup中使用正則表達式來解析Python中的字符串

回答

相關問題