beautifulsoup解析html標記異常

我從HTML文件中提取一些信息。但是，一些文件沒有標籤<p class="p p1"> date </p>，它返回beautifulsoup解析html標記異常

AttributeError: 'NoneType' object has no attribute 'strip'

而且在一些文件的日期是不是在標籤內。我發現一個是：

<time content="2005-11-11T19:09:08Z" itemprop="datePublished"> 
Nov. 11, 2005 2:09 PM ET 
</time>

我該如何解決這兩個問題？

我的代碼：

month_list = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October','November', 'December', 'Jan', 'Feb', 'Aug', 'Oct', 'Dec'] 


def first_date_p(): 

    for p in soup.find_all('p', {"class": "p p1"}): 
     for month in month_list: 
      if month in p.get_text(): 
       first_date_p = p.get_text() 
       date_start = first_date_p.index(month) 
       date_text = first_date_p[date_start:] 
       return date_text 
      else: 
      #if the tag exist, but do not have date. 
       month = 'No Date/Error' 
       return month.strip()

來源

2017-05-30 Michael Lin

在我看來，你應該尋找日期的特性開始，你希望適用於* all *的HTML文件。實際上，日期可能會有更多的格式，並且您需要分別處理每個格式。你有多少種不同的格式？ –

如果要確保所選擇的 'P' 標籤總是包含一些文字，您可以將text參數設置爲True，即：

soup.find_all('p', {"class": "p p1"}, text=True)

否則，如果你想得到所有的'p'，即使它們不包含任何文本，你可以將None轉換爲字符串，例如：

str(p.get_text()).strip()

關於你的第二個問題，您可以選擇「時間」標籤的「內容」屬性，如：

soup.find('time').get('content')

來源

2017-05-30 06:07:18

beautifulsoup解析html標記異常

回答

相關問題