在完全平坦的HTML層次上使用BeautifulSoup

所以我是一個webscraping noob，並遇到了一些我從未見過的HTML格式。我需要的所有信息都是完全平坦的層次結構。我需要抓住日期/電影名稱/位置/設施。在完全平坦的HTML層次上使用BeautifulSoup

它奠定了這樣（就這樣）：

<div class="caption"> 
    <strong>July 1</strong> 
    <br> 
    <em>Top Gun</em> 
    <br> 
    "Location: Millennium Park" 
    <br> 
    "Amenities: Please be a volleyball tournament..." 
    <br> 
    <em>Captain Phillips</em> 
    <br> 
    "Location: Montgomery Ward Park" 
    <br> 
    <br> 
    <strong>July 2</strong> 
    <br> 
    <em>The Fantastic Mr. Fox </em>

我想最終在一本字典或列表格式，以能夠使用csvwriter寫出來作爲一個CSV文件或Dictwriter;所以像

輸出

[7月1日，壯志凌雲，千禧公園，「請一個排球賽......」]， [7月1日，飛利浦船長，蒙哥馬利沃德公園]等

由於令人煩惱的是，當兩部電影在相同的日期顯示時，日期只顯示在第一部電影之前;然後列出所有電影，直到下一個somedate歸入該初始日期。

建議傢伙？如何讓多部電影在上面標籤中指定的日期之下？可能考慮find_next_siblings包括檢查標籤是否爲標籤？

來源

2015-05-21 SpicyClubSauce

這是一個非常醜陋的解決方案，並應讓你使用它之前更強大的，但這樣的事情應該工作：

from bs4 import BeautifulSoup 
import re 
import csv 

doc = """<div class="caption"> 
    <strong>July 1</strong> 
    <br> 
    <em>Top Gun</em> 
    <br> 
    "Location: Millennium Park" 
    <br> 
    "Amenities: Please be a volleyball tournament..." 
    <br> 
    <em>Captain Phillips</em> 
    <br> 
    "Location: Montgomery Ward Park" 
    <br> 
    <br> 
    <strong>July 2</strong> 
    <br> 
    <em>The Fantastic Mr. Fox </em> 
    <br> 
    "Location: Somewhere" 
    <br> 
    "Amenities: Something something" 
    <br>""" 

soup = BeautifulSoup(doc.replace("<br>", "<br/>")) 

data = [] 

for date in soup.find_all("strong"): 
    sibling = date.next_sibling 
    while sibling and sibling.name != "strong": 
     if sibling.name == "em": 
      title = sibling 
      location = title.find_next("br").next 
      extra = location.find_next("br").next 

      row = [] 
      row.append(date.text) 
      row.append(title.text) 
      row.append(re.findall('(?<=:)[^"]*', location)[0]) 
      extra_val = re.findall('(?<=:)[^"]*', extra) 
      if len(extra_val): 
       row.append(extra_val[0]) 

      data.append(row) 

     sibling = sibling.next_sibling 

with open('foo.csv', 'wb') as csvfile: 
    writer = csv.writer(csvfile) 
    writer.writerows(data)

注意doc.replace(" ", " ")作爲BeautifulSoup另有解釋 標籤來包裝所有的休息的文件。

爲了解釋  VS  部分進一步：

<p></p><em></em>

在上述HTML em是p同級。

<p><em></em></p>

在這個HTML em是p一個孩子。現在讓我們看看BeautifulSoup如何解析一些HTML代碼：

>>> from bs4 import BeautifulSoup 
>>> BeautifulSoup('<br><p>Hello<br></p>', 'html.parser') 
<br><p>Hello<br/></p></br> 
>>> BeautifulSoup('<br><p>Hello<br></p>', 'html5lib') 
<html><head></head><body><br/><p>Hello<br/></p></body></html>

html.parser是蟒蛇內置HTML解析器，這是一個你在默認情況下得到的。正如您所看到的，它會添加一個關閉標記並將其中一個 轉換爲。總之，沒有關閉標籤就沒有很好的工作。那弄亂了什麼元素應該是兄弟姐妹。

html5lib另一方面試圖匹配瀏覽器會做什麼，並使用它而不是doc.replace(" ", " ")將工作以及。但是，它要慢很多，並且它不會與Python或BeautifulSoup一起使用，所以它需要另一個pip install html5lib才能工作。

來源

2015-05-21 09:43:22

嘿@Erik Vesteraas，不是真正理解doc.replace的目的。你能否詳細說明一下？謝謝！ – SpicyClubSauce

在完全平坦的HTML層次上使用BeautifulSoup

回答

相關問題