使用BeautifulSoup findAll將多行輸出組合到一行中使用多個類/標籤

我試圖構建一個從網頁中收集文本的刮板。我正在研究具有不同類名的兩個特定div（「產品圖像」和「產品詳情」）。我通過它們循環，抓住div中每個「a」和「dd」標籤的文本。使用BeautifulSoup findAll將多行輸出組合到一行中使用多個類/標籤

值得一提的，這是我曾經寫過的第一個Python程序...

這裏是我的代碼：

list_of_rows = [] 
for row in soup.findAll(True, {"class":["product-image", "product-details"]}): 
    list_of_cells = [] 
    for cell in row.findAll(['a', 'dd']): 
     text = cell.text.replace('&nbsp;', '') 
     list_of_cells.append(text) 
    list_of_rows.append(list_of_cells)

當我打印出來list_of_rows，我得到下面的輸出在循環的每個通：

[價格]

[標題]，[作者]，[出版商]，[嗒嗒]，[嗒嗒]，[嗒嗒]

[price]來自「product-image」div塊。 [標題]等。來自「產品細節」div塊。

所以基本上，findAll和循環我已經寫了輸出不同的行我循環每個div塊。我想要得到的結果是輸出的兩個塊一行，就像這樣：

[價格]，[標題]，[作者]，[發行]，[等等]，[等等]，[無]

有沒有辦法在我現有的流程內做到這一點，還是我需要將其分解爲多個循環，單獨提取數據，然後結合？我已經瀏覽了StackOverflow和其他站點上的所有Q & A，並且我可以找到具有多個類的findAll循環的實例，但我找不到任何有關如何將輸出減少爲單行的示例。

以下是我正在解析的網頁片段。這個片段出現1-x次在我解析HTML，其中x是頁面上的產品數量：

<div class="product-image"> 
    <a class="thumb" href="/Store/Details/life-on-the-screen/_/R-9780684833484B"><img src="http://images.bookdepot.com/covers/large/isbn978068/9780684833484-l.jpg" alt="" class="cover" /> 
     <div class="price "><span>$</span>2.25 
     </div> 
    </a> 
</div> 

<div class="product-details"> 
    <dl> 
     <dt><div class="nowrap"><span><a href="/Store/Details/life-on-the-screen/_/R-9780684833484B" title="Life On The Screen">Life On The Screen</a></span></div></dt> 
     <dd class="type"><div class="nowrap"><span><a href="/Store/Browse/turkle-sherry/_/N-4294697489/Ne-4">Turkle, Sherry</a></span></div></dd> 
     <dd class="type"><div class="nowrap"><a href="/Store/Browse/simon-and-schuster/_/N-4294151338/Ne-5">Simon and Schuster</a></div></dd> 
     <dd class="type">(Paperback)</dd> 
     <dd class="type">Computers &amp; Internet</dd> 
     <dd class="type">ISBN: 9780684833484</dd> 
     <dd>List $15.00 - Qty: 9</dd> 
      </dl> 
</div>

任何指針或幫助是極大的讚賞！

來源

2017-01-26 Chad Whitney

從你的問題，我想出了2個結果..我不知道你在找什麼...所以我張貼這兩種情況下

第一種情況 - 延長而不是進行附加

from bs4 import BeautifulSoup 
data = """<div class="product-image"> 
    <a class="thumb" href="/Store/Details/life-on-the-screen/_/R-9780684833484B"><img src="http://images.bookdepot.com/covers/large/isbn978068/9780684833484-l.jpg" alt="" class="cover" /> 
     <div class="price "><span>$</span>2.25 
     </div> 
    </a> 
</div> 

<div class="product-details"> 
    <dl> 
     <dt><div class="nowrap"><span><a href="/Store/Details/life-on-the-screen/_/R-9780684833484B" title="Life On The Screen">Life On The Screen</a></span></div></dt> 
     <dd class="type"><div class="nowrap"><span><a href="/Store/Browse/turkle-sherry/_/N-4294697489/Ne-4">Turkle, Sherry</a></span></div></dd> 
     <dd class="type"><div class="nowrap"><a href="/Store/Browse/simon-and-schuster/_/N-4294151338/Ne-5">Simon and Schuster</a></div></dd> 
     <dd class="type">(Paperback)</dd> 
     <dd class="type">Computers &amp; Internet</dd> 
     <dd class="type">ISBN: 9780684833484</dd> 
     <dd>List $15.00 - Qty: 9</dd> 
      </dl> 
</div>""" 

soup = BeautifulSoup(data,'lxml') 

list_of_rows = [] 
for row in soup.findAll(True, {"class":["product-image", "product-details"]}): 
    list_of_cells = [] 
    for cell in row.findAll(['a', 'dd']): 
     text = cell.text.replace('&nbsp;', '') 
     list_of_cells.append(text) 
    list_of_rows.extend(list_of_cells) 
print list_of_rows

輸出

[u'\n$2.25\n  \n', u'Life On The Screen', u'Turkle, Sherry', u'Turkle, Sherry', u'Simon and Schuster', u'Simon and Schuster', u'(Paperback)', u'Computers & Internet', u'ISBN: 9780684833484', u'List $15.00 - Qty: 9']

第二種情況列表 - 你需要刪除新線來自html文本的字符

list_of_rows = [] 
for row in soup.findAll(True, {"class":["product-image", "product-details"]}): 
    list_of_cells = [] 
    for cell in row.findAll(['a', 'dd']): 
     text = cell.text.replace('&nbsp;', '') 
     list_of_cells.append(text.strip()) 
    list_of_rows.append(list_of_cells) 
print list_of_rows

輸出

[[u'$2.25'], [u'Life On The Screen', u'Turkle, Sherry', u'Turkle, Sherry', u'Simon and Schuster', u'Simon and Schuster', u'(Paperback)', u'Computers & Internet', u'ISBN: 9780684833484', u'List $15.00 - Qty: 9']]

來源

2017-01-26 20:01:23 Shijo

感謝您的回答 - 我其實去除這行文字完全從我的代碼樣本 - 它只是在那裏進行調試和我無意中把它放在什麼我最終做的是輸出list_of_rows。到一個.csv文件（我沒有打擾包含該代碼，因爲問題不在寫入到csv中，它在list_of_rows的結構中，其中每個循環都有多行被寫入輸出而不是單個）希望我已經澄清了這個問題 - 對於意外退出調試代碼感到抱歉。 –

問題是有點混亂，仍然提出了一些解決方案，讓我知道這對你有用 – Shijo

這完美的作品！感謝您的幫助 - 我真的很感激它！ –

使用BeautifulSoup findAll將多行輸出組合到一行中使用多個類/標籤

回答

相關問題