隨着Beautifulsoup，除了那些指定的元素提取標籤

我使用Beutifulsoup 4和Python 3.5+提取webdata。我有以下的HTML，從中我解壓：隨着Beautifulsoup，除了那些指定的元素提取標籤

<div class="the-one-i-want"> 
    <p> 
     content 
    </p> 
    <p> 
     content 
    </p> 
    <p> 
     content 
    </p> 
    <p> 
     content 
    </p> 
    <ol> 
     <li> 
      list item 
     </li> 
     <li> 
      list item 
     </li> 
    </ol> 
    <div class='something-i-don't-want> 
     content 
    </div> 
    <script class="something-else-i-dont-want'> 
     script 
    </script> 
    <p> 
     content 
    </p> 
</div>

所有這一切我想提取是<div class="the-one-i-want">元素中發現的內容。現在，我使用下面的方法，其工作的大部分時間：

soup = Beautifulsoup(html.text, 'lxml') 
content = soup.find('div', class_='the-one-i-want').findAll('p')

這不包括腳本，怪異插入div的否則不可預測的內容，如廣告或‘推薦內容’類型的東西。

現在，有些情況下，除了<p>標記之外，其他元素的內容對於主要內容（如列表）具有上下文重要性。

是否有一種方式來獲得從<div class="the-one-i-want">的方式，因爲這樣的內容：

soup = Beautifulsoup(html.text, 'lxml') 
content = soup.find('div', class_='the-one-i-want').findAll(desired-content-elements)

凡desired-content-elements將是包容性的，我認爲適合特定內容的每一個元素的？如全部爲<p>標籤，全部爲<ol>和<li>標籤，但沒有<div>或<script>標籤。

也許值得一提的，是我保存內容的方法：

content_string = '' 
for p in content: 
    content_string += str(p)

這種方法收集的數據，在出現的命令，這將被證明是難以管理，如果我只是通過不同的迭代過程中發現的不同的元素類型。如果可能的話，我希望不必管理拆分列表的重新構建，以重新組裝內容中最初發生的每個元素的順序。

來源

2016-07-21 theeastcoastwest

可以傳遞的，你想要的標籤列表：

content = soup.find('div', class_='the-one-i-want').find_all(["p", "ol", "whatever"])

如果我們運行在您的問題網址尋找p和預標記類似的東西，你可以看到，我們得到兩個：

...: for ele in soup.select_one("td.postcell").find_all(["pre","p"]): 
    ...:  print(ele) 
    ...: 

<p>I'm using Beutifulsoup 4 and Python 3.5+ to extract webdata. I have the following html, from which I am extracting:</p> 
<pre><code>&lt;div class="the-one-i-want"&gt; 
    &lt;p&gt; 
     content 
    &lt;/p&gt; 
    &lt;p&gt; 
     content 
    &lt;/p&gt; 
    &lt;p&gt; 
     content 
    &lt;/p&gt; 
    &lt;p&gt; 
     content 
    &lt;/p&gt; 
    &lt;ol&gt; 
     &lt;li&gt; 
      list item 
     &lt;/li&gt; 
     &lt;li&gt; 
      list item 
     &lt;/li&gt; 
    &lt;/ol&gt; 
    &lt;div class='something-i-don't-want&gt; 
     content 
    &lt;/div&gt; 
    &lt;script class="something-else-i-dont-want'&gt; 
     script 
    &lt;/script&gt; 
    &lt;p&gt; 
     content 
    &lt;/p&gt; 
&lt;/div&gt; 
</code></pre> 
<p>All of the content that I want to extract is found within the <code>&lt;div class="the-one-i-want"&gt;</code> element. Right now, I'm using the following methods, which work most of the time:</p> 
<pre><code>soup = Beautifulsoup(html.text, 'lxml') 
content = soup.find('div', class_='the-one-i-want').findAll('p') 
</code></pre> 
<p>This excludes scripts, weird insert <code>div</code>'s and otherwise un-predictable content such as ads or 'recommended content' type stuff.</p> 
<p>Now, there are some instances in which there are elements other than just the <code>&lt;p&gt;</code> tags, which has content that is contextually important to the main content, such as lists.</p> 
<p>Is there a way to get the content from the <code>&lt;div class="the-one-i-want"&gt;</code> in a manner as such:</p> 
<pre><code>soup = Beautifulsoup(html.text, 'lxml') 
content = soup.find('div', class_='the-one-i-want').findAll(desired-content-elements) 
</code></pre> 
<p>Where <code>desired-content-elements</code>would be inclusive of every element that I deemed fit for that particular content? Such as, all <code>&lt;p&gt;</code> tags, all <code>&lt;ol&gt;</code> and <code>&lt;li&gt;</code> tags, but no <code>&lt;div&gt;</code> or <code>&lt;script&gt;</code> tags.</p> 
<p>Perhaps noteworthy, is my method of saving the content:</p> 
<pre><code>content_string = '' 
for p in content: 
    content_string += str(p) 
</code></pre> 
<p>This approach collects the data, in order of occurrence, which would prove difficult to manage if I simply found different element types through different iteration processes. I'm looking to NOT have to manage re-construction of split lists to re-assemble the order in which each element originally occurred in the content, if possible.</p>

來源

2016-07-22 07:35:01

工程就像一個魅力，感謝您的幫助@ padraic – theeastcoastwest

-1

這是否適合您？它應該循環添加所需文本的內容，同時忽略div和腳本標記。

for p in content: 
    if p.find('div') or p.find('script'): 
     continue 
    content_string += str(p)

來源

2016-07-21 17:30:19 jinksPadlock

你可以很容易地用做

soup = Beautifulsoup(html.text, 'lxml') 
desired-tags = {'div', 'ol'} # add what you need 
content = filter(lambda x: x.name in desired-tags 
     soup.find('div', class_='the-one-i-want').children)

這將通過所有直接div標籤的子女。如果您希望以遞歸方式發生這種情況（您說過要添加li標籤），則應使用.decendants而不是.children。快樂爬行！

來源

2016-07-22 07:55:31 dirkster

隨着Beautifulsoup，除了那些指定的元素提取標籤

回答

相關問題