兩個單詞之間的Python文本解析

我正在使用beautifulsoup並希望從網頁上的兩個單詞之間提取所有文本。兩個單詞之間的Python文本解析

防爆，想象下面的網站文字：

This is the text of the webpage. It is just a string of a bunch of stuff and maybe some tags in between.

我要拉出來，與text開始，以bunch結束網頁上的所有內容。

在這種情況下，我只希望：

text of the webpage. It is just a string of a bunch

然而，有可能有這樣的多個實例在頁面上的機會。

這樣做的最好方法是什麼？

這是我的當前設置：

#!/usr/bin/env python 
from mechanize import Browser 
from BeautifulSoup import BeautifulSoup 

mech = Browser() 
urls = [ 
http://ca.news.yahoo.com/forget-phoning-business-app-sends-text-instead-100143774--sector.html 
    ] 



    for url in urls: 
     page = mech.open(url) 
     html = page.read() 
     soup = BeautifulSoup(html) 
     text= soup.prettify() 
      texts = soup.findAll(text=True) 

    def visible(element): 
     if element.parent.name in ['style', 'script', '[document]', 'head', 'title']: 
     # If the parent of your element is any of those ignore it 

      return False 

     elif re.match('<!--.*-->', str(element)): 
     # If the element matches an html tag, ignore it 

      return False 

     else: 
     # Otherwise, return True as these are the elements we need 

      return True 

    visible_texts = filter(visible, texts) 
    # Filter only returns those items in the sequence, texts, that return True. 
    # We use those to build our final list. 

    for line in visible_texts: 
     print line

來源

2012-11-22 user1328021

，因爲你只是把文本，你只需要在正則表達式：

import re 
result = re.findall("text.*?bunch", text_from_web_page)

來源

2012-11-22 03:44:17 scripts

謝謝，但我不知道這將是標籤的類型我所做的是解析出所有的文本，所以現在這是更多的文本解析問題。我已更新我的代碼以顯示此內容。 – user1328021

編輯以符合您的需求 – scripts

什麼是'text_from_web_page'變量應該在我的例子中？ – user1328021

兩個單詞之間的Python文本解析

回答

相關問題