1
我正在使用beautifulsoup並希望從網頁上的兩個單詞之間提取所有文本。兩個單詞之間的Python文本解析
防爆,想象下面的網站文字:
This is the text of the webpage. It is just a string of a bunch of stuff and maybe some tags in between.
我要拉出來,與text
開始,以bunch
結束網頁上的所有內容。
在這種情況下,我只希望:
text of the webpage. It is just a string of a bunch
然而,有可能有這樣的多個實例在頁面上的機會。
這樣做的最好方法是什麼?
這是我的當前設置:
#!/usr/bin/env python
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
mech = Browser()
urls = [
http://ca.news.yahoo.com/forget-phoning-business-app-sends-text-instead-100143774--sector.html
]
for url in urls:
page = mech.open(url)
html = page.read()
soup = BeautifulSoup(html)
text= soup.prettify()
texts = soup.findAll(text=True)
def visible(element):
if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
# If the parent of your element is any of those ignore it
return False
elif re.match('<!--.*-->', str(element)):
# If the element matches an html tag, ignore it
return False
else:
# Otherwise, return True as these are the elements we need
return True
visible_texts = filter(visible, texts)
# Filter only returns those items in the sequence, texts, that return True.
# We use those to build our final list.
for line in visible_texts:
print line
謝謝,但我不知道這將是標籤的類型我所做的是解析出所有的文本,所以現在這是更多的文本解析問題。我已更新我的代碼以顯示此內容。 – user1328021
編輯以符合您的需求 – scripts
什麼是'text_from_web_page'變量應該在我的例子中? – user1328021