讀取保存在文本文件中的源文件並提取文本

我有多個文本文件，這些文件用於存儲網站的源頁面。所以每個文本文件都是一個源頁面。讀取保存在文本文件中的源文件並提取文本

我需要使用下面的代碼保存在文本文件中一個div類提取文本：

from bs4 import BeautifulSoup 
soup = BeautifulSoup(open("zing.internet.accelerator.plus.txt")) 
txt = soup.find('div' , attrs = { 'class' : 'id-app-orig-desc' }).text 
print txt

我已經檢查了我的湯對象的類型，以確保它不使用字符串find方法，同時尋找爲div類。類型湯對象的

print type(soup) 
<class 'bs4.BeautifulSoup'>

我已經從一個the previous post所取出的參考，並書面beautifulsoup語句內公開聲明。

錯誤：從頁面

Traceback (most recent call last): 
    File "html_desc_cleaning.py", line 13, in <module> 
    txt2 = soup.find('div' , attrs = { 'class' : 'id-app-orig-desc' }).text 
AttributeError: 'NoneType' object has no attribute 'text'

來源：

來源

2015-10-14 Pappu Jha

請勿上傳圖片添加文字，因爲圖片無用 – styvane

我已經解決了這個問題。

在我的情況下，beautifulsoup的默認解析器是'lxml'，它無法讀取完整的源頁面。

更改解析器爲'html.parser'已爲我工作。

f = open("zing.internet.accelerator.plus.txt") 
soup = f.read() 
bs = BeautifulSoup(soup,"html.parser") 
print bs.find('div',{'class' : 'id-app-orig-desc'}).text

來源

2015-10-14 14:04:59

嘗試替換此：

soup = BeautifulSoup(open("zing.internet.accelerator.plus.txt"))

與此：

soup = BeautifulSoup(open("zing.internet.accelerator.plus.txt").read())

而且順便說一下，關閉這個文件讀完後是個不錯的主意。您可以使用with這樣的：

with open("zing.internet.accelerator.plus.txt") as f: 
    soup = BeautifulSoup(f.read())

with將會自動關閉該文件。

這是爲什麼你需要.read()函數的一個例子：

>>> a = open('test.txt') 
>>> type(a) 
<class '_io.TextIOWrapper'> 

>>> print(a) 
<_io.TextIOWrapper name='test.txt' mode='r' encoding='UTF-8'> 

>>> b = a.read() 
>>> type(b) 
<class 'str'> 

>>> print(b) 
Hey there. 

>>> print(open('test.txt')) 
<_io.TextIOWrapper name='test.txt' mode='r' encoding='UTF-8'> 

>>> print(open('test.txt').read()) 
Hey there.

來源

2015-10-14 06:11:06

嘿，謝謝。我試過上面的代碼，幷包括閱讀，但仍然得到相同的錯誤:( –

嗯...嘗試'打開（「zing.internet.accelerator.plus.txt」）。閱讀（）' –

它是打印整體源代碼頁 –

讀取保存在文本文件中的源文件並提取文本

回答

相關問題