使用bs4提取html文件中的文本

想從我的html文件中提取文本。如果我使用下面的特定文件：使用bs4提取html文件中的文本

import bs4, sys 
from urllib import urlopen 
#filin = open(sys.argv[1], 'r') 
filin = '/home/iykeln/Desktop/R_work/file1.html' 
webpage = urlopen(filin).read().decode('utf-8') 
soup = bs4.BeautifulSoup(webpage) 
for node in soup.findAll('html'): 
    print u''.join(node.findAll(text=True)).encode('utf-8')

它會工作。但是使用開放下面試圖對非特定文件（sys.argv中[1]， 'R'）：

import bs4, sys 
from urllib import urlopen 
filin = open(sys.argv[1], 'r') 
#filin = '/home/iykeln/Desktop/R_work/file1.html' 
webpage = urlopen(filin).read().decode('utf-8') 
soup = bs4.BeautifulSoup(webpage) 
for node in soup.findAll('html'): 
    print u''.join(node.findAll(text=True)).encode('utf-8')

import bs4, sys 
from urllib import urlopen 
with open(sys.argv[1], 'r') as filin: 
    webpage = urlopen(filin).read().decode('utf-8') 
    soup = bs4.BeautifulSoup(webpage) 
    for node in soup.findAll('html'): 
     print u''.join(node.findAll(text=True)).encode('utf-8')

我會得到以下錯誤：

Traceback (most recent call last): 
    File "/home/iykeln/Desktop/py/clean.py", line 5, in <module> 
    webpage = urlopen(filin).read().decode('utf-8') 
    File "/usr/lib/python2.7/urllib.py", line 87, in urlopen 
    return opener.open(url) 
    File "/usr/lib/python2.7/urllib.py", line 180, in open 
    fullurl = unwrap(toBytes(fullurl)) 
    File "/usr/lib/python2.7/urllib.py", line 1057, in unwrap 
    url = url.strip() 
AttributeError: 'file' object has no attribute 'strip'

來源

2013-08-04 Iykeln

您不應該致電open，只是將文件名傳遞給urlopen：

import bs4, sys 
from urllib import urlopen 

webpage = urlopen(sys.argv[1]).read().decode('utf-8') 
soup = bs4.BeautifulSoup(webpage) 
for node in soup.findAll('html'): 
    print u''.join(node.findAll(text=True)).encode('utf-8')

僅供參考，您不需要urllib打開本地文件：

import bs4, sys 

with open(sys.argv[1], 'r') as f: 
    webpage = f.read().decode('utf-8') 

soup = bs4.BeautifulSoup(webpage) 
for node in soup.findAll('html'): 
    print u''.join(node.findAll(text=True)).encode('utf-8')

希望有所幫助。

來源

2013-08-04 12:01:38 alecxe

是的！你是對的。謝謝alecxe。 – Iykeln

是的！它幫助。@ alecxe。謝謝。 – Iykeln

使用bs4提取html文件中的文本

回答

相關問題