2013-08-04 56 views
3

想從我的html文件中提取文本。如果我使用下面的特定文件:使用bs4提取html文件中的文本

import bs4, sys 
from urllib import urlopen 
#filin = open(sys.argv[1], 'r') 
filin = '/home/iykeln/Desktop/R_work/file1.html' 
webpage = urlopen(filin).read().decode('utf-8') 
soup = bs4.BeautifulSoup(webpage) 
for node in soup.findAll('html'): 
    print u''.join(node.findAll(text=True)).encode('utf-8') 

它會工作。 但是使用開放下面試圖對非特定文件(sys.argv中[1], 'R'):

import bs4, sys 
from urllib import urlopen 
filin = open(sys.argv[1], 'r') 
#filin = '/home/iykeln/Desktop/R_work/file1.html' 
webpage = urlopen(filin).read().decode('utf-8') 
soup = bs4.BeautifulSoup(webpage) 
for node in soup.findAll('html'): 
    print u''.join(node.findAll(text=True)).encode('utf-8') 

OR

import bs4, sys 
from urllib import urlopen 
with open(sys.argv[1], 'r') as filin: 
    webpage = urlopen(filin).read().decode('utf-8') 
    soup = bs4.BeautifulSoup(webpage) 
    for node in soup.findAll('html'): 
     print u''.join(node.findAll(text=True)).encode('utf-8') 

我會得到以下錯誤:

Traceback (most recent call last): 
    File "/home/iykeln/Desktop/py/clean.py", line 5, in <module> 
    webpage = urlopen(filin).read().decode('utf-8') 
    File "/usr/lib/python2.7/urllib.py", line 87, in urlopen 
    return opener.open(url) 
    File "/usr/lib/python2.7/urllib.py", line 180, in open 
    fullurl = unwrap(toBytes(fullurl)) 
    File "/usr/lib/python2.7/urllib.py", line 1057, in unwrap 
    url = url.strip() 
AttributeError: 'file' object has no attribute 'strip' 

回答

3

您不應該致電open,只是將文件名傳遞給urlopen

import bs4, sys 
from urllib import urlopen 

webpage = urlopen(sys.argv[1]).read().decode('utf-8') 
soup = bs4.BeautifulSoup(webpage) 
for node in soup.findAll('html'): 
    print u''.join(node.findAll(text=True)).encode('utf-8') 

僅供參考,您不需要urllib打開本地文件:

import bs4, sys 

with open(sys.argv[1], 'r') as f: 
    webpage = f.read().decode('utf-8') 

soup = bs4.BeautifulSoup(webpage) 
for node in soup.findAll('html'): 
    print u''.join(node.findAll(text=True)).encode('utf-8') 

希望有所幫助。

+0

是的!你是對的。謝謝alecxe。 – Iykeln

+0

是的!它幫助。@ alecxe。謝謝。 – Iykeln