提取跨多個HTML文檔的信息

我有一個問題，其中有大約700個html文檔，每個文檔包含一個字母中包含的一個字母，全部在給定相同的類。提取跨多個HTML文檔的信息

有沒有辦法讓所有的字母和他們在一起？也許使用BeautifulSoup或其他方法？

2013-03-20 Quazum

當然有。試試這樣：

import os 
from BeautifulSoup import BeautifulSoup 

letter_list = [] 
for file in os.listdir('path/to/dir'): 
    with open('path/to/file', 'r') as html_file: 
     html = ' '.join(str(x) for x in list(html_file)) # Combines each row in file into a single string 
     soup = BeautifulSoup(html) 

     letter = soup('span',{'class':'someclass'})[0].contents[0] 
     letter_list.append(letter) 

my_string = ''.join(str(x) for x in letter_list)

這將迭代目錄，打開每個html文件並解析字符串。提取的字母會附加到列表中，並在所有文件都被解析後加入。

來源

2013-03-20 17:39:11 That1Guy

提取跨多個HTML文檔的信息

回答

相關問題