數據刮擦本地存儲的HTML文件 - 使用Python

我有一個很大的Excel文件，並且在每個單元格中我有各種HTML內容，包含數據庫用戶所做的註釋。每個單元格中的內容都是獨特的，並且長度各不相同。我需要擺脫所有HTML語法/標籤，以便我可以將此內容上傳到數據庫表。如何使用Python（或Java，如果沒有Python的答案）刮取這些數據？你能提供一個代碼示例嗎？數據刮擦本地存儲的HTML文件 - 使用Python

來源

2016-10-13 andi m

你嘗試過什麼？告訴我們你寫的代碼。如果你還沒有嘗試過任何東西，你可以考慮使用[lxml]（http://lxml.de/）庫解析HTML，然後從那裏拉取文本。 –

是的，可能想展示一個內容字符串的例子。 –

Excel單元格1：控制櫃上的指示燈應由24Vdc LED替換。 3紅色& 3綠色。 Excel Cell2：「

\t Close the Monthly LAD and Lanyard Work orders to show they were executed.

–

在終端中，pip install bs4。然後您可以提取像這樣的Python文本：

import bs4 

for cell in [ 
    '<html>The indicator lights on the control cabinet&nbsp;are to be replaced with 24Vdc&nbsp;LED\'s. 3 Red &amp;&nbsp;3 Green.</html>', 
    '<html><div> <span style=""FONT-SIZE: 18pt"">Close the Monthly LAD and Lanyard Work orders to show they were executed. </span></div>']: 
    print(bs4.BeautifulSoup(cell).text.strip())

結果：

The indicator lights on the control cabinet are to be replaced with 24Vdc LED's. 3 Red & 3 Green. 
Close the Monthly LAD and Lanyard Work orders to show they were executed.

來源

2016-10-13 21:16:08

數據刮擦本地存儲的HTML文件 - 使用Python

回答

相關問題