美麗的湯4 HTML文檔目錄

我這個代碼工作：美麗的湯4 HTML文檔目錄

from bs4 import BeautifulSoup 
import glob 
import os 
import re 

def trade_spider(): 
    os.chdir(r"C:\Users\6930p\FLO'S DATEIEN\Master FAU\Sommersemester 2016\02_Masterarbeit\04_Testumgebung\01_Probedateien für Analyseaspekt\Independent Auditors Report") 
    for file in glob.glob('*.html'): 
     with open(file, encoding="utf8") as f: 
      contents = f.read() 
      soup = BeautifulSoup(contents, "html.parser") 
      results = [item for item in soup.findAll("ix:nonfraction") if re.match("^[^:]:AuditFeesExpenses", item['name'])] 
      print(results) 
       #print(file, end="| ") 
       #print(item['name'], end="| ") 
       #print(item.get_text()) 
trade_spider()

我想在我的電腦上BS4某個目錄解析多個HTML文檔。我的目標是找到以「ix：NonFraction ....」開頭的標籤，其中包含一個名稱屬性，可以在'AuditFeesExpenses'之前具有多個表達式，比如name =「aurep：AuditFeesExpenses，name = bus：AuditFeesExpenses」等等（這就是爲什麼我我正在使用正則表達式）。所以，如果BS4找到了特定的標籤，我想用soup.get_text（Value）從中提取文本。

任何一個想法，我已經錯過了？

UPDATE：一個例子標籤是：

<td style=" width:12.50%; text-align:right; " class="ta_60"> 
<ix:nonFraction contextRef="ThirdPartyAgentsHypercube_FY_31_12_2012_Set1" 
name="ns19:AuditFeesExpenses" unitRef="GBP" decimals="0" 
format="ixt2:numdotdecimal" scale="0" xmlns:ix="http://www.xbrl.org 
/2008/inlineXBRL">3,600</ix:nonFraction></td>

通常這個標記出現在同一行，爲了清楚起見，我插了幾個換行符！

我最後的代碼如下所示：

from bs4 import BeautifulSoup 
import glob 
import os 
import re 

def trade_spider(): 
    os.chdir(r"C:\Users\6930p\FLO'S DATEIEN\Master FAU\Sommersemester 2016\02_Masterarbeit\04_Testumgebung\01_Probedateien für Analyseaspekt\Independent Auditors Report") 
    for file in glob.glob('*.html'): 
     with open(file, encoding="utf8") as f: 
      contents = f.read() 
      soup = BeautifulSoup(contents, "html.parser") 
      for item in soup.findAll("ix:nonfraction"): 
       if re.match(".*AuditFeesExpenses", item['name']): 
        print(file, end="| ") 
        print(item['name'], end="| ") 
        print(item.get_text()) 
trade_spider()

，並給了我這樣的輸出：

Prod224_0010_00079350_20140331.html |英國aurep：AuditFeesExpenses | 2,000

來源

2016-05-10 Florian Schramm

findAll()函數具有name作爲其第一個參數。當你調用

`soup.findAll('ix:NonFraction', name=re.compile("^[^:]:AuditFeesExpenses"))`,

你實際上調用soup與參數name=ix:NonFraction和name=re.compile("^[^:]:AuditFeesExpenses")。當然，我們只能設置name等於這兩個輸入中的一個，從而給出錯誤。

錯誤消息顯示find_all()而不是findAll()。從docs，我們看到findAll是舊方法名稱find_all。應該使用find_all方法。

混淆可能來自屬性name。區分BeautifulSoup屬性name和html屬性name很重要。爲了證明，我認爲一個標籤的格式如下：

<body> 
    <ix:NonFraction name="AuditFeesExpenses">stuff<ix:NonFraction> 
</body>

我們可以找到所有<ix:NonFraction>標籤與soup.find_all("ix:nonfraction")。這使包含結果如下列表：

[<ix:NonFraction name="AuditFeesExpenses">stuff<ix:NonFraction>]

迭代通過這一個項目列表，看到兩個不同的名屬性。首先，我們訪問BeautifulSoup name屬性爲對象的屬性：

for item in soup.find_all("ix:nonfraction"): 
    print(item.name) 

Out: 'ix:nonfraction'

要查看HTML name屬性，訪問name作爲字典鍵：

for item in soup.find_all("ix:nonfraction"): 
    print(item['name']) 

Out: 'AuditFeesExpenses'

加入這兩個搜索起來縮小結果：

results = [item for item in soup.find_all("ix:nonfraction") if re.match("^[^:]:AuditFeesExpenses", item['name']) 

Out: [<ix:nonfraction name="ns19:AuditFeesExpenses">3,600</ix:nonfraction>]

或者，如果我們想獲得每場比賽的文字：

完整輸出

results = [item.get_text() for item in soup.find_all("ix:nonfraction") if re.match("^[^:]:AuditFeesExpenses", item['name']) 

Out: [3,600]

建議代碼：

from bs4 import BeautifulSoup 
import glob 
import os 

def trade_spider(): 
    os.chdir(r"C:\Independent Auditors Report") 
    for file in glob.glob('*.html'): 
     with open(file, encoding="utf8") as f: 
      contents = f.read() 
      soup = BeautifulSoup(contents, "html.parser") 
      for item in soup.findAll("ix:nonfraction"): 
       if re.match("^[^:]:AuditFeesExpenses", item['name']) 
        print(file, end="| ") 
        print(item['name'], end="| ") 
        print(item.get_text()) 
trade_spider()

來源

2016-05-10 17:59:54 SNygard

我更新了我的問題，這樣你可以看到，我想用我的代碼 –

更新的答案示例代碼。我認爲這個問題來自兩個不同的'name'屬性。最終的解決方案可能需要2個步驟：獲取所有'NonFraction'標籤，然後過濾以獲得所有'AuditFeesExpenses'名稱。 – SNygard

這個工作幾乎完美，但python現在打印文檔中的每個NonFraction-Tag-Name（每個文檔〜100-200）。是否有機會僅對「AuditFeesExpenses」進行過濾，並同時告訴Python收集標籤> 3,600 <之間的文本。如果我能解決這個問題，代碼將完美工作！ –

美麗的湯4 HTML文檔目錄

回答

相關問題