的Python 3.5：與剝離HTML代碼

我刮的網頁內容，但堅持了一個問題的Web刮。經過一系列處理以剝離我想要的範圍之後，我無法剝離html代碼，使其在列表中顯示爲純文本。我曾嘗試使用replace，re.compile和join的功能（嘗試將列表更改爲剝離文本）。所有這些都不起作用，因爲它們是爲字符串設計的，或者在運行時彈出錯誤。的Python 3.5：與剝離HTML代碼

任何人都可以給我一些提示如何做到這一點。例如，我想從下面的代碼更改爲Instructor輸出從

<p class="course-d-title">Instructor</p>

。

import tkinter as tk 
import re 

def test(): 
    from bs4 import BeautifulSoup 
    import urllib.request 
    from urllib.parse import urljoin 

    '''for layer 0'''   
    url_text = 'http://www.scs.cuhk.edu.hk/en/part-time/accounting-and-finance/accounting-and-finance/fundamental-accounting/162-610441-01' 
    resp = urllib.request.urlopen(url_text) 
    soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset')) 
    a = soup.find_all('p') 

    k=0 
    for item in a[:]: 
     if 'Instructor' in item: 
      a=a[k:] 
      break 
     k+=1 

    j=0 
    for item in a[:]: 
     if 'Enquiries' in item: 
      a=a[:j-1] 
      break 
     j+=1 

    for i in range(0,a.__len__()): 
     print (a[i]) 

if __name__ == '__main__': 
    test()

來源

2016-08-24 CL. L

使用.text從BS4元素

>>> a = soup.find_all('p') 
>>> data = [ item for item in a if 'Instructor' in item] 
[<p class="course-d-title">Instructor</p>] 

>>> data[0].text 
'Instructor'

來源

2016-08-24 07:13:48

酷！你釘了它！ –

謝謝，祝你好運:) –

提取文本如果你想採取的講師和諮詢電話號碼的名稱也，下面的代碼可以幫助你。

import requests 
from bs4 import BeautifulSoup 

url_text = 'http://www.scs.cuhk.edu.hk/en/part-time/accounting-and-finance/accounting-and-finance/fundamental-accounting/162-610441-01' 
resp = requests.get(url_text) 
soup = BeautifulSoup(resp.text, 'html.parser') 
a = soup.find_all('p') 
for i in a: 
    if 'Instructor' in i: 
     print i.get_text(), "Name is " + soup.find('p',{'class':'course-d-val'}).get_text() 
    elif 'Enquiries' in i: 
     print i.get_text(), "The number is " + soup.find('span',{'class':'enq-phone'}).get_text() ,"The Fax is " + soup.find('span',{'class':'enq-fax'}).get_text()

此代碼會給你打印輸出作爲

Instructor Name is Ms. MACK Shui San 
Enquiries The number is 3943 9046 The Fax is 2770 8275

來源

2016-08-24 13:37:47 thebadguy

嗨，非常感謝，我喜歡你使用關鍵字搜索的想法。你能否在「教練」的情況下進一步解釋你如何做到這一點。看起來'course-d-val'並非教練所獨有。它發生在更多的地方。系統如何知道你正在談論講師之後的'course-d-val'？ –

@ CL.L它很簡單我正在尋找第一個'course-d-val'，因爲我使用find函數而不是findAll，基本上這些類型的東西取決於網站的html – thebadguy

我發現下面的代碼，以更簡單，更靈活的方式，可以達到同樣的目的，但允許關鍵字搜索：

from bs4 import BeautifulSoup 
import requests 

url_text = 'http://www.scs.cuhk.edu.hk/en/part-time/accounting-and-finance/accounting-and-finance/fundamental-accounting/162-610441-01' 

resp = requests.get(url_text) 
soup = BeautifulSoup(resp.text, 'html.parser')  
print (soup.find(text="Instructor").findNext('p').text)

來源

2016-08-25 03:18:13

的Python 3.5：與剝離HTML代碼

回答

相關問題