2016-08-24 60 views
0

我刮的網頁內容,但堅持了一個問題的Web刮。經過一系列處理以剝離我想要的範圍之後,我無法剝離html代碼,使其在列表中顯示爲純文本。我曾嘗試使用replace,re.compile和join的功能(嘗試將列表更改爲剝離文本)。所有這些都不起作用,因爲它們是爲字符串設計的,或者在運行時彈出錯誤。的Python 3.5:與剝離HTML代碼

任何人都可以給我一些提示如何做到這一點。例如,我想從下面的代碼更改爲Instructor輸出從

<p class="course-d-title">Instructor</p> 

import tkinter as tk 
import re 

def test(): 
    from bs4 import BeautifulSoup 
    import urllib.request 
    from urllib.parse import urljoin 

    '''for layer 0'''   
    url_text = 'http://www.scs.cuhk.edu.hk/en/part-time/accounting-and-finance/accounting-and-finance/fundamental-accounting/162-610441-01' 
    resp = urllib.request.urlopen(url_text) 
    soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset')) 
    a = soup.find_all('p') 

    k=0 
    for item in a[:]: 
     if 'Instructor' in item: 
      a=a[k:] 
      break 
     k+=1 

    j=0 
    for item in a[:]: 
     if 'Enquiries' in item: 
      a=a[:j-1] 
      break 
     j+=1 

    for i in range(0,a.__len__()): 
     print (a[i]) 

if __name__ == '__main__': 
    test() 

回答

1

使用.text從BS4元素

>>> a = soup.find_all('p') 
>>> data = [ item for item in a if 'Instructor' in item] 
[<p class="course-d-title">Instructor</p>] 

>>> data[0].text 
'Instructor' 
+0

酷!你釘了它! –

+0

謝謝,祝你好運:) –

0

提取文本如果你想採取的講師和諮詢電話號碼的名稱也,下面的代碼可以幫助你。

import requests 
from bs4 import BeautifulSoup 

url_text = 'http://www.scs.cuhk.edu.hk/en/part-time/accounting-and-finance/accounting-and-finance/fundamental-accounting/162-610441-01' 
resp = requests.get(url_text) 
soup = BeautifulSoup(resp.text, 'html.parser') 
a = soup.find_all('p') 
for i in a: 
    if 'Instructor' in i: 
     print i.get_text(), "Name is " + soup.find('p',{'class':'course-d-val'}).get_text() 
    elif 'Enquiries' in i: 
     print i.get_text(), "The number is " + soup.find('span',{'class':'enq-phone'}).get_text() ,"The Fax is " + soup.find('span',{'class':'enq-fax'}).get_text() 

此代碼會給你打印輸出作爲

Instructor Name is Ms. MACK Shui San 
Enquiries The number is 3943 9046 The Fax is 2770 8275 
+0

嗨,非常感謝,我喜歡你使用關鍵字搜索的想法。你能否在「教練」的情況下進一步解釋你如何做到這一點。看起來'course-d-val'並非教練所獨有。它發生在更多的地方。系統如何知道你正在談論講師之後的'course-d-val'? –

+0

@ CL.L它很簡單我正在尋找第一個'course-d-val',因爲我使用find函數而不是findAll,基本上這些類型的東西取決於網站的html – thebadguy

0

我發現下面的代碼,以更簡單,更靈活的方式,可以達到同樣的目的,但允許關鍵字搜索:

from bs4 import BeautifulSoup 
import requests 

url_text = 'http://www.scs.cuhk.edu.hk/en/part-time/accounting-and-finance/accounting-and-finance/fundamental-accounting/162-610441-01' 

resp = requests.get(url_text) 
soup = BeautifulSoup(resp.text, 'html.parser')  
print (soup.find(text="Instructor").findNext('p').text)