2016-11-29 66 views
0

我是一個Python初學者,所以我想要做的就是用BeautifulSoup刮一個網站。在這個網頁源代碼的一小部分是HTML:用BeautifulSoup刮一條線

<table class="swift" width="100%"> 
    <tr> 
    <th class="no">ID</th> 
    <th>Bank or Institution</th> 
    <th>City</th> 
    <th class="branch">Branch</th> 
    <th>Swift Code</th> 
    </tr> <tr> 
    <td align="center">101</td> 
    <td>BANK LEUMI ROMANIA S.A.</td> 
    <td>CONSTANTA</td> 
    <td>(CONSTANTA BRANCH)</td> 
    <td align="center"><a href="/romania/dafbro22cta/">DAFBRO22CTA</a></td> 
    </tr> 
    <tr> 
    <td align="center">102</td> 
    <td>BANK LEUMI ROMANIA S.A.</td> 
    <td>ORADEA</td> 
    <td>(ORADEA BRANCH)</td> 
    <td align="center"><a href="/romania/dafbro22ora/">DAFBRO22ORA</a></td> 
    </tr> 

我設法爭取到了他們,但這樣的輸出:

ID 
Bank or Institution 
City 
Branch 
Swift Code 

101 
BANK LEUMI ROMANIA S.A. 
CONSTANTA 
(CONSTANTA BRANCH) 
DAFBRO22CTA 


102 
BANK LEUMI ROMANIA S.A. 
ORADEA 
(ORADEA BRANCH) 
DAFBRO22ORA 

當我真正想要的是這樣的:

ID, Bank or Institution, City, Branch, Swift Code 

101, BANK LEUMI ROMANIA S.A., CONSTANTA, (CONSTANTA BRANCH) ,DAFBRO22CTA 

102, BANK LEUMI ROMANIA S.A., ORADEA, (ORADEA BRANCH), DAFBRO22ORA 

這是我的代碼:

base_url = "https://www.theswiftcodes.com/" 
nr = 0 
page = 'page' 
country = 'Romania' 
while nr < 4: 
    url_country = base_url + country + '/' + 'page' + "/" + str(nr) + "/" 
    pages = requests.get(url_country) 
    soup = BeautifulSoup(pages.text, 'html.parser') 

    for script in soup.find_all('script'): 
     script.extract() 

    tabel = soup.find_all("table") 
    text = ("".join([p.get_text() for p in tabel])) 
    nr += 1 
    print(text) 

    file = open('swiftcodes.txt', 'a') 
    file.write(text) 
    file.close() 

    file = open('swiftcodes.txt', 'r') 
    for item in file: 
     print(item) 
    file.close() 

回答

2

這應該做的伎倆

from bs4 import BeautifulSoup 

str = """<table class="swift" width="100%"> 
    <tr> 
    <th class="no">ID</th> 
    <th>Bank or Institution</th> 
    <th>City</th> 
    <th class="branch">Branch</th> 
    <th>Swift Code</th> 
    </tr> <tr> 
    <td align="center">101</td> 
    <td>BANK LEUMI ROMANIA S.A.</td> 
    <td>CONSTANTA</td> 
    <td>(CONSTANTA BRANCH)</td> 
    <td align="center"><a href="/romania/dafbro22cta/">DAFBRO22CTA</a></td> 
    </tr> 
    <tr> 
    <td align="center">102</td> 
    <td>BANK LEUMI ROMANIA S.A.</td> 
    <td>ORADEA</td> 
    <td>(ORADEA BRANCH)</td> 
    <td align="center"><a href="/romania/dafbro22ora/">DAFBRO22ORA</a></td> 
    </tr>""" 

soup = BeautifulSoup(str) 

for i in soup.find_all("tr"): 
    result = "" 
    for j in i.find_all("th"): # find all the header tags 
     result += j.text + ", " 
    for j in i.find_all("td"): # find the cell tags 
     result += j.text + ", " 
    print(result.rstrip(', ')) 

輸出:

ID, Bank or Institution, City, Branch, Swift Code 
101, BANK LEUMI ROMANIA S.A., CONSTANTA, (CONSTANTA BRANCH), DAFBRO22CTA 
102, BANK LEUMI ROMANIA S.A., ORADEA, (ORADEA BRANCH), DAFBRO22ORA 
+0

你可以嘗試在代碼更新它嗎?像這樣理解它有點困難。 –

+0

那麼代碼中只有2件事情。遍歷所有'tr'標籤。在'tr'標籤內迭代'td'標籤或'th'標籤,並將文本值存儲在'result'變量中。在'tr'迭代的每一端打印出來。 'strip'只是一個字符串操作來刪除逗號 –

+0

所以你的代碼應該放在print(text)和file = open('swiftcodes.txt','a')之間 –

0
from bs4 import BeautifulSoup 
import requests 
r = requests.get('https://www.theswiftcodes.com/united-states/') 
soup = BeautifulSoup(r.text, 'lxml') 
rows = soup.find(class_="swift").find_all('tr') 
th = [th.text for th in rows[0].find_all('th')] 
print(th) 
for row in rows[1:]: 
    cell = [i.text for i in row.find_all('td', colspan=False)] 
    print(cell) 

出來:

['ID', 'Bank or Institution', 'City', 'Branch', 'Swift Code'] 
['1', '1ST CENTURY BANK, N.A.', 'LOS ANGELES,CA', '', 'CETYUS66'] 
['2', '1ST PMF BANCORP', 'LOS ANGELES,CA', '', 'PMFAUS66'] 
['3', '1ST PMF BANCORP', 'LOS ANGELES,CA', '', 'PMFAUS66HKG'] 
['4', '3M COMPANY', 'ST. PAUL,MN', '', 'MMMCUS44'] 
['5', 'ABACUS FEDERAL SAVINGS BANK', 'NEW YORK,NY', '', 'AFSBUS33'] 
[] 
['6', 'ABBEY NATIONAL TREASURY SERVICES LTD US BRANCH', 'STAMFORD,CT', '', 'ANTSUS33'] 
['7', 'ABBOTT LABORATORIES', 'ABBOTT PARK,IL', '', 'ABTTUS44'] 
['8', 'ABBVIE, INC.', 'CHICAGO,IL', '', 'ABBVUS44'] 
['9', 'ABEL/NOSER CORP', 'NEW YORK,NY', '', 'ABENUS3N']