2013-11-20 50 views
1

我正在使用以下代碼從網站上刮取數據。使用BS4 python

from bs4 import BeautifulSoup 
import urllib2 
import re 
for i in xrange(1,461,10): 
    try: 
    page = urllib2.urlopen("http://cms.onlinedemos.in/directory.php?click=n&startline={}#lst".format(i)) 
    except urllib2.HTTPError: 
    continue 
    else: 
    pass 
    finally: 
    soup = BeautifulSoup(page) 
    td1=soup.findAll('td', {'class':'comtext'}) 
    td2 = soup.findAll('td',{'class':'comuser'}) 
    td3 = soup.findAll('td',{'class':'com'}) 
    for td1s, td2s, td3s in zip(td1,td2,td3): 
     data = [re.sub('\s+', '', text).strip().encode('utf8') for text in td1s.find_all(text=True) + td2s.find_all(text=True) + td3s.find_all(text=True) if text.strip()] 
     print ','.join(data) 

我的輸出是

A.T.E.EnterprisesPvt.Ltd.,,AnujBhagwati 
A.T.E.Pvt.Ltd.,,AtulBhagwati 
AalidhraTextileEngineersLtd.,,HansrajGondalia,Mumbai 
AarBeeAssociates,Mr.Gopalsamy,022-22872245 
ABCarterIndiaPvt.Ltd.,,B.B.Shetty,[email protected] 
ABCCorporation,MittalPatel,Mumbai 
ABCIndustrialFasteners,S.R.Sheth,022-22872245 

但它應該是這樣的

A.T.E. Enterprises Pvt. Ltd., Anuj Bhagwati Mumbai 022-22872245 [email protected]  

    A.T.E. Pvt. Ltd., Atul Bhagwati Mumbai 022-22872245 [email protected]  

    Aalidhra Textile Engineers Ltd., Hansraj Gondalia Surat 0261-2279520/30/40 [email protected]  

    Aar Bee Associates Mr. Gopalsamy Coimbatore 0422-2236250/2238560 [email protected] 

所以你可以看到,第一行值Mumbai 022-22872245 [email protected]開始飄落在第三,第四和第五行。並繼續所有。我知道我錯了哪裏。

+0

你需要得到製表符分隔列? – itdxer

+0

用逗號分隔。 –

回答

1

@VooDooNOFX是對的。相應地修改你的代碼,嘗試這樣的事情:

from bs4 import BeautifulSoup 
import urllib2 
import re 
for i in xrange(1,461,10): 
    try: 
    page = urllib2.urlopen("http://cms.onlinedemos.in/directory.php?click=n&startline={}#lst".format(i)) 
    except urllib2.HTTPError: 
    continue 
    else: 
    pass 
    finally: 
    soup = BeautifulSoup(page) 
    td1=soup.findAll('td', {'class':'comtext'})  
    td2 = soup.findAll('td',{'class':'comuser'}) 
    td345 = soup.findAll('td',{'class':'com'}) 
    #for td3, td4, and td5, use slicing method: s[i:j:k] slice of s from i to j with step k 
    td3 = td345[0::3] 
    td4 = td345[1::3] 
    td5 = td345[2::3] 
    for td1s, td2s, td3s, td4s, td5s in zip(td1,td2,td3,td4,td5): 
     data = [re.sub('\s+', ' ', text).strip().encode('utf8').replace(",", "") for text in td1s.find_all(text=True) + td2s.find_all(text=True) + td3s.find_all(text=True) + td4s.find_all(text=True) + td5s.find_all(text=True) if text.strip()] 
     print ', '.join(data) 

輸出第一頁:

A.T.E. Enterprises Pvt. Ltd., Anuj Bhagwati, Mumbai, 022-22872245, [email protected] 
A.T.E. Pvt. Ltd., Atul Bhagwati, Mumbai, 022-22872245, [email protected] 
Aalidhra Textile Engineers Ltd., Hansraj Gondalia, Surat, 0261-2279520/30/40, [email protected] 
Aar Bee Associates, Mr. Gopalsamy, Coimbatore, 0422-2236250/2238560, [email protected] 
AB Carter India Pvt. Ltd., B.B. Shetty, Mumbai, 022-66662961/62, [email protected] 
ABC Corporation, Mittal Patel, Ahmedabad, 079-40068999/26582333, [email protected] 
ABC Industrial Fasteners, S.R. Sheth, Mumbai, 022-28470806/66923987, [email protected] 
Abhishek Enterprises, N.C. Jain, Bhilwara, 01482-264250, [email protected] 
Accurate Trans Heat Pvt. Ltd., Kedarmal Dargar, Surat, 0261-2397268, [email protected] 
2

看看這個頁面的HTML,每行有3列的類com。使用第三個列表中的30個項目將10個項目的列表與10個項目的另一個列表一起壓縮將導致您獲得的輸出類型。

>>> len(td3) 
30 
>>> td3[0:3] 
[<td class="com" width="100"></td>, <td class="com" width="160"></td>, <td class="com" width="185"></td>] 
>>> td3[3:6] 
[<td class="com" width="100">Mumbai</td>, <td class="com" width="160">022-22872245</td>, <td class="com" width="185">[email protected]</td>]