我正在使用以下代碼從網站上刮取數據。使用BS4 python
from bs4 import BeautifulSoup
import urllib2
import re
for i in xrange(1,461,10):
try:
page = urllib2.urlopen("http://cms.onlinedemos.in/directory.php?click=n&startline={}#lst".format(i))
except urllib2.HTTPError:
continue
else:
pass
finally:
soup = BeautifulSoup(page)
td1=soup.findAll('td', {'class':'comtext'})
td2 = soup.findAll('td',{'class':'comuser'})
td3 = soup.findAll('td',{'class':'com'})
for td1s, td2s, td3s in zip(td1,td2,td3):
data = [re.sub('\s+', '', text).strip().encode('utf8') for text in td1s.find_all(text=True) + td2s.find_all(text=True) + td3s.find_all(text=True) if text.strip()]
print ','.join(data)
我的輸出是
A.T.E.EnterprisesPvt.Ltd.,,AnujBhagwati
A.T.E.Pvt.Ltd.,,AtulBhagwati
AalidhraTextileEngineersLtd.,,HansrajGondalia,Mumbai
AarBeeAssociates,Mr.Gopalsamy,022-22872245
ABCarterIndiaPvt.Ltd.,,B.B.Shetty,[email protected]
ABCCorporation,MittalPatel,Mumbai
ABCIndustrialFasteners,S.R.Sheth,022-22872245
但它應該是這樣的
A.T.E. Enterprises Pvt. Ltd., Anuj Bhagwati Mumbai 022-22872245 [email protected]
A.T.E. Pvt. Ltd., Atul Bhagwati Mumbai 022-22872245 [email protected]
Aalidhra Textile Engineers Ltd., Hansraj Gondalia Surat 0261-2279520/30/40 [email protected]
Aar Bee Associates Mr. Gopalsamy Coimbatore 0422-2236250/2238560 [email protected]
所以你可以看到,第一行值Mumbai 022-22872245 [email protected]
開始飄落在第三,第四和第五行。並繼續所有。我知道我錯了哪裏。
你需要得到製表符分隔列? – itdxer
用逗號分隔。 –