使用BeautifulSoup從網頁上刮取代理IP

-1

以下是我的代碼&我試圖從網站上刮取代理。它將代理&寫入文本文件，但在將最後一個代理寫入文本文件後，它也會顯示此錯誤。我在這裏做錯了什麼？使用BeautifulSoup從網頁上刮取代理IP

def new(): 
url = 'https://free-proxy-list.net' 
page = requests.get(url) 
# Turn the HTML into a Beautiful Soup object 
soup = BeautifulSoup(page.text,'lxml') 

with io.open('C:\\Users\\Desktop\\' + 'proxy.txt','a', encoding='utf8') as logfile: 
     for tr in soup.find_all('tr')[1:]: 
      tds = tr.find_all('td') 
      logfile.write(u"%s:%s\n"%(tds[0].text,tds[1].text)) 
      print(u"\n%s:%s\n"%(tds[0].text,tds[1].text))

錯誤： -

回溯（最近通話最後一個）：文件「C：\用戶\應用程序數據\本地\程序\ Python的\ Python36 \ YT bot.py」，第69行，在 new（）文件「C：\ Users \ AppData \ Local \ Programs \ Python \ Python36 \ yt bot.py」，第64行，新建爲 logfile.write（u「％s：％s \ n」％（ tds [0] .text，tds [1] .text）） IndexError：列表索引超出範圍

來源

2017-08-27 Rahul

最後一個tr顯然包含少於2個tds ... –

爲什麼負面評價我？我只是問，因爲我已經嘗試過，之後我問了問題。 – Rahul

看起來好像最後一個<tr>看起來像這樣：

<tr> 
    <th class="input"> 
     <input type="text" /> 
    </th> 
    ... 
</tr>

這有什麼好做的其他項目，所以你可以只跳過它：

for tr in soup.find_all('tr')[1:-1]: 
    ...

另外，使用try-except塊來捕捉IndexError也將工作：

with io.open(...) as logfile: 
    for tr in soup.find_all('tr'): 
     try: 
      tds = tr.find_all('td') 
      logfile.write(u"%s:%s\n" % (tds[0].text, tds[1].text)) 
     except IndexError: 
      pass

另外，我建議使用os.path.join加入文件路徑：

os.path.join('C:\\Users\\Desktop\\', 'proxy.txt')

這比使用+的簡單級聯更安全。

來源

2017-08-27 06:58:37

使用BeautifulSoup從網頁上刮取代理IP

回答

相關問題