從Python中的多個網頁中刮取文本

我的任務是將我們主機的某個客戶端的所有網頁都刪除掉。我已經設法編寫了一個腳本，可以從單個網頁中刪除文本，並且您可以在每次要抓取其他網頁時手動替換代碼中的網址。但顯然這是非常低效的。理想情況下，我可以讓Python連接到一些列表，其中包含我需要的所有URL，它將遍歷列表並將所有刮取的文本打印成單個CSV。我試圖通過創建一個2 URL長列表並試圖讓我的代碼去刪除這兩個URL來編寫此代碼的「測試」版本。但是，正如您所看到的，我的代碼只會刪除列表中最近的url並且不會保留在它所刮取的第一個頁面上。我認爲這是由於我的印刷聲明中有一個缺陷，因爲它總會自行寫入。是否有辦法讓我所抓到的所有東西都保存在某個地方，直到循環遍歷整個列表，然後打印所有內容。從Python中的多個網頁中刮取文本

隨意完全拆除我的代碼。我對計算機語言一無所知。我只是繼續分配這些任務，並使用Google來盡我所能。

import urllib 
import re 
from bs4 import BeautifulSoup 

data_file_name = 'C:\\Users\\confusedanalyst\\Desktop\\python_test.csv' 
urlTable = ['url1','url2'] 

def extractText(string): 
    page = urllib.request.urlopen(string) 
    soup = BeautifulSoup(page, 'html.parser') 

##Extracts all paragraph and header variables from URL as GroupObjects 
    text = soup.find_all("p") 
    headers1 = soup.find_all("h1") 
    headers2 = soup.find_all("h2") 
    headers3 = soup.find_all("h3") 

##Forces GroupObjects into str 
    text = str(text) 
    headers1 = str(headers1) 
    headers2 = str(headers2) 
    headers3 = str(headers3) 

##Strips HTML tags and brackets from extracted strings 
    text = text.strip('[') 
    text = text.strip(']') 
    text = re.sub('<[^<]+?>', '', text) 

    headers1 = headers1.strip('[') 
    headers1 = headers1.strip(']') 
    headers1 = re.sub('<[^<]+?>', '', headers1) 

    headers2 = headers2.strip('[') 
    headers2 = headers2.strip(']') 
    headers2 = re.sub('<[^<]+?>', '', headers2) 

    headers3 = headers3.strip('[') 
    headers3 = headers3.strip(']') 
    headers3 = re.sub('<[^<]+?>', '', headers3) 

    print_to_file = open (data_file_name, 'w' , encoding = 'utf') 
    print_to_file.write(text + headers1 + headers2 + headers3) 
    print_to_file.close() 


for i in urlTable: 
    extractText (i)

來源

2016-08-04 confusedanalyst

試試這個，'w'會用指針打開文件的開頭。您希望指針指向文件

print_to_file = open (data_file_name, 'a' , encoding = 'utf')

這裏到底是供將來參考所有不同的讀寫模式

The argument mode points to a string beginning with one of the following 
sequences (Additional characters may follow these sequences.): 

``r'' Open text file for reading. The stream is positioned at the 
     beginning of the file. 

``r+'' Open for reading and writing. The stream is positioned at the 
     beginning of the file. 

``w'' Truncate file to zero length or create text file for writing. 
     The stream is positioned at the beginning of the file. 

``w+'' Open for reading and writing. The file is created if it does not 
     exist, otherwise it is truncated. The stream is positioned at 
     the beginning of the file. 

``a'' Open for writing. The file is created if it does not exist. The 
     stream is positioned at the end of the file. Subsequent writes 
     to the file will always end up at the then current end of file, 
     irrespective of any intervening fseek(3) or similar. 

``a+'' Open for reading and writing. The file is created if it does not 
     exist. The stream is positioned at the end of the file. Subse- 
     quent writes to the file will always end up at the then current 
     end of file, irrespective of any intervening fseek(3) or similar.

來源

2016-08-04 19:52:25

非常感謝！那正是我所期待的。我想，一旦我從客戶端獲得了真正的URL列表，我就可以應用相同的原則。再次感謝你！ – confusedanalyst

從Python中的多個網頁中刮取文本

回答

相關問題