BeautifulSoup - 颳着論壇頁面

我試圖抓住一個論壇討論，並將其導出爲一個csv文件，諸如「主題標題」，「用戶」和「帖子」等行，其中後者是實際的來自每個人的論壇帖子。BeautifulSoup - 颳着論壇頁面

我是一個完整的初學者，用Python和BeautifulSoup，所以我很難與此！

我目前的問題是所有的文本在csv文件中每行被分割成一個字符。有沒有人可以幫助我？如果有人能幫我一把，那會很棒！

這是我一直在使用的代碼：我們去

from bs4 import BeautifulSoup 
import csv 
import urllib2 

f = urllib2.urlopen("https://silkroad5v7dywlc.onion.to/index.php?action=printpage;topic=28536.0") 

soup = BeautifulSoup(f) 

b = soup.get_text().encode("utf-8").strip() #the posts contain non-ascii words, so I had to do this 

writer = csv.writer(open('silkroad.csv', 'w')) 
writer.writerows(b)

來源

2014-02-23 Isak

我相信你已經知道這一點，但只是以防萬一你沒有.onion.to可以正常工作的程序訪問.onion網站，但你不應該只是瀏覽到他們那樣，因爲缺乏安全。 –

確定這裏。不太清楚我在這裏幫你做什麼，但希望你有一個很好的理由來分析絲綢之路的帖子。

這裏有幾個問題，最大的問題是你根本沒有解析數據。 你基本上用.get_text（）做的事是進入頁面，突出顯示整個事物，然後將整個事件複製並粘貼到一個csv文件。

所以這是你應該嘗試做的：

閱讀網頁源
使用湯把它分解成你想要的部分
保存部分，時間，崗位等
將數據寫入csv文件逐行

我寫一些代碼來告訴你是什麼樣子，它應該做的工作：

from bs4 import BeautifulSoup 
import csv 
import urllib2 

# get page source and create a BeautifulSoup object based on it 
print "Reading page..." 
page = urllib2.urlopen("https://silkroad5v7dywlc.onion.to/index.php?action=printpage;topic=28536.0") 
soup = BeautifulSoup(page) 

# if you look at the HTML all the titles, dates, 
# and authors are stored inside of <dt ...> tags 
metaData = soup.find_all("dt") 

# likewise the post data is stored 
# under <dd ...> 
postData = soup.find_all("dd") 

# define where we will store info 
titles = [] 
authors = [] 
times = [] 
posts = [] 

# now we iterate through the metaData and parse it 
# into titles, authors, and dates 
print "Parsing data..." 
for html in metaData: 
    text = BeautifulSoup(str(html).strip()).get_text().encode("utf-8").replace("\n", "") # convert the html to text 
    titles.append(text.split("Title:")[1].split("Post by:")[0].strip()) # get Title: 
    authors.append(text.split("Post by:")[1].split(" on ")[0].strip()) # get Post by: 
    times.append(text.split(" on ")[1].strip()) # get date 

# now we go through the actual post data and extract it 
for post in postData: 
    posts.append(BeautifulSoup(str(post)).get_text().encode("utf-8").strip()) 

# now we write data to csv file 
# ***csv files MUST be opened with the 'b' flag*** 
csvfile = open('silkroad.csv', 'wb') 
writer = csv.writer(csvfile) 

# create template 
writer.writerow(["Time", "Author", "Title", "Post"]) 

# iterate through and write all the data 
for time, author, title, post in zip(times, authors, titles, posts): 
    writer.writerow([time, author, title, post]) 


# close file 
csvfile.close() 

# done 
print "Operation completed successfully."

編輯：包括解決方案，可以從

好讀目錄，並使用數據文件，所以你必須在目錄中的HTML文件。您需要獲取目錄中的文件列表，遍歷它們，然後將其追加到目錄中每個文件的csv文件中。

這是我們新程序的基本邏輯。

如果我們有一個名爲過程數據（）函數把作爲自變量的文件路徑和附加數據從文件到這裏CSV文件是它會是什麼樣子：

# the directory where we have all our HTML files 
dir = "myDir" 

# our csv file 
csvFile = "silkroad.csv" 

# insert the column titles to csv 
csvfile = open(csvFile, 'wb') 
writer = csv.writer(csvfile) 
writer.writerow(["Time", "Author", "Title", "Post"]) 
csvfile.close() 

# get a list of files in the directory 
fileList = os.listdir(dir) 

# define variables we need for status text 
totalLen = len(fileList) 
count = 1 

# iterate through files and read all of them into the csv file 
for htmlFile in fileList: 
    path = os.path.join(dir, htmlFile) # get the file path 
    processData(path) # process the data in the file 
    print "Processed '" + path + "'(" + str(count) + "/" + str(totalLen) + ")..." # display status 
    count = count + 1 # increment counter

因爲它發生在我們的processData（）函數或多或少是我們之前做的，只做了一些改動。

所以這是非常相似，我們的最後一個程序，有一些小的改動：

我們寫的列標題第一件事
下面我們打開與「AB」標誌的CSV追加
我們進口操作系統獲取文件的列表

這裏是一個樣子：

from bs4 import BeautifulSoup 
import csv 
import urllib2 
import os # added this import to process files/dirs 

# ** define our data processing function 
def processData(pageFile): 
    ''' take the data from an html file and append to our csv file ''' 
    f = open(pageFile, "r") 
    page = f.read() 
    f.close() 
    soup = BeautifulSoup(page) 

    # if you look at the HTML all the titles, dates, 
    # and authors are stored inside of <dt ...> tags 
    metaData = soup.find_all("dt") 

    # likewise the post data is stored 
    # under <dd ...> 
    postData = soup.find_all("dd") 

    # define where we will store info 
    titles = [] 
    authors = [] 
    times = [] 
    posts = [] 

    # now we iterate through the metaData and parse it 
    # into titles, authors, and dates 
    for html in metaData: 
     text = BeautifulSoup(str(html).strip()).get_text().encode("utf-8").replace("\n", "") # convert the html to text 
     titles.append(text.split("Title:")[1].split("Post by:")[0].strip()) # get Title: 
     authors.append(text.split("Post by:")[1].split(" on ")[0].strip()) # get Post by: 
     times.append(text.split(" on ")[1].strip()) # get date 

    # now we go through the actual post data and extract it 
    for post in postData: 
     posts.append(BeautifulSoup(str(post)).get_text().encode("utf-8").strip()) 

    # now we write data to csv file 
    # ***csv files MUST be opened with the 'b' flag*** 
    csvfile = open('silkroad.csv', 'ab') 
    writer = csv.writer(csvfile) 

    # iterate through and write all the data 
    for time, author, title, post in zip(times, authors, titles, posts): 
     writer.writerow([time, author, title, post]) 

    # close file 
    csvfile.close() 
# ** start our process of going through files 

# the directory where we have all our HTML files 
dir = "myDir" 

# our csv file 
csvFile = "silkroad.csv" 

# insert the column titles to csv 
csvfile = open(csvFile, 'wb') 
writer = csv.writer(csvfile) 
writer.writerow(["Time", "Author", "Title", "Post"]) 
csvfile.close() 

# get a list of files in the directory 
fileList = os.listdir(dir) 

# define variables we need for status text 
totalLen = len(fileList) 
count = 1 

# iterate through files and read all of them into the csv file 
for htmlFile in fileList: 
    path = os.path.join(dir, htmlFile) # get the file path 
    processData(path) # process the data in the file 
    print "Processed '" + path + "'(" + str(count) + "/" + str(totalLen) + ")..." # display status 
    count = count + 1 # incriment counter

來源

2014-02-24 02:46:38

太棒了！非常感謝你的幫助！別擔心，我的意圖很好。 :) – Isak

你有沒有機會知道如何做到這一點與幾個本地存儲的文件？我知道如何刮一個文件，但我有一個文件夾，其中包含20k個精簡的HTML文件，如示例中的文件（每個文件<100kb）。 – Isak

@ user3343907我很高興它有幫助！至於你的問題，這取決於。如果你有原始的HTML文件，它應該很容易！您不必從urllib2調用中定義頁面，而只需打開文件並將其讀入變量頁面。另一方面，如果您使用了原始方法（.get_text（）），它將無法工作，因爲您將刪除允許我們解析出不同值的所有HTML標記。 –

BeautifulSoup - 颳着論壇頁面

回答

相關問題