2015-10-19 25 views
0

我正在嘗試爲了調試目的而通過發佈scrapy抓取輸出列表。不能將多行文本添加到列表中作爲一個項目

這裏是我的代碼:

post_list = [] 

with open('last_crawl_output.txt','r') as f: 
    crawl_output = f.read() 

# Find first 'referer' that indicates start of scrapy crawl AFTER initial crawl of search results page 
iter = re.finditer("referer", crawl_output) 
referer_list = [m.start(0) for m in iter] 

# Find indicator of crawl finished. 
iter2 = re.finditer("scrapy", crawl_output) 
closing_list = [m.start(0) for m in iter2] 

del referer_list[0] 

pos1 = referer_list[0] 

for pos1 in referer_list: 
    # Get largest scrapy index after each referer index. 
    pos2_index = bisect.bisect(closing_list, pos1) 
    # Get post from positions. 
    pos2 = closing_list[pos2_index+1] 
    post = crawl_output[pos1:pos2-21] 

我使用post_list.append(post)也試過了,沒有用。

下面是一些示例輸出。

我想添加到post_listhere

一個字符串,這是我得到的替代。這裏是post_list與帖子說:output

當我使用插入,它通過\n

+0

你能提供一個'referer_list'和'closing_list'的例子嗎?我也有點困惑,你爲什麼不寫一個正則表達式,一次性查找開始和結束指示符(例如'post_list = re.findall(「referrer。*?scrapy」,crawl_output)') 。 – Blckknght

+0

@Blckknght我是一個完全noob,所以我只是這樣做,我知道如何。我已經更新了這個問題。正則表達式是否允許像你一樣在一行中? – Manix

+0

我敢肯定,你可以想出一個匹配你正在尋找的正則表達式,雖然我懷疑我在我的評論中提供的是不是它(它在'referrer'後面是第一個'scrapy'引用結束髮現,而不是第二)。 – Blckknght

回答

0

分離,我決定來解決這個名單問題像這樣我的方式:

# Splits post by newline, adds to list 
post_lines = post.split('\n') 

# Add the words "Next Post" to differentiate each post. 
post_lines.append('Next Post') 

# Print each line, and get perfect formatting. 
for line in post_lines: 
    print line 
0

更好的解決方案應該將帖子添加到字典中。這保持格式化並使用較少的代碼。

post_count = 0 
post_dict = {} 

for pos1 in referer_list: 

    post_count += 1 

    pos2_index = bisect.bisect(closing_list, pos1) 
    pos2 = closing_list[pos2_index+1] 

    post = crawl_output[pos1:pos2-21] 

    post_dict[post_count] = post 
相關問題