2016-05-16 53 views
2

我試圖使用scraperwikibs4將PDF轉換爲文本文件。我得到TypeError。我對Python非常感興趣,並且非常感謝您的幫助。發生TypeError:必須可轉換爲緩衝區,而不是ResultSet

錯誤的位置:

File "scraper_wiki_download.py", line 53, in write_file 
f.write(soup) 

這是我的代碼:

# Get content, regardless of whether an HTML, XML or PDF file 
def send_Request(url):   
    response = http.urlopen('GET', url, preload_content=False) 
    return response 

# Use this to get PDF, covert to XML 
def process_PDF(fileLocation): 
    pdfToProcess = send_Request(fileLocation) 
    pdfToObject = scraperwiki.pdftoxml(pdfToProcess.read()) 
    return pdfToObject 

# returns a navigatibale tree, which you can iterate through 
def parse_HTML_tree(contentToParse): 
    soup = BeautifulSoup(contentToParse, 'lxml') 
    return soup 

pdf = process_PDF('http://www.sfbos.org/Modules/ShowDocument.aspx?documentid=54790') 
pdfToSoup = parse_HTML_tree(pdf) 
soupToArray = pdfToSoup.findAll('text') 

def write_file(soup_array): 
    with open('test.txt', "wb") as f: 
     f.write(soup_array) 

write_file(soupToArray) 
+0

它可以幫助瞭解哪些線拋出異常。 – polku

回答

1

沒用過scraperwiki至今但這獲取文本:

import scraperwiki 
import requests 
from bs4 import BeautifulSoup 

pdf_xml = scraperwiki.pdftoxml(requests.get('http://www.sfbos.org/Modules/ShowDocument.aspx?documentid=54790').content) 
print(BeautifulSoup(pdf_xml, "lxml").find_all("text")) 
1

我想soupToArray = pdfToSoup.findAll('text')返回某種名單,但f.write()只對字符串的工作,所以你必須重複它並以某種方式將每個元素轉換爲字符串。打印soupToArray以查看它的外觀。

+0

看起來你是對的。不幸的是,我得到一個空的列表。它似乎不像pdfToSoup正在做它的工作。 – tonestrike

相關問題