2017-06-03 147 views
1

你好,我是一個蟒蛇新手,不好意思問這樣一個具體的問題,當我不知道什麼是錯誤的。新聞網站爬行不起作用?

我試圖從一個韓國新網站上爬取新聞文章。 當我運行這段代碼

import sys 
    from bs4 import BeautifulSoup 
    import urllib.request 
    from urllib.parse import quote 

    target_url_b4_pn="http://news.donga.com/search?p=" 
    target_url_b4_keyword='&query=' 

target_url_rest="&check_news1&more=1&sorting1&search_date1&v1=&v2=&range=1" 



    def get_text(URL, output_file): 
     source_code_from_URL=urllib.request.urlopen(URL) 
     soup=BeautifulSoup(source_code_from_URL, 'lxml', from_encoding='UTF-8') 
     content_of_article=soup.select('div.article') 
     for item in content_of_article: 
      string_item=str(item.find_all(text=True)) 
      output_file.write(string_item) 

    def get_link_from_news_title(page_num, URL, output_file): 
     for i in range(page_num): 
      current_page_num=1+i*15 
      position=URL.index('=') 
        URL_with_page_num=URL[:position+1]+str(current_page_num)+URL[position+1:] 
      source_code_from_URL=urllib.request.urlopen(URL_with_page_num) 
      soup=BeautifulSoup(source_code_from_URL, 'lxml',from_encoding='UTF-8') 

      for title in soup.find_all('p','tit'): 
       title_link=title.select('a') 
       article_URL=title_link[0]['href'] 
       get_text(article_URL, output_file) 

    def main(): 
     keyword="노무현" 
     page_num=1 
     output_file_name="output.txt" 
     target_url=target_url_b4_pn+target_url_b4_keyword+quote(keyword)+target_url_rest 
     output_file=open(output_file_name, "w", -1, "utf-8") 
     get_link_from_news_title(page_num, target_url, output_file) 
     output_file.close() 


    if __name__=='__main__': 
     main() 
    print(target_url) 
    print(11111) 

的jupyter筆記本不給輸入作出響應,在底部犯規甚至吐出任何簡單的命令(不打印任何東西)

想想代碼凍結不知何故,請告訴我哪裏可能出錯?

the picture where it's not responding

回答

0
  1. get_text函數的第一行,urllib.request.urlopen(URL)意味着你打開URL,但就像你打開一個文件,你必須read它。
    因此在它後面添加一個read()
    urllib.request.urlopen(URL).read()否則beautifulsoup將無法識別它。

  2. 並在您的css選擇器soup.select('div.article'),頁面中沒有這樣的元素,我猜你想要的是soup.select('div.article_txt'),它匹配文章的段落。

  3. print(target_url)應該進入你main功能,target_url只在main定義。

代碼

import sys 
from bs4 import BeautifulSoup 
import urllib.request 
from urllib.parse import quote 

target_url_b4_pn="http://news.donga.com/search?p=" 
target_url_b4_keyword='&query=' 

target_url_rest="&check_news1&more=1&sorting1&search_date1&v1=&v2=&range=1" 



def get_text(URL, output_file): 
    source_code_from_URL=urllib.request.urlopen(URL) 
    soup=BeautifulSoup(source_code_from_URL, 'lxml', from_encoding='UTF-8') 
    # change your css selector so it match some element 
    content_of_article=soup.select('div.article_txt') 
    for item in content_of_article: 
     string_item=item.find_all(text=True) 
     #write string to file 
     output_file.write(" ".join(string_item)) 

def get_link_from_news_title(page_num, URL, output_file): 
    for i in range(page_num): 
     current_page_num=1+i*15 
     position=URL.index('=') 
     URL_with_page_num=URL[:position+1]+str(current_page_num)+URL[position+1:] 
     source_code_from_URL=urllib.request.urlopen(URL_with_page_num) 
     soup=BeautifulSoup(source_code_from_URL, 'lxml',from_encoding='UTF-8') 

     for title in soup.find_all('p','tit'): 
      title_link=title.select('a') 
      article_URL=title_link[0]['href'] 
      get_text(article_URL, output_file) 

def main(): 
    keyword="노무현" 
    page_num=1 
    output_file_name="output.txt" 
    target_url=target_url_b4_pn+target_url_b4_keyword+quote(keyword)+target_url_rest 
    # move `target_url` here 
    print(target_url) 

    output_file=open(output_file_name, "w", -1, "utf-8") 
    get_link_from_news_title(page_num, target_url, output_file) 
    output_file.close() 


if __name__=='__main__': 
    main() 
    print(11111) 
+0

太謝謝你了!你真好!我改變了代碼,但它仍然沒有在一小時內吐出結果......如果你能提出任何改進建議,這將是很好的。 –

+0

'output.txt'的結果會在你的當前目錄中,你檢查那個文件嗎? – Aaron