從論壇主題中提取特定字段

我正在研究一個數據挖掘項目，爲此我需要分析論壇主題中的討論進度。我有興趣提取信息，比如帖子的時間，帖子作者的統計信息（帖子的數量，加入日期等），帖子的文本等。從論壇主題中提取特定字段

但是，當使用標準的抓取工具（如Scrapy中的蟒蛇）我需要編寫正則表達式來檢測頁面html源代碼中的這些字段。由於這些標籤因論壇類型而異，因此，爲每個論壇處理正則表達式正成爲一個主要問題。是否有標準的正則表達式銀行可供使用，以便可以根據論壇類型使用它們？

或者是否有任何其他技術從論壇頁面中提取這些字段。

來源

2011-04-01 vijay

我爲一些主要的論壇寫了一些配置文件。希望你能破譯並推斷如何解析它。

對於vBulletin：

enclosed_section=tag:table,attributes:id;threadslist 
thread=tag:a,attributes:id;REthread_title_ 
list_next_page=type:next_page,attributes:anchor_text;&gt; 
post=tag:div,attributes:id;REpost_message_ 
thread_next_page=type:next_page,attributes:anchor_text;&gt;

enclosed_section是包含鏈接到所有線程的股利線程是在那裏你會找到鏈接到每個線程 list_next_page是鏈接到下一個頁面的列表帖子帖子是與帖子文本的div。 thread_next_page是鏈接到線程

的下頁的InVision：

enclosed_section=tag:table,attributes:id;forum_table 
thread=tag:a,attributes:class;topic_title 
list_next_page=tag:a,attributes:rel;next,inside_tag_attribute:href 
post=tag:div,attributes:class;post entry-content | 
thread_next_page=tag:a,attributes:rel;next,inside_tag_attribute:href 
post_count_section=tag:td,attributes:class;stats 
post_count=tag:li,attributes:,reg_exp:(\d+) Repl

來源

2011-04-02 04:15:39

你還必須創建每個論壇的幾種方法。但正如亨利所說，也有很多論壇分享他們的結構。

關於輕鬆解析論壇主題的日期，dateparser誕生於此特定需求，它可能會有很大的幫助。

來源

2016-12-20 03:36:13 eLRuLL

從論壇主題中提取特定字段

回答

相關問題