從PDF文檔獲取唯一的字數

我已經看過PDFSharp，但它對於我想要做的事情來說非常笨重。我無法訪問服務器，因此我無法安裝acrobat以訪問其api或任何內容。我願意在iTextSharp或其他工具中使用它。

2011-07-18 Rikon

您尋找什麼類型的成功率？我以前遇到過使用掃描圖像創建pdf的問題，這些圖像在某些時候基本上需要使用OCR，這有它自己的一套問題 –

成功率並不重要......這是一個翻譯網站，猜測翻譯報價。 ToU有各種各樣的言論，說報價不是合同。另外，通常在該行業中，圖像不被認爲是「可翻譯的」（這是一個非常不容忍的行業）。 :) – Rikon

iTextSharp的有一個美好的PdfTextExtractor對象，將讓你的所有文本（assumming作爲@Rob A的人士指出，其實際存儲爲文本而不是圖像或純矢量）。一旦你得到了所有的文本，一個簡單的正則表達式會給你字數。

下面的代碼應該爲你做。（在iText 5.1.1.0上測試）

using System; 
using System.Collections.Generic; 
using System.ComponentModel; 
using System.Data; 
using System.Drawing; 
using System.Linq; 
using System.Text; 
using System.Windows.Forms; 
using System.IO; 
using iTextSharp.text.pdf.parser; 

namespace WindowsFormsApplication1 
{ 
    public partial class Form1 : Form 
    { 
     public Form1() 
     { 
      InitializeComponent(); 
     } 

     private void Form1_Load(object sender, EventArgs e) 
     { 
      string InputFile = System.IO.Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "Input.pdf"); 

      //Get all the text 
      string T = ExtractAllTextFromPdf(InputFile); 
      //Count the words 
      int I = GetWordCountFromString(T); 

     } 

     public static string ExtractAllTextFromPdf(string inputFile) 
     { 
      //Sanity checks 
      if (string.IsNullOrEmpty(inputFile)) 
       throw new ArgumentNullException("inputFile"); 
      if (!System.IO.File.Exists(inputFile)) 
       throw new System.IO.FileNotFoundException("Cannot find inputFile", inputFile); 

      //Create a stream reader (not necessary but I like to control locks and permissions) 
      using (FileStream SR = new FileStream(inputFile, FileMode.Open, FileAccess.Read, FileShare.Read)) 
      { 
       //Create a reader to read the PDF 
       iTextSharp.text.pdf.PdfReader reader = new iTextSharp.text.pdf.PdfReader(SR); 

       //Create a buffer to store text 
       StringBuilder Buf = new StringBuilder(); 

       //Use the PdfTextExtractor to get all of the text on a page-by-page basis 
       for (int i = 1; i <= reader.NumberOfPages; i++) 
       { 
        Buf.AppendLine(PdfTextExtractor.GetTextFromPage(reader, i)); 
       } 

       return Buf.ToString(); 
      } 
     } 
     public static int GetWordCountFromString(string text) 
     { 
      //Sanity check 
      if (string.IsNullOrEmpty(text)) 
       return 0; 

      //Count the words 
      return System.Text.RegularExpressions.Regex.Matches(text, "\\S+").Count; 
     } 
    } 
}

來源

2011-07-18 16:06:25

哈斯：太棒了，我會在今晚插上它，並將其標記爲答案。 @peer：如果我可以在這一點上，我會開源。 – Rikon