從pdf中提取文本到c＃

我正在尋找一種方法從pdf中提取文本並將其用於我的程序。我在網上做了一些研究，並得到了一些圖書館的工作。這些不是免費的;然而，在這裏有限制。從pdf中提取文本到c＃

所以我正在尋找一個免費的圖書館。我想到了ITextSharp，但我不知道要開始。你們能幫我出去嗎？

2012-02-29 jorne

請注意，iTextSharp也不是免費軟件。 – Bobrovsky 2012-03-01 17:05:16

查看文檔和資源： - http://api.itextpdf.com/ - http://stackoverflow.com/questions/3365986/documentation-for-itextsharp – 2012-02-29 14:23:42

喜歡的東西應該爲你工作。您必須觀看它 - 它們隨時會使用iTextSharp發佈更改函數名稱，這有點煩人 - Lol

public static string GetPDFText(String pdfPath) 
{ 
    PdfReader reader = new PdfReader(pdfPath); 

    StringWriter output = new StringWriter(); 

    for (int i = 1; i <= reader.NumberOfPages; i++) 
     output.WriteLine(PdfTextExtractor.GetTextFromPage(reader, i, new SimpleTextExtractionStrategy())); 

    return output.ToString(); 
}

來源

2012-02-29 14:32:01 Dave

好，好！仍然有一個難題：如果pdf中有圖像，是否存在問題，或者他是否會閱讀它們？ – jorne 2012-03-01 10:52:53

如果文檔中包含圖像，這應該沒問題。要提取圖像，您需要檢查對象集合中的每個pdfobject。這隻會提取文本:) – Dave 2012-03-01 19:25:27

iTextSharp是開源的，但許可模式在版本4.1.6後發生了變化。舊許可證嚴格得不那麼嚴格，而新許可證則需要支付，如果你在商業上使用它並且不想發佈你的源代碼。這可能會也可能不會影響你。

下面是一個使用5.1.2.0版本的文本提取的最基本的版本：

//Full path to the file to read 
string fileToRead = System.IO.Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), @"file1.pdf"); 
//Bind a PdfReader to our file 
iTextSharp.text.pdf.PdfReader reader = new iTextSharp.text.pdf.PdfReader(fileToRead); 
//Extract all of the text from the first page 
string allPage1Text = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, 1); 
//That's it! 
Console.Write(allPage1Text);

來源

2012-02-29 14:29:33

從pdf中提取文本到c＃

回答

相關問題