0
我正在嘗試爲我們的PDF生成例程編寫驗證碼,而且我很難讓PDFsharp從使用MigraDoc創建的文件中提取文本。 ExtractText代碼可與其他PDF一起使用,但不能與我通過MigraDoc生成的PDF一起使用(請參閱下面的代碼)。使用PDFsharp和MigraDoc寫入並從PDF讀取
關於我在做什麼的錯誤提示?
//Create the Doc
var doc = new MigraDoc.DocumentObjectModel.Document();
doc.Info.Title = "VerifyReadWrite";
var section = doc.AddSection();
section.AddParagraph("ABCDEF abcdef");
//Render the PDF
var renderer = new PdfDocumentRenderer(true);
var pdf = new PdfDocument();
renderer.PdfDocument = pdf;
renderer.Document = doc;
renderer.RenderDocument();
var msOut = new MemoryStream();
pdf.Save(msOut, true);
var pdfBytes = msOut.ToArray();
//Read the PDF into PdfSharp
var ms = new MemoryStream(pdfBytes);
var pdfRead = PdfSharp.Pdf.IO.PdfReader.Open(ms, PdfDocumentOpenMode.ReadOnly);
var segments = pdfRead.Pages[0].ExtractText().ToList();
結果如下所示:
段[0] = 「\ 0 $ \ 0%\ 0 & \ 0' \ 0(\ 0)」
段[1] =「\ 0D \ 0E \ 0F \ 0G \ 0H \ 0I」
我希望看到:
段[0] = 「ABCDEF」
段[1] = 「ABCDEF」
我正在使用ExtractText代碼: C# Extract text from PDF using PdfSharp
對於除MigraDoc生成的PDF以外的所有PDF都很適用。
public static IEnumerable<string> ExtractText(this PdfPage page)
{
var content = ContentReader.ReadContent(page);
var text = content.ExtractText();
return text.Select(x => x.Trim());
}
public static IEnumerable<string> ExtractText(this CObject cObject)
{
if (cObject is COperator)
{
var cOperator = (COperator) cObject;
if (cOperator.OpCode.Name == OpCodeName.Tj.ToString() ||
cOperator.OpCode.Name == OpCodeName.TJ.ToString())
{
foreach (var cOperand in cOperator.Operands)
foreach (var txt in ExtractText(cOperand))
yield return txt;
}
}
else
{
var sequence = cObject as CSequence;
if (sequence != null)
{
var cSequence = sequence;
foreach (var element in cSequence)
foreach (var txt in ExtractText(element))
yield return txt;
}
else if (cObject is CString)
{
var cString = (CString) cObject;
yield return cString.Value;
}
}
}
謝謝!我發現這個:「如果所有文本都使用真正的Unicode編碼,如果爲false,則使用WinAnsi編碼。」這裏: http://www.nudoq.org/#!/Packages/PDFsharp-MigraDoc-GDI/MigraDoc.Rendering/PdfDocumentRenderer/ctor謝謝你的回答。 – WendellJ