使用itextsharp在c＃中提取阿拉伯語文本

我有這段代碼，我正在使用它來獲取PDF文本。這對使用英文的PDF格式非常有用，但是當我嘗試用阿拉伯語提取文本時，它會顯示出類似這樣的內容。使用itextsharp在c＃中提取阿拉伯語文本

「）+ N 9 N < +，+）+ $＃$ + $ F％9 & < $：;」。

using (PdfReader reader = new PdfReader(path)) 
{ 
    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy(); 
    String text = ""; 
    for (int i = 1; i <= reader.NumberOfPages; i++) 
    { 
     text = PdfTextExtractor.GetTextFromPage(reader, i,strategy); 
    }

來源

2016-11-14 Ahmad Tarabeshi

這看起來像PDF不包含根據pdf規範提取文本所需的信息。 – mkl

你試過這個http://stackoverflow.com/questions/35436158/itextsharp-cant-extract-pdf-unicode-content-in-c-sharp？ – KMoussa

沒有有很多的話，但iTextSharp的代碼用阿拉伯文寫着 –

我不得不改變這樣

var t = PdfTextExtractor.GetTextFromPage(reader, i, new LocationTextExtractionStrategy()); 
var te = Convert(t);

戰略，此功能可扭轉阿拉伯語單詞和保持英語

private string Convert(string source) 
     { 
      string arabicWord = string.Empty; 
      StringBuilder sbDestination = new StringBuilder(); 

      foreach (var ch in source) 
      { 
       if (IsArabic(ch)) 
        arabicWord += ch; 
       else 
       { 
        if (arabicWord != string.Empty) 
         sbDestination.Append(Reverse(arabicWord)); 

        sbDestination.Append(ch); 
        arabicWord = string.Empty; 
       } 
      } 

      // if the last word was arabic  
      if (arabicWord != string.Empty) 
       sbDestination.Append(Reverse(arabicWord)); 

      return sbDestination.ToString(); 
     } 


     private bool IsArabic(char character) 
     { 
      if (character >= 0x600 && character <= 0x6ff) 
       return true; 

      if (character >= 0x750 && character <= 0x77f) 
       return true; 

      if (character >= 0xfb50 && character <= 0xfc3f) 
       return true; 

      if (character >= 0xfe70 && character <= 0xfefc) 
       return true; 

      return false; 
     } 

     // Reverse the characters of string 
     string Reverse(string source) 
     { 
      return new string(source.ToCharArray().Reverse().ToArray()); 
     }

來源

2016-11-15 20:47:48

您的問題中輸出的任何字符都不顯示在「IsArabic」測試的範圍內。因此，如果你的答案中的代碼確實有幫助，那麼你沒有提供你真正提取的數據在你的問題中...... – mkl

實際上它發生的一些特別是當它是舊版本:) –

好的，謝謝你分享這個碼。我相信這對其他人也有幫助。 –

使用itextsharp在c＃中提取阿拉伯語文本

回答

相關問題