Home » Source Code » Converting PDF to Text in C#

Converting PDF to Text in C#

maninwest
2015-01-13 05:18:01
The author
View(s):
Download(s): 12
Point (s): 4 
Category Category:
DocumentDocument C SharpC#

Description

Translated by  maninwest@Codeforge Author:Dan Letecky @CodeProject

There are several main methods for extracting text from PDF files in .NET:

Microsoft IFilter interface and Adobe IFilter implementation.
iTextSharp
PDFBox
None of these PDF parsing solutions is perfect. We will discuss all these methods below.
1. Parsing PDF using Adobe PDF IFilter

In order to parse PDF files using IFilter interface you need the following:

Windows 2000 or laterAdobe Acrobat or Reader 7.0.5+ (or the standalone Adobe PDF IFilter [adobe.com])

IFilter COM wrapper class [dotlucene.net]
Sample code:
using IFilter;


// ...


public static string ExtractTextFromPdf(string path) {
  return DefaultParser.Extract(path); 
} 

Download a sample project:

  Parsing PDF Files using IFilter [squarepdf.net]
If you are using the PDF IFilter that comes with Adobe Acrobat Reader you will need to rename the process to "filtdump.exe" otherwise the IFilter interface will return E_NOTIMPL error code. See more at Parsing PDF Files using IFilter [squarepdf.net].
Disadvantages:
Using unreliable COM interop that handles IFilter interface (and the combination of IFilter COM and Adobe PDF IFilter is especially troublesome).
A separate installation of Adobe IFilter on the target system. This can be painful if you need to distribute your indexing solution to someone else.
You have to use "filtdump.exe" file name for your application with the latest PDF IFilter implementation that comes with Acrobat Reader.
2. Parsing PDF using iTextSharp
iTextSharp is a .NET port of iText, a PDF manipulation library for Java. It is primarily focused on creating and not reading PDFs but it supports extracting text from PDF as well.
Sample code:

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;


// ...
 
public static string ExtractTextFromPdf(string path)
{
  using (PdfReader reader = new PdfReader(path))
  {
    StringBuilder text = new StringBuilder();


    for (int i = 1; i <= reader.NumberOfPages; i++)
    {
        text.Append(PdfTextExtractor.GetTextFromPage(reader, i));
    }


    return text.ToString();
  }
} 


Download a sample project:

Parsing PDF Files using iTextSharp [squarepdf.net]
You may consider using LocationTextExtractionStrategy to get better precision.

public static string ExtractTextFromPdf(string path)
{
  ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
  
  using (PdfReader reader = new PdfReader(path))
  {
      StringBuilder text = new StringBuilder();


      for (int i = 1; i <= reader.NumberOfPages; i++)
      {
          string thePage = PdfTextExtractor.GetTextFromPage(reader, i, its);
          string[] theLines = thePage.Split('\n');
          foreach (var theLine in theLines)
          {
              text.AppendLine(theLine);
          }
      }
      return text.ToString();
  }
}  

 
Disadvantages of iTextSharp:

Licensing if you are not happy with AGPL license
3. Parsing PDF using PDFBox
PDFBox is another Java PDF library. It is also ready to be used with the original Java Lucene (see LucenePDFDocument).
Fortunately, there is a .NET version of PDFBox that is created using IKVM.NET (just download the PDFBox package).
Using PDFBox in .NET requires adding references to:


IKVM.OpenJDK.Core.dll
IKVM.OpenJDK.SwingAWT.dll
pdfbox-1.8.7.dll
and copying the following files the bin directory:
commons-logging.dll
fontbox-1.8.7.dll
IKVM.OpenJDK.Text.dll
IKVM.OpenJDK.Util.dll
IKVM.Runtime.dll
Using the PDFBox to parse PDFs is fairly easy:

using org.apache.pdfbox.pdmodel;
using org.apache.pdfbox.util;


// ...


private static string ExtractTextFromPdf(string path)
{
  PDDocument doc = null;
  try {
    doc = PDDocument.load(path)
    PDFTextStripper stripper = new PDFTextStripper();
    return stripper.getText(doc);
  }
  finally {
    if (doc != null) {
      doc.close();
    }
  }
}  

Download a sample project:
How to convert PDF files to text in C# (.NET) [squarepdf.net]
How to convert PDF file to text in VB (.NET) [squarepdf.net]
The size of the required assemblies adds up to almost 18 MB:
IKVM.OpenJDK.Core.dll (4 MB)
IKVM.OpenJDK.SwingAWT.dll (6 MB)
pdfbox-1.8.7.dll (4 MB)
commons-logging.dll (82 kB)
fontbox-1.8.7.dll (180 kB)
IKVM.OpenJDK.Text.dll (800 kB)
IKVM.OpenJDK.Util.dll (2 MB)
IKVM.Runtime.dll (1 MB)
The speed is not so bad: Parsing the U.S. Copyright Act PDF (5.1 MB) took about 13 seconds.
Disadvantages:
IKVM.NET Dependencies (18 MB)
Speed (especially the IKVM.NET warm-up time)
Sponsored links

File list

Tips: You can preview the content of files by clicking file names^_^
Name Size Date
how_to_convert_pdf_to_text_in.html1.18 kB2014-11-27 14:17
01.97 kB
LICENSE22.56 kB2011-10-31 14:30
THIRD_PARTY_README171.41 kB2011-10-31 14:30
TRADEMARK2.10 kB2010-06-30 15:57
01.97 kB
LICENSE.txt17.67 kB2012-05-25 19:14
NOTICE.txt435.00 B2012-05-25 19:14
01.97 kB
readme.txt133.00 B2012-06-20 19:06
01.97 kB
App.ico1.05 kB2005-12-02 00:29
AssemblyInfo.cs2.37 kB2005-12-02 00:29
Pdf2Text.csproj6.82 kB2014-11-27 14:17
Pdf2Text.sln900.00 B2012-06-20 11:04
Program.cs1.12 kB2014-11-27 13:35
01.97 kB
...
Sponsored links

Comments

(Add your comment, get 0.1 Point)
Minimum:15 words, Maximum:160 words
niuwa
2016-06-13

这个文件非常有用,学习了

  • 1
  • Page 1
  • Total 1

Converting PDF to Text in C# (8.73 MB)(65.14 kB)

Need 4 Point(s)
Your Point (s)

Your Point isn't enough.

Get 22 Point immediately by PayPal

Point will be added to your account automatically after the transaction.

More(Debit card / Credit card / PayPal Credit / Online Banking)

Submit your source codes. Get more Points

LOGIN

Don't have an account? Register now
Need any help?
Mail to: support@codeforge.com

切换到中文版?

CodeForge Chinese Version
CodeForge English Version

Where are you going?

^_^"Oops ...

Sorry!This guy is mysterious, its blog hasn't been opened, try another, please!
OK

Warm tip!

CodeForge to FavoriteFavorite by Ctrl+D