Home > Uncategorized > High performance extraction of unstructured text from a #PDF in C#

High performance extraction of unstructured text from a #PDF in C#

There are a myriad of tools that allow the extraction of text from a PDF, and this is code is not meant as a replacement for them, it was a specific case where I was looking to extract text from a PDF as fast as possible without worrying about the structure of the document. I.e. to very quickly answer the question “on what pages does the text “X” appear?”

In my specific case, performance was of paramount importance, knowing the layout of the page was unimportant.

The Github repo is here: https://github.com/infiniteloopltd/PdfToTextCSharp

And the performance was 10x faster than iText, parsing a 270 page PDF in 0.735 seconds.

It’s also a very interesting look at how one could go about creating a PDF reader from first principles, so without further ado, let’s take a look at a PDF, when opened in a text editor:

7 0 obj
/Contents [ 8 0 R  ] 
/Parent 5 0 R 
/Resources 6 0 R 
/Type /Page
6 0 obj
/Font <<
/ttf0 11 0 R 
/ttf1 17 0 R 
/ProcSet 21 0 R 
8 0 obj
/Filter [ /FlateDecode ]
/Length 1492
..... BINARY DATA ...

What is interesting here, is that the page data is encoded in the “BINARY DATA” which is enclosed between the stream and endstream markers

This binary data can be decompressed using the Deflate method. There are other compression schemes used in PDF, and they can even be chained, but that goes beyond the scope of this tool.

Here is the code to uncompress deflated binary data;

private static string Decompress(byte[] input)
            var cutInput = new byte[input.Length - 2];
            Array.Copy(input, 2, cutInput, 0, cutInput.Length);
            var stream = new MemoryStream();
            using (var compressStream = new MemoryStream(cutInput))
            using (var deflateStream = new DeflateStream(compressStream, CompressionMode.Decompress))
            return Encoding.Default.GetString(stream.ToArray());

So, I read through the PDF document, looking for markers of “stream” and “endstream”, and when found, I would snip out the binary data, deflate it to reveal this text;

/DeviceRGB cs
/DeviceRGB CS
1 0 0 1 0 792 cm
18 -18 901.05 -756  re W n
1.5 w
0 0 0 SC
32.05 -271.6 m
685.25 -271.6 l
32.05 -235.6 m
685.25 -235.6 l
1 w
32.05 -723.9 m
685.25 -723.9 l
1 0 0 1 636 -743.7 Tm
0 0 0 sc
0 Tr
/ttf0 9 Tf
-510 0.1 Td
0 Tr

Most of this text is relating to the layout and appearance of the page, and is once again, beyond the scope of the tool. I wanted to extract the text which is represented like (…)Tj , which I extracted using a regex as follows;

 const string strContentRegex = @"\((?<Content>.*?)\)Tj";
 UnstructuredContent = Regex.Matches(RawContent, strContentRegex)
                .Select(m => m.Groups["Content"].Value)

Once this was done, I could then write a Find function, that could find which pages a given string of text appeared;

public List<FastPDFPage> Find(string text)
            return Pages
                .Where(p => p.UnstructuredContent
                    .Any(c => string.Equals(c, text, StringComparison.OrdinalIgnoreCase)))

And, in performance tests, this consistently performed at 0.735 seconds to scan 270 pages, much faster than iText, and a order of magnitude faster than PDF Miner for Python.

Categories: Uncategorized
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: