High performance extraction of unstructured text from a #PDF in C#

Home > Uncategorized > High performance extraction of unstructured text from a #PDF in C#

High performance extraction of unstructured text from a #PDF in C#

May 31, 2021 Infinite Loop Development Ltd Leave a comment Go to comments

There are a myriad of tools that allow the extraction of text from a PDF, and this is code is not meant as a replacement for them, it was a specific case where I was looking to extract text from a PDF as fast as possible without worrying about the structure of the document. I.e. to very quickly answer the question “on what pages does the text “X” appear?”

In my specific case, performance was of paramount importance, knowing the layout of the page was unimportant.

The Github repo is here: https://github.com/infiniteloopltd/PdfToTextCSharp

And the performance was 10x faster than iText, parsing a 270 page PDF in 0.735 seconds.

It’s also a very interesting look at how one could go about creating a PDF reader from first principles, so without further ado, let’s take a look at a PDF, when opened in a text editor:

%PDF-1.7 
%âãÏÓ 
7 0 obj
<<
/Contents [ 8 0 R  ] 
/Parent 5 0 R 
/Resources 6 0 R 
/Type /Page
>>
endobj
6 0 obj
<<
/Font <<
/ttf0 11 0 R 
/ttf1 17 0 R 
>>
/ProcSet 21 0 R 
>>
endobj
8 0 obj
<<
/Filter [ /FlateDecode ]
/Length 1492
>>
stream
..... BINARY DATA ...
endstream

What is interesting here, is that the page data is encoded in the “BINARY DATA” which is enclosed between the stream and endstream markers

This binary data can be decompressed using the Deflate method. There are other compression schemes used in PDF, and they can even be chained, but that goes beyond the scope of this tool.

Here is the code to uncompress deflated binary data;

private static string Decompress(byte[] input)
{
            var cutInput = new byte[input.Length - 2];
            Array.Copy(input, 2, cutInput, 0, cutInput.Length);
            var stream = new MemoryStream();
            using (var compressStream = new MemoryStream(cutInput))
            using (var deflateStream = new DeflateStream(compressStream, CompressionMode.Decompress))
                deflateStream.CopyTo(stream);
            return Encoding.Default.GetString(stream.ToArray());
}

So, I read through the PDF document, looking for markers of “stream” and “endstream”, and when found, I would snip out the binary data, deflate it to reveal this text;

/DeviceRGB cs
/DeviceRGB CS
q
1 0 0 1 0 792 cm
18 -18 901.05 -756  re W n
1.5 w
0 0 0 SC
32.05 -271.6 m
685.25 -271.6 l
S
32.05 -235.6 m
685.25 -235.6 l
S
1 w
32.05 -723.9 m
685.25 -723.9 l
S
BT
1 0 0 1 636 -743.7 Tm
0 0 0 sc
0 Tr
/ttf0 9 Tf
(270)Tj
-510 0.1 Td
0 Tr
(08-Apr-2021)Tj

Most of this text is relating to the layout and appearance of the page, and is once again, beyond the scope of the tool. I wanted to extract the text which is represented like (…)Tj , which I extracted using a regex as follows;

 const string strContentRegex = @"\((?<Content>.*?)\)Tj";
 UnstructuredContent = Regex.Matches(RawContent, strContentRegex)
                .Select(m => m.Groups["Content"].Value)
                .ToList();

Once this was done, I could then write a Find function, that could find which pages a given string of text appeared;

public List<FastPDFPage> Find(string text)
{
            return Pages
                .Where(p => p.UnstructuredContent
                    .Any(c => string.Equals(c, text, StringComparison.OrdinalIgnoreCase)))
                .ToList();
}

And, in performance tests, this consistently performed at 0.735 seconds to scan 270 pages, much faster than iText, and a order of magnitude faster than PDF Miner for Python.

Categories: Uncategorized

Comments (0) Trackbacks (0) Leave a comment Trackback

No comments yet.

No trackbacks yet.

Network Programming in .NET