Fixing PDFSharp hangs

To analyse a couple of PDF files whether they contain only images, I used the latest release build of PDFsharp, version 1.32.

However, when processing a certain file (of unknown origin) using code found in an SO answer

public static IEnumerable ExtractText(this PdfPage page)
{       
    var content = ContentReader.ReadContent(page);      
    var text = content.ExtractText();
    return text;
}   

the ExtractText() function simply would not return.

I upgraded to the most current build 1.50 beta 3, included the source in my project, and ran it in Debug mode, where execution halted in the file PDFsharp\src\PdfSharp\Pdf.Content\CParser.cs line 163 failing an assertion:

#if DEBUG
    default:
        Debug.Assert(false);
        break;
#endif

Without digging too deep into the analysis of PDF files, it was clear that the PDF contained a CSymbol that is not being handled by the library, and thus (most likely) ended up in an infinite loop inside CParser.ParseObject().

I fixed this by replacing the Debug.Assert statement with

        throw new Exception("unhandled PDF symbol " + symbol);

which fixed the situation for me.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.