Fixing PDFSharp hangs

January 12, 2017

To analyse a couple of PDF files whether they contain only images, I used the latest release build of PDFsharp, version 1.32.

However, when processing a certain file (of unknown origin) using code found in an SO answer

public static IEnumerable ExtractText(this PdfPage page)
{       
    var content = ContentReader.ReadContent(page);      
    var text = content.ExtractText();
    return text;
}   

the ExtractText() function simply would not return.

I upgraded to the most current build 1.50 beta 3, included the source in my project, and ran it in Debug mode, where execution halted in the file PDFsharp\src\PdfSharp\Pdf.Content\CParser.cs line 163 failing an assertion:

#if DEBUG
    default:
        Debug.Assert(false);
        break;
#endif

Without digging too deep into the analysis of PDF files, it was clear that the PDF contained a CSymbol that is not being handled by the library, and thus (most likely) ended up in an infinite loop inside CParser.ParseObject().

I fixed this by replacing the Debug.Assert statement with

        throw new Exception("unhandled PDF symbol " + symbol);

which fixed the situation for me.