Caradoc v0.3 released | Blog | Guillaume Endignoux

Version 0.3 of Caradoc was released this week. This update comes with lots of new features to conclude the year 2016, so this is time to review them.

On the parsing side, encrypted files are now supported, and new normalization options have come to cope with common errors found in the wild. Regarding validation, the syntax of graphical commands – that constitute the core of PDF – is now checked as well. New friendly features are also available to perform search and obtain more statistics about PDFs.

Encryption
Graphical commands
File statistics
Search
Normalization options
What’s next?

Encryption

As I explained in a previous post, PDF supports a custom encryption scheme. Although it relies on quite outdated cryptographic primitives such as RC4 and MD5, PDF encryption is not uncommon in practice, notably to restrict permissions.

Caradoc now supports decryption of PDF documents, provided that you know the correct password. For this, you need to pass an option file to Caradoc. For example, you can type one of the following commands:

# Obtain statistics about an encrypted file
$ caradoc stats --options options.txt <encrypted.pdf>
# Decrypt and normalize an encrypted file
$ caradoc cleanup --options options.txt --out <decrypted.pdf> <encrypted.pdf>

and indicate the user and/or owner password in the file options.txt.

user-password=you_wont_find_this_password
owner-password=top_secret

By default, if Caradoc detects that the file is encrypted but you did not provide a password, it tries the empty password. This behavior mimics what common PDF readers do. To manually disable opportunistic decryption, use the following option in the options.txt file.

no-decrypt

You can also ignore encryption altogether if there is a persistent problem, again with the options.txt file.

ignore-crypt

For a more thorough description and (in)security analysis of PDF encryption, you can read my previous post.

Graphical commands

The core of PDF is a vector graphics language largely inspired from PostScript. This language contains relatively low-level instructions to choose colors, fonts, draw lines, text, fill polygons, etc. I recently made a pdf-graphics cheat sheet for these instructions.

Cheat sheet of PDF graphics instructions. More on GitHub.

Caradoc now checks that these instructions are well-formed, i.e. that no unknown instruction is used, that arguments are of the correct type and that instructions are used in the correct state. For example, a command fill path must be preceded by command(s) that define a path.

Also, this language defines a stack that allows to save and restore the graphic state (that contains colors, fonts, geometric transformations, etc.). Caradoc checks that push and pop operators on the graphic stack are balanced.

File statistics

The stats command, along with giving information about the validation process, the number of objects or compression filters, now also outputs metadata found in the /Info dictionary of the document. This includes the following (self-explanatory) fields /Producer, /Creator, /CreationDate, /ModDate. If the file is encrypted, encryption parameters are also output: /V and /R (version of the encryption algorithm), /O and /U (user and owner passwords «hashes») and /ID (salt).

For example, if you try the stats command on the PDF file containing the PDF reference, you obtain the following report.

$ caradoc stats PDF32000_2008.pdf
...
/Producer : (Acrobat Distiller 8.1.0 \(Windows\))
/Creator : (FrameMaker 8.0)
/CreationDate : (D:20080918111951Z)
/ModDate : (D:20080929101841-07'00')

Search

There are two new commands to perform search on PDF files.

The findref command looks for all occurrences of a reference to a given object. The findname command looks for all occurrences of a given name, as values and as dictionary keys.

You can for example look for objects containing the /Page name (in principle page objects).

$ ./caradoc findname --name Page <file.pdf>
Found in object 68 at entry /Type
Found in object 135 at entry /Type
Found in object 167 at entry /Type
...
Found in object 984 at entry /Type
Found in object 1011 at entry /Type
Found 35 occurrence(s) in 35 object(s).

The --show option displays objects in which a match is found. On compatible terminals, you can also --highlight the matches.

$ ./caradoc findname --name Page --show <file.pdf>
Found in object 68 at entry /Type
<<
    /Contents 98 0 R
    /Type /Page
    /Resources 97 0 R
    /Annots [77 0 R 78 0 R 79 0 R 80 0 R 81 0 R 82 0 R 83 0 R 84 0 R 85 0 R 86 0 R 87 0 R 88 0 R 89 0 R 90 0 R 91 0 R 92 0 R 93 0 R 94 0 R 95 0 R 96 0 R]
    /Trans <<
        /S /R
    >>
    /MediaBox [0 0 362.835 272.126]
    /Parent 106 0 R
>>

...

The findref command can be useful to retrieve in which contexts an object appears. For example, let’s say that you searched for Type1 fonts.

$ ./caradoc findname --name Type1 --show <file.pdf>
Found in object 102 at entry /Subtype
<<
    /LastChar 122
    /Widths 1027 0 R
    /BaseFont /EGTMLR+SFSS1440
    /Type /Font
    /FontDescriptor 1043 0 R
    /Encoding 1015 0 R
    /FirstChar 28
    /Subtype /Type1
>>

...

You can then find where font object 102 is used.

$ ./caradoc findref --num 102 --show <file.pdf>
Found in object 97 at entry /Font/F19
<<
    /Pattern 2 0 R
    /ColorSpace 3 0 R
    /Font <<
        /F17 103 0 R
        /F37 104 0 R
        /F19 102 0 R
        /F18 105 0 R
    >>
    /ProcSet [/PDF /Text]
    /ExtGState 1 0 R
    /XObject <<
        /Fm3 73 0 R
        /Fm4 75 0 R
        /Fm2 71 0 R
    >>
>>

...

Normalization options

Several new options allow to cope with errors commonly found in the wild.

Cleanup options

The cleanup command of Caradoc, that allows to normalize PDFs, now implements more options to increase the validation rate after normalization.

--simplify-info removes all non-standard metadata fields in the /Info dictionary. Many PDF producers add their own metadata fields, but these cannot be type-checked, because no specification defines their type. Hence, removing non-standard fields avoid type-checking errors.
Similarly, --remove-ptex removes all fields starting with /PTEX., everywhere in the document. These fields are created by LaTeX in some cases, but are not specified by the PDF standard.
--merge-content-stream merges content streams for each page. A content stream is a stream that contains graphical instructions, so each page refers to a content stream. But a page can also refer to an array of such content streams, in which case they are interpreted one after another. The problem is that they can be arbitrarily split, which means that they cannot be validated separately. However, Caradoc’s new validator for graphical instructions adopts a simple and conservative approach by fetching all objects of type content stream and validating each one separately. So this new normalization option helps simplifying the file to avoid validation errors.

Flawed xref streams

--xref-stream-default-zero can be used to fix flawed xref streams, if you get one of the following errors.

Free object entry in xref stream has no next generation number
Compressed object entry in xref stream has no index

An xref stream is a stream that contains an xref table in a custom binary format, with three columns per object: the first is the type of object and the other two depend on the type, according to the following table.

Type	Field 1	Field 2
free object	next free object number	next free generation number
in-use object	offset in file	generation number
compressed object	object stream containing it	index in the object stream

Each file defines the number of bits per column for these three columns, but sometimes the last field has 0 bits per column. Although this is not a problem for in-use objects, that have a default generation number of 0, the specification does not define defaults for free and compressed objects. In that case, the option --xref-stream-default-zero manually defines a default value of 0 in all cases, which fixes the problem in practice.

Other options

--allow-nonascii-in-names allows names to contain non-ASCII characters in non-escaped form. The specification does not strictly forbid this, but states that non-ASCII characters should be escaped.
--allow-overlaps allows several objects to overlap in a file. This is only for the relaxed parser; the strict parser always processes the file linearly, and expects objects to follow one another – without overlap or blank spaces.

What’s next?

Caradoc now covers a large part of the specification, and simple files can already be fully validated. Yet, this is still a beta version because some important features are not implemented. This gives an idea of what could come next year.

Support and validation of more compression filters: LZW, JPEG, JBIG…
More type definitions to cover more features: annotations, marked content…
Verification of resources associated with graphical instructions.

Apart from validation, new features could also come to Caradoc, to play with PDFs: file optimization, comparison between files, interactive UI, etc.

Comments

To react to this blog post please check the Twitter thread.

RSS | Mastodon | GitHub