Case Study
PDF or XPS: Choose the Right Document Format for Your Applications
Providing a single source for creating, converting, and processing both PDF and XPS documents
Mar. 9, 2008 11:00 AM
Originally, we had also looked at the Open Document Format supported by the OASIS group. But that format looked to us as being very archaic with a concept much different than PDF or XPS. This article still includes a feature comparison table to compare the three formats but will mostly discuss PDF and XPS.
Structure of a PDF File
A PDF file is made of a sequence of objects identified by numbers, followed by an index table containing the location of each object. Each object has a set of attributes associated with it. Some of the types of objects that can be found within a PDF file are Document information, Page description, Font description, Colorspaces, Images, Form fieldobjects, Annotations... Each object can have contents associated with it. Contents are usually compressed with flate compression (also known as ZIP), whereas object descriptions and attributes cannot be compressed.
The contents of a page is a series of drawing instructions that define exactly how the page should view and print. Here's an example of a page description and its contents:
6 0 obj
<<
/Type /Page /Parent 3 0 R /MediaBox [0 0 612 792 ] /Contents [7 0 R ]
/Resources <</ProcSet [/PDF /Text]/Font <</F9 9 0 R >>>>
>>
endobj
7 0 obj
<< /Length 98 >>
stream
1 0 0 1 16.8 0 cm
n
BT /F9 12 Tf 1 0 0 1 73.2 708.96 Tm
-0.075 Tc 0.435 Tw (Hello World) Tj
ET
endstream
endobj
In the example above, we're drawing text on a page using a specified font at a specific size and location. In PDF, a text fragment is not an object; it is a series of drawing instructions. The same thing applies to vector graphics and for some images.
This file format was designed for very fast viewing and printing. Its main inconveniences are:
- Extracting document contents in a meaningful way for analysis purposes can be very painful as the content is not structured.
- A larger file size because only fragments of the file are compressed.
- PDF files can be corrupted when sent by email or ftp because their content might be considered as text rather than binary (www.amyuni.com/forum/viewtopic.php?p=1063). This part is usually compressed.
Note that Adobe made an attempt at compressing the whole file contents in its latest revisions of the format, but that resulted in backward incompatible files as we now have two different file formats for PDF.
Structure of an XPS File
An XPS file is made of a sequence of objects identified by URIs (Uniform Resource Identifiers.) The URI can be of the format /images/image1.jpg. The objects are stored in a regular ZIP file where all the object descriptions and object contents are compressed. The objects in a ZIP files are not indexed, which makes access to an object much slower than in the case of PDF.
The content of a page is made of objects described in XML format rather than drawing instructions. Here's an example of a page description and contents in XPS:
<FixedPage Width="816" Height="1056"
xmlns="http://schemas.microsoft.com/xps/2005/06"
xml:lang="und">
<Glyphs Fill="#ff000000"
FontUri="/Documents/1/Resources/Fonts/71271D1B84C9.o
dttf" FontRenderingEmSize="15.9697"
StyleSimulations="None" OriginX="120"
OriginY="110.4"
Indices="43;72;79,27;79,29;82;3;58;82;85;79;71;3"
UnicodeString="Hello World " />
</FixedPage>
Text here is an object that has various attributes and can be easily processed. The same thing applies to vector graphics and images.
This XML-based format was designed to be easily extensible by simply adding attributes or objects to the schema. The main inconveniences of this format are:
- Once decompressed, the page content tends to be quite large.
- Processing page content can be much slower than PDF especially in printers that have more limited resources than PCs.
- Loading a complete document or searching for text within a document tends to be much slower than PDF.
Packaging XPS documents as a ZIP file makes the manipulation of XPS files much easier:
- Images and fonts are stored in their original format without the need for the extra processing that is required by PDF.
- The XML parts can be easily edited even with a simple text editor.
- One can easily add items to the ZIP package or replace some items. For example, one can substitute an image with another by simply opening the ZIP package with standard tools. Note that this facility can be seen as a disadvantage or a security threat for some applications.