Converting a PDF document into a tagged text-based XML file does not come without its problems. PDF-to-XML is an extremely easy-to-use utility that uses an intelligent parsing algorithm that detects all lines, page breaks, and paragraphs and tags them accordingly. Likewise, it can identify tables and tag all cells, columns, and rows efficiently to produce an XML file compatible with the 1.0 specification and above.
In order for the conversion process to succeed, the original PDF document needs to be a text-based file or to have gone through an OCR process beforehand, as PDF-to-XML comes without OCR capabilities. The program’s conversion functionality itself is as simple as they come, thanks to its wizard-driven interface. You’ll be asked to select a PDF file and to define the page range to be converted. You are then given the option to customize the XML tags supported, namely cell, column, line, page, par (for paragraphs), and row. The limited list of XML elements supported can give you an idea of the types of PDF documents that this tool can deal with successfully. The third step of the conversion wizard is the conversion process itself. You will need to go through these three steps again for every PDF document you wish to convert, as no batch conversion capabilities have been added to this tool.
The quality of the resulting file is directly proportional to the complexity of the source PDF. If the original document includes elements or objects whose XML labels are not included in the list above, the app will reduce them to lines and paragraphs, which may not define the true nature of the object at hand. If you use the Demo version to test the program’s functionality, you will find that a significant number of the characters in the original text have been replaced with asterisks in the resulting XML file.
What PDF-to-XML has in simplicity of use, it lacks in efficiency. The short list of XML tags supported and the limited amount of conversion settings provided make it suitable nearly just for extremely simple PDF source files. It makes up this limitation with a wizard-based interface that – even though it can deal with one file at a time only – is suitable for all types of users.
- Wizard-driven interface
- Customizable XML tags
- Fast conversion speed
- No OCR capabilities
- Limited customization settings and tags
- No batch conversion supported
- Not suitable for highly complex PDFs