org.faceless.pdf2
Class PDFParser

java.lang.Object
  extended by org.faceless.pdf2.PDFParser
All Implemented Interfaces:
Pageable

public class PDFParser
extends Object
implements Pageable

The PDFParser class can be used to parse the contents of a PDF document, for example converting it to an Image, writing to TIFF, printing it and so on. Typically you will either use PDFParser directly when working on the whole document (for instance, to save the PDF as a multi-page TIFF), or will use it to get a PagePainter object for parsing individual pages or a PageExtractor object, to extract text and images from a specific page.

Note that this class is part of the "Viewer Extension" of the library - although it's supplied with the package an "viewer extension" license must be purchased to activate this class. While the library is unlicensed this class may be used freely, although a "DEMO" stamp will be applied to each document.

This class implements Pageable, which means it can be printed directly using the PrinterJob.setPageable() method.

Since:
2.5

Field Summary
static ColorModel BLACKANDWHITE
          A ColorModel that can be passed in to writeAsTIFF() or the various PagePainter methods which represent a 1-bit black and white color model.
static ColorModel CMYK
          A ColorModel that can be passed in to writeAsTIFF() of the various PagePainter methods which represent an opaque CMYK color model.
static ColorModel GRAYSCALE
          A ColorModel that can be passed in to writeAsTIFF() of the various PagePainter methods which represent an opaque grayscale color model
static ColorModel RGB
          A ColorModel that can be passed in to writeAsTIFF() of the various PagePainter methods which represent an opaque RGB color model.
static ColorModel RGBA
          A ColorModel that can be passed in to writeAsTIFF() of the various PagePainter methods which represent a translucent RGB color model with an alpha component.
 
Fields inherited from interface java.awt.print.Pageable
UNKNOWN_NUMBER_OF_PAGES
 
Constructor Summary
PDFParser(PDF pdf)
          Creates a PDFParser from the specified PDF document.
 
Method Summary
static ColorModel getBlackAndWhiteColorModel(int threshold)
           Return a Black and White ColorModel that ensures that any colours below the specified threshold are converted to black.
 org.apache.lucene.document.Document getLuceneDocument(boolean createall, boolean createbody, boolean createpages)
           Create a Document object for indexing the PDF with the Apache Lucene full-text indexing library.
 int getNumberOfPages()
          Return the number of pages in the document being parsed.
 PageExtractor getPageExtractor(int pagenumber)
          Returns a PageExtractor for the specified page number.
 PageExtractor getPageExtractor(PDFPage page)
          Returns a PageExtractor for the specified page.
 List getPageExtractors()
          Get a list containining all the PageExtractors for this PDF, in order.
 PageFormat getPageFormat(int pagenumber)
          Returns the PageFormat for the specified page.
 PagePainter getPagePainter(int pagenumber)
          Returns a PagePainter for the specified page number.
 PagePainter getPagePainter(PDFPage page)
          Returns a PagePainter for the specified page.
 PDF getPDF()
          Return the PDF this PDFParser is built from.
 Printable getPrintable(int pagenumber)
          Returns the Printable interface for a page.
 float getWriteAsTIFFProgress()
          Get the progress of the writeAsTIFF() method running in a different thread.
 boolean isExtractable()
          Return true if this PDF allows it's text and/or images to be extracted by calling the getPageExtractor(int) method.
 boolean isPrintable()
          Return true if this PDF is allowed to be printed.
 void resetPageExtractor(PDFPage page)
          Reset the previously created PageExtractor.
 void setFont(String fontname, Object font)
          Specify a font substitution to use.
 void setOutputProfile(OutputProfile profile)
          Set the OutputProfile which should be updated for any extraction or rendering performed with this PDFParser.
 void writeAsTIFF(OutputStream out, int dpi, ColorModel model)
           Convert the PDF to a TIFF image using the specified ColorModel and dots per inch.
 void writeAsTIFF(OutputStream out, int dpi, ColorModel model, RenderingHints hints)
          As for writeAsTIFF(OutputStream,int,ColorModel) but allows the user to set RenderingHints to control the rendering process.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

BLACKANDWHITE

public static final ColorModel BLACKANDWHITE
A ColorModel that can be passed in to writeAsTIFF() or the various PagePainter methods which represent a 1-bit black and white color model. When writing TIFF images however, we recommend using a model returned getBlackAndWhiteColorModel(int) instead of this model, as they're much faster.

See Also:
getBlackAndWhiteColorModel(int)

GRAYSCALE

public static final ColorModel GRAYSCALE
A ColorModel that can be passed in to writeAsTIFF() of the various PagePainter methods which represent an opaque grayscale color model


RGB

public static final ColorModel RGB
A ColorModel that can be passed in to writeAsTIFF() of the various PagePainter methods which represent an opaque RGB color model.


RGBA

public static final ColorModel RGBA
A ColorModel that can be passed in to writeAsTIFF() of the various PagePainter methods which represent a translucent RGB color model with an alpha component. TIFFs created this way will have a transparent background.

Since:
2.5.2

CMYK

public static final ColorModel CMYK
A ColorModel that can be passed in to writeAsTIFF() of the various PagePainter methods which represent an opaque CMYK color model.

Since:
2.5.2
Constructor Detail

PDFParser

public PDFParser(PDF pdf)
Creates a PDFParser from the specified PDF document.

Parameters:
pdf - the PDF to parse
Method Detail

getPDF

public final PDF getPDF()
Return the PDF this PDFParser is built from.

Since:
2.11.3

getPagePainter

public PagePainter getPagePainter(int pagenumber)
Returns a PagePainter for the specified page number. Just calls getPagePainter(pdf.getPage(pagenumber))

Parameters:
pagenumber - the page to select, from 0 to PDF.getNumberOfPages()
Returns:
a PagePainter for the specified page

getPagePainter

public PagePainter getPagePainter(PDFPage page)
Returns a PagePainter for the specified page.

Parameters:
page - the PDFPage to select
Returns:
a PagePainter for the specified page
Since:
2.7.1

getPageExtractor

public PageExtractor getPageExtractor(int pagenumber)
Returns a PageExtractor for the specified page number. Just calls getPageExtractor(pdf.getPage(pagenumber))

Parameters:
pagenumber - the page to select, from 0 to PDF.getNumberOfPages()
Returns:
a PageExtractor for the specified page
Since:
2.6.1
See Also:
isExtractable()

getPageExtractor

public PageExtractor getPageExtractor(PDFPage page)
Returns a PageExtractor for the specified page. If the PDF does not allow extraction, throws a SecurityException

Parameters:
page - the page to select.
Returns:
a PageExtractor for the specified page
Since:
2.7.1
See Also:
isExtractable()

resetPageExtractor

public void resetPageExtractor(PDFPage page)
Reset the previously created PageExtractor. This will only need to be done if that page has had its content altered, ie by appending to it or by changing its orientation.

Since:
2.11.7

getPageExtractors

public List getPageExtractors()
Get a list containining all the PageExtractors for this PDF, in order. This is not a particularly expensive operation as the extraction is not run when the extractor is created.

Since:
2.11.7

setFont

public void setFont(String fontname,
                    Object font)
Specify a font substitution to use. For unembedded fonts, the library must choose a substitute font to render the glyphs. Typically the heuristics used are quite effective, but occasionally (particularly with east-asian fonts) this may need to be overridden. This method allows you to specify the mapping from a PDF font name to an AWT font, overriding the heuristics.

Parameters:
fontname - the name of the font used in the PDF
font - the Font to use - either a Font or an OpenTypeFont
Since:
2.7.7 (since 2.11.17 the second parameter can also be an OpenTypeFont)

isPrintable

public boolean isPrintable()
Return true if this PDF is allowed to be printed. Since 2.8.2 this method simply returns the value of EncryptionHandler.hasRight("Print")

Returns:
true if the document is allowed to be printed

isExtractable

public boolean isExtractable()
Return true if this PDF allows it's text and/or images to be extracted by calling the getPageExtractor(int) method. PDF's may optionally be encrypted to prevent this - see the StandardEncryptionHandler class for more information. Since 2.8.2 this method simply returns the value of EncryptionHandler.hasRight("Extract")

Returns:
true if the document can have its text and/or images extracted.

writeAsTIFF

public void writeAsTIFF(OutputStream out,
                        int dpi,
                        ColorModel model)
                 throws IOException

Convert the PDF to a TIFF image using the specified ColorModel and dots per inch. For example, to convert the PDF to a black and white TIFF, try:

   PDFParser parser = new PDFParser(pdf);
   FileOutputStream out = new FileOutputStream("out.tif");
   parser.writeAsTIFF(out, 72, PDFParser.BLACKANDWHITE);
   out.close();
 

The ColorModel determines what type of TIFF is created and what sort of compression is used. For instance, passing in a 2-bit black & white model will result in a black & white TIFF compressed with CCITT Group 4 compression. If the specified model returns Transparency.TRANSLUCENT from ColorModel.getTransparency() then the TIFF will be written with alpha values and created with a transparent background, otherwise the TIFF will have a white background set and will be written without alpha-values. Note that specifying a model that doesn't match the model of the PDF causes color conversions to be applied, which can be quite a slow process.

You can create TIFF images that have less then all the pages of the PDF by manipulating the the PDF's page list before saving. Say for example you want to create 10 single-page TIFF images from your 10-page PDF document. Here's how:

 List copy = new ArrayList(pdf.getPages());
 for (int i=0;i<copy.size();i++) {
     pdf.getPages().clear();
     pdf.getPages().add(copy.get(i));
     pdf.writeAsTIFF(out[i], dpi, model);
 }
 

Parallel Operation Note: Since 2.10, this method can optionally run multiple threads in parallel to speed up writing. To enable this, set the Threads.TIFF property (typically by setting the org.faceless.pdf2.Threads.TIFF System property) to the number of threads you want to use. Note that each thread may require significant amount of memory - how much depends on the content of each page, so it's very difficult to determine in advance. Carefully tune this value yourself based on the amount of memory in your system and the type of documents you're working with in order to avoid an OutOfMemoryError.

Parameters:
out - The OutputStream to write the TIFF to. The stream will be left open on completion
dpi - how many dots per inch to view the page. A value of 72 gives in 1 point per pixel. As a special hack for those creating Class F TIFF images, a DPI of -1 gives a 204x196 DPI image and -2 gives 204x96 DPI (these added in 2.6.9).
model - the ColorModel to use to render the images.
Throws:
IOException - if an exception is encountered when writing the TIFF
See Also:
BLACKANDWHITE, RGB, CMYK, getBlackAndWhiteColorModel(int)

writeAsTIFF

public void writeAsTIFF(OutputStream out,
                        int dpi,
                        ColorModel model,
                        RenderingHints hints)
                 throws IOException
As for writeAsTIFF(OutputStream,int,ColorModel) but allows the user to set RenderingHints to control the rendering process.

Parameters:
out - The OutputStream to write the TIFF to. The stream will be left open on completion
dpi - how many dots per inch to view the page. A value of 72 gives in 1 point per pixel. As a special hack for those creating Class F TIFF images, a DPI of -1 gives a 204x196 DPI image and -2 gives 204x96 DPI (these added in 2.6.9).
model - the ColorModel to use to render the images.
hints - the RenderingHints to be used when rendering the image, or null to use the defaults.
Throws:
IOException - if an exception is encountered when writing the TIFF
Since:
2.6.3
See Also:
BLACKANDWHITE, RGB, CMYK, getBlackAndWhiteColorModel(int)

getWriteAsTIFFProgress

public float getWriteAsTIFFProgress()
Get the progress of the writeAsTIFF() method running in a different thread. The returned value will start at 0 and move towards 1 as the write progresses.

Since:
2.8

setOutputProfile

public void setOutputProfile(OutputProfile profile)
Set the OutputProfile which should be updated for any extraction or rendering performed with this PDFParser. This will not give the full PDF OutputProfile (for that you should call PDF.getFullOutputProfile()) but it can be used to determine some of which features apply to particular pages.

Since:
2.11.25

getNumberOfPages

public int getNumberOfPages()
Return the number of pages in the document being parsed. Needed for the Pageable interface, this method just calls PDF.getNumberOfPages()

Specified by:
getNumberOfPages in interface Pageable
Returns:
the number of pages in the document being parsed

getPageFormat

public PageFormat getPageFormat(int pagenumber)
Returns the PageFormat for the specified page.

Specified by:
getPageFormat in interface Pageable
Parameters:
pagenumber - the page to select, from 0 to PDF.getNumberOfPages()
Returns:
the PageFormat for page at index pagenumber

getPrintable

public Printable getPrintable(int pagenumber)
Returns the Printable interface for a page. Needed for the Pageable interface, this method just calls getPagePainter(int)

Specified by:
getPrintable in interface Pageable
Parameters:
pagenumber - the page to select, from 0 to PDF.getNumberOfPages()
Returns:
the Printable object for the specified page

getLuceneDocument

public org.apache.lucene.document.Document getLuceneDocument(boolean createall,
                                                             boolean createbody,
                                                             boolean createpages)

Create a Document object for indexing the PDF with the Apache Lucene full-text indexing library. The Document is created with Field objects representing the content of the PDF, the info dictionary, the form and any annotations that may be there. The fields are called:

bodyThe contents of all the pages in the PDF
page.nThe contents of page n of the PDF
info.fieldThe contents of the field field of the Info dictionary - eg. info.Title
infoThe contents of the whole Info dictionary as one item
form.fieldThe contents of the field field of the Form
formThe contents of the whole Form as one item
annotationsThe contents of all the annotations in the document as one item
allAll the fields above concatenated into one big field - useful for searching the entire textual content of the PDF in one go

Because creating indices for all, body and page.n is usually redundant (typically you will want only one of them), they can be turned on or off individually by setting the appropriate parameter to true or false.

Parameters:
createall - whether to create an all entry in the index
createbody - whether to create an body entry in the index
createpages - whether to create the page.n entries in the index
Returns:
a Document suitable for indexing with Lucene.
Since:
2.6.2

getBlackAndWhiteColorModel

public static ColorModel getBlackAndWhiteColorModel(int threshold)

Return a Black and White ColorModel that ensures that any colours below the specified threshold are converted to black. This method can be used to convert images that have shades of gray to black and white TIFF images - because it renders the PDF to RGB before manually converting it to Black and White it avoids some of the platform dependent behaviour that arises from using BLACKANDWHITE, and will probably run faster on many operating systems.

Parameters:
threshold - a number between 0 and 255 - typically around 128 or so. Higher values result in more black.

Since 2.11.17 the value "0" can be used to automatically determine the threshold value using Otsu's algorithm. This may be appropriate for poor quality images.

Note this ColorModel should only by used in the writeAsTIFF(java.io.OutputStream, int, java.awt.image.ColorModel) method - passing it into one of the PagePainter.getImage methods will not work

Since:
2.6.8
See Also:
BLACKANDWHITE, writeAsTIFF(java.io.OutputStream, int, java.awt.image.ColorModel)


Copyright © 2001-2010 Big Faceless Organization