|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectorg.faceless.pdf2.PDFParser
public class PDFParser
The PDFParser
class can be used to parse the contents of a PDF document,
for example converting it to an Image, writing to TIFF, printing it and so on. Typically you
will either use PDFParser
directly when working on the whole document (for instance,
to save the PDF as a multi-page TIFF), or will use it to get a PagePainter
object for
parsing individual pages or a PageExtractor
object, to extract text and images from a
specific page.
Note that this class is part of the "Viewer Extension" of the library - although it's supplied with the package an "viewer extension" license must be purchased to activate this class. While the library is unlicensed this class may be used freely, although a "DEMO" stamp will be applied to each document.
This class implementsPageable
, which means it can be printed directly using the
PrinterJob.setPageable()
method.
Field Summary | |
---|---|
static ColorModel |
BLACKANDWHITE
A ColorModel that can be passed in to writeAsTIFF() or the
various PagePainter methods which represent a 1-bit black and white
color model. |
static ColorModel |
CMYK
A ColorModel that can be passed in to writeAsTIFF() of the
various PagePainter methods which represent an opaque CMYK color model. |
static ColorModel |
GRAYSCALE
A ColorModel that can be passed in to writeAsTIFF() of the
various PagePainter methods which represent an opaque grayscale color model |
static ColorModel |
RGB
A ColorModel that can be passed in to writeAsTIFF() of the
various PagePainter methods which represent an opaque RGB color model. |
static ColorModel |
RGBA
A ColorModel that can be passed in to writeAsTIFF() of the
various PagePainter methods which represent a translucent RGB color model with
an alpha component. |
Fields inherited from interface java.awt.print.Pageable |
---|
UNKNOWN_NUMBER_OF_PAGES |
Constructor Summary | |
---|---|
PDFParser(PDF pdf)
Creates a PDFParser from the specified PDF document. |
Method Summary | |
---|---|
static ColorModel |
getBlackAndWhiteColorModel(int threshold)
Return a Black and White ColorModel that ensures that any colours
below the specified threshold are converted to black. |
org.apache.lucene.document.Document |
getLuceneDocument(boolean createall,
boolean createbody,
boolean createpages)
Create a Document object for indexing the PDF with the
Apache Lucene full-text indexing library. |
int |
getNumberOfPages()
Return the number of pages in the document being parsed. |
PageExtractor |
getPageExtractor(int pagenumber)
Returns a PageExtractor for the specified page number. |
PageExtractor |
getPageExtractor(PDFPage page)
Returns a PageExtractor for the specified page. |
List |
getPageExtractors()
Get a list containining all the PageExtractors for this PDF, in order. |
PageFormat |
getPageFormat(int pagenumber)
Returns the PageFormat for the specified page. |
PagePainter |
getPagePainter(int pagenumber)
Returns a PagePainter for the specified page number. |
PagePainter |
getPagePainter(PDFPage page)
Returns a PagePainter for the specified page. |
PDF |
getPDF()
Return the PDF this PDFParser is built from. |
Printable |
getPrintable(int pagenumber)
Returns the Printable interface for a page. |
float |
getWriteAsTIFFProgress()
Get the progress of the writeAsTIFF() method running in a different
thread. |
boolean |
isExtractable()
Return true if this PDF allows it's text and/or images to be extracted by calling the getPageExtractor(int) method. |
boolean |
isPrintable()
Return true if this PDF is allowed to be printed. |
void |
resetPageExtractor(PDFPage page)
Reset the previously created PageExtractor. |
void |
setFont(String fontname,
Object font)
Specify a font substitution to use. |
void |
setOutputProfile(OutputProfile profile)
Set the OutputProfile which should be updated for any extraction or rendering performed with this PDFParser. |
void |
writeAsTIFF(OutputStream out,
int dpi,
ColorModel model)
Convert the PDF to a TIFF image using the specified ColorModel and dots per inch. |
void |
writeAsTIFF(OutputStream out,
int dpi,
ColorModel model,
RenderingHints hints)
As for writeAsTIFF(OutputStream,int,ColorModel) but allows the user to
set RenderingHints to control the rendering process. |
Methods inherited from class java.lang.Object |
---|
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final ColorModel BLACKANDWHITE
ColorModel
that can be passed in to writeAsTIFF()
or the
various PagePainter
methods which represent a 1-bit black and white
color model. When writing TIFF images however, we recommend using a model returned
getBlackAndWhiteColorModel(int)
instead of this model, as they're much faster.
getBlackAndWhiteColorModel(int)
public static final ColorModel GRAYSCALE
ColorModel
that can be passed in to writeAsTIFF()
of the
various PagePainter
methods which represent an opaque grayscale color model
public static final ColorModel RGB
ColorModel
that can be passed in to writeAsTIFF()
of the
various PagePainter
methods which represent an opaque RGB color model.
public static final ColorModel RGBA
ColorModel
that can be passed in to writeAsTIFF()
of the
various PagePainter
methods which represent a translucent RGB color model with
an alpha component. TIFFs created this way will have a transparent background.
public static final ColorModel CMYK
ColorModel
that can be passed in to writeAsTIFF()
of the
various PagePainter
methods which represent an opaque CMYK color model.
Constructor Detail |
---|
public PDFParser(PDF pdf)
PDF
document.
pdf
- the PDF to parseMethod Detail |
---|
public final PDF getPDF()
public PagePainter getPagePainter(int pagenumber)
PagePainter
for the specified page number.
Just calls getPagePainter(pdf.getPage(pagenumber))
pagenumber
- the page to select, from 0 to PDF.getNumberOfPages()
public PagePainter getPagePainter(PDFPage page)
PagePainter
for the specified page.
page
- the PDFPage to select
public PageExtractor getPageExtractor(int pagenumber)
PageExtractor
for the specified page number.
Just calls getPageExtractor(pdf.getPage(pagenumber))
pagenumber
- the page to select, from 0 to PDF.getNumberOfPages()
isExtractable()
public PageExtractor getPageExtractor(PDFPage page)
PageExtractor
for the specified page.
If the PDF does not allow extraction, throws a SecurityException
page
- the page to select.
isExtractable()
public void resetPageExtractor(PDFPage page)
public List getPageExtractors()
public void setFont(String fontname, Object font)
fontname
- the name of the font used in the PDFfont
- the Font to use - either a Font
or an OpenTypeFont
public boolean isPrintable()
EncryptionHandler.hasRight("Print")
public boolean isExtractable()
getPageExtractor(int)
method. PDF's may optionally be encrypted to prevent
this - see the StandardEncryptionHandler
class for more information.
Since 2.8.2 this method simply returns the value of
EncryptionHandler.hasRight("Extract")
public void writeAsTIFF(OutputStream out, int dpi, ColorModel model) throws IOException
Convert the PDF to a TIFF image using the specified ColorModel and dots per inch. For example, to convert the PDF to a black and white TIFF, try:
PDFParser parser = new PDFParser(pdf); FileOutputStream out = new FileOutputStream("out.tif"); parser.writeAsTIFF(out, 72, PDFParser.BLACKANDWHITE); out.close();
The ColorModel determines what type of TIFF is created and what sort of compression is used.
For instance, passing in a 2-bit black & white model will result in a black & white TIFF
compressed with CCITT Group 4 compression. If the specified model returns
Transparency.TRANSLUCENT
from ColorModel.getTransparency()
then the TIFF will be
written with alpha values and created with a transparent background, otherwise the TIFF will
have a white background set and will be written without alpha-values. Note that specifying a
model that doesn't match the model of the PDF causes color conversions to be applied, which
can be quite a slow process.
You can create TIFF images that have less then all the pages of the PDF by manipulating the the PDF's page list before saving. Say for example you want to create 10 single-page TIFF images from your 10-page PDF document. Here's how:
List copy = new ArrayList(pdf.getPages()); for (int i=0;i<copy.size();i++) { pdf.getPages().clear(); pdf.getPages().add(copy.get(i)); pdf.writeAsTIFF(out[i], dpi, model); }
Parallel Operation Note: Since 2.10, this method can optionally run
multiple threads in parallel to speed up writing. To enable this, set the
Threads.TIFF
property
(typically by setting the
org.faceless.pdf2.Threads.TIFF
System property
)
to the number of threads you want to use. Note that each thread may require
significant amount of memory - how much depends on the content of each page,
so it's very difficult to determine in advance. Carefully tune this value yourself
based on the amount of memory in your system and the type of documents you're working
with in order to avoid an OutOfMemoryError
.
out
- The OutputStream to write the TIFF to. The stream will be left open on completiondpi
- how many dots per inch to view the page. A value of 72 gives in 1 point per pixel. As a special hack for those creating Class F TIFF images, a DPI of -1 gives a 204x196 DPI image and -2 gives 204x96 DPI (these added in 2.6.9).model
- the ColorModel to use to render the images.
IOException
- if an exception is encountered when writing the TIFFBLACKANDWHITE
,
RGB
,
CMYK
,
getBlackAndWhiteColorModel(int)
public void writeAsTIFF(OutputStream out, int dpi, ColorModel model, RenderingHints hints) throws IOException
writeAsTIFF(OutputStream,int,ColorModel)
but allows the user to
set RenderingHints
to control the rendering process.
out
- The OutputStream to write the TIFF to. The stream will be left open on completiondpi
- how many dots per inch to view the page. A value of 72 gives in 1 point per pixel. As a special hack for those creating Class F TIFF images, a DPI of -1 gives a 204x196 DPI image and -2 gives 204x96 DPI (these added in 2.6.9).model
- the ColorModel to use to render the images.hints
- the RenderingHints to be used when rendering the image, or null
to
use the defaults.
IOException
- if an exception is encountered when writing the TIFFBLACKANDWHITE
,
RGB
,
CMYK
,
getBlackAndWhiteColorModel(int)
public float getWriteAsTIFFProgress()
writeAsTIFF()
method running in a different
thread. The returned value will start at 0 and move towards 1 as the write progresses.
public void setOutputProfile(OutputProfile profile)
PDF.getFullOutputProfile()
) but it can be used
to determine some of which features apply to particular pages.
public int getNumberOfPages()
Pageable
interface, this method just
calls PDF.getNumberOfPages()
getNumberOfPages
in interface Pageable
public PageFormat getPageFormat(int pagenumber)
PageFormat
for the specified page.
getPageFormat
in interface Pageable
pagenumber
- the page to select, from 0 to PDF.getNumberOfPages()
PageFormat
for page at index pagenumber
public Printable getPrintable(int pagenumber)
Printable
interface for a page.
Needed for the Pageable
interface, this method just
calls getPagePainter(int)
getPrintable
in interface Pageable
pagenumber
- the page to select, from 0 to PDF.getNumberOfPages()
Printable
object for the specified pagepublic org.apache.lucene.document.Document getLuceneDocument(boolean createall, boolean createbody, boolean createpages)
Create a Document
object for indexing the PDF with the
Apache Lucene full-text indexing library.
The Document is created with Field
objects representing the content of
the PDF, the info dictionary, the form and any annotations that may be there.
The fields are called:
body | The contents of all the pages in the PDF |
---|---|
page.n | The contents of page n of the PDF |
info.field | The contents of the field field of the Info dictionary - eg. info.Title |
info | The contents of the whole Info dictionary as one item |
form.field | The contents of the field field of the Form |
form | The contents of the whole Form as one item |
annotations | The contents of all the annotations in the document as one item |
all | All the fields above concatenated into one big field - useful for searching the entire textual content of the PDF in one go |
Because creating indices for all
, body
and page.n
is usually redundant (typically you will want only one of them), they can be turned on or off individually
by setting the appropriate parameter to true
or false
.
createall
- whether to create an all
entry in the indexcreatebody
- whether to create an body
entry in the indexcreatepages
- whether to create the page.n
entries in the index
Document
suitable for indexing with Lucene.public static ColorModel getBlackAndWhiteColorModel(int threshold)
Return a Black and White ColorModel
that ensures that any colours
below the specified threshold are converted to black. This method can be
used to convert images that have shades of gray to black and white TIFF images -
because it renders the PDF to RGB before manually converting it to Black and White
it avoids some of the platform dependent behaviour that arises from using
BLACKANDWHITE
, and will probably run faster on many operating systems.
threshold
- a number between 0 and 255 - typically around 128 or so. Higher
values result in more black.
Since 2.11.17 the value "0" can be used to automatically determine the threshold value using Otsu's algorithm. This may be appropriate for poor quality images.
Note this ColorModel should only by used in the writeAsTIFF(java.io.OutputStream, int, java.awt.image.ColorModel)
method -
passing it into one of the PagePainter.getImage
methods will not
work
BLACKANDWHITE
,
writeAsTIFF(java.io.OutputStream, int, java.awt.image.ColorModel)
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |