PDF Searching
Recent activity:
Updated with this link to Adobe XMP Developer Center. A free XMP Toolkit might come in handy.
I’ve restored the default HTML template for the search page until I figure out the more advanced template option.
In some regards, I think this tool is done until I figure out what the next step is. In other words, what will I want to search for?
Check out the current PDF search page here. (Hint: search for “pdf” since content is limited)
I’m going to try to follow the installation of Swish-e according to the instructions in “How to Index Anything“, an article written by Josh Rabinowitz for the Linux Journal (July 2003).
- Install swish-e. I’ll have to do the install on the Pro/E FAQ server without root access. (./configure –prefix=$HOME)
Note: I needed to install xpdf to support the pdf2xml module. I didn’t configure FreeType, but that’s not a problem for compiling pdftotext. - Create the Swish-e config file to index the PDF directory. From the “How to Index Anything” article:
- Swish-e can take utilize an external PERL module and the xpdf package to convert PDF’s to XML. This is listed in the article.
PROBLEM: So I managed to create an index and do a command line search, but it seems that xpdf only supports PDF v1.5 (I assume Acrobat 5.0)!, so I need to find out what version (header info) the WF3 pdf’s will be, or find a solution that keeps current with Adobe. What is Adobe PDF IFilter?
GOOD NEWS: WF3 PDF’s are v1.5, so indexing with Swish-e is still going to work. (BTW, Ghostscript 8.00 creates v1.2 PDF’s). - Next, I need a CGI script to search my indexed files. Conveniently, Swish-e installs a script called swish.cgi, which you can customize.
Note: First thing I needed to do was change install location (./configure –prefix=$HOME, remember?)
I need to start slowly, so here’s my first PDF search page, using a very simple Swish-e configuration.