PDF Searching

Recent activity:

Updated with this link to Adobe XMP Developer Center. A free XMP Toolkit might come in handy.

I’ve restored the default HTML template for the search page until I figure out the more advanced template option.

In some regards, I think this tool is done until I figure out what the next step is. In other words, what will I want to search for?

Check out the current PDF search page here. (Hint: search for “pdf” since content is limited)

I’m going to try to follow the installation of Swish-e according to the instructions in “How to Index Anything“, an article written by Josh Rabinowitz for the Linux Journal (July 2003).

  1. Install swish-e. I’ll have to do the install on the Pro/E FAQ server without root access. (./configure –prefix=$HOME)
    Note: I needed to install xpdf to support the pdf2xml module. I didn’t configure FreeType, but that’s not a problem for compiling pdftotext.
  2. Create the Swish-e config file to index the PDF directory. From the “How to Index Anything” article:
  3. Swish-e can take utilize an external PERL module and the xpdf package to convert PDF’s to XML. This is listed in the article.
    PROBLEM
    : So I managed to create an index and do a command line search, but it seems that xpdf only supports PDF v1.5 (I assume Acrobat 5.0)!, so I need to find out what version (header info) the WF3 pdf’s will be, or find a solution that keeps current with Adobe. What is Adobe PDF IFilter?
    GOOD NEWS:
    WF3 PDF’s are v1.5, so indexing with Swish-e is still going to work. (BTW, Ghostscript 8.00 creates v1.2 PDF’s).
  4. Next, I need a CGI script to search my indexed files. Conveniently, Swish-e installs a script called swish.cgi, which you can customize.
    Note: First thing I needed to do was change install location (./configure –prefix=$HOME, remember?)
    I need to start slowly, so here’s my first PDF search page, using a very simple Swish-e configuration.

Leave a Reply