Do you need to extract text from Word, Excel, PowerPoint, and Adobe PDF files? We recently added full text searching of uploaded files to ProjectSpaces. Thanks to the work of other hackers, we were able to utilize command line utilities to extract meaningful text from proprietary, and/or binary files. Once implemented, we could index the majority of files uploaded by our users and present matching files for search queries.
catdoc for Word and Excel files
The catdoc package (Debian, Ubuntu, RedHat distros). On a Debian or Ubuntu server you can install the package with
sudo apt-get install catdoc
To install on our RHEL4 server, I had to find and download an RPM file for catdoc, and install tk
sudo up2date tk
sudo rpm -ivh catdoc-0.94.2-3.el4.i386.rpm
Once installed, you will have two utilities for extracting from Microsoft Word (catdoc) and Microsoft Excel (xls2csv) files. Using catdoc is as simple as:
catdoc -w ~/MeetingNotes.doc
If successful, the contents of the Word file will be piped to STDOUT. The -w flag is used to suppress word wrapping, otherwise catdoc will wrap lines at 72 characters.
Simlary you can get the content from an Excel spreadsheet with:
xls2cvs ~/Sales.xls
Executing the command above pipes the contents of your spreadsheet to STDOUT with comma-separated values.
xpdf for PDF files
The xpdf package provides a command line utility named pdftotext, which can parse PDF files up to version 1.5. This is a straightforward package to install on Debian or RedHat servers either:
sudo apt-get install xpdf
OR :
sudo up2date xpdf
The utility pdftotext is a little trickier to use, because by default it wants to save the extracted text to another file. By specifying - as the target file, output is sent to STDOUT instead.
pdftotext ~/DeveloperResume.pdf -
xlhtml for Powerpoint
Although catdoc also bundles a utility named catppt. I didn't not have any success in getting meaningful output from Powerpoint files with catppt. Instead, I settled on using ppthtml. This utility maybe a little harder to find a package for your distribution. On Ubuntu, it can be installed with:
sudo apt-get install ppthtml
For Redhat systems, look for the xlhtml package, I was able to find xlhtml-0.5-2.el4.sme.i386.rpm for our server and installed it with
sudo rpm -ivh xlhtml-0.5-2.el4.sme.i386.rpm
Usage of ppthtml is straightforward:
ppthtml ~/ConferenceSlides.ppt
The text of the Powerpoint file will be sent to STDOUT and formatted as HTML. Since only text is extracted, images and text contained inside images will be ignored.
Capturing content in PHP
Above, I emphasized how to send output to STDOUT. By doing so, with php we can then use the exec command to extract text and process it further or save it to a database table. A function for working with a word file might look like:
/**
* Attempts to extracts text from a MS Word file
* @param string full path to file
* @return string
*/
function extractWord($word_file)
{
if (file_exists($word_file)
{
// prevent malicious command execution
exec("/usr/bin/catdoc -w ' . escapeshellarg($word_file), $output);
// $output is an array corresponding to lines of output
return join("\n", $output);
}
}
Functions for extracting from Excel, Powerpoint, and PDF files would only require switching the command line tool.
What can we do with such tools?
- Add them to a mysql table for simple FULLTEXT searching.
- Use Yahoo's Term Extraction service to extract meaningful keywords
- Index them with a more advanced full text search engine.
- Provide a text preview of file contents so that users don't fire up a client application or browser plugin.
I'd love to hear other ideas too!
Comments
Thu, 14.08.2008 16:58
Thanks for the tip. I made a slight mod you might be interested [...]
Mon, 28.07.2008 15:06
Solution (to my issue): Views > Tools > Flush Views Cache It explains that Views doesn't always keep up with changes [...]
Mon, 28.07.2008 14:52
Thanks for this helpful post. I've seen this effect too. I'm running into a different (but related?) issue - the Views [...]
Tue, 15.07.2008 20:25
Oscar, Krista from Calais here, writing to let you know that Calais 2.1 is live. In addition to our ongoing [...]
Tue, 01.07.2008 11:30
Dan, You are absolutely correct and I should have stated this within my post; the described steps within the post [...]
Mon, 30.06.2008 09:45
i wouldnt recomand this at all, because if something happens and the conection is lost u will have your data lost if the [...]
Mon, 09.06.2008 13:42
PDT syntax highlighting support does not seem to work when subclipse is installed, any one else had this problem?
Mon, 09.06.2008 11:56
I didn't mean to imply that you were bashing unit tests.
Mon, 09.06.2008 11:52
My point isn't to bash unit tests, but rather to say there are a bunch of things you should be doing before you get [...]
Mon, 09.06.2008 11:43
I agree with, what I think is, the gist of your argument. That is, if you don't write code that anticipates failure, [...]
Mon, 09.06.2008 08:58
clipse is an open source IDE — or as they put it themselves: “universal toolset for development”. It [...]
Tue, 27.05.2008 12:17
Navigation links should fill their container to ensure ease of selection. A good method for that is to make them [...]
Thu, 22.05.2008 10:35
One of the better comments I've seen in a while: "Although I like PHP, I agree the language is only as good as the [...]
Tue, 20.05.2008 14:03
Oscar, Yahoo's Term Extraction service takes an entire article and returns a few of (what it thinks are) the most [...]
Tue, 20.05.2008 13:13
Hi, Tom Tague from Calais here. First, thanks for taking note of Calais. And integrating an example right within the [...]