Do you need to extract text from Word, Excel, PowerPoint, and Adobe PDF files? We recently added full text searching of uploaded files to ProjectSpaces. Thanks to the work of other hackers, we were able to utilize command line utilities to extract meaningful text from proprietary, and/or binary files. Once implemented, we could index the majority of files uploaded by our users and present matching files for search queries.
catdoc for Word and Excel files
The catdoc package (Debian, Ubuntu, RedHat distros). On a Debian or Ubuntu server you can install the package with
sudo apt-get install catdoc
To install on our RHEL4 server, I had to find and download an RPM file for catdoc, and install tk
sudo up2date tk
sudo rpm -ivh catdoc-0.94.2-3.el4.i386.rpm
Once installed, you will have two utilities for extracting from Microsoft Word (catdoc) and Microsoft Excel (xls2csv) files. Using catdoc is as simple as:
catdoc -w ~/MeetingNotes.doc
If successful, the contents of the Word file will be piped to STDOUT. The -w flag is used to suppress word wrapping, otherwise catdoc will wrap lines at 72 characters.
Simlary you can get the content from an Excel spreadsheet with:
xls2cvs ~/Sales.xls
Executing the command above pipes the contents of your spreadsheet to STDOUT with comma-separated values.
xpdf for PDF files
The xpdf package provides a command line utility named pdftotext, which can parse PDF files up to version 1.5. This is a straightforward package to install on Debian or RedHat servers either:
sudo apt-get install xpdf
OR :
sudo up2date xpdf
The utility pdftotext is a little trickier to use, because by default it wants to save the extracted text to another file. By specifying - as the target file, output is sent to STDOUT instead.
pdftotext ~/DeveloperResume.pdf -
xlhtml for Powerpoint
Although catdoc also bundles a utility named catppt. I didn't not have any success in getting meaningful output from Powerpoint files with catppt. Instead, I settled on using ppthtml. This utility maybe a little harder to find a package for your distribution. On Ubuntu, it can be installed with:
sudo apt-get install ppthtml
For Redhat systems, look for the xlhtml package, I was able to find xlhtml-0.5-2.el4.sme.i386.rpm for our server and installed it with
sudo rpm -ivh xlhtml-0.5-2.el4.sme.i386.rpm
Usage of ppthtml is straightforward:
ppthtml ~/ConferenceSlides.ppt
The text of the Powerpoint file will be sent to STDOUT and formatted as HTML. Since only text is extracted, images and text contained inside images will be ignored.
Capturing content in PHP
Above, I emphasized how to send output to STDOUT. By doing so, with php we can then use the exec command to extract text and process it further or save it to a database table. A function for working with a word file might look like:
/**
* Attempts to extracts text from a MS Word file
* @param string full path to file
* @return string
*/
function extractWord($word_file)
{
if (file_exists($word_file)
{
// prevent malicious command execution
exec("/usr/bin/catdoc -w ' . escapeshellarg($word_file), $output);
// $output is an array corresponding to lines of output
return join("\n", $output);
}
}
Functions for extracting from Excel, Powerpoint, and PDF files would only require switching the command line tool.
What can we do with such tools?
- Add them to a mysql table for simple FULLTEXT searching.
- Use Yahoo's Term Extraction service to extract meaningful keywords
- Index them with a more advanced full text search engine.
- Provide a text preview of file contents so that users don't fire up a client application or browser plugin.
I'd love to hear other ideas too!
Yahoo!'s Term Extraction Service can be used to extract significant words or phrases from some larger body of text. There are many uses for it, not the least of which is providing keywords, or tags in Web2.0 jargon, to help classify and organize a librar
Tracked: Jun 13, 15:55