Encouraging Wiki Adoption

Tuesday, May 13. 2008

GrowYourWiki has a pair of posts on the Pitfalls and Keys to Success for Wiki Adoption within your organization. Its a concise summary of key best practices to encourage participation when you deploy a wiki.

Internally, we've faced similar issues, its eerie that his advice that "Meetings are an especially good place to start" describes how, at least the Tech Team, makes the most use of our Intranet Wiki. He also makes the point that you should plan for success, not failure. Too often, we're worried how a few bad actors may post inappropriate content or misuse the tool in some way. The usual reaction to such a risk, is to either not deploy the tool at all, which can be a huge missed opportunity, or to overburden it with controls, reviews, and approval process so that no one is ecouraged to use it.

Managing to the possibility of failure, not success – If you are more focused on how the wiki will fail, instead of how it will succeed, you have already written your destiny.

HT: Sage advice on wiki adoption: keys to success

Posted by Oscar Merida in Social Software at 11:24 | Comments (0) | Trackbacks (0)
Bookmark Encouraging Wiki Adoption  at del.icio.us Digg Encouraging Wiki Adoption Bloglines Encouraging Wiki Adoption Technorati Encouraging Wiki Adoption Fark this: Encouraging Wiki Adoption Bookmark Encouraging Wiki Adoption  at YahooMyWeb Bookmark Encouraging Wiki Adoption  at Furl.net Bookmark Encouraging Wiki Adoption  at reddit.com Bookmark Encouraging Wiki Adoption  at blinklist.com Bookmark Encouraging Wiki Adoption  at Spurl.net Bookmark Encouraging Wiki Adoption  at NewsVine Bookmark Encouraging Wiki Adoption  at Simpy.com Bookmark Encouraging Wiki Adoption  at blogmarks Bookmark Encouraging Wiki Adoption  with wists Bookmark Encouraging Wiki Adoption  at Ma.gnolia.com wong it! Bookmark using any bookmark manager!

Should you rely on CAPTCHAs to stop malicious behavior?

Monday, May 12. 2008

If you've signed up for a website in the last year or two, you're likely familiar with CAPTCHAs, those distorted images asking you to figure out some gibberish string of numbers and letters. A CAPTCHA is intended to stop abuse of a system by automated software by offering a task that only people can solve. We've often been asked by clients to "put a CAPTCHA" where we can anticipate abuse, but I've always pushed back as the effectiveness of CAPTCHAs has degraded over time. Here, we'll take a look at the problems with captchas, and suggest some alternatives to their use.

CAPTCHAs hurt usability and accessibility.

A visual CAPTCHA will not be usable by visitors using screen readers, or who suffer some vision impairment such as color blindness. An accompanying audio CAPTCHA is recommended, but now you've doubled opportunities for nefarious users to attack your web site. Even if you have good vision, you've probably encountered the visual CAPTCHA that are difficult to use, since making them hard to read is the only way to make them effective. By making them hard to read, you've made your web page much harder to use. I've run into CAPTCHA that take me a number of tries to get right because its hard to tell the ones apart from the Ls or zero's from the letter O.

CAPTCHAs have already been broken

CAPTCHAs have already been cracked through various methods. Automated programs exist to break common CAPTCHAs, and you can actually buy such software.. Jeff Atwood asked last November Has CAPTCHA Been "Broken"?, and argued that CAPTCHAs were still effective since Google, Hotmail, and Yahoo were considered unbreakable. For now let's ignore the fact that you need the resources of Google, Hotmail, or Yahoo to make "unbreakable" CAPTCHAs. Recent reports suggest that even their systems have been broken - Software Attacks Software in Security Wars.

Image CAPTCHAs for Google, Windows Live, and Yahoo! have been broken in recent months, and is believed to account for the increasing levels of spam that are coming from webmail services that those companies provide.

Security Labs even managed to dissect exactly how spammers have automated setting up Microsoft Hotmail account: Microsoft Live Hotmail Under Attack by Streamlined Anti-CAPTCHA and Mass-mailing Operations.

It is observed that unlike Live Mail Anti-CAPTCHA and Gmail Anti-CAPTCHA operations in the past, the current attack is aggressive and instantaneous in terms of CAPTCHA breaking host turn-around time.

Automated solutions are not required though, as CAPTCHAs can be solved by relaying the image to unsuspecting users through a relay attack. Just last year, a striptease program was used to bypass Yahoo's CAPTCHAs.

Trend Micro has identified the program as TROJ_CAPTCHAR.A, a striptease game wherein the player enters the letters hiding within a CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) image. For each correct entry, more clothes come off in photos of a scantily clad woman identified as "Melissa."

You can't rely on CAPTCHAs

If this technique is relatively useless, at best it'll just slow down malicious users instead of stopping them altogether, what alternatives do we have? I'll consider two scenarios where CAPTCHAs are commonly used - deterring spam on blogs and message boards, and preventing automated registration for user accounts.

Alternatives for limiting spam messages

If you have a blog or run a message board, spammers are a nuisance that drown out legit conversations with noise. One of the best solutions I've used for limiting spam is Akismet, a distributed and collaborative effort to identify spam messages. Its a web service that you must sign up for - although free for personal use, you'll need a subscription for non-personal uses. Basically, when a visitor leaves a message, your CMS or blog first sends the message, along with some information about who posted it, to Akismet which returns a simple yes or no result about the message's spammy-ness. At that point, you can either reject the message altogether or hold it for further review and approval. Akismet integrates easily with Wordpress, and their are libraries and plug-ins for many other platforms. If one doesn't exist, the Akismet API is open and documented so you can write your own.

If you are using PHP, and don't want to integrate with the Akismet web service, or simply want another line of defense, there is Bad Behavior. It uses a number of tests to try to screen out spam bots from your site before they can do any damage.

Bad Behavior runs before your software on each request to your Web site, so if a spam bot does visit, it will receive nothing, and your software never runs. This reduces the amount of server CPU time, database activity and bandwidth spent on processing robots which are just harvesting your site and delivering junk.

A third method for fighting comment spam is to require unknown users to confirm their message via email. That is, ask for an email address along with a comment - this is fairly standard already - and for unregistered users send them an email with a link for them to confirm their message. For regular visitors, you can ask them to create an account or, better still, use OpenID to confirm their identify, and allow them to skip the email confirmation step altogether. As an added precaution, you may want to review postings from new users until they reach a milestone like "5 non-spam messages".

Alternatives for protecting user registrations from bots.

Using CAPTCHAs as part of the registration process is meant to separate people from bots. Digg even asks the question Are you human? Technological alternatives here are a little less obvious. You could require users to activate their account via email, which at least makes it more time consuming for potentially malicious users to register. Depending on the sensitivity of the application, you can require even more difficult activation procedures. I know one credit card company system requires providing a phone number to call you with an activation code. Another alternative is to require invitations to join a system coupled with a way to audit invitations in case someone invites a bad apple. An overall approach that should work is to give users gradually escalating privileges as they demonstrate good behavior.

I'm not sure that a single technical cure exists for preventing unwanted user registrations. For now, I think sites will need to rely on an approval process of some kind for new registrations and a method for other site users to report people who abuse the system.

Posted by Oscar Merida in Tools at 12:00 | Comments (0) | Trackbacks (0)
Bookmark Should you rely on CAPTCHAs to stop malicious behavior?  at del.icio.us Digg Should you rely on CAPTCHAs to stop malicious behavior? Bloglines Should you rely on CAPTCHAs to stop malicious behavior? Technorati Should you rely on CAPTCHAs to stop malicious behavior? Fark this: Should you rely on CAPTCHAs to stop malicious behavior? Bookmark Should you rely on CAPTCHAs to stop malicious behavior?  at YahooMyWeb Bookmark Should you rely on CAPTCHAs to stop malicious behavior?  at Furl.net Bookmark Should you rely on CAPTCHAs to stop malicious behavior?  at reddit.com Bookmark Should you rely on CAPTCHAs to stop malicious behavior?  at blinklist.com Bookmark Should you rely on CAPTCHAs to stop malicious behavior?  at Spurl.net Bookmark Should you rely on CAPTCHAs to stop malicious behavior?  at NewsVine Bookmark Should you rely on CAPTCHAs to stop malicious behavior?  at Simpy.com Bookmark Should you rely on CAPTCHAs to stop malicious behavior?  at blogmarks Bookmark Should you rely on CAPTCHAs to stop malicious behavior?  with wists Bookmark Should you rely on CAPTCHAs to stop malicious behavior?  at Ma.gnolia.com wong it! Bookmark using any bookmark manager!

Terms of Service - not an April Fool's Joke

Tuesday, April 1. 2008

At least, I hope its not an April Fool's joke. It looks more like someone's lawyers go up to 11. From the Slideshare API Terms of Service.

(iii) use the SlideShare APIs to operate nuclear facilities, life support, or other mission critical application where human life or property may be at stake. You understand that the SlideShare APIs are not designed for such purposes and that their failure in such cases could lead to death, personal injury, or severe property or environmental damage for which SlideShare is not responsible;
Posted by Oscar Merida in APIs at 17:05 | Comment (1) | Trackbacks (0)
Bookmark Terms of Service - not an April Fool's Joke  at del.icio.us Digg Terms of Service - not an April Fool's Joke Bloglines Terms of Service - not an April Fool's Joke Technorati Terms of Service - not an April Fool's Joke Fark this: Terms of Service - not an April Fool's Joke Bookmark Terms of Service - not an April Fool's Joke  at YahooMyWeb Bookmark Terms of Service - not an April Fool's Joke  at Furl.net Bookmark Terms of Service - not an April Fool's Joke  at reddit.com Bookmark Terms of Service - not an April Fool's Joke  at blinklist.com Bookmark Terms of Service - not an April Fool's Joke  at Spurl.net Bookmark Terms of Service - not an April Fool's Joke  at NewsVine Bookmark Terms of Service - not an April Fool's Joke  at Simpy.com Bookmark Terms of Service - not an April Fool's Joke  at blogmarks Bookmark Terms of Service - not an April Fool's Joke  with wists Bookmark Terms of Service - not an April Fool's Joke  at Ma.gnolia.com wong it! Bookmark using any bookmark manager!

Another set of tips for writing good code.

Friday, March 21. 2008

Jeff Vodel provides tips for writing more comprehensible code, in a very humoruous article. They don't just apply to solo coders, people who inherit or update your code down the line will also thank you. Even though they're written for C programmers, his advice applies to any programming language. For web programmers, #5 should be heeded, as we've learned that the code we run on the server typically contributes only a fraction towards how long users must wait to see your web page. You get more bang for the buck by reducing how much data browsers have to download (file sizes) and how many connection it must make (hits to download css, images, javascript files, etc...).

This means that, if I make a stinky mess, I'm doing it in my own nest. When I'm chasing down a bug at 3 a.m., staring at a nightmare cloud of spaghetti code, and I say, "Dear God, what idiot child of married cousins wrote this garbage?", the answer to that question is "Me."

I've written previously my own set of tips to write readable php code.

Posted by Oscar Merida in Programming at 13:29 | Comments (0) | Trackbacks (0)
Bookmark Another set of tips for writing good code.  at del.icio.us Digg Another set of tips for writing good code. Bloglines Another set of tips for writing good code. Technorati Another set of tips for writing good code. Fark this: Another set of tips for writing good code. Bookmark Another set of tips for writing good code.  at YahooMyWeb Bookmark Another set of tips for writing good code.  at Furl.net Bookmark Another set of tips for writing good code.  at reddit.com Bookmark Another set of tips for writing good code.  at blinklist.com Bookmark Another set of tips for writing good code.  at Spurl.net Bookmark Another set of tips for writing good code.  at NewsVine Bookmark Another set of tips for writing good code.  at Simpy.com Bookmark Another set of tips for writing good code.  at blogmarks Bookmark Another set of tips for writing good code.  with wists Bookmark Another set of tips for writing good code.  at Ma.gnolia.com wong it! Bookmark using any bookmark manager!

Users will love you if you tame Scope/Feature creep

Friday, March 21. 2008

There is an excellent case study in NY Times on how Zen's ultra simple camcorder the Flip has grabbed 13% of the camcorder market and been the best-selling camcorder on Amazon.com. It did this not by offering every feature and tech acronym buzzword under the sun, but by making it trivially easy to use. David Poque says,

Instead, the Flip has been reduced to the purest essence of video capture. You turn it on, and it's ready to start filming in two seconds. You press the red button once to record (press hard -- it's a little balky) and once to stop. You press Play to review the video, and the Trash button to delete a clip.

There it is: the entire user's manual.

"But Oscar, we don't make camcorders or gadgets. Whats your point?" The success of the Zen lies in getting out of the way of users who want to record a video clip. User's don't need to adjust a million settings, don't have the option of adding special effects, zooming in and out, and so on. They just point and record. Next time you're building out a new web site, section of your web site, or any point where you're asking users to interacts with you - fill out a form, post a comment, make a donation - try to eliminate as much cruft as possible and reduce the interaction to its barest essence. Ideally, and I'm probably a bit of an idealist for writing this, this would mean an email newsletter subscription form that, shocker, has only one field for the user's email address, and no more. Or a donation form that is simply the amount to donate and fields required to process payment. The less hoops you make users jump through, the more users may jump through your hoops.

Posted by Oscar Merida in Social Software at 11:59 | Comments (0) | Trackbacks (0)
Bookmark Users will love you if you tame Scope/Feature creep  at del.icio.us Digg Users will love you if you tame Scope/Feature creep Bloglines Users will love you if you tame Scope/Feature creep Technorati Users will love you if you tame Scope/Feature creep Fark this: Users will love you if you tame Scope/Feature creep Bookmark Users will love you if you tame Scope/Feature creep  at YahooMyWeb Bookmark Users will love you if you tame Scope/Feature creep  at Furl.net Bookmark Users will love you if you tame Scope/Feature creep  at reddit.com Bookmark Users will love you if you tame Scope/Feature creep  at blinklist.com Bookmark Users will love you if you tame Scope/Feature creep  at Spurl.net Bookmark Users will love you if you tame Scope/Feature creep  at NewsVine Bookmark Users will love you if you tame Scope/Feature creep  at Simpy.com Bookmark Users will love you if you tame Scope/Feature creep  at blogmarks Bookmark Users will love you if you tame Scope/Feature creep  with wists Bookmark Users will love you if you tame Scope/Feature creep  at Ma.gnolia.com wong it! Bookmark using any bookmark manager!

Compare Site Traffic with Google Analytics

Friday, March 21. 2008

Google Analytics releases a new feature two weeks ago, allowing sites to anonymously share data with comparable sites. Benchmarking site traffic has been something we've suggested to clients in the past, and in a few cases have facilitated it on a small scale. But now, you can compare how your blog, online community, association, or non-profit's website compares to peers who also use GA. I'd stress that you're comparing yourself to other peers who use the service and choose to share the data so there is some selection bias, and its up to Google to decide who you are compared with. Still, it makes available data that would otherwise be expensive or impossible to get, and as more sites otp to share their traffic, it can get better. You can learn how to enable this and what you can compare over at googlesystem.

The system is very beta at the moment, and the "Verticals" used to categorize sites are few at the moment. It can still help answer questions like:

  • What are overall traffic patterns in my sector?
  • Is my traffic growth/decline part of an overall traffic growth/decline or unique to my site?
  • What's an "average" site like mine get in Page Visits/Visitors/Bounce Rate/New Vistors, and how do I compare?

Technorati Tags: Google Analytics, nptech

Posted by Oscar Merida in Tools at 11:29 | Comments (0) | Trackbacks (0)
Bookmark Compare Site Traffic with Google Analytics  at del.icio.us Digg Compare Site Traffic with Google Analytics Bloglines Compare Site Traffic with Google Analytics Technorati Compare Site Traffic with Google Analytics Fark this: Compare Site Traffic with Google Analytics Bookmark Compare Site Traffic with Google Analytics  at YahooMyWeb Bookmark Compare Site Traffic with Google Analytics  at Furl.net Bookmark Compare Site Traffic with Google Analytics  at reddit.com Bookmark Compare Site Traffic with Google Analytics  at blinklist.com Bookmark Compare Site Traffic with Google Analytics  at Spurl.net Bookmark Compare Site Traffic with Google Analytics  at NewsVine Bookmark Compare Site Traffic with Google Analytics  at Simpy.com Bookmark Compare Site Traffic with Google Analytics  at blogmarks Bookmark Compare Site Traffic with Google Analytics  with wists Bookmark Compare Site Traffic with Google Analytics  at Ma.gnolia.com wong it! Bookmark using any bookmark manager!

Using PHP to check SVN commit permissions

Monday, February 18. 2008

Once you've been using Subversion when you and and your colleagues are working on a project, you're bound to find useful ways to exploit Subversion's hook system. We use the sample commit-email.pl script to send all commits to an email list for ad-hoc peer-review. I found and enabled Ian Christian's pre-commit script to check PHP syntax when checking in code.

The latest piece of the puzzle was to restrict commits to a project's branch to a few users, which has been harder to figure out than I expected. The most common script for access control is svnperms, which has a rich syntax for configuring access. Unfortunately, svnperms seems to work best with a repository with the following repository layout:

  • trunk
    • Project1
    • project2
  • branches
    • Project1
    • project2
  • tags
    • Project1
    • project2

Our repository is laid out as:

  • Project1
    • trunk
    • branches
    • tags
  • Project2
    • trunk
    • branches
    • tags

I was trying to restrict access to Project1's stable branch (project1/branches/stable), and this didn't seem to be possible under svnperms, no matter how many regular expressions I tried. Subversion provides another access control script, commit-access-control.pl script, but having been burned by svnperms, I was reluctant to spend too much time trying to configure it and get it to work.

Since hooks are just shell scripts, its easy to write your own, which is what I did in this case. The place to check commit access is before the transaciton is created, in the start-commit hook. Being more comfortable in PHP, I whipped up the following command line script and saved it as check_commit_privs.php

#!/usr/bin/php
<?php
/*
 CHECKS IF A USER CAN COMMIT TO THE REPOSITORY
  Oscar Merida  <omerida@forumone.com>
 */

// SVN passes two arguments, the repository path and user for the commit
$repo_path = $_SERVER['argv'][1];
$commit_user = $_SERVER['argv'][2];

// You can use array to define user groups
$qa_group = array('bob', 'roger', 'amanda'');
$contractors = array('marco', 'dawn', 'bill');

// CONFIGURATION
//
// array key is a path in SVN repository or a regular expression that will match a path.
// value is an array of usernames that can commit to that path
// first path match that limits access will prevent commits.
// This script assumes you only need to lock down certain
// parts of your repository.
$allowed = array(
    // only contractors can commit to widgets project
    '/widgets/' => $contractors,
    // only qa_group can commit to any project's testing branch
    '/.*\/branch\/testing/' => $qa_group,
    // only bill can commit to his project
    '/bills_project/' => array('bill')
);

foreach ($allowed as $regexp => $group)
{
    if (preg_match($regexp, $repo_path)
        && !in_array($commit_user, $group))
    {
        exit(1);
    }
}

To enable this script, create or add a file named 'start-commit' to your repository's hooks/ folder with the following. If there is a file named start-commit.tmpl, copy that as a starting point. You'll also have to make sure that both start-commit and check_commit_privs.php are executable by your SVN users.

REPOS="$1"
USER="$2"
# basic permissions check
/path/to/check_commit_privs.php "$REPOS" "$USER"  || exit 1
Posted by Oscar Merida at 15:03 | Comments (0) | Trackbacks (0)
Bookmark Using PHP to check SVN commit permissions  at del.icio.us Digg Using PHP to check SVN commit permissions Bloglines Using PHP to check SVN commit permissions Technorati Using PHP to check SVN commit permissions Fark this: Using PHP to check SVN commit permissions Bookmark Using PHP to check SVN commit permissions  at YahooMyWeb Bookmark Using PHP to check SVN commit permissions  at Furl.net Bookmark Using PHP to check SVN commit permissions  at reddit.com Bookmark Using PHP to check SVN commit permissions  at blinklist.com Bookmark Using PHP to check SVN commit permissions  at Spurl.net Bookmark Using PHP to check SVN commit permissions  at NewsVine Bookmark Using PHP to check SVN commit permissions  at Simpy.com Bookmark Using PHP to check SVN commit permissions  at blogmarks Bookmark Using PHP to check SVN commit permissions  with wists Bookmark Using PHP to check SVN commit permissions  at Ma.gnolia.com wong it! Bookmark using any bookmark manager!

Subversion named product of the year

Tuesday, January 22. 2008

Congrats to CollabNet! Subversion, the open source versioning tool written as a "compelling replacement to CVS", was named by developer.com as the Product of the Year 2008. We've been using subversion for our work for over a year now and its proved to be a very useful tool in helping us get work done and be more productive.

Before SVN, we had a manual cut over process to move files from our development server to production servers. It was slow, since you had to navigate around the file system, and error-prone, since one could never be sure that you moved all the correct files. This became more of a problem as multiple developers and designers would work on a single site. Now, SVN is constantly keeping track of which files are modified on each environment and automatically synchronizes the code on each. It's also allowed us to move from a single, shared development server, to working locally on our own workstations. We've also started exploiting the hooks in svn for checking PHP syntax before commits, and sending emails for code-reviews upon commit.

If you're working with or hiring programmers, one of your first questions you should ask them is how they manage and track changes to their source code. If they aren't using subversion or another system, that's an immediate red flag.

We've written about subversion before. Dan's previously written about Subversion: Simple Practices and Sandy wrote HOWTO: Use Eclipse PDT with Subversion in 13 Easy Steps.

Posted by Oscar Merida at 14:07 | Comments (0) | Trackbacks (0)
Bookmark Subversion named product of the year  at del.icio.us Digg Subversion named product of the year Bloglines Subversion named product of the year Technorati Subversion named product of the year Fark this: Subversion named product of the year Bookmark Subversion named product of the year  at YahooMyWeb Bookmark Subversion named product of the year  at Furl.net Bookmark Subversion named product of the year  at reddit.com Bookmark Subversion named product of the year  at blinklist.com Bookmark Subversion named product of the year  at Spurl.net Bookmark Subversion named product of the year  at NewsVine Bookmark Subversion named product of the year  at Simpy.com Bookmark Subversion named product of the year  at blogmarks Bookmark Subversion named product of the year  with wists Bookmark Subversion named product of the year  at Ma.gnolia.com wong it! Bookmark using any bookmark manager!

Non-Profit Tech Blog The basics of geting your .org on line

Tuesday, January 22. 2008

Non Profit Tech blog has posted an excellent start to their 3 part series on How to get your small nonprofit up on the Web, showing how to register for your domain name and why you want one. I'm also heartened to see that part 2 will detail how to use Google Apps for Domains as your email hosting provider, for two reasons. First, we're in the process of switching our email infrastructure from hosting it ourselves to GAD - it made to much sense from a cost, reliability, and support angle. Second, this past weekend I spent time moving my personal email to GAD for the same reason, and found the whole process to be well documented and structured by Google. At every step of the way the instructions are pretty clear and not overly technical. Where it does veer into technical details, there are specific, step by step guides to follow.

One caution about registering domain names - make sure that the email contact address you have for both the Administrative and Technical contacts is a reliable email address that you check regularly. You'll want to keep an eye out for bogus DNS transfers, and more importantly, when your domain name comes up to renewal you'll be notified there. You absolutely do not want your domain to expire, since another person could quickly come along and register it.

Once you have your domain name, you'll likely need DNS hosting. DNS hosting lets route host names, mail, and other services to servers by associating a numerical IP address with, you can read up on DNS hosting services on wikipedia. If you register with Godaddy, I believe they offer that as part of your registration, as most registars do. You may not want to use a registrar's DNS hosting to insulate yourself from having to also move DNS hosting if you switch registrars in the future. Using a separate DNS hosting provider can prevent downtime for your web site and services int hat case. There are many free DNS hosting services available, one to look at is everydns.net, which is free but accepts donations.

Posted by Oscar Merida at 12:17 | Comment (1) | Trackbacks (0)
Bookmark Non-Profit Tech Blog The basics of geting your .org on line  at del.icio.us Digg Non-Profit Tech Blog The basics of geting your .org on line Bloglines Non-Profit Tech Blog The basics of geting your .org on line Technorati Non-Profit Tech Blog The basics of geting your .org on line Fark this: Non-Profit Tech Blog The basics of geting your .org on line Bookmark Non-Profit Tech Blog The basics of geting your .org on line  at YahooMyWeb Bookmark Non-Profit Tech Blog The basics of geting your .org on line  at Furl.net Bookmark Non-Profit Tech Blog The basics of geting your .org on line  at reddit.com Bookmark Non-Profit Tech Blog The basics of geting your .org on line  at blinklist.com Bookmark Non-Profit Tech Blog The basics of geting your .org on line  at Spurl.net Bookmark Non-Profit Tech Blog The basics of geting your .org on line  at NewsVine Bookmark Non-Profit Tech Blog The basics of geting your .org on line  at Simpy.com Bookmark Non-Profit Tech Blog The basics of geting your .org on line  at blogmarks Bookmark Non-Profit Tech Blog The basics of geting your .org on line  with wists Bookmark Non-Profit Tech Blog The basics of geting your .org on line  at Ma.gnolia.com wong it! Bookmark using any bookmark manager!

Extracting text from Office and PDF files

Thursday, December 20. 2007

Do you need to extract text from Word, Excel, PowerPoint, and Adobe PDF files? We recently added full text searching of uploaded files to ProjectSpaces. Thanks to the work of other hackers, we were able to utilize command line utilities to extract meaningful text from proprietary, and/or binary files. Once implemented, we could index the majority of files uploaded by our users and present matching files for search queries.

catdoc for Word and Excel files

The catdoc package (Debian, Ubuntu, RedHat distros). On a Debian or Ubuntu server you can install the package with

sudo apt-get install catdoc
To install on our RHEL4 server, I had to find and download an RPM file for catdoc, and install tk
sudo up2date tk
sudo rpm -ivh catdoc-0.94.2-3.el4.i386.rpm
Once installed, you will have two utilities for extracting from Microsoft Word (catdoc) and Microsoft Excel (xls2csv) files. Using catdoc is as simple as:
catdoc -w ~/MeetingNotes.doc
If successful, the contents of the Word file will be piped to STDOUT. The -w flag is used to suppress word wrapping, otherwise catdoc will wrap lines at 72 characters.

Simlary you can get the content from an Excel spreadsheet with:

xls2cvs ~/Sales.xls

Executing the command above pipes the contents of your spreadsheet to STDOUT with comma-separated values.

xpdf for PDF files

The xpdf package provides a command line utility named pdftotext, which can parse PDF files up to version 1.5. This is a straightforward package to install on Debian or RedHat servers either:

sudo apt-get install xpdf
OR :
sudo up2date xpdf

The utility pdftotext is a little trickier to use, because by default it wants to save the extracted text to another file. By specifying - as the target file, output is sent to STDOUT instead.

pdftotext ~/DeveloperResume.pdf -

xlhtml for Powerpoint

Although catdoc also bundles a utility named catppt. I didn't not have any success in getting meaningful output from Powerpoint files with catppt. Instead, I settled on using ppthtml. This utility maybe a little harder to find a package for your distribution. On Ubuntu, it can be installed with:

sudo apt-get install ppthtml
For Redhat systems, look for the xlhtml package, I was able to find xlhtml-0.5-2.el4.sme.i386.rpm for our server and installed it with
sudo rpm -ivh xlhtml-0.5-2.el4.sme.i386.rpm 

Usage of ppthtml is straightforward:

ppthtml ~/ConferenceSlides.ppt

The text of the Powerpoint file will be sent to STDOUT and formatted as HTML. Since only text is extracted, images and text contained inside images will be ignored.

Capturing content in PHP

Above, I emphasized how to send output to STDOUT. By doing so, with php we can then use the exec command to extract text and process it further or save it to a database table. A function for working with a word file might look like:

/**
 * Attempts to extracts text from a MS Word file
 * @param string full path to file
 * @return string
 */
function extractWord($word_file) 
{
    if (file_exists($word_file)
    {
        // prevent malicious command execution 
        exec("/usr/bin/catdoc -w ' . escapeshellarg($word_file), $output);

        // $output is an array corresponding to lines of output
        return join("\n", $output);
    }
}

Functions for extracting from Excel, Powerpoint, and PDF files would only require switching the command line tool.

What can we do with such tools?

  1. Add them to a mysql table for simple FULLTEXT searching.
  2. Use Yahoo's Term Extraction service to extract meaningful keywords
  3. Index them with a more advanced full text search engine.
  4. Provide a text preview of file contents so that users don't fire up a client application or browser plugin.

I'd love to hear other ideas too!

Posted by Oscar Merida in Tools at 11:04 | Comments (3) | Trackbacks (0)
Bookmark Extracting text from Office and PDF files  at del.icio.us Digg Extracting text from Office and PDF files Bloglines Extracting text from Office and PDF files Technorati Extracting text from Office and PDF files Fark this: Extracting text from Office and PDF files Bookmark Extracting text from Office and PDF files  at YahooMyWeb Bookmark Extracting text from Office and PDF files  at Furl.net Bookmark Extracting text from Office and PDF files  at reddit.com Bookmark Extracting text from Office and PDF files  at blinklist.com Bookmark Extracting text from Office and PDF files  at Spurl.net Bookmark Extracting text from Office and PDF files  at NewsVine Bookmark Extracting text from Office and PDF files  at Simpy.com Bookmark Extracting text from Office and PDF files  at blogmarks Bookmark Extracting text from Office and PDF files  with wists Bookmark Extracting text from Office and PDF files  at Ma.gnolia.com wong it! Bookmark using any bookmark manager!