 |
|
Tuesday, May 13. 2008
GrowYourWiki has a pair of posts on the Pitfalls and Keys to Success for Wiki Adoption within your organization. Its a concise summary of key best practices to encourage participation when you deploy a wiki.
Internally, we've faced similar issues, its eerie that his advice that "Meetings are an especially good place to start" describes how, at least the Tech Team, makes the most use of our Intranet Wiki. He also makes the point that you should plan for success, not failure. Too often, we're worried how a few bad actors may post inappropriate content or misuse the tool in some way. The usual reaction to such a risk, is to either not deploy the tool at all, which can be a huge missed opportunity, or to overburden it with controls, reviews, and approval process so that no one is ecouraged to use it.
Managing to the possibility of failure, not success – If you are more focused on how the wiki will fail, instead of how it will succeed, you have already written your destiny.
HT: Sage advice on wiki adoption: keys to success
Monday, May 12. 2008
If you've signed up for a website in the last year or two, you're likely familiar with CAPTCHAs, those distorted images asking you to figure out some gibberish string of numbers and letters. A CAPTCHA is intended to stop abuse of a system by automated software by offering a task that only people can solve. We've often been asked by clients to "put a CAPTCHA" where we can anticipate abuse, but I've always pushed back as the effectiveness of CAPTCHAs has degraded over time. Here, we'll take a look at the problems with captchas, and suggest some alternatives to their use.
CAPTCHAs hurt usability and accessibility.
A visual CAPTCHA will not be usable by visitors using screen readers, or who suffer some vision impairment such as color blindness. An accompanying audio CAPTCHA is recommended, but now you've doubled opportunities for nefarious users to attack your web site. Even if you have good vision, you've probably encountered the visual CAPTCHA that are difficult to use, since making them hard to read is the only way to make them effective. By making them hard to read, you've made your web page much harder to use. I've run into CAPTCHA that take me a number of tries to get right because its hard to tell the ones apart from the Ls or zero's from the letter O.
CAPTCHAs have already been broken
CAPTCHAs have already been cracked through various methods. Automated programs exist to break common CAPTCHAs, and you can actually buy such software.. Jeff Atwood asked last November Has CAPTCHA Been "Broken"?, and argued that CAPTCHAs were still effective since Google, Hotmail, and Yahoo were considered unbreakable. For now let's ignore the fact that you need the resources of Google, Hotmail, or Yahoo to make "unbreakable" CAPTCHAs. Recent reports suggest that even their systems have been broken - Software Attacks Software in Security Wars.
Image CAPTCHAs for Google, Windows Live, and Yahoo! have been broken in recent months, and is believed to account for the increasing levels of spam that are coming from webmail services that those companies provide.
Security Labs even managed to dissect exactly how spammers have automated setting up Microsoft Hotmail account: Microsoft Live Hotmail Under Attack by Streamlined Anti-CAPTCHA and Mass-mailing Operations.
It is observed that unlike Live Mail Anti-CAPTCHA and Gmail Anti-CAPTCHA operations in the past, the current attack is aggressive and instantaneous in terms of CAPTCHA breaking host turn-around time.
Automated solutions are not required though, as CAPTCHAs can be solved by relaying the image to unsuspecting users through a relay attack. Just last year, a striptease program was used to bypass Yahoo's CAPTCHAs.
Trend Micro has identified the program as TROJ_CAPTCHAR.A, a striptease game wherein the player enters the letters hiding within a CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) image. For each correct entry, more clothes come off in photos of a scantily clad woman identified as "Melissa."
You can't rely on CAPTCHAs
If this technique is relatively useless, at best it'll just slow down malicious users instead of stopping them altogether, what alternatives do we have? I'll consider two scenarios where CAPTCHAs are commonly used - deterring spam on blogs and message boards, and preventing automated registration for user accounts.
Alternatives for limiting spam messages
If you have a blog or run a message board, spammers are a nuisance that drown out legit conversations with noise. One of the best solutions I've used for limiting spam is Akismet, a distributed and collaborative effort to identify spam messages. Its a web service that you must sign up for - although free for personal use, you'll need a subscription for non-personal uses. Basically, when a visitor leaves a message, your CMS or blog first sends the message, along with some information about who posted it, to Akismet which returns a simple yes or no result about the message's spammy-ness. At that point, you can either reject the message altogether or hold it for further review and approval. Akismet integrates easily with Wordpress, and their are libraries and plug-ins for many other platforms. If one doesn't exist, the Akismet API is open and documented so you can write your own.
If you are using PHP, and don't want to integrate with the Akismet web service, or simply want another line of defense, there is Bad Behavior. It uses a number of tests to try to screen out spam bots from your site before they can do any damage.
Bad Behavior runs before your software on each request to your Web site, so if a spam bot does visit, it will receive nothing, and your software never runs. This reduces the amount of server CPU time, database activity and bandwidth spent on processing robots which are just harvesting your site and delivering junk.
A third method for fighting comment spam is to require unknown users to confirm their message via email. That is, ask for an email address along with a comment - this is fairly standard already - and for unregistered users send them an email with a link for them to confirm their message. For regular visitors, you can ask them to create an account or, better still, use OpenID to confirm their identify, and allow them to skip the email confirmation step altogether. As an added precaution, you may want to review postings from new users until they reach a milestone like "5 non-spam messages".
Alternatives for protecting user registrations from bots.
Using CAPTCHAs as part of the registration process is meant to separate people from bots. Digg even asks the question Are you human? Technological alternatives here are a little less obvious. You could require users to activate their account via email, which at least makes it more time consuming for potentially malicious users to register. Depending on the sensitivity of the application, you can require even more difficult activation procedures. I know one credit card company system requires providing a phone number to call you with an activation code. Another alternative is to require invitations to join a system coupled with a way to audit invitations in case someone invites a bad apple. An overall approach that should work is to give users gradually escalating privileges as they demonstrate good behavior.
I'm not sure that a single technical cure exists for preventing unwanted user registrations. For now, I think sites will need to rely on an approval process of some kind for new registrations and a method for other site users to report people who abuse the system.
Tuesday, April 1. 2008
At least, I hope its not an April Fool's joke. It looks more like someone's lawyers go up to 11. From the Slideshare API Terms of Service. (iii) use the SlideShare APIs to operate nuclear facilities, life
support, or other mission critical application where human life or
property may be at stake. You understand that the SlideShare APIs are
not designed for such purposes and that their failure in such cases
could lead to death, personal injury, or severe property or
environmental damage for which SlideShare is not responsible;
Friday, March 21. 2008
Jeff Vodel provides tips for writing more comprehensible code, in a very humoruous article. They don't just apply to solo coders, people who inherit or update your code down the line will also thank you. Even though they're written for C programmers, his advice applies to any programming language. For web programmers, #5 should be heeded, as we've learned that the code we run on the server typically contributes only a fraction towards how long users must wait to see your web page. You get more bang for the buck by reducing how much data browsers have to download (file sizes) and how many connection it must make (hits to download css, images, javascript files, etc...).
This means that, if I make a stinky mess, I'm doing it in my own nest. When I'm chasing down a bug at 3 a.m., staring at a nightmare cloud of spaghetti code, and I say, "Dear God, what idiot child of married cousins wrote this garbage?", the answer to that question is "Me."
I've written previously my own set of tips to write readable php code.
Friday, March 21. 2008
There is an excellent case study in NY Times on how Zen's ultra simple camcorder the Flip has grabbed 13% of the camcorder market and been the best-selling camcorder on Amazon.com. It did this not by offering every feature and tech acronym buzzword under the sun, but by making it trivially easy to use. David Poque says,
Instead, the Flip has been reduced to the purest essence of video capture. You turn it on, and it's ready to start filming in two seconds. You press the red button once to record (press hard -- it's a little balky) and once to stop. You press Play to review the video, and the Trash button to delete a clip. There it is: the entire user's manual.
"But Oscar, we don't make camcorders or gadgets. Whats your point?" The success of the Zen lies in getting out of the way of users who want to record a video clip. User's don't need to adjust a million settings, don't have the option of adding special effects, zooming in and out, and so on. They just point and record. Next time you're building out a new web site, section of your web site, or any point where you're asking users to interacts with you - fill out a form, post a comment, make a donation - try to eliminate as much cruft as possible and reduce the interaction to its barest essence. Ideally, and I'm probably a bit of an idealist for writing this, this would mean an email newsletter subscription form that, shocker, has only one field for the user's email address, and no more. Or a donation form that is simply the amount to donate and fields required to process payment. The less hoops you make users jump through, the more users may jump through your hoops.
Friday, March 21. 2008
Google Analytics releases a new feature two weeks ago, allowing sites to anonymously share data with comparable sites. Benchmarking site traffic has been something we've suggested to clients in the past, and in a few cases have facilitated it on a small scale. But now, you can compare how your blog, online community, association, or non-profit's website compares to peers who also use GA. I'd stress that you're comparing yourself to other peers who use the service and choose to share the data so there is some selection bias, and its up to Google to decide who you are compared with. Still, it makes available data that would otherwise be expensive or impossible to get, and as more sites otp to share their traffic, it can get better. You can learn how to enable this and what you can compare over at googlesystem.
The system is very beta at the moment, and the "Verticals" used to categorize sites are few at the moment. It can still help answer questions like:
- What are overall traffic patterns in my sector?
- Is my traffic growth/decline part of an overall traffic growth/decline or unique to my site?
- What's an "average" site like mine get in Page Visits/Visitors/Bounce Rate/New Vistors, and how do I compare?
Technorati Tags: Google Analytics, nptech
Monday, February 18. 2008
Once you've been using Subversion when you and and your colleagues are working on a project, you're bound to find useful ways to exploit Subversion's hook system. We use the sample commit-email.pl script to send all commits to an email list for ad-hoc peer-review. I found and enabled Ian Christian's pre-commit script to check PHP syntax when checking in code.
The latest piece of the puzzle was to restrict commits to a project's branch to a few users, which has been harder to figure out than I expected. The most common script for access control is svnperms, which has a rich syntax for configuring access. Unfortunately, svnperms seems to work best with a repository with the following repository layout:
Our repository is laid out as:
I was trying to restrict access to Project1's stable branch (project1/branches/stable), and this didn't seem to be possible under svnperms, no matter how many regular expressions I tried. Subversion provides another access control script, commit-access-control.pl script, but having been burned by svnperms, I was reluctant to spend too much time trying to configure it and get it to work.
Since hooks are just shell scripts, its easy to write your own, which is what I did in this case. The place to check commit access is before the transaciton is created, in the start-commit hook. Being more comfortable in PHP, I whipped up the following command line script and saved it as check_commit_privs.php
#!/usr/bin/php
<?php
/*
CHECKS IF A USER CAN COMMIT TO THE REPOSITORY
Oscar Merida <omerida@forumone.com>
*/
// SVN passes two arguments, the repository path and user for the commit
$repo_path = $_SERVER['argv'][1];
$commit_user = $_SERVER['argv'][2];
// You can use array to define user groups
$qa_group = array('bob', 'roger', 'amanda'');
$contractors = array('marco', 'dawn', 'bill');
// CONFIGURATION
//
// array key is a path in SVN repository or a regular expression that will match a path.
// value is an array of usernames that can commit to that path
// first path match that limits access will prevent commits.
// This script assumes you only need to lock down certain
// parts of your repository.
$allowed = array(
// only contractors can commit to widgets project
'/widgets/' => $contractors,
// only qa_group can commit to any project's testing branch
'/.*\/branch\/testing/' => $qa_group,
// only bill can commit to his project
'/bills_project/' => array('bill')
);
foreach ($allowed as $regexp => $group)
{
if (preg_match($regexp, $repo_path)
&& !in_array($commit_user, $group))
{
exit(1);
}
}
To enable this script, create or add a file named 'start-commit' to your repository's hooks/ folder with the following. If there is a file named start-commit.tmpl, copy that as a starting point. You'll also have to make sure that both start-commit and check_commit_privs.php are executable by your SVN users.
REPOS="$1"
USER="$2"
# basic permissions check
/path/to/check_commit_privs.php "$REPOS" "$USER" || exit 1
Tuesday, January 22. 2008
Congrats to CollabNet! Subversion, the open source versioning tool written as a "compelling replacement to CVS", was named by developer.com as the Product of the Year 2008. We've been using subversion for our work for over a year now and its proved to be a very useful tool in helping us get work done and be more productive.
Before SVN, we had a manual cut over process to move files from our development server to production servers. It was slow, since you had to navigate around the file system, and error-prone, since one could never be sure that you moved all the correct files. This became more of a problem as multiple developers and designers would work on a single site. Now, SVN is constantly keeping track of which files are modified on each environment and automatically synchronizes the code on each. It's also allowed us to move from a single, shared development server, to working locally on our own workstations. We've also started exploiting the hooks in svn for checking PHP syntax before commits, and sending emails for code-reviews upon commit.
If you're working with or hiring programmers, one of your first questions you should ask them is how they manage and track changes to their source code. If they aren't using subversion or another system, that's an immediate red flag.
We've written about subversion before. Dan's previously written about Subversion: Simple Practices and Sandy wrote HOWTO: Use Eclipse PDT with Subversion in 13 Easy Steps.
Tuesday, January 22. 2008
Non Profit Tech blog has posted an excellent start to their 3 part series on How to get your small nonprofit up on the Web, showing how to register for your domain name and why you want one. I'm also heartened to see that part 2 will detail how to use Google Apps for Domains as your email hosting provider, for two reasons. First, we're in the process of switching our email infrastructure from hosting it ourselves to GAD - it made to much sense from a cost, reliability, and support angle. Second, this past weekend I spent time moving my personal email to GAD for the same reason, and found the whole process to be well documented and structured by Google. At every step of the way the instructions are pretty clear and not overly technical. Where it does veer into technical details, there are specific, step by step guides to follow.
One caution about registering domain names - make sure that the email contact address you have for both the Administrative and Technical contacts is a reliable email address that you check regularly. You'll want to keep an eye out for bogus DNS transfers, and more importantly, when your domain name comes up to renewal you'll be notified there. You absolutely do not want your domain to expire, since another person could quickly come along and register it.
Once you have your domain name, you'll likely need DNS hosting. DNS hosting lets route host names, mail, and other services to servers by associating a numerical IP address with, you can read up on DNS hosting services on wikipedia. If you register with Godaddy, I believe they offer that as part of your registration, as most registars do. You may not want to use a registrar's DNS hosting to insulate yourself from having to also move DNS hosting if you switch registrars in the future. Using a separate DNS hosting provider can prevent downtime for your web site and services int hat case. There are many free DNS hosting services available, one to look at is everydns.net, which is free but accepts donations.
Thursday, December 20. 2007
Do you need to extract text from Word, Excel, PowerPoint, and Adobe PDF files? We recently added full text searching of uploaded files to ProjectSpaces. Thanks to the work of other hackers, we were able to utilize command line utilities to extract meaningful text from proprietary, and/or binary files. Once implemented, we could index the majority of files uploaded by our users and present matching files for search queries.
catdoc for Word and Excel files
The catdoc package (Debian, Ubuntu, RedHat distros). On a Debian or Ubuntu server you can install the package with
sudo apt-get install catdoc
To install on our RHEL4 server, I had to find and download an RPM file for catdoc, and install tk
sudo up2date tk
sudo rpm -ivh catdoc-0.94.2-3.el4.i386.rpm
Once installed, you will have two utilities for extracting from Microsoft Word (catdoc) and Microsoft Excel (xls2csv) files. Using catdoc is as simple as:
catdoc -w ~/MeetingNotes.doc
If successful, the contents of the Word file will be piped to STDOUT. The -w flag is used to suppress word wrapping, otherwise catdoc will wrap lines at 72 characters.
Simlary you can get the content from an Excel spreadsheet with:
xls2cvs ~/Sales.xls
Executing the command above pipes the contents of your spreadsheet to STDOUT with comma-separated values.
xpdf for PDF files
The xpdf package provides a command line utility named pdftotext, which can parse PDF files up to version 1.5. This is a straightforward package to install on Debian or RedHat servers either:
sudo apt-get install xpdf
OR :
sudo up2date xpdf
The utility pdftotext is a little trickier to use, because by default it wants to save the extracted text to another file. By specifying - as the target file, output is sent to STDOUT instead.
pdftotext ~/DeveloperResume.pdf -
xlhtml for Powerpoint
Although catdoc also bundles a utility named catppt. I didn't not have any success in getting meaningful output from Powerpoint files with catppt. Instead, I settled on using ppthtml. This utility maybe a little harder to find a package for your distribution. On Ubuntu, it can be installed with:
sudo apt-get install ppthtml
For Redhat systems, look for the xlhtml package, I was able to find xlhtml-0.5-2.el4.sme.i386.rpm for our server and installed it with
sudo rpm -ivh xlhtml-0.5-2.el4.sme.i386.rpm
Usage of ppthtml is straightforward:
ppthtml ~/ConferenceSlides.ppt
The text of the Powerpoint file will be sent to STDOUT and formatted as HTML. Since only text is extracted, images and text contained inside images will be ignored.
Capturing content in PHP
Above, I emphasized how to send output to STDOUT. By doing so, with php we can then use the exec command to extract text and process it further or save it to a database table. A function for working with a word file might look like:
/**
* Attempts to extracts text from a MS Word file
* @param string full path to file
* @return string
*/
function extractWord($word_file)
{
if (file_exists($word_file)
{
// prevent malicious command execution
exec("/usr/bin/catdoc -w ' . escapeshellarg($word_file), $output);
// $output is an array corresponding to lines of output
return join("\n", $output);
}
}
Functions for extracting from Excel, Powerpoint, and PDF files would only require switching the command line tool.
What can we do with such tools?
- Add them to a mysql table for simple FULLTEXT searching.
- Use Yahoo's Term Extraction service to extract meaningful keywords
- Index them with a more advanced full text search engine.
- Provide a text preview of file contents so that users don't fire up a client application or browser plugin.
I'd love to hear other ideas too!
| |