 |
|
Friday, March 21. 2008
Google Analytics releases a new feature two weeks ago, allowing sites to anonymously share data with comparable sites. Benchmarking site traffic has been something we've suggested to clients in the past, and in a few cases have facilitated it on a small scale. But now, you can compare how your blog, online community, association, or non-profit's website compares to peers who also use GA. I'd stress that you're comparing yourself to other peers who use the service and choose to share the data so there is some selection bias, and its up to Google to decide who you are compared with. Still, it makes available data that would otherwise be expensive or impossible to get, and as more sites otp to share their traffic, it can get better. You can learn how to enable this and what you can compare over at googlesystem.
The system is very beta at the moment, and the "Verticals" used to categorize sites are few at the moment. It can still help answer questions like:
- What are overall traffic patterns in my sector?
- Is my traffic growth/decline part of an overall traffic growth/decline or unique to my site?
- What's an "average" site like mine get in Page Visits/Visitors/Bounce Rate/New Vistors, and how do I compare?
Technorati Tags: Google Analytics, nptech
Tuesday, January 29. 2008
Dan and I are back from the MySQL for Developers training, and we decided to share the enlightenment. Instead of boring you with a summary of what we've learned, we've decided to divvy up our notes into various "tips & tricks" posts.
Let's say that a client sends you an Access, Excel, or CSV file and tells you to stick all that information into your database. The client is oblivious to the structure of your table, and the file contains several columns that aren't needed. Some of the other columns are improperly formatted, making your life that much harder.
Your first instinct might be to write a PHP (or Perl, or Python, etc) script that squeezes and trims data to the right dimensions, then sends to MySQL. That's fine, but many don't know that MySQL can usually do this tedius work for you. It might also be worth mentioning that MySQL parses this data exceedingly fast.
LOAD DATA INFILE (or LOAD LOCAL DATA INFILE from your local machine) is an importer's dream come true. It allows for MySQL itself to load and parse a file. Once loaded, you can set the FIELDS TERMINATED BY, ENCLOSED BY, LINES TERMINATED BY, and IGNORE ... LINES parameters to split the data into meaningful chunks.
Setup:
LOAD DATE INFILE '[file.ext]' INTO TABLE [table_name]
FIELDS TERMINATED BY '[column_separator]'
ENCLOSED BY '[usually doublequote]'
LINES TERMINATED BY '[usually \r or \n]'
IGNORE [lines_to_skip] LINES
(col1, @variable2, col3, col4)
SET col2 = SOME_FUNCTION(@variable2, some_parameters);
You can use @variables instead of columns for processing. If unneeded file columns exist, then we store that column into a dummy variable (look at @skip below).
The import file uses "Apr 20 1982" for the birthday, but we want it looking like "1982-04-20". Below, we set the @birthday variable equal to the file's birthday column, then SET birthday (the DB column) equal to STR_TO_DATE(@birthday, 'input_format').
Real-world usage:
LOAD DATA INFILE 'import.csv' INTO TABLE people
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 2 LINES
(id, @skip, @fname, @lname, @skip, @birthday)
SET name = CONCAT(@fname, ' ', $lname),
birthday = STR_TO_DATE(@birthday, '%b %e %Y');
NOTE: STR_TO_DATE expects the input date format to be specified. The 2nd parameter contains the curious string '%b %e %Y'. In this example, it's looking for a date formatted like 'Jan 9 2001'. The meaning of those special characters can be found here.
I would highly recommend attending the MySQL training seminars. David Swain, our trainer, was highly competent and had helpful answers and further insight to every question we could throw at him.
Thursday, December 20. 2007
Do you need to extract text from Word, Excel, PowerPoint, and Adobe PDF files? We recently added full text searching of uploaded files to ProjectSpaces. Thanks to the work of other hackers, we were able to utilize command line utilities to extract meaningful text from proprietary, and/or binary files. Once implemented, we could index the majority of files uploaded by our users and present matching files for search queries.
catdoc for Word and Excel files
The catdoc package (Debian, Ubuntu, RedHat distros). On a Debian or Ubuntu server you can install the package with
sudo apt-get install catdoc
To install on our RHEL4 server, I had to find and download an RPM file for catdoc, and install tk
sudo up2date tk
sudo rpm -ivh catdoc-0.94.2-3.el4.i386.rpm
Once installed, you will have two utilities for extracting from Microsoft Word (catdoc) and Microsoft Excel (xls2csv) files. Using catdoc is as simple as:
catdoc -w ~/MeetingNotes.doc
If successful, the contents of the Word file will be piped to STDOUT. The -w flag is used to suppress word wrapping, otherwise catdoc will wrap lines at 72 characters.
Simlary you can get the content from an Excel spreadsheet with:
xls2cvs ~/Sales.xls
Executing the command above pipes the contents of your spreadsheet to STDOUT with comma-separated values.
xpdf for PDF files
The xpdf package provides a command line utility named pdftotext, which can parse PDF files up to version 1.5. This is a straightforward package to install on Debian or RedHat servers either:
sudo apt-get install xpdf
OR :
sudo up2date xpdf
The utility pdftotext is a little trickier to use, because by default it wants to save the extracted text to another file. By specifying - as the target file, output is sent to STDOUT instead.
pdftotext ~/DeveloperResume.pdf -
xlhtml for Powerpoint
Although catdoc also bundles a utility named catppt. I didn't not have any success in getting meaningful output from Powerpoint files with catppt. Instead, I settled on using ppthtml. This utility maybe a little harder to find a package for your distribution. On Ubuntu, it can be installed with:
sudo apt-get install ppthtml
For Redhat systems, look for the xlhtml package, I was able to find xlhtml-0.5-2.el4.sme.i386.rpm for our server and installed it with
sudo rpm -ivh xlhtml-0.5-2.el4.sme.i386.rpm
Usage of ppthtml is straightforward:
ppthtml ~/ConferenceSlides.ppt
The text of the Powerpoint file will be sent to STDOUT and formatted as HTML. Since only text is extracted, images and text contained inside images will be ignored.
Capturing content in PHP
Above, I emphasized how to send output to STDOUT. By doing so, with php we can then use the exec command to extract text and process it further or save it to a database table. A function for working with a word file might look like:
/**
* Attempts to extracts text from a MS Word file
* @param string full path to file
* @return string
*/
function extractWord($word_file)
{
if (file_exists($word_file)
{
// prevent malicious command execution
exec("/usr/bin/catdoc -w ' . escapeshellarg($word_file), $output);
// $output is an array corresponding to lines of output
return join("\n", $output);
}
}
Functions for extracting from Excel, Powerpoint, and PDF files would only require switching the command line tool.
What can we do with such tools?
- Add them to a mysql table for simple FULLTEXT searching.
- Use Yahoo's Term Extraction service to extract meaningful keywords
- Index them with a more advanced full text search engine.
- Provide a text preview of file contents so that users don't fire up a client application or browser plugin.
I'd love to hear other ideas too!
Thursday, November 15. 2007
With the help of Forum One, the Center for Global Development (CGD) launched a web-based initiative aiming to reduce global carbon emissions. CGD believes that public disclosure is one of the best steps towards solving a problem, and they came into the CO2 front with some pretty heavy artillery. Kevin Ummel and David Wheeler at CGD worked tremendously hard not only to find and sanity-check the emissions data, but to sculpt it in a way that many will undoubtedly find meaninful. The database itself consists of information for 50,000 power plants and over 4,000 power companies. Google maps have been used to add some flesh to the raw data, and a colored icon is attached to every item to further reinforce the numbers.
 What's coolest about CARMA is the ability to "dig" through all the data to find stuff that you'd consider meaningful. The Dig Deeper section allows you to drill down to find plants and companies in your area, as well as the most troublesome (or greenest) ones in your region/country/continent/etc.
Users can download the raw data in CSV or XML formats on just about every page, and if that's not thorough enough, there's a full-featured API. With the API, anyone could pull out data and use it for their own website -- perfect for custom widgets. Examples and helpful tips will shortly be added to the API documentation as well.
Even though we knew from the start that this is a really cool initiative, we had no idea that it would make it to the front page of sites like CNN and NPR. On the day after launch (11/15), the first server buckled under the enormous load, causing a bit of chaos. The site was down for a good 20 minutes while Rackspace upgraded to a new server. However, all is well again, and the steps made towards correcting the problem will be on a follow-up post.
Friday, October 19. 2007
After the release of the Eclipse PDT (PHP Development Toolkit)'s All-In-One 1.0 build, I gave it a whirl, but found to my dismay that a Subversion client isn't included in the default install, though CVS is. It took me a few tries to figure out how to get everything working well with it, so I thought I'd share my experiences and save someone else the ramp-up time.
I'll outline a couple of wrong turns here, and then describe a way I found to make it work after the jump. Note I'm not saying it's the way, not with something as configurable as Eclipse. Also, I'm coming to it as an absolute Eclipse newbie, so some of my mistakes might be obvious to a long-time user, but I assure you aren't obvious if you are new to it. First I tried making the workspace the place I'd been working previously, assuming Eclipse would recognize and import the files with a new project. Well, not so much. It seems a new project can import files, but not recognize pre-existing ones. Then I tried setting up a new project in a new workspace, and importing the project I'd checked out with Subversion on the command line. That worked better than I feared. It pulled in .svn directories and the svn client works to an extent in the new file location. However, I got strange file locks and it turns out it didn't import all the files. That's when I stopped screwing around and turned to Google.
Continue reading "HOWTO: Use Eclipse PDT with Subversion in 13 Easy Steps"
Friday, October 12. 2007
Worth closer inspection, today Kintera announced that they've published an open API to allow integration directly with Kintera Technology. You can view the API documentation here. Some initial thoughts:
- You need a Kintera Sphere account with the API enabled (duh).
- If you are on PHP, they recommend version 5, since PHP4 will be dead soon.
- Sample code is available for C#, Java, and PHP. Must mean that that not a lot of Kintera clients have Ruby, Python, or Perl sites.
- The API is SOAP based, which is a little heavier to program, and can be difficult to debug.
- The API has methods for describing entities (business objects), creating new records, retrieving one or more records, updating a record, and deleting a record.
- A sign that they've thought ahead, for making batch operations easy, there is a GetUpdate method to return updated records within a specified time frame.
- Seems like you'll be able to run arbitrary queries in the CRM database, and also Execute transactions.
- One potentially worrying aspect, is that the API does not apply entity access restrictions. I'm not versed in exactly what that means, but it sounds like you are connecting as the equivalent of "root" and can do almost whatever you want via the API.
- I couldn't find any mention of the API being rate limited to so-many queries or calls a day.
HT: Non-profit Tech Blog
Monday, October 8. 2007
The boundaries of the physical office don't mean as much as they used to. Wifi is available at nearly every coffee shop, airport, and even some parks. Along with mobility come some challenges, however, specifically when dealing with user support from afar. VNC is great in the office, but once your user steps outside the batcave, there's no telling what kind of environment they will be in. Does the firewall or access point they're behind allow VPN pass through? Can they tell you what IP they're at?
In most circumstances, the answer is hard to come by while your user is frazzled and short on time. What do you do? On the Windows side of the fence, things are greatly simplified by the folks at TeamViewer. This tidy and small app provides a way to circumvent the common problems of connectivity by assigning each side a unique ID, like a phone number. What's more, it doesn't even require an installation on either the technician or user's side. Point them to the download site, and 500k later, you have a connection that traverses NAT and firewalls without a lot of fuss. TeamViewer provides file transfer and a remote desktop control feature.
If you're supporting non-Windows clients, you can pair up Hamachi with VNC, though it should installed ahead of time. Hamachi gives the user a unique ID, just as Teamviewer does, which you can use to create a pipeline for a VNC session to use much in the same way as TeamViewer.
|
|
Comments
Mon, 21.04.2008 13:43
Hi Vikram-- Have you set up your repository in Subversive and successfully connected?
Mon, 21.04.2008 12:56
On checkout as.. dialog you asked to choose "Check out as a project configured using the New Project Wizard." That [...]
Fri, 11.04.2008 06:06
Excellent post, very detailed and I especially like the images. It helped me a lot with my set up.
Thu, 03.04.2008 01:55
How can they say that they're "not responsible" for "death, personal injury, or severe property or environmental [...]
Tue, 01.04.2008 23:29
Hi Marcel-- Make sure you follow the pre-step I have above and ensure you've installed one of the two Subversion [...]
Tue, 01.04.2008 20:21
Please help when I try to import, I don't have Other/Checkout Projects from SVN. Where is my problem?
Thu, 21.02.2008 05:58
Great, this was useful.
Wed, 13.02.2008 09:26
Thanks , I was trying hard & hard with http://www.eclipse.org/pdt/ins tall.php but success came here... thanks for the [...]
Wed, 06.02.2008 15:39
I had to find packages for e Redhat EL3 server, the following worked for me: Catdoc [...]
Wed, 30.01.2008 14:38
Yeah, it does. Just remember that a function is specific to the database in which it's created. If you're trying to call [...]
Wed, 30.01.2008 14:17
Awesome - I didn't know you could calculate and transform fields using functions that way. Does it work with custom [...]
Tue, 22.01.2008 20:49
Hehe... DNS hosting services I toyed with including in the tutorial but it's really a bit too much for the one-man (or [...]
Sat, 19.01.2008 18:51
Awesome, thanks very much for the guide!
Wed, 16.01.2008 14:21
I'd have to say that I like #2, its much more of a natural web experience to me. We don't always have to emulate [...]
Wed, 02.01.2008 16:10
It's fashionable this time of year to create some sort of year-end "best of" or "top ten" list [...]