Yahoo!'s Term Extraction Service can be used to extract significant words or phrases from some larger body of text. There are many uses for it, not the least of which is providing keywords, or tags in Web2.0 jargon, to help classify and organize a library of content. The following PHP script uses will use the Term Extraction service to analyze a PDF file. With a little more work, it could be expanded to work with Microsoft Word, Excel, and Powerpoint files. Extracting keywords automatically would be a helpful feature to build into your blog or CMS. There are modules to extract keywords for Drupal and Wordpress.
1:<?php
2:// discover where pdftotext tool is
3:$catpdf = trim(`which pdftotext`);
4:
5:// the PDF file to analyze
6:$source = 'http://example.com/my_file.pdf';
7:
8:// will copy file to a local temporary file
9:$temp_pdf_file = tempnam(sys_get_temp_dir(), "ek");
10:
11:// see below
12:download_file($source, $temp_pdf_file);
13:
14:// save text contents of pdf source to another temp file
15:$extract_file = tempnam(sys_get_temp_dir(), "ek");
16:exec($catpdf . ' ' . escapeshellarg($temp_pdf_file) . ' ' . escapeshellarg($extract_file));
17:
18:// fetch and output terms
19:$contents = file_get_contents($extract_file);
20:if ($terms = get_yahoo_terms($contents))
21:{
22: echo "\nYahoo terms for the file $source";
23: foreach ($terms as $term)
24: {
25: echo "\n$term";
26: }
27: echo "\n";
28:}
29:
30:// hide our footsteps
31:unlink($temp_pdf_file);
32:unlink($extract_file);
33:
34:/**
35: * Uses curl to copy $source to a local file $dest
36: * @param string
37: * @param string
38: */
39:function download_file($source, $dest)
40:{
41: $out = fopen($dest, 'wb');
42:
43: $ch = curl_init();
44:
45: curl_setopt($ch, CURLOPT_FILE, $out);
46: curl_setopt($ch, CURLOPT_HEADER, 0);
47: curl_setopt($ch, CURLOPT_URL, $source);
48:
49: curl_exec($ch);
50:
51: curl_close($ch);
52:}
53:
54:/**
55: * Uses curl to query yahoo term extraction service for meaninful terms
56: * @param string
57: * @return mixed, array on success or null on failure
58: */
59:function get_yahoo_terms($content)
60:{
61: $SERVICE_URL = 'http://api.search.yahoo.com/ContentAnalysisService/V1/termExtraction';
62: $app_id = 'F1_Testing';
63:
64: $ch = curl_init();
65: curl_setopt($ch, CURLOPT_URL, $SERVICE_URL);
66: curl_setopt($ch, CURLOPT_POST, 3);
67: curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
68:
69: curl_setopt( $ch, CURLOPT_POSTFIELDS, 'appid=' . $app_id . '&context=' . urlencode($content) . '&output=php');
70: $raw = curl_exec($ch);
71: curl_close($ch);
72:
73: if ($raw = unserialize($raw))
74: {
75: if (isset($raw['ResultSet']['Result']))
76: {
77: return $raw['ResultSet']['Result'];
78: }
79: }
80:}
81:?>
As a sample of what to expect, I used the script to look at Calculating CARMA: Global Estimation of CO2 Emissions from the Power Sector - Working Paper 145 and the list of terms returned is below. The list of words is fairly accurate, and even includes the name of one of the authors.
global estimation
geographical scales
carbon emissions
co2 emissions
global citizens
global poverty
david wheeler
rigorous research
power plants
power sector
fossil energy
poverty and inequality
solar wind
energy sources
monitoring system
keystrokes
groundwork
strengths and weaknesses
carbon dioxide
aggregation
Comments
Tue, 01.07.2008 11:30
Dan, You are absolutely correct and I should have stated this within my post; the described steps within the post [...]
Mon, 30.06.2008 09:45
i wouldnt recomand this at all, because if something happens and the conection is lost u will have your data lost if the [...]
Mon, 09.06.2008 13:42
PDT syntax highlighting support does not seem to work when subclipse is installed, any one else had this problem?
Mon, 09.06.2008 11:56
I didn't mean to imply that you were bashing unit tests.
Mon, 09.06.2008 11:52
My point isn't to bash unit tests, but rather to say there are a bunch of things you should be doing before you get [...]
Mon, 09.06.2008 11:43
I agree with, what I think is, the gist of your argument. That is, if you don't write code that anticipates failure, [...]
Mon, 09.06.2008 08:58
clipse is an open source IDE — or as they put it themselves: “universal toolset for development”. It [...]
Tue, 27.05.2008 12:17
Navigation links should fill their container to ensure ease of selection. A good method for that is to make them [...]
Thu, 22.05.2008 10:35
One of the better comments I've seen in a while: "Although I like PHP, I agree the language is only as good as the [...]
Tue, 20.05.2008 14:03
Oscar, Yahoo's Term Extraction service takes an entire article and returns a few of (what it thinks are) the most [...]
Tue, 20.05.2008 13:13
Hi, Tom Tague from Calais here. First, thanks for taking note of Calais. And integrating an example right within the [...]
Tue, 20.05.2008 13:03
How does this compare to Yahoo!'s Term Extraction Service?
Thu, 15.05.2008 14:37
I rounded up useful links over on the Forum One Tech blog: Getting your Organization on Facebook
Mon, 21.04.2008 13:43
Hi Vikram-- Have you set up your repository in Subversive and successfully connected?
Mon, 21.04.2008 12:56
On checkout as.. dialog you asked to choose "Check out as a project configured using the New Project Wizard." That [...]