Yahoo!'s Term Extraction Service can be used to extract significant words or phrases from some larger body of text. There are many uses for it, not the least of which is providing keywords, or tags in Web2.0 jargon, to help classify and organize a library of content. The following PHP script uses will use the Term Extraction service to analyze a PDF file. With a little more work, it could be expanded to work with Microsoft Word, Excel, and Powerpoint files. Extracting keywords automatically would be a helpful feature to build into your blog or CMS. There are modules to extract keywords for Drupal and Wordpress.
1:<?php
2:// discover where pdftotext tool is
3:$catpdf = trim(`which pdftotext`);
4:
5:// the PDF file to analyze
6:$source = 'http://example.com/my_file.pdf';
7:
8:// will copy file to a local temporary file
9:$temp_pdf_file = tempnam(sys_get_temp_dir(), "ek");
10:
11:// see below
12:download_file($source, $temp_pdf_file);
13:
14:// save text contents of pdf source to another temp file
15:$extract_file = tempnam(sys_get_temp_dir(), "ek");
16:exec($catpdf . ' ' . escapeshellarg($temp_pdf_file) . ' ' . escapeshellarg($extract_file));
17:
18:// fetch and output terms
19:$contents = file_get_contents($extract_file);
20:if ($terms = get_yahoo_terms($contents))
21:{
22: echo "\nYahoo terms for the file $source";
23: foreach ($terms as $term)
24: {
25: echo "\n$term";
26: }
27: echo "\n";
28:}
29:
30:// hide our footsteps
31:unlink($temp_pdf_file);
32:unlink($extract_file);
33:
34:/**
35: * Uses curl to copy $source to a local file $dest
36: * @param string
37: * @param string
38: */
39:function download_file($source, $dest)
40:{
41: $out = fopen($dest, 'wb');
42:
43: $ch = curl_init();
44:
45: curl_setopt($ch, CURLOPT_FILE, $out);
46: curl_setopt($ch, CURLOPT_HEADER, 0);
47: curl_setopt($ch, CURLOPT_URL, $source);
48:
49: curl_exec($ch);
50:
51: curl_close($ch);
52:}
53:
54:/**
55: * Uses curl to query yahoo term extraction service for meaninful terms
56: * @param string
57: * @return mixed, array on success or null on failure
58: */
59:function get_yahoo_terms($content)
60:{
61: $SERVICE_URL = 'http://api.search.yahoo.com/ContentAnalysisService/V1/termExtraction';
62: $app_id = 'F1_Testing';
63:
64: $ch = curl_init();
65: curl_setopt($ch, CURLOPT_URL, $SERVICE_URL);
66: curl_setopt($ch, CURLOPT_POST, 3);
67: curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
68:
69: curl_setopt( $ch, CURLOPT_POSTFIELDS, 'appid=' . $app_id . '&context=' . urlencode($content) . '&output=php');
70: $raw = curl_exec($ch);
71: curl_close($ch);
72:
73: if ($raw = unserialize($raw))
74: {
75: if (isset($raw['ResultSet']['Result']))
76: {
77: return $raw['ResultSet']['Result'];
78: }
79: }
80:}
81:?>
As a sample of what to expect, I used the script to look at Calculating CARMA: Global Estimation of CO2 Emissions from the Power Sector - Working Paper 145 and the list of terms returned is below. The list of words is fairly accurate, and even includes the name of one of the authors.
global estimation
geographical scales
carbon emissions
co2 emissions
global citizens
global poverty
david wheeler
rigorous research
power plants
power sector
fossil energy
poverty and inequality
solar wind
energy sources
monitoring system
keystrokes
groundwork
strengths and weaknesses
carbon dioxide
aggregation