Building a search engine with Appengine and Yahoo

About

Appengine is a new Google service using which you can write an application in Python, and then use Google’s infrasructure when your application needs to scale. We will use Google’s Appengine to create a search website using Yahoo’s Developer API. Ah, the Irony. You can see the complete application from appspot

What do you already need to know

We will build the app in Python, so you need to know Python. No other knowledge is assumed.

Downloading Appengine SDK

Appengine has two parts, the Appengine servers at Google’s infrastructure where you will deploy you code, and a SDK which you will use to develop code locally. Download the SDK, and make sure that you add dev_appserver.py and appcfg.py to the system PATH.

An overview

You can download the completed application from here. The complete spplication consists of five files, which we will explore in detail below.

  1. app.yaml, the configuration file.
  2. search.py, the python file with our code.
  3. index.html, the template shown when a search is done.
  4. form.html, The template shown with a search box.
  5. index.yaml. An autogenerated file.

Getting Started

You need to provide a cofiguration file to Appengine, with information about your application. The configuration is done using a YAML file, which is a very simple markup language. Create a directory where you would store all your application files and create a file app.yaml. Edit this file to put these lines.:

application: asdf
version: 1
runtime: python
api_version: 1

handlers:
- url: /.*
  script: search.py

Let us disect each of these lines to see what they do.

  1. application: asdf: This tell the name of the application. On your local webserver, you can keep any name, but when you deploy it to Appspot, you must own the application there for uploads to work.

  2. version: 1: This determines the major version of your application and is mostly used for versoning at Google’s end.

  3. runtime: python: This tell the runtime to use. As of now Python is the only supported runtime.

  4. api_version: 1: The version of API to use. Currently 1 is the only supported value.

  5. handlers:
    - url: /.*
      script: search.py

Handlers maps the script to call when a particular URL pattern is encoutered, and is sepcified using regular expressions. The regex url: /.* asks the script to map all urls to a python script search.py.

Search.py: The python code.

Let us take a look at the python code which we will look through in detail below.:

import wsgiref.handlers

from google.appengine.ext import webapp
from google.appengine.ext.webapp import template
from google.appengine.api import urlfetch
from django.utils import simplejson
import urllib
import logging
from StringIO import StringIO

class MainPage(webapp.RequestHandler):
  def get(self):
    self.response.headers['Content-Type'] = 'text/html'
    query = self.request.get('q', '')
    if query:
      logging.debug('query: %s'% query)
      results = get_search_results('YLPjx2rV34F4hXcTnJYqYJUj9tANeqax76Ip2vADl9kKuByRNHgC4qafbATFoQ', query)
      results = results['Result']
      payload = dict(results=results, query=query)
      resp = template.render('index.html', payload)
    else:
      resp = template.render('form.html', {})
    self.response.out.write(resp)

def get_search_results(appid, query, region ='us', type = 'all', results = 10, start = 0, format ='any', adult_ok = "", similar_ok = "", language = "", country = "", site = "", subscription = "", license = ''):
    base_url = u'http://search.yahooapis.com/WebSearchService/V1/webSearch?'
    params = locals()
    result = _query_yahoo(base_url, params)
    return result['ResultSet']

def _query_yahoo(base_url, params):
    params['output'] = 'json'
    payload = urllib.urlencode(params)
    url = base_url + payload
    response = StringIO(urlfetch.fetch(url).content)
    result = simplejson.load(response)
    return result

def main():
  application = webapp.WSGIApplication(
                                       [('/', MainPage)],
                                       debug=True)
  wsgiref.handlers.CGIHandler().run(application)

if __name__ == "__main__":
  main()

main is the first function called when our script is called. It creates a WSGI application, which has the job of mapping URLs to the classes. The next line runs the WSGI application.

The class MainPage is used in response to \ Urls. This class defines a method get which is invoked in response to HTTP GET requests. Similarly you can define put or post to handle the corresponding requests. Here our form only does get requests, so we define get. The line self.response.headers['Content-Type'] = 'text/html' sets a header on the reqponse telling the browser we would be sending HTML back.

The GET or the POST data is in the request objects. So we get the user’s query from request.get. get_search_results queries Yahoo to find web pages with the query. Once we have the results we can show the results by rendering the data with our templates. Lets take a small diversion to learn about templates.

Templates in appengine

To create a webpage with dynamic data, webapp uses templates. You create the structure of the html, while providing placeholders for the variables which you need to insert. Appengine uses Django templates, which provides programming costructs like looping and if using tags. Lets look at the template for the search results page.:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>
<head>
<title>searching {{query}}</title>
</head>
<body>
<h2>You searched for {{query}}</h2>

{%  for res in results  %}
<div class="result">
<h3><a href="{{res.ClickUrl}}">{{res.Title}}</a></h3>
<div class="summary">
{{res.Summary}}
</div>
<div class="extra">
<small >{{res.Url}}</small>
</div>
</div>
{%  endfor  %}

</body>
</html>

Most of this is simple Html, but you can see a few new constructs, such as, {{query}} and {%  for res in results  %}. {{...}} allows you to put variables you have passed from your python script to this page. {% ... %}, allow you access to looping, conditionals and other constructs. Here we used {%  for res in results  %} to loop over an array which we passed to this templates. End of loop is signified by {% endfor %}. Inside of the for loop you have access to the variable defined in the {% for ... %} tag. So inside of the {% for %} we could use {{res}}. As results is a array of dictionaries, {{res}} is a dictionary. We can access any element in {{res}} using a dotted notation, which we did with {{res.Summary}} and {{res.Url}}.

Lets see the other template.:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>
<head>
<title>Search</title>
</head>
<body>

<form actrion="." method="get">
<input type="text" name="q" value="" rows="80" />
<br />
<input type="submit" name="submit" value="search" />

</form>

</body>
</html>

You will see that this is a simple Html file with no Appengine specific tags. Here we just needed a form, so we used a simple html page, but used python to render it.

Back to search.py

If the user has done a search, the code to render the template is:

payload = dict(results=results, query=query)
resp = template.render('index.html', payload)

payload is a dictionary of variables, which we want to use in the template. We pass the results, and the query string to the template.

If the user has not done any search, the code which runs is, resp = template.render('form.html', {}), which renders the forms.html template with an empty dictionary.

We have two helper functions defined, to talk to Yahoo search api:

def get_search_results(appid, query, region ='us', type = 'all', results = 10, start = 0, format ='any', adult_ok = "", similar_ok = "", language = "", country = "", site = "", subscription = "", license = ''):
    base_url = u'http://search.yahooapis.com/WebSearchService/V1/webSearch?'
    params = locals()
    result = _query_yahoo(base_url, params)
    return result['ResultSet']

def _query_yahoo(base_url, params):
    params['output'] = 'json'
    payload = urllib.urlencode(params)
    url = base_url + payload
    response = StringIO(urlfetch.fetch(url).content)
    result = simplejson.load(response)
    return result

Appengine is a sandboxed python runtime, and hence there are some limitations in which python functions you can call. urllib.urlopen is such a disallowed function. When we want to access external resources, we need to use Appengine’s urlfetch class instead. Using this call, we get the results for query in json format. We want to use this in python, so we use simplejson.load to get the python representation. simplejson is bundled with Django which is bundled with Appengine.

How this all ties together

  1. You make a call to the server.
  2. The server uses app.yaml to find the script to call in response.
  3. Since we tied all responses to search.py, this is the script called.
  4. A WSGI application is created, which forwards all Url / with or without a querystring to MainClass.
  5. If there is no querystring, user did not search anything. form.html is rendered and returned.
  6. If there is a querystring, user searched, get_search_results is called which calls _query_yahoo.
  7. google.appengine.api.urlfetch.fetch is used to fetch results from Yahoo API, instead of urllib.urlopen.
  8. The json response is conveted to Python using django.utils.simplejson.
  9. The results and query string are passed to index.html and the template is rendered. This shows the results page.

Deploying the application.

You can test the code locally using the SDK which you installed. Navigate to the directory where you saved the files and run dev_appserver .. This will start the server on localhost:8080, where you can use your application. If you have an Appengine account go to http://appengine.google.com, and register your application. Edit the app.yaml file to change the name of the application, application: asdf, to your application, and deploy the application using command appcfg update ..

What next

I hope this tutorial has got you excited about the potential of Appengine. A infintely scalable solution seems a tall order, but if Google delivers on it, it moves away a lot of headaches of web development. You can learn more about Appengine at Google code.