====================================================== Building a search engine with Appengine and Yahoo ====================================================== About --------- Appengine_ is a new Google service using which you can write an application in Python, and then use Google's infrasructure when your application needs to scale. We will use Google's Appengine to create a search website using `Yahoo's Developer API`_. Ah, the Irony. You can see the complete application from appspot_ .. _Appengine: http://code.google.com/appengine .. _Yahoo's Developer API: http://developer.yahoo.com/ .. _appspot: http://asdf.appspot.com/ What do you already need to know --------------------------------------- We will build the app in Python, so you need to know Python. No other knowledge is assumed. Downloading Appengine SDK -------------------------- Appengine has two parts, the Appengine servers at Google's infrastructure where you will deploy you code, and a SDK which you will use to develop code locally. `Download the SDK`_, and make sure that you add dev_appserver.py and appcfg.py to the system PATH. .. _Download the SDK: http://code.google.com/appengine/downloads.html An overview ------------ You can download the completed application from `here `_. The complete spplication consists of five files, which we will explore in detail below. 1. app.yaml, the configuration file. 2. search.py, the python file with our code. 3. index.html, the template shown when a search is done. 4. form.html, The template shown with a search box. 5. index.yaml. An autogenerated file. Getting Started ------------------ You need to provide a cofiguration file to Appengine, with information about your application. The configuration is done using a YAML file, which is a very simple markup language. Create a directory where you would store all your application files and create a file ``app.yaml``. Edit this file to put these lines.:: application: asdf version: 1 runtime: python api_version: 1 handlers: - url: /.* script: search.py Let us disect each of these lines to see what they do. 1. ``application: asdf``: This tell the name of the application. On your local webserver, you can keep any name, but when you deploy it to Appspot, you must own the application there for uploads to work. 2. ``version: 1``: This determines the major version of your application and is mostly used for versoning at Google's end. 3. ``runtime: python``: This tell the runtime to use. As of now Python is the only supported runtime. 4. ``api_version: 1``: The version of API to use. Currently 1 is the only supported value. 5. :: handlers: - url: /.* script: search.py Handlers maps the script to call when a particular URL pattern is encoutered, and is sepcified using regular expressions. The regex ``url: /.*`` asks the script to map all urls to a python script ``search.py``. Search.py: The python code. ------------------------------- Let us take a look at the python code which we will look through in detail below.:: import wsgiref.handlers from google.appengine.ext import webapp from google.appengine.ext.webapp import template from google.appengine.api import urlfetch from django.utils import simplejson import urllib import logging from StringIO import StringIO class MainPage(webapp.RequestHandler): def get(self): self.response.headers['Content-Type'] = 'text/html' query = self.request.get('q', '') if query: logging.debug('query: %s'% query) results = get_search_results('YLPjx2rV34F4hXcTnJYqYJUj9tANeqax76Ip2vADl9kKuByRNHgC4qafbATFoQ', query) results = results['Result'] payload = dict(results=results, query=query) resp = template.render('index.html', payload) else: resp = template.render('form.html', {}) self.response.out.write(resp) def get_search_results(appid, query, region ='us', type = 'all', results = 10, start = 0, format ='any', adult_ok = "", similar_ok = "", language = "", country = "", site = "", subscription = "", license = ''): base_url = u'http://search.yahooapis.com/WebSearchService/V1/webSearch?' params = locals() result = _query_yahoo(base_url, params) return result['ResultSet'] def _query_yahoo(base_url, params): params['output'] = 'json' payload = urllib.urlencode(params) url = base_url + payload response = StringIO(urlfetch.fetch(url).content) result = simplejson.load(response) return result def main(): application = webapp.WSGIApplication( [('/', MainPage)], debug=True) wsgiref.handlers.CGIHandler().run(application) if __name__ == "__main__": main() ``main`` is the first function called when our script is called. It creates a `WSGI `_ application, which has the job of mapping URLs to the classes. The next line runs the ``WSGI`` application. The class ``MainPage`` is used in response to ``\`` Urls. This class defines a method ``get`` which is invoked in response to HTTP ``GET`` requests. Similarly you can define ``put`` or ``post`` to handle the corresponding requests. Here our form only does get requests, so we define ``get``. The line ``self.response.headers['Content-Type'] = 'text/html'`` sets a header on the reqponse telling the browser we would be sending HTML back. The ``GET`` or the ``POST`` data is in the ``request`` objects. So we get the user's query from ``request.get``. ``get_search_results`` queries Yahoo to find web pages with the query. Once we have the results we can show the results by rendering the data with our templates. Lets take a small diversion to learn about templates. Templates in appengine ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To create a webpage with dynamic data, webapp uses templates. You create the structure of the html, while providing placeholders for the variables which you need to insert. Appengine uses `Django `_ templates, which provides programming costructs like looping and ``if`` using tags. Lets look at the template for the search results page.:: searching {{query}}

You searched for {{query}}

{% for res in results %}

{{res.Title}}

{{res.Summary}}
{{res.Url}}
{% endfor %} Most of this is simple Html, but you can see a few new constructs, such as, ``{{query}}`` and ``{% for res in results %}``. ``{{...}}`` allows you to put variables you have passed from your python script to this page. ``{% ... %}``, allow you access to looping, conditionals and other constructs. Here we used ``{% for res in results %}`` to loop over an array which we passed to this templates. End of loop is signified by ``{% endfor %}``. Inside of the for loop you have access to the variable defined in the ``{% for ... %}`` tag. So inside of the ``{% for %}`` we could use ``{{res}}``. As ``results`` is a array of dictionaries, ``{{res}}`` is a dictionary. We can access any element in ``{{res}}`` using a dotted notation, which we did with ``{{res.Summary}}`` and ``{{res.Url}}``. Lets see the other template.:: Search

You will see that this is a simple Html file with no Appengine specific tags. Here we just needed a form, so we used a simple html page, but used python to render it. Back to search.py ~~~~~~~~~~~~~~~~~~~~~~~~ If the user has done a search, the code to render the template is:: payload = dict(results=results, query=query) resp = template.render('index.html', payload) ``payload`` is a dictionary of variables, which we want to use in the template. We pass the results, and the query string to the template. If the user has not done any search, the code which runs is, ``resp = template.render('form.html', {})``, which renders the ``forms.html`` template with an empty dictionary. We have two helper functions defined, to talk to `Yahoo search api `_:: def get_search_results(appid, query, region ='us', type = 'all', results = 10, start = 0, format ='any', adult_ok = "", similar_ok = "", language = "", country = "", site = "", subscription = "", license = ''): base_url = u'http://search.yahooapis.com/WebSearchService/V1/webSearch?' params = locals() result = _query_yahoo(base_url, params) return result['ResultSet'] def _query_yahoo(base_url, params): params['output'] = 'json' payload = urllib.urlencode(params) url = base_url + payload response = StringIO(urlfetch.fetch(url).content) result = simplejson.load(response) return result Appengine is a sandboxed python runtime, and hence there are some limitations in which python functions you can call. ``urllib.urlopen`` is such a disallowed function. When we want to access external resources, we need to use `Appengine's urlfetch `_ class instead. Using this call, we get the results for query in `json `_ format. We want to use this in python, so we use ``simplejson.load`` to get the python representation. ``simplejson`` is bundled with `Django `_ which is bundled with Appengine. How this all ties together ---------------------------------- 1. You make a call to the server. 2. The server uses ``app.yaml`` to find the script to call in response. 3. Since we tied all responses to ``search.py``, this is the script called. 4. A WSGI application is created, which forwards all Url ``/`` with or without a querystring to ``MainClass``. 5. If there is no querystring, user did not search anything. ``form.html`` is rendered and returned. 6. If there is a querystring, user searched, ``get_search_results`` is called which calls ``_query_yahoo``. 7. ``google.appengine.api.urlfetch.fetch`` is used to fetch results from Yahoo API, instead of ``urllib.urlopen``. 8. The json response is conveted to Python using ``django.utils.simplejson``. 9. The results and query string are passed to ``index.html`` and the template is rendered. This shows the results page. Deploying the application. ---------------------------------- You can test the code locally using the SDK which you installed. Navigate to the directory where you saved the files and run ``dev_appserver .``. This will start the server on ``localhost:8080``, where you can use your application. If you have an Appengine account go to ``_, and register your application. Edit the ``app.yaml`` file to change the name of the application, ``application: asdf``, to your application, and deploy the application using command ``appcfg update .``. What next ----------------- I hope this tutorial has got you excited about the potential of Appengine. A infintely scalable solution seems a tall order, but if Google delivers on it, it moves away a lot of headaches of web development. You can learn more about Appengine at `Google code `_.