Notes on the Google Search Appliance

The Google appliance which we use on campus to index web content, can index content in several different ways.

1. Crawling. This is the traditional method of Google indexing whereby you give it a URL and it hunts down any content under that location. This is the way we currently index the IS&T web site and other sites at MIT.

2. Database access. Content hidden away in a database (Oracle, for example) can be indexed - you tell Google how to connect to the database and supply queries for it to execute.

3. Feed API. This method allows you to push content to Google for indexing. You can either push URLs or full content. In this way, content that is neither in a URL-accessible file system nor a network-enabled database can be indexed. Content in a run-time Alfresco CMS could be indexed in this way by the Google appliance.

To summarize, these options give us considerable flexibility in how we design our web app while satisfying the requirement that Google provide the search capability.

MIT's license for the Google appliance limits us to 500,000 pages. In talking to Dave Conlon of IS&T, it appears we are well within that limit. As the IS&T web site work is generally not adding a lot of new pages, merely moving them from one location to another, I don't think at this point we will make an impact on the limits of the license.

Child pages

Notes on the Google Search Appliance