/search?q=<REQ>&num=<NUM>&start=<START>&output=<FORMAT>&client=<PARTNER>where:
chicken teriyakiin his/her browser window to do a search, the partner's software might get its first page of chicken teriyaki results from Google by sending the following query to bar.google.com on port 80:
GET /search?q=chicken+teriyaki&start=0&output=xml&client=fooThis is only an example, of course; Google will inform each of its partners individually what machine and port number to use for queries (as well as the precise format for those queries).
The MIME type (if any) returned by Google's webserver for results outputs may not be accurate, and should be ignored.
A minor upgrade of this sort will typically result in a results protocol's version's "minor" component being changed. E.g., a protocol's version might change from "3.0" to "3.1" or from "4A.5" to "4A.6".
Despite the fact that results output protocol parsers should be written to handle minor upgrades to results output protocols completely transparently, Google will notify partners at least 1 business day in advance of making any such upgrades. This way, partners have a chance to review their output protocol parsers to ensure that they correctly ignore any "unexpected information".
Major upgrades of a protocol may be followed by subsequent discontinuation (see below) of the original protocol.
The way most partners will set up their Google search services, one way for end users to perform "related:" searches is by simply entering a query into the partner's search box. If an end user enters the query:
related:<URL>on his or her browser, the browser will URL-escape the string "related:<URL>" and send it to the partner. The partner will then send that URL-escaped string to Google as the user's query, as usual. Google will send back its results page (although if Google doesn't have related-page information for the page in question, the Google response will not have any actual results on it).
More typically, however, partners will cause "related:" searches to be done in a different way. A partner might display a "related" link next to each search result displayed on a results page; clicking on it might cause the partner to send Google a "related:" query. Note that such a query should be indistinguishable by Google from the Google query resulting from an end user performing a "related:" query-- the query string sent to Google should be properly URL-escaped. E.g., to have Google perform a "related:" search on the URL:
http://www.foo.com/frob.asp?A=bb&C=ddthe partner might send to Google the query
GET /search?q=related%3Ahttp%3A%2F%2Fwww.foo.com%2Ffrob.asp%3FA%3Dbb%26C%3DddA "related:" query may fail to return any results. Each of Google's output protocols has a way of indicating whether or not a particular search result has "related:" information available for it, and Google partners shouldn't display "related" links next to search results for which no "related:" information is available.
&output=xml&client=foo
If desired, a partner can allow its end users to make use of Google's cache. To do this, the page that a partner returns can have links which point directly to Google's cache. To enable end users to use Google's cached text for, e.g., the document:
http://www.foo.com/frob.asp?A=bb&C=ddthe partner should create a hyperlink on its page to something like:
As with "related:" searches and "link:" searches, each of Google's output protocols has a way of indicating whether or not a particular search result has "cache:" information available for it.
If a partner creates results pages containing links to other results pages past the next page, then it will sometimes be the case that the destinations of some of these links actually will return no results. For example, a partner's end user might perform a query which Google services, returning an estimate of 40 total results for the query to the partner. The partner might reasonably then create a results page displaying results 1-10 and containing links to results page 2 (with results 11-20); results page 3 (with results 21-30); and results page 4 (with results 31-40). However, it is possible that, despite Google's original estimate of 40 total results, only 20 results are actually available for the query. This can happen because of Google's advanced filtering capabilities, which can weed out duplicate and other undesirable results. In the example at hand, this means that results page 3 and results page 4 will actually not return any results.
Although this kind of unfortunate event need not be considered serious, some partners may wish to avoid it. For such partners, Google suggests not displaying a full-fledged "navigation bar" on results pages. Instead, Google suggests that these partners consider simply displaying "Previous" and "Next" links which go to the previous page and next page of results, respectively. The results information that the partner receives in response to a Google query indicates whether it is appropriate for the partner to place a "Previous" link or a "Next link" on their own results page. Because Google results are presented in decreasing order of quality, a typical user will be more likely to find the information he or she seeks on an earlier results page, instead of on a later results page (and the most likely page to find the information sought is the first results page). Therefore, typical users won't be skipping ahead (e.g., skipping straight from the first results page to the fourth results page for a query), anyway.
At the present time, the DTD describing Google's XML results has the following contents:
<!ELEMENT GSP (TM, Q, CT?, TT?, SC*, RES?)>
<!ATTLIST GSP VER CDATA #REQUIRED>
<!ELEMENT TM (#PCDATA)>
<!ELEMENT Q (#PCDATA)>
<!ELEMENT CT (#PCDATA)>
<!ELEMENT TT (#PCDATA)>
<!ELEMENT SC (#PCDATA)>
<!ELEMENT RES (M, NB?, R*)>
<!ATTLIST RES SN CDATA #REQUIRED
EN CDATA #REQUIRED>
<!ELEMENT M (#PCDATA)>
<!ELEMENT NB (PU?, NU?)>
<!ELEMENT PU (#PCDATA)>
<!ELEMENT NU (#PCDATA)>
<!ELEMENT R (U, T?, RK, F*, S?, HAS)>
<!ATTLIST R N CDATA #REQUIRED
L CDATA "1">
<!ELEMENT U (#PCDATA)>
<!ELEMENT T (#PCDATA)>
<!ELEMENT RK (#PCDATA)>
<!ELEMENT F (#PCDATA)>
<!ELEMENT S (#PCDATA)>
<!ELEMENT HAS (CI?, L?, C?, RT?)>
<!ELEMENT CI (RC, DT?, DS?)>
<!ELEMENT RC (#PCDATA)>
<!ELEMENT DT (#PCDATA)>
<!ELEMENT DS (#PCDATA)>
<!ELEMENT L EMPTY>
<!ATTLIST L TAG CDATA "link:">
<!ELEMENT C EMPTY>
<!ATTLIST C TAG CDATA "cache:"
SZ CDATA #REQUIRED>
<!ELEMENT RT EMPTY>
<!ATTLIST RT TAG CDATA "related:">
|
|
|
either < or < |
|
either & or & |
|
either > or > |
|
either ' or ' |
|
either " or " |
All other characters will be presented without modification.
In other words, partners should take the values for elements that Google sends and unescape only these five characters. This unescaping should be performed only for elements which do not have element content; by performing this unescaping, the partner will recover the correct value of the element. If the element is described as containing HTML, the newly unescaped string will be valid HTML suitable for displaying by inserting it into an HTML document. If the element is described as containing a URL, the newly unescaped string will be valid HTML suitable for using as a link destination in a document (i.e., suitable for assigning to an href attribute); to have a browser actually display an element which is described as containing a URL, the newly unescaped string should be HTML-escaped.
Further information and comments about the tags are listed below the
table.
Tag Name | Meaning of Contents | Format | Attributes |
GSP | The entire output from Google (GSP: "Google Search Protocol") | Contains a TM; a Q; an optional CT; an optional TT; any number of SC's; and an optional RES | VER |
TM | Total search time in seconds | A floating-point number | |
Q | The search query submitted, suitable for viewing | HTML | |
CT | Search comments | HTML | |
TT | Search tips | HTML | |
SC | A directory category relevant to the search as a whole | A string (needs HTML-escaping to view; needs URL-escaping to put in a URL) | |
RES | The search results themselves | Contains an M; an optional NB; and any number of R's | SN; EN |
M | The estimated total number of results for the search | An integer | |
NB | The search navigation bar | Contains an optional PU and an optional NU | |
PU | Relative URL for the previous results page | [Relative] URL (needs HTML-escaping to view) | |
NU | Relative URL for the next results page | [Relative] URL (needs HTML-escaping to view) | |
R | A single search result | Contains a U; an optional T; an RK; any number of F's; an optional S; and a HAS | N; L |
U | The URL of a single search result | [Absolute] URL (needs HTML-escaping to view) | |
T | The title of a single search result | HTML | |
RK | Google's rating of how good a single search result is | An integer in the range 0-10, inclusive | |
F | Special-purpose field | Potentially anything | |
S | A document snippet of a single search result | HTML | |
HAS | Indicates what "special" features are available for this document | Contains an optional CI;an optional L; an optional C; and an optional RT | |
CI | Directory category information for a single search result | Contains an RC; an optional DT; and an optional DS | |
RC | A directory category for a single search result | A string (needs HTML-escaping to view; needs URL-escaping to put in a URL) | |
DT | The title listed in the directory for a single search result | HTML | |
DS | The summary listed in the directory of a single search result | HTML | |
L | If present, indicates that Google has backlinks information for this document | (empty) | |
C | If present, indicates that Google has this document in its cache | (empty) | SZ |
RT | If present, indicates that Google has GoogleScout ("related:") information for this document | (empty) |
If there are no appropriate search comments to put in the optional CT element, CT will not be present. The same holds for the search tips in the optional TT element and the actual search results in the RES element.
1 < piinto a browser search box, the partner's server sees this value as:
1+%3C+piand therefore sends a query to Google like:
GET /search?q=1+%3C+piGoogle unescapes this query to see the end user's original query:
1 < piTo make this string into something that can be put into an HTML document, relevant HTML characters are escaped. In this case, only the '<' needs to be escaped, yielding:
1 < piFinally, to create the value of the Q element, Google escapes all characters which need to be escaped for XML. This produces the string:
1 &lt; piThis is the exact sequence of bytes that Google sends to the partner in between <Q> and </Q>. The partner takes what it receives from Google and unescapes the characters that Google escaped, yielding
1 < piThis text is a valid HTML snippet, and is ready to put into a document.
The above process is admittedly somewhat convoluted, but is really only presented in this amount of detail for completeness. The only thing a partner needs to do to display Q is the same as for any other element containing HTML: unescape the characters that are escaped in the XML format, and then output the result as HTML.
"in" is a very common word and was ignored. [details]
Tip: in most browsers you can just hit the return key instead of clicking on the search button.However, since Google doesn't know how its partners' customers have navigated to get their search results, this particular search tip is not relevant for its partners, and will not be returned in search results. This example is only presented to indicate the general flavor intended for search tips. At the present time, Google may not return any search tips.
To make an HTML-printable string from a category, the category should be HTML-escaped by substituting escaped values for each of the five characters
< & > ' "This is the same process that should be applied to make any URL that Google returns into something viewable.
In addition to this HTML-escaping, partners may well want to make other modifications to category strings. As a trivial example, a partner might substitute the string " > " for each instance of the character '/' within the category string.
Category strings should be URL-escaped in some fashion before putting them into URLs.
An R has another attribute, L, whose value indicates to what level it might be appropriate to indent that result if the partner wants to present results in a "clustered" format. Google clusters its results by host so that multiple hits from the same host tend to appear together; the first hit from a given host has L="1", and later hits from the same host have L="2". It is possible that Google will implement more sophisticated clustering in the future, so a parser should not assume that the only permissible values for L are "1" and "2". However, it may be assumed that the value of L is a positive integer; its default value is "1", as indicated in the DTD.
The C element also has a mandatory attribute, SZ, which holds the size of Google's cached content for the document. A typical value for the SZ attribute might be the string "8k".
Between triplets, arbitrary quantities of whitespace may be present in Google's protocol4 output (primarily for the sake of legibility). For example, a terminating newline character is likely to be appended after each triplet (although partners' protocol4 parsers should not require this). In addition, arbitrary amounts of whitespace may precede or follow the entire collection of triplets (although, once more, partners' protocol4 parsers should not require any particular amount of whitespace in these places).
As indicated earlier, protocol4 may be modified from time to time. Such modifications will always be augmentations: either new triplets will be added to the output format, or previously optional triplets will become mandatory. Therefore, a protocol4 parser should be written so that it ignores any unexpected triplets. In this way, it will not be affected by any protocol4 modifications.
Name | Meaning | Format of value | Comments |
GSPVersion | The version number of the output format | A string beginning with '4' | Like the VER attribute of GSP in XML. The current protocol version is 4.0 |
Time | The number of seconds the query took | A floating-point number | Like the TM element in XML |
Search | The query that Google searched on | HTML | Like the Q element in XML |
Comments | Search comments | HTML | Like CT element in XML. Optional |
Tips | Search tips | HTML | Like TT element in XML. Optional |
SearchCat_<i> | A directory category relevant to the search as a whole | A string (needs HTML-escaping to view; needs URL-escaping to put in a URL) | Like SC category in XML. Optional; any number of these may be present |
Results | The (1-based) range of results displayed on this page. | Two integers, separated by a hyphen | Holds the same information as in the SN and EN attributes of RES in XML. If there are no results, this triplet will not be present. Optional |
Matches | Google's estimate of the total number of hits it has for the query. | A single integer | Like the M element in XML. If there are no results, this triplet will not be present. Optional |
BackURL | Relative URL for the previous results page | [Relative] URL (needs HTML-escaping to view) | Like the PU element in XML. If there is no previous results page, this triplet will not be present. Optional |
NextURL | Relative URL for the next results page | [Relative] URL (needs HTML-escaping to view) | Like the NU element in XML. If there is no next results page, this triplet will not be present. Optional |
Note that any number (including zero) of SearchCat_<i> triplets may be present. The first one is SearchCat_1, the second one is SearchCat_2, etc.
For result #<i>, the following triplets occur in the output
results:
Name | Meaning | Format of value | Comments |
Level_<i> | The "level" at which this result should be displayed | A positive integer | Like the L attribute of R in XML |
URL_<i> | The URL of a single search result. | [Absolute] URL (needs HTML-escaping to view) | Like the U element in XML |
Title_<i> | The title of a single search result. | HTML | Like the T element in XML. Optional |
Rank_<i> | Google's rating of how good a single search result is | An integer in the range 0-10, inclusive | Like the RK element in XML |
Summary_<i> | A document snippet of a single search result | HTML | Like the S element in XML. Optional |
Cat_<i> | A directory category for a single search result | A string (needs HTML-escaping to view; needs URL-escaping to put in a URL) | Like the RC element in XML. Optional |
DirTitle_<i> | The title listed in the directory for a single search result | HTML | Like the DT element in XML. Optional |
DirSummary_<i> | The summary listed in the directory for a single search result | HTML | Like the DS element in XML. Optional |
Link_<i> | Indicates that Google has backlinks information for this document | (empty string) | Conveys the same information as in the L element in XML. If Google has no backlinks information for this document, this triplet will not be present. Optional |
CacheSize_<i> | The approximate size of Google's cached copy of this document | An integral number of Kilobytes, such as "8k" | Holds the same information as in the SZ attribute of C in XML. If Google has no cached copy of this document, this triplet will not be present. Optional |
Related_<i> | Indicates that Google has GoogleScout ("") information for this document | (empty string) | Conveys the same information as in the RT element in XML. If Google has no backlinks information for this document, this triplet will not be present. Optional |
A given result can only have a DirTitle_<i> triplet or a DirSummary_<i> triplet if it has a Cat_<i> triplet, as well. However, the converse does not hold. Also, a given result can have a DirTitle_<i> without having a DirSummary_<i>, and vice versa.
Some agreements with Google may permit only the use of part of Google's results protocols. For example, a partner might have an agreement with Google to perform searches using Google's results protocols, but might nevertheless not be entitled to make use of Google's GoogleScout feature, despite the fact that an interface to it exists in Google's results protocols. Or a partner might have an agreement with Google to perform searches-- including GoogleScout searches-- using Google's results protocols, but might nevertheless not be entitled to make use of Google's directory and category information.
The contents of this document are confidential and proprietary to Google.
©2000 Google Inc.