Scored Searching on RDBMS-backed Web sites (ArsDigita Systems Journal)

Scored Searching on RDBMS-backed Web sites

by Bill Schneider (bschneid@arsdigita.com)

Submitted on: 2000-07-01
Last updated: 2000-08-05

ArsDigita : ArsDigita Systems Journal : One article

Many web sites have a page where users can search a database for records that match a set of criteria. Often, the search criteria translate directly into WHERE clauses in a SQL query, joined with the AND operator. But this search method has the undesirable effect of punishing users for telling us more about what they want. Instead of getting better results for offering more search criteria, the user gets fewer results.
Real-estate and apartment rental Web sites often have search functions with this behavior, whether or not they use RDBMSs. One example is Century 21's web site (http://www.century21.com). The search function allows the user to enter several search criteria to find houses for sale in a certain area and price range; the user may enter additional critieria for amenities, such as "near a golf course." The Century 21 site will return only listings that match all of the specified criteria exactly.
An oversimplified example of such a SQL query might look like this:
SELECT house_id
 FROM houses
 WHERE houses.city = 'Boston'
  AND houses.price < 120000
  AND houses.golf_p = 't' 
In this example, a user with vague search terms could be overwhelmed with too many matches, while a user who provides a very specific set of criteria will likely get nothing back. For example, one search for houses within 5 miles of a certain ZIP code on http://www.century21.com returned 19 results, but adding a single constraint for amenities (air conditioning, pool, etc.) caused the search to return no results.
How can we improve the search results returned to the user? By returning near misses to a user's supplied criteria, in addition to exact matches. In the above example, a person in the market for a new house might want to see listings for homes in an area that have all the desired amenities, but are slightly out of the specified price range, or vice versa. Just as a salesman might help a customer by saying, "Sorry, never heard of it, but we do have this, which is pretty close to what you asked for," providing near misses helps users better refine or reformulate their search criteria.
We can generate partial matches by assigning a numeric score to each record in the search, proportional to how well each element in the record matches the desired value. The SQL query that performs the search can then restrict queries based on the numeric score rather than simply by combining WHERE clause expressions with AND; the search may return either all results with a score above a certain threshold value, or it may return a fixed number of results with the highest scores.
A skeletal example of a score-based SQL query, improving upon the above example, might look like this:
SELECT house_id,
	-- a house gets one point if the city matches
        decode(city, 'Boston', 1, 0)
	-- another point if the price is below the user's max
        + decode(sign(120000-price), 1, 1, 0)
	-- and another point for amenities that match
        + decode(has_pool, 'yes', 1, 0)
         AS score
 FROM houses
 WHERE score >= 2
 ORDER by score desc
This query will return those near-misses that match two out of three supplied criteria, with the ones that match all three criteria returned first.

Note on the decode function in Oracle

The decode function in Oracle is used for conditional evaluation in SQL statements, to encode logic into queries like "if x = 1 then return 10, else return 20." Its syntax is decode(expr, a1, b1, a2, b2, ..., default), and it returns b(i) where expr = a(i); if expr doesn't match any "a" values, then default is returned. In the above query, for example, decode(city, 'Boston', 1, 0) evaluates to 1 if the city column contains the string 'Boston', and 0 otherwise.
Inequalities can be constructed using decode and the sign function. In the above example, sign(120000-price) will evaluate to 1 if and only if price < 120000, so the expression decode(sign(120000-price), 1, 1, 0) will be 1 if price < 120000 and 0 otherwise.

Case study: scored search for law firms in Infirmation.com

The Infirmation.com web site (http://www.infirmation.com) is an information and recruiting resource for lawyers and law students seeking jobs, and for law firm recruiting. One feature of the service is a scored search for law firms meeting some set of user-supplied criteria. Users may search for law firms based on the following criteria:

Location (city/state)
Salary
Size of firm (number of attorneys)
Legal practice areas (e.g., intellectual property, real estate, tax, etc.)
The user may designate each search criteria as "hard" or absolute, so that a firm will appear in the search results only if it matches the criteria. Or, by default, the search criteria will only affect the firms' score. Infirmation's interface for entering search criteria and designating them as absolute is illustrated in figure 1 below.

Figure 1: Search form input for Infirmation.com

Motivation

Just as in the real-estate example, we want to expand the law firm search results around the user's supplied criteria, so that they will see near misses in addition to exact matches. This way, if a search has no exact matches, the user may still see some useful results. For example, suppose a user wants to see a list of all law firms with offices in Boston that pay their new associates over $85,000 a year. If no such firm exists, or if there are only a few, the user will also see other law firms in Boston with lower starting salaries (e.g., Nixon Peabody in figure 2 below), and high-paying firms in other cities.

Figure 2: Search Results from Infirmation with Partial Matches

The result is a search function that is more useful to the end-user than one that simply excludes non-matching rows from the result set, while still allowing for this behavior when it is explicitly requested. For example, the user may specifically restrict his search to firms in Boston, so that the partial matches will consist only of law firms in Boston that pay less than the desired salary (see figure 3 below.)

Figure 3: Search Results from Infirmation with Exact Matches for the City, and Partial Matches on Salary

Showing near misses may help guide a user to better refine their search criteria. The graduating law student in the above example might consider his options in other cities for the bigger paycheck, or might get a better idea of what kinds of salary to expect at a law firm in Boston.
Implementation Methodology
We implement the scored-search logic with a single SQL query, which returns five columns:

the primary key for each result (the law firm ID),
a short description of each result (the name of the law firm),
a numeric score indicating the quality of the match,
the location of the firm, and
the salary offered by the firm.
Only the first three of these columns are essential to the search; if one were to implement a similar search in a different service, there would be other columns in place of the firm location and salary. (These are only included as a convenience, so that salary and location may be displayed along with the search results.) The query ends with an order by score desc clause, so that the better matches are returned first.
Hard vs. Soft criteria
Depending on user input, each search criteria is treated as either absolute ("hard") or relative ("soft"). The hard criteria are directly reflected in the query's WHERE clause; a firm will not appear in the search results under any circumstance unless that criteria is met.
The soft criteria only appear in the WHERE clause to influence scoring. For each search criteria, we designate a maximum score for a complete match, and assign a lower score for other firms depending on how close that firm is to matching the search criteria. Then, the search returns all firms that have a combined total score above an empirically determined cutoff.
Boolean vs. Numeric Criteria
In the set of search criteria for law firms, some criteria are considered to be Boolean (yes-or-no), like whether or not a firm has an office in a particular state, or whether a firm has attorneys who are involved in a particular area of legal practice. Other criteria, however, are numeric, such as salary or the size of the firm.
Numeric criteria easily lead to algorithms for assigning partial-match scores. For example, we can assume that people generally want more money, and that, if a user wants to see law firms paying new associates $100,000 a year, firms that pay $95k a year should rank higher than firms that pay $55k. Assigning a score based on firm size follows similar logic, although score as a function of firm size does not necessarily increase monotonically.
The tricky thing, then, is combining Boolean and numeric criteria in the total score for a law firm. Boolean criteria are, by their nature, on-or-off, yes-or-no questions; they don't have partial matches. So, we often restrict the maximum possible score for the numeric criteria above the maximum score for a Boolean criteria. For example, we give a single point (score of 1) for each matching legal practice area, but a score of 1.5 is possible for the salary and firm size criteria. The location is an exception, and gets a maximum score of 2. This reflects that a location is composed of two separate criteria, city and state; if the city matches, one can infer that the state also matches.
Shape of Linear Scoring Functions

To better combine numeric and boolean criteria in a total firm score, we also incorporated a discontinuity in the scoring functions for the numeric criteria. This discontinuity incorporates some of the function of a boolean criteria: a firm gets a point for matching the criteria (e.g., salary >= desired salary), and gets an additional fraction of a point (or loses an additional fraction of a point) depending on how far the actual value is above or below the desired value.
We express a partial firm score as a function of firm salary and desired salary as follows:
score =
{ 1 if salary = desired
{ 1 + 0.5 * (salary-desired/(max(salary)-desired)) if salary > desired
{ (salary-desired/5000) if desired < salary

The astute reader will notice that a firm gets one point if its salary exactly matches the desired salary, and increases to a maximum of 1.5 points; and, if the firm salary is below the desired salary, the score is negative and approaches zero as the firm salary approaches the desired salary. In other words, the score jumps from 0 to 1 about the point where the firm salary equals the desired salary.
The scoring function for firm size is similar, although it requires us to arbitrarily assign a firm size (number of attorneys) as "medium" so that the user may search for firms that are either "small, "medium," or "large." We chose to call a firm with 75 attorneys "medium-sized."
When looking for small firms, we assign the maximum score (1.5) to a firm with no attorneys, decreasing to 1 at the declared medium firm size; then the score jumps to 0 and goes slightly negative proportional to the number of attorneys. A search for large firms is scored the same way, except as a mirror image.
Searches for medium firms are scored differently. We declare an arbitrary range for which a firm is considered to be medium-sized; in Infirmation.com, we say a firm is medium-sized if it has between 45 and 105 employees. Any firm within this range gets a score of 1, and any firm outside this range gets a score of 0. This scoring function behaves differently than those for small and large firms, but has the same one-point discontinuity as the others.
Implementation notes

The firm-scoring logic in Infirmation.com depends heavily on the Oracle decode function; for example, the expression decode(city, :city, 2, 0) within a SELECT statement will evaluate to 2 points if the city column equals the supplied parameter :city, and 0 otherwise. More complex conditionals, such as those used to build the piecewise-linear functions used to score numeric criteria, can be built by using the output of the sign function as the first argument to decode, as in the previous real-estate example.
Performance notes
This methodology is effective in practice only if the final database query runs efficiently. Since we are using an ORDER BY expression, the query will do several sequential scans on the result set. Therefore, for the query to run efficiently, the entire result set must fit into RAM. When Oracle is used, partitioning large tables on the search-parameter columns (e.g., firm location and/or salary in Infirmation.com) could speed up the query, because the database would scan fewer blocks to find search results.
Limitations and Possible Improvements
As of June 1, 2000, the Infirmation.com web site has 28,580 registered users, and serves over a thousand distinct users a day. The scored law-firm search is a popular feature, used several hundred times a day on average.
Our approach has worked well in practice on a popular public Web site, but there are several shortcomings in our specifics that we would like to address in a future launch or if we were building a similar service.
First, it is not clear how we should score a list of Boolean criteria, as we do in Infirmation with a list of legal practice areas. Right now, we assign a score of 1 for each match in the list; and while this works fairly well in common practice, it can result in this particular criteria dominating other criteria, since its importance increases with each selected item. We might want to consider normalizing the total possible score from a list of Boolean criteria, or expressing it as a ratio of items matched to items requested.
Second, the scored firm search does not scale the numeric scores returned (that is, express them as a percentage). In the current system, the score for any given firm is dependent on not only how well the firm matches the requested criteria, but on the supplied criteria themselves; so, there is no way to compare the quality of results from different searches. Some potential future features (e.g., customizable e-mail alerts for new firms entered into the database) may depend on the ability to make such comparisons between searches.
Third, the search as it is implemented in Infirmation attempts to separate complete matches from partial matches, when some of the criteria are soft. It accomplishes this by using a (somewhat arbitrary) score as a cut-off value. Numeric criteria, however, have a maximum score higher than the designated full-match value; firms that offer a very high salary but don't match a requested practice area might appear as full matches, when they should appear as partial matches.
Fourth, and finally, we should note that the law firm search in Infirmation.com does not include any context searches, which we generally implement with the Oracle ConText/interMedia cartridge. But, if we decided to add another search criteria that would be treated contextually (e.g., "tell me all the law firms in New York with 'Kirk' in their name"), it would not change the fundamental logic behind the scoring algorithm. We would treat such a field as another numeric criteria, and incorporate the results of the ConText/interMedia scoring functions into our combined score for a firm. It is not clear how we would create the same discontinuity that we did with the salary and firm size criteria, nor is it clear if this would be necessary to produce an accurate relevancy search.
Future Directions: A generic API for scored searches
It may be possible to implement a generic API to facilitate including similar searches in other RDBMS-backed sites. As a first step, we could design a procedural abstraction that will return a SQL query given the following inputs:

The names of database tables that contain the records to be returned
A list of JOIN expressions, if the data is in multiple tables
The names of the desired columns
A list of absolute criteria (in SQL)
A list of scoring expressions (in SQL)
A list of the maximum possible score for each scoring expression, to generate cutoffs

To further refine this idea, we could also build procedural abstractions to generate the scoring expressions automatically, if the score can easily be expressed as a function of the user-supplied target value and the value in each database record. This is not always easily generalizable, though; for example, Infirmation uses a different scoring function for firm size depending on the user's input. And, even with straightforward scoring functions, such as that for salary, the cut-off for the maximum score is dependent on the maximum salary of all firms in the database. Still, such helper procedures could be useful for generalizing the SQL code to count how many checklist items are matched by records in another table, and so on.
Service developers using a generic scored-search API could move some of this detail into the RDBMS by using stored functions (in PL/SQL, if Oracle is the RDBMS) to generate scores given a user input value and a database input value. Then, the service developers would pass a list of stored function names to the scored-search API, rather than the actual scoring expression.
A call to such an API to generate the real-estate example might look like this:
function scored_search_query (tables, join_exprs, columns, hard_where,
	score_exprs, max_scores)

	Returns a SQL scored-search query given a list of tables, 
	a list of join expressions on those tables, a list of columns to 
	return in the result set, a list of "hard" WHERE clauses, a list
	of scoring expressions, and a list of maximum scores for each
	scoring expression

function scored_search_match (name, val, score)
	return "decode(name, val, score, 0)"

function scored_search_greater_than (v1, v2, score)
	return "decode(sign(v1-v2),1,score,0)"

function scored_search_checklist(match_list, max_score)

	Generates SQL to compute the score from a series of Boolean
	criteria (checklist items), scaled to a maximum of max_score.  

	if match_list is empty then
		return max_score // trivially true
	else
	  score_list := ();
	  for each item in match_list do
	    score_list := score_list U (scored_search_match(item,"yes",1))
	  done
	  return "sum(score_list) * " || max_score / length(score_list)

Example call to scored_search_query:

city := "Boston"
max_price := 120000
amenities_list := ("has_pool", "has_golf", "has_central_air")
query := scored-search-query("houses", (), 
	("house_id", "description", "price", "city", "state"),
	("city = $city", "state = $state"), 
	(scored_search_match("city", city, 1), 
	 (scored_search_match("state", state, 1),
	 (scored_search_greater_than(max_price, "price", 1),
	 (scored_search_checklist(amenities_list,1)),
	(1,1,1,1))

asj-editors@arsdigita.com

Scored Searching on RDBMS-backed Web sites

Note on the decode function in Oracle

Case study: scored search for law firms in Infirmation.com

Motivation

Implementation Methodology

Hard vs. Soft criteria

Boolean vs. Numeric Criteria

Shape of Linear Scoring Functions

Implementation notes

Performance notes

Limitations and Possible Improvements

Future Directions: A generic API for scored searches

Note on the `decode` function in Oracle