Scored Searching on RDBMS-backed Web sites
by Bill Schneider (bschneid@arsdigita.com)
Submitted on: 2000-07-01
Last updated: 2000-08-05
ArsDigita : ArsDigita Systems Journal : One article
Many web sites have a page where users can search a database for
records that match a set of criteria. Often, the search criteria
translate directly into WHERE
clauses in a SQL query,
joined with the AND
operator. But this search method has
the undesirable effect of punishing users for telling us more about
what they want. Instead of getting better results for offering more
search criteria, the user gets fewer results.
Real-estate and apartment rental Web sites often have search functions
with this behavior, whether or not they use RDBMSs. One example is Century 21's web site (http://www.century21.com). The search
function allows the user to enter several search criteria to find
houses for sale in a certain area and price range; the user may enter
additional critieria for amenities, such as "near a golf course." The
Century 21 site will return only listings that match all of the
specified criteria exactly.
An oversimplified example of such a SQL query might look like this:
SELECT house_id
FROM houses
WHERE houses.city = 'Boston'
AND houses.price < 120000
AND houses.golf_p = 't'
In this example, a user with vague search terms could be overwhelmed
with too many matches, while a user who provides a very specific set
of criteria will likely get nothing back. For example, one search for
houses within 5 miles of a certain ZIP code on http://www.century21.com returned 19 results,
but adding a single constraint for amenities (air conditioning, pool,
etc.) caused the search to return no results.
How can we improve the search results returned to the user? By
returning near misses to a user's supplied criteria, in addition to
exact matches. In the above example, a person in the market for a new
house might want to see listings for homes in an area that have all
the desired amenities, but are slightly out of the specified price
range, or vice versa. Just as a salesman might help a customer by
saying, "Sorry, never heard of it, but we do have this, which
is pretty close to what you asked for," providing near
misses helps users better refine or reformulate their search criteria.
We can generate partial matches by assigning a numeric score to each
record in the search, proportional to how well each element in the
record matches the desired value. The SQL query that performs the
search can then restrict queries based on the numeric score rather
than simply by combining WHERE clause expressions with AND; the search
may return either all results with a score above a certain threshold
value, or it may return a fixed number of results with the highest
scores.
A skeletal example of a score-based SQL query, improving upon the above
example, might look like this:
SELECT house_id,
-- a house gets one point if the city matches
decode(city, 'Boston', 1, 0)
-- another point if the price is below the user's max
+ decode(sign(120000-price), 1, 1, 0)
-- and another point for amenities that match
+ decode(has_pool, 'yes', 1, 0)
AS score
FROM houses
WHERE score >= 2
ORDER by score desc
This query will return those near-misses that match two out of three
supplied criteria, with the ones that match all three criteria
returned first.
Note on the decode function in Oracle
The decode function in Oracle is used
for conditional evaluation in SQL statements, to encode logic into
queries like "if x = 1 then return 10, else return 20."
Its syntax is decode(expr, a1, b1, a2, b2, ..., default) ,
and it returns b(i) where expr = a(i); if expr doesn't
match any "a" values,
then default is returned. In the above query, for example,
decode(city, 'Boston', 1, 0) evaluates to 1 if the city
column contains the string 'Boston', and 0 otherwise.
Inequalities can be constructed
using decode and the sign function. In the
above example, sign(120000-price) will evaluate to 1 if
and only if price < 120000, so the expression
decode(sign(120000-price), 1, 1, 0) will be 1 if price < 120000
and 0 otherwise.
|
Case study: scored search for law firms in Infirmation.com
The Infirmation.com web site (http://www.infirmation.com)
is an information and recruiting resource for lawyers and law students
seeking jobs, and for law firm recruiting. One feature of the service
is a scored search for law firms meeting some set of user-supplied
criteria. Users may search for law firms based on the following
criteria:
- Location (city/state)
- Salary
- Size of firm (number of attorneys)
- Legal practice areas (e.g., intellectual property, real estate, tax, etc.)
The user may designate each search criteria as "hard" or absolute,
so that a firm will appear in the search results only if it
matches the criteria. Or, by default, the search criteria will only
affect the firms' score. Infirmation's interface for entering search criteria
and designating them as absolute is illustrated in figure 1 below.
Figure 1: Search form input for Infirmation.com
Motivation
Just as in the real-estate example, we want to expand the law
firm search results around the user's supplied criteria, so that they
will see near misses in addition to exact matches. This way, if a
search has no exact matches, the user may still see some useful
results. For example, suppose a user wants to see a list of all law
firms with offices in Boston that pay their new associates over
$85,000 a year. If no such firm exists, or if there are only a few,
the user will also see other law firms in Boston with lower starting
salaries (e.g., Nixon Peabody in figure 2 below), and high-paying firms in
other cities.
Figure 2: Search Results from Infirmation with Partial Matches
The result is a search function that is more useful to the end-user
than one that simply excludes non-matching rows from the result set,
while still allowing for this behavior when it is explicitly
requested. For example, the user may specifically restrict his search
to firms in Boston, so that the partial matches will consist only of
law firms in Boston that pay less than the desired salary (see figure 3 below.)
Figure 3: Search Results from Infirmation with Exact Matches for
the City, and Partial Matches on Salary
Showing near misses may help guide a user to better refine their
search criteria. The graduating law student in the above example
might consider his options in other cities for the bigger paycheck, or
might get a better idea of what kinds of salary to expect at a law
firm in Boston.
Implementation Methodology
We implement the scored-search logic with a single SQL query, which
returns five columns:
- the primary key for each result (the law firm ID),
- a short description of each result (the name of the law firm),
- a numeric score indicating the quality of the match,
- the location of the firm, and
- the salary offered by the firm.
Only the first three of these columns are essential to the search; if
one were to implement a similar search in a different service, there
would be other columns in place of the firm location and salary.
(These are only included as a convenience, so that salary and location
may be displayed along with the search results.) The query ends with
an order by score desc
clause, so that the better matches are
returned first.
Hard vs. Soft criteria
Depending on user input, each search criteria is treated as either
absolute ("hard") or relative ("soft"). The hard criteria are directly
reflected in the query's WHERE
clause; a firm will not
appear in the search results under any circumstance unless that
criteria is met.
The soft criteria only appear in the WHERE
clause to
influence scoring. For each search criteria, we designate a maximum
score for a complete match, and assign a lower score for other firms
depending on how close that firm is to matching the search criteria.
Then, the search returns all firms that have a combined total score above an
empirically determined cutoff.
Boolean vs. Numeric Criteria
In the set of search criteria for law firms, some criteria are
considered to be Boolean (yes-or-no), like whether or not a
firm has an office in a particular state, or whether a firm has
attorneys who are involved in a particular area of legal practice.
Other criteria, however, are numeric, such as salary or the
size of the firm.
Numeric criteria easily lead to algorithms for assigning
partial-match scores. For example, we can assume that people
generally want more money, and that, if a user wants to see law firms
paying new associates $100,000 a year, firms that pay $95k a
year should rank higher than firms that pay $55k. Assigning a score
based on firm size follows similar logic, although score as a function
of firm size does not necessarily increase monotonically.
The tricky thing, then, is combining Boolean and numeric criteria
in the total score for a law firm. Boolean criteria are, by their
nature, on-or-off, yes-or-no questions; they don't have partial
matches. So, we often restrict the maximum possible score for the
numeric criteria above the maximum score for a Boolean criteria. For
example, we give a single point (score of 1) for each matching legal
practice area, but a score of 1.5 is possible for the salary and firm
size criteria. The location is an exception, and gets a maximum score
of 2. This reflects that a location is composed of two separate
criteria, city and state; if the city matches, one can infer that the
state also matches.
Shape of Linear Scoring Functions
To better combine numeric and boolean criteria in a total firm
score, we also incorporated a discontinuity in the scoring
functions for the numeric criteria. This discontinuity incorporates
some of the function of a boolean criteria: a firm gets a point for
matching the criteria (e.g., salary >= desired salary), and gets an
additional fraction of a point (or loses an additional fraction of a
point) depending on how far the actual value is above or below the desired
value.
We express a partial firm
score as a function of firm salary and desired salary as follows:
score =
{ 1 if salary = desired
{ 1 + 0.5 * (salary-desired/(max(salary)-desired)) if salary > desired
{ (salary-desired/5000) if desired < salary
The astute reader will notice that a firm gets one point if its salary
exactly matches the desired salary, and increases to a maximum of 1.5
points; and, if the firm salary is below the desired salary, the score
is negative and approaches zero as the firm salary approaches the
desired salary. In other words, the score jumps from 0 to 1 about
the point where the firm salary equals the desired salary.
The scoring function for firm size is similar, although it
requires us to arbitrarily assign a firm size (number of attorneys) as
"medium" so that the user may search for firms that are either "small,
"medium," or "large." We chose to call a firm with 75 attorneys
"medium-sized."
When looking for small firms, we assign the maximum score (1.5) to a
firm with no attorneys, decreasing to 1 at the declared medium firm
size; then the score jumps to 0 and goes slightly negative
proportional to the number of attorneys. A search for large firms
is scored the same way, except as a mirror image.
Searches for medium firms are scored differently. We declare an
arbitrary range for which a firm is considered to be medium-sized; in
Infirmation.com, we say a firm is medium-sized if it has between 45
and 105 employees. Any firm within this range gets a score of 1, and
any firm outside this range gets a score of 0. This scoring function
behaves differently than those for small and large firms, but has the
same one-point discontinuity as the others.
Implementation notes
The firm-scoring logic in Infirmation.com depends heavily on the
Oracle decode
function; for example, the expression
decode(city, :city, 2, 0)
within a SELECT
statement will evaluate to 2 points if the city
column
equals the supplied parameter :city
, and 0 otherwise.
More complex conditionals, such as those used to build the
piecewise-linear functions used to score numeric criteria, can be
built by using the output of the sign
function as the
first argument to decode
, as in the previous real-estate
example.
Performance notes
This methodology is effective in practice only if the final database
query runs efficiently. Since we are using an ORDER BY
expression, the query will do several sequential scans on the result
set. Therefore, for the query to run efficiently, the entire result
set must fit into RAM. When Oracle is used, partitioning large tables
on the search-parameter columns (e.g., firm location and/or salary in
Infirmation.com) could speed up the query, because the database would
scan fewer blocks to find search results.
Limitations and Possible Improvements
As of June 1, 2000, the Infirmation.com web site has 28,580 registered
users, and serves over a thousand distinct users a day. The scored
law-firm search is a popular feature, used several hundred times a day
on average.
Our approach has worked well in practice on a popular public Web site,
but there are several shortcomings in our specifics that we would like
to address in a future launch or if we were building a similar
service.
First, it is not clear how we should score a list of
Boolean criteria, as we do in Infirmation with a list of legal
practice areas. Right now, we assign a score of 1 for each match in
the list; and while this works fairly well in common practice, it can
result in this particular criteria dominating other criteria, since
its importance increases with each selected item. We might want to
consider normalizing the total possible score from a list of Boolean
criteria, or expressing it as a ratio of items matched to items
requested.
Second, the scored firm search does not scale the numeric scores
returned (that is, express them as a percentage). In the current
system, the score for any given firm is dependent on not only how well
the firm matches the requested criteria, but on the supplied
criteria themselves; so, there is no way to compare the quality
of results from different searches. Some potential future features
(e.g., customizable e-mail alerts for new firms entered into the
database) may depend on the ability to make such comparisons between
searches.
Third, the search as it is implemented in Infirmation
attempts to separate complete matches from partial matches, when
some of the criteria are soft. It accomplishes this by using a
(somewhat arbitrary) score as a cut-off value. Numeric criteria,
however, have a maximum score higher than the designated full-match
value; firms that offer a very high salary but don't match a requested
practice area might appear as full matches, when they should appear as
partial matches.
Fourth, and finally, we should note that the law firm search in
Infirmation.com does not include any context searches, which we
generally implement with the Oracle ConText/interMedia cartridge.
But, if we decided to add another search criteria that would be
treated contextually (e.g., "tell me all the law firms in New York
with 'Kirk' in their name"), it would not change the fundamental logic
behind the scoring algorithm. We would treat such a field as another
numeric criteria, and incorporate the results of the
ConText/interMedia scoring functions into our combined score for a
firm. It is not clear how we would create the same discontinuity
that we did with the salary and firm size criteria, nor is it clear if
this would be necessary to produce an accurate relevancy search.
Future Directions: A generic API for scored searches
It may be possible to implement a generic API to facilitate
including similar searches in other RDBMS-backed sites. As a first step,
we could design a procedural abstraction that will return a SQL query
given the following inputs:
- The names of database tables that contain the records to be returned
- A list of
JOIN
expressions, if the data is in multiple
tables
- The names of the desired columns
- A list of absolute criteria (in SQL)
- A list of scoring expressions (in SQL)
- A list of the maximum possible score for each scoring expression, to
generate cutoffs
To further refine this idea, we could also build procedural
abstractions to generate the scoring expressions automatically, if the
score can easily be expressed as a function of the user-supplied
target value and the value in each database record. This is not
always easily generalizable, though; for example, Infirmation uses a
different scoring function for firm size depending on the user's
input. And, even with straightforward scoring functions, such as that
for salary, the cut-off for the maximum score is dependent on the
maximum salary of all firms in the database. Still, such helper
procedures could be useful for generalizing the SQL code to count how
many checklist items are matched by records in another table, and so
on.
Service developers using a generic scored-search API could move
some of this detail into the RDBMS by using stored functions (in
PL/SQL, if Oracle is the RDBMS) to generate scores given a user input
value and a database input value. Then, the service developers would
pass a list of stored function names to the scored-search API, rather
than the actual scoring expression.
A call to such an API to generate the real-estate example might
look like this:
function scored_search_query (tables, join_exprs, columns, hard_where,
score_exprs, max_scores)
Returns a SQL scored-search query given a list of tables,
a list of join expressions on those tables, a list of columns to
return in the result set, a list of "hard" WHERE clauses, a list
of scoring expressions, and a list of maximum scores for each
scoring expression
function scored_search_match (name, val, score)
return "decode(name, val, score, 0)"
function scored_search_greater_than (v1, v2, score)
return "decode(sign(v1-v2),1,score,0)"
function scored_search_checklist(match_list, max_score)
Generates SQL to compute the score from a series of Boolean
criteria (checklist items), scaled to a maximum of max_score.
if match_list is empty then
return max_score // trivially true
else
score_list := ();
for each item in match_list do
score_list := score_list U (scored_search_match(item,"yes",1))
done
return "sum(score_list) * " || max_score / length(score_list)
Example call to scored_search_query:
city := "Boston"
max_price := 120000
amenities_list := ("has_pool", "has_golf", "has_central_air")
query := scored-search-query("houses", (),
("house_id", "description", "price", "city", "state"),
("city = $city", "state = $state"),
(scored_search_match("city", city, 1),
(scored_search_match("state", state, 1),
(scored_search_greater_than(max_price, "price", 1),
(scored_search_checklist(amenities_list,1)),
(1,1,1,1))
asj-editors@arsdigita.com