Difference between revisions of "POD/Bugzilla::Search"

From Wiki4Intranet
Jump to: navigation, search
 
Line 3: Line 3:
 
== NAME ==
 
== NAME ==
  
This is totally rewritten Bugzilla4Intranet search engine.
+
Original Bugzilla::Search was ugly and stupid. It contained a lot of legacy code and often generated queries very complex for the DBMS, and search performance was awful on big databases.
  
Original Bugzilla::Search was ugly and stupid. It contained a lot of legacy code and often generated queries very complex for the DBMS, which leads to awful performance on big databases.
+
Although we could expect the authors to refactor it in Bugzilla 4.0, they've just decomposed overlong subroutines into small ones.
 
+
Yet in Bugzilla 4.0, although we could expect the authors to refactor it, they've just decomposed overlong subroutines into small ones.
+
  
 
So in Bugzilla4Intranet, I (Vitaliy Filippov) have rewritten it totally.
 
So in Bugzilla4Intranet, I (Vitaliy Filippov) have rewritten it totally.
Line 13: Line 11:
 
License is dual-license GPL 3.0+ or MPL 1.1+ ("+" means "or any later version"), to be compatible with both GPL and Bugzilla code.
 
License is dual-license GPL 3.0+ or MPL 1.1+ ("+" means "or any later version"), to be compatible with both GPL and Bugzilla code.
  
Most of the functionality remains unchanged, but the internals are totally different, as well as the query performance (tested on MySQL) :)
+
Most of the functionality remains unchanged, but the internals are totally different, as well as query performance is (tested on MySQL) :)
  
 
== BOOLEAN CHARTS ==
 
== BOOLEAN CHARTS ==

Latest revision as of 13:38, 24 August 2011

This is totally rewritten Bugzilla4Intranet search engine.

NAME

Original Bugzilla::Search was ugly and stupid. It contained a lot of legacy code and often generated queries very complex for the DBMS, and search performance was awful on big databases.

Although we could expect the authors to refactor it in Bugzilla 4.0, they've just decomposed overlong subroutines into small ones.

So in Bugzilla4Intranet, I (Vitaliy Filippov) have rewritten it totally.

License is dual-license GPL 3.0+ or MPL 1.1+ ("+" means "or any later version"), to be compatible with both GPL and Bugzilla code.

Most of the functionality remains unchanged, but the internals are totally different, as well as query performance is (tested on MySQL) :)

BOOLEAN CHARTS

A boolean chart is a way of representing the terms in a logical expression. Bugzilla builds SQL queries depending on how you enter terms into the boolean chart. Boolean charts are represented in urls as three-tuples of (chart, row, column) = (field, operator, value). The query form (query.cgi) may contain an arbitrary number of boolean charts where each chart represents a condition in SQL query.

The expressions represented by columns are ORed together. The expressions represented by rows are ANDed together and can be negated. The expressions represented by each chart are ORed together. So, boolean charts specify a logical expression in the form of:

    OR(AND(OR()), NOT(AND(OR())), ...)

Original boolean charts in Bugzilla 2.x really consisted of tables with rows and columns. Now, there is no tables, but three levels of charts remain.

Original boolean charts also were ANDed together, not ORed. I.e. the expression was AND(AND(OR())), not OR(AND(OR())).

Original boolean charts also were "constraints on a single piece of data". I.e. '(cc = foo) and (cc = bar)' within one boolean chart would match nothing in original Bugzilla, as this would run the check on a single user, which could not be 'foo' and 'bar' simultaneously. But, '(cc = foo)' and '(cc = bar)' as two separate charts would match bugs which have 'foo' as one CC AND 'bar' as another.

Now, this magic is only enabled for bug comments, bug changes, flags and attachments. See "CORRELATED SEARCH TERMS" below.

FIXME: Add support for manual selection of match target for such terms. I.e. the user should be able to select "comment 1 author = ..." AND "comment 2 author = ..." search terms by hand, without thinking about any magic. Now, it's also the single piece of functionality which is lost compared to the old search engine.

QUERY OPTIMISATION

The two main problems of Bugzilla query complexity on big databases were:

OR operators inside query which stop DBMS from using indexes for query speed-up. 
I.e. for example "cc=X or commenter=X" was translated to 2 LEFT JOINs - to cc and longdescs tables - and a term like "cc.who=X or longdescs.who=X". DBMS can't do an index merge here, so this leads to a scan + two index checks.
SOLUTION:
Use UNIONs instead of ORs. I.e. this same query is now translated to
  (SELECT bug_id FROM cc WHERE who=X)
  UNION
  (SELECT bug_id FROM longdescs WHERE who=X)
It's important to note that the UNION of "non-seeding" queries is worse than OR, because UNION must be always executed fully and can't be just checked while executing other part of query.
SOLUTION: (Partially experimental by now, but works)
Expand brackets in expressions. I.e. transform AND(a,OR(b,c)) to OR(AND(a,b),AND(a,c)). This also allows to handle OR query parts with correlated search terms.
Usage of multiple JOINS with many rows in one query, leading to insane amounts of rows when multiplied. 
SOLUTION:
Wrap them into (SELECT DISTINCT) subqueries if there is more than one such term in the query. This removes duplicate rows from the result and makes DBMS's life easier. To do so in Bugzilla, specify { many_rows => 1 } for such terms.

SEARCH FUNCTIONS

Search functions get and return arguments through $self. Input arguments:

$self->{sequence} 
Sequence number of current condition, starting with 0. Used mostly to prevent table name collisions.
$self->{field} 
Field name.
$self->{fieldsql} 
Field SQL representation.
$self->{type} 
Search operator type (equals, greaterthan, etc)
$self->{negated} 
Most "negative" search operators (notequals, etc) are automatically derived from negated terms for corresponding "posivite" operators. So $self->{type} cannot be "negative", and $self->{negated} contains true if it's really negative.
$self->{value} 
Search value.
$self->{quoted} 
SQL representation of value. Search functions must set this field before calling default search operator implementation ($self->call_op) if they want to use different SQL code for the value.

OUTPUT ARGUMENT, $self->{term}

Resulting SQL condition or an expression consisting of several conditions.

All conditions are logically divided into two classes: scalar conditions and list conditions. Scalar condition is a check of a single value linked to single bug (for example, "bug assignee = somebody"). List condition is a check in the form of "ANY of linked values match condition" or "NONE of linked values match condition". A lot search conditions in Bugzilla are List conditions. A simple example is "CC" field: "CC = somebody" is really "ANY CC is somebody", and "CC != somebody" is really "NONE of CC is somebody".

Scalar conditions can be returned either as a simple string with SQL condition, or as a hashref with following keys:

supp => [ " LEFT JOIN t1 ON ...", ... ] 
Arrayref with table JOINs (plaintext with ON) required to match.
term => "string" 
SQL condition string.
description => [ 'field', 'operator', 'value' ] 
Search term description for the UI. It is not required for search functions to provide this description unless they return an expression consisting of several terms. The default 'field', 'operator' and 'value' are taken from the request.

List conditions must be always returned as a hashref with the following keys:

table => "string" 
Table specification with alias(es), just like <table> inside SQL query:
 ... JOIN <table> ON <condition> ...
For example, it may itself contain JOINs inside brace expressions:
 ... JOIN (table1 JOIN table2 ON ...) ON <condition> ...
$self->{sequence} should be appended to alias(es) to easily combat table name collisions.
where => "string" 
Join <condition>. fields of 'bugs' table could be used for joining.
bugid_field => "string" 
When the <table> is joined to 'bugs' simply on 'bug_id', and none of other 'bugs' fields are used, function should omit "bugs.bug_id=table_alias.bug_id" from 'where', and specify "tablealias.bug_id" as 'bugid_field'.
notnull_field => "string" 
When the <table> is joined to 'bugs' on some field other than 'bug_id', 'bugid_field' key MUST be omitted and any field of table that cannot be NULL must be specified in 'notnull_field'. This is used in negative lookups ("NONE of linked values match") for NULL checks.
neg => boolean value (1 or 0) 
If a positive term corresponds to negative lookup, search function can specify a true value for this key. Bugzilla search engine will then negate the term itself.
many_rows => boolean value (1 or 0) 
If many_rows is true, this term will be wrapped into a (SELECT DISTINCT bug_id ...) subquery when INNER JOINed to different terms with many_rows. Use it when 'table' could have relatively many rows for a single bug.
description => [ 'field', 'operator', 'value' ]

CORRELATED SEARCH TERMS

There is also a subclass of "list" conditions - conditions that can be correlated. I.e., conditions that can possibly match different fields of one entity linked to a bug (for example, comment date and comment author). If the user specifies "comment date is ..." AND "comment author is ...", then he probably means to find ONE comment with date and author matching these terms.

To specify such conditions, search term must have different keys:

base_table => "string" 
The name of base table in which the entities linked to a bug are stored without any aliases, subqueries or JOINs.
base_joins => [ [ 'LEFT'|'INNER', <table name>, <alias>, <on> ], ... ] 
The names of tables which must be additionally joined to base table to run the term. LEFT|INNER is join type. <alias> MUST be non-unique (MUST not contain $self->{sequence}). When two such terms are united, base_joins duplicates are filtered based on <alias>. Unique aliases are then generated automatically by SQL core.
where => string or [ string1, string2, ... ]
bugid_field => "string"
notnull_field => "string"
neg => 1|0
many_rows => 1|0
description => [ 'field', 'operator', 'value' ] 
All these fields have meaning identical to their meaning in simple "list" conditions.

EXPLANATION

As it was mentioned above, original boolean charts in Bugzilla were "constraints on a single piece of data".

This magic enables certain amount of power in queries, but is useful only when the binary relation between the field and value is not enough to spell the real condition.

Such conditions are possible only for ANDed terms, and only for "list" bug fields, i.e. when there can be more than one entity linked to bug: ((e1 in A) & (e2 in B)) is not necessarily equal to (e in (A intersect B)), which the user probably meant.

When there is no more than 1 entity linked to a bug, the situation ((e1 in A) & (e2 in B)) is impossible, as e1 and e2 may not be different. So the condition is always (e in (A intersect B)). Also, they are impossible for OR: swap e1 and e2 in ((e1 in A) | (e2 in B)), and you'll get ((e2 in A) | (e1 in B)), OR it with unswapped term, and you'll get (e in (A union B)).

In Bugzilla, this is possible for bug comments, bug changes, flags and attachments. I.e. if somebody writes:

   commented after 2010-01-01
   AND commented before 2010-02-01
   AND commented by foo@bar.org

Then he probably means to search for bugs with a comment from foo@bar.org left between 2010-01-01 and 2010-02-01, not for bugs with 3 different comments corresponding to single terms.

EXPRESSIONS

Although this should be needed very rarely, search functions can also return expressions consisting of several single conditions.

ATTENTION! Functions that return expressions MUST specify $term->{description} for each single term inside this expression to correctly generate search descriptions in the UI.

An expression is an arrayref with the operation name as the first element and operation arguments as the following ones. Possible operation names are 'OR', 'AND', 'OR_MANY', 'AND_MANY', 'OR_NE', 'AND_NE'. Each argument may either be an expression itself, or a search term in a format returned by search functions.

OR is replaced with SQL UNION.

There are (partially experimental) variations of OR:

OR_NE is not expanded when its outer expression is expanded, i.e. AND(OR(a,b),OR_NE(c,d)) becomes OR(AND(a,OR_NE(c,d)),AND(b,OR_NE(c,d))).

OR_MANY is attached as LEFT JOINs instead of an INNER JOIN to UNION, if there are any terms ANDed with it.

Example: ((assignee=X) and (changed since 2008 or commented since 2008))

The second, predictably, produces an insane amount of rows itself, so it's faster to scan rows selected by first and check the second on them. But, it's faster to select simply (changed since 2008 or commented since 2008) using indexes and UNION.