|
这篇文档综述了SQLite中的查询规划器和优化器时如何工作的。
This document provides overview of how the query planner and optimizer
for SQLite works.
给出一条SQL语句,可能会有几十、几百甚至上千种实现这条语句的方法,这要取决于语句自身的复杂性和底下的数据库结构。查询规划器的任务就是在这些选择中选出一个磁盘I/O和CPU开销最小的算法。
Given a single SQL statement, there might be dozens, hundreds, or even
thousands of ways to implement that statement, depending on the complexity
of the statement itself and of the underlying database schema. The
task of the query planner is to select an algorithm from among the many
choices that provides the answer with a minimum of disk I/O and CPU
overhead.
在3.8.0版中,SQLite重新实现了查询规划器,名为下一代查询规划器(NGQP)。本文中所介绍的所有特性、技术和算法对于3.8.0版以前了老版查询规划器和NGQP都是适用的。关于NGQP和老版查询规划器之间的具体区别,请参见NGQP的具体信息。
With release 3.8.0, the SQLite query planner was reimplemented as the
Next Generation Query Planner or "NGQP". All of the features, techniques,
and algorithms described in this document are applicable to both the
pre-3.8.0 legacy query planner and to the NGQP. For further information on
how the NGQP differs from the legacy query planner, see the
detailed description of the NGQP.
查询中的WHERE子句会以AND为分隔符拆解成多个“term”。如果WHERE子句中是由OR操作符分割的多个约束组成,那么整个子句都会被当做一个单独的“term”。然后适用于OR-子句优化。
The WHERE clause on a query is broken up into "terms" where each term
is separated from the others by an AND operator.
If the WHERE clause is composed of constraints separate by the OR
operator then the entire clause is considered to be a single "term"
to which the OR-clause optimization is applied.
WHERE子句中的所有term都会进行分析,来看是否有可以适用的索引。能够使用索引的term必须是下面的格式之一:
All terms of the WHERE clause are analyzed to see if they can be
satisfied using indices.
To be usable by an index a term must be of one of the following
forms:
column = expression column > expression column >= expression column < expression column <= expression expression = column expression > column expression >= column expression < column expression <= column column IN (expression-list) column IN (subquery) column IS NULL
如果一个索引是使用类似下面的语句创建的:
If an index is created using a statement like this:
CREATE INDEX idx_ex1 ON ex1(a,b,c,d,e,...,y,z);
那么只要当索引中初始的列(列a、b等等)出现在WHERE子句term中就能使用索引。索引中的初始列必须是在= or IN or IS NULL这些操作符中使用的。
最右边的列可以使用不等式。对于使用的索引中最右边的列,可以使用两个不等式,但是必须把列可接受的值夹在两个极值的中间。
Then the index might be used if the initial columns of the index
(columns a, b, and so forth) appear in WHERE clause terms.
The initial columns of the index must be used with
the = or IN or IS NULL operators.
The right-most column that is used can employ inequalities.
For the right-most
column of an index that is used, there can be up to two inequalities
that must sandwich the allowed values of the column between two extremes.
并不需要索引中的所有列都出现在WHERE子句term中就可以使用这个索引。但是使用的索引中的列之间不能有间隔。
因此,对于上面所示的索引,如果WHERE子句term中没有对c列的约束,那么,包含列a和b约束的term也可以使用该索引,但是如果term是列d到z的约束,那么则不能使用该索引。同样的,如果列的右边只有不等式约束,那么也无法正常使用索引列。(有一个例外,参见下面的跳跃扫描优化)
It is not necessary for every column of an index to appear in a
WHERE clause term in order for that index to be used.
But there can not be gaps in the columns of the index that are used.
Thus for the example index above, if there is no WHERE clause term
that constraints column c, then terms that constrain columns a and b can
be used with the index but not terms that constraint columns d through z.
Similarly, index columns will not normally be used (for indexing purposes)
if they are to the right of a
column that is constrained only by inequalities.
(See the skip-scan optimization below for the exception.)
对于上面的索引和这样的WHERE子句:
For the index above and WHERE clause like this:
... WHERE a=5 AND b IN (1,2,3) AND c IS NULL AND d='hello'
可以使用索引的前四列a、b、c和d,因为这四列是索引的一个前缀形式,并且都是等式约束。
The first four columns a, b, c, and d of the index would be usable since
those four columns form a prefix of the index and are all bound by
equality constraints.
对于上面的索引和这样的WHERE子句:
For the index above and WHERE clause like this:
... WHERE a=5 AND b IN (1,2,3) AND c>12 AND d='hello'
只有索引中的a、b和c列可以使用,d列无法使用是因为其出现在c的右边,而c是一个不等式约束。
Only columns a, b, and c of the index would be usable. The d column
would not be usable because it occurs to the right of c and c is
constrained only by inequalities.
对于上面的索引和这样的WHERE子句:
For the index above and WHERE clause like this:
... WHERE a=5 AND b IN (1,2,3) AND d='hello'
只有索引中的a和b列可以使用。d列无法使用是因为c列没有约束,并且使用索引中的列的使用不能有间隔。
Only columns a and b of the index would be usable. The d column
would not be usable because column c is not constrained and there can
be no gaps in the set of columns that usable by the index.
对于上面的索引和这样的WHERE子句:
For the index above and WHERE clause like this:
... WHERE b IN (1,2,3) AND c NOT NULL AND d='hello'
整个索引都不能使用,这是因为索引中最左边的列并没有使用。假设没有其他的索引了,那么上面的查询最终只能做全表扫描了。
The index is not usable at all because the left-most column of the
index (column "a") is not constrained. Assuming there are no other
indices, the query above would result in a full table scan.
对于上面的索引和这样的WHERE子句:
For the index above and WHERE clause like this:
... WHERE a=5 OR b IN (1,2,3) OR c NOT NULL OR d='hello'
这个子句不会使用索引,这是因为WHERE子句的term是由OR连接的,而不是AND。这个查询最终会使用全表扫描来实现。但是,如果添加三个额外的索引,将b、c和d作为其最左边的列,那么就可以使用OR-子句 优化了。
The index is not usable because the WHERE clause terms are connected
by OR instead of AND. This query would result in a full table scan.
However, if three additional indices where added that contained columns
b, c, and d as their left-most columns, then the
OR-clause optimization might apply.
如果WHERE子句中的一个term时下面的形式:
If a term of the WHERE clause is of the following form:
expr1 BETWEEN expr2 AND expr3
这时会加入两个“虚拟”term:
Then two "virtual" terms are added as follows:
expr1 >= expr2 AND expr1 <= expr3
虚拟term只是用在分析中,不会对VDBE码的生成造成任何影响。
如果两个虚拟term都能用做索引约束,那么原始的BETWEEN term就会被忽略,也就不会在输入记录上执行相应的测试。
因此,如果BETWEEN term可以用做索引约束,那么这个term上就不会执行任何测试。
另一方面,虚拟term本身没有在输入记录上执行测试。
因此,如果BETWEEN term没有被用做索引约束,那么就必须用来测试输入记录,其中expr1表达式只会计算一次。
Virtual terms are used for analysis only and do not cause any VDBE
code to be generated.
If both virtual terms end up being used as constraints on an index,
then the original BETWEEN term is omitted and the corresponding test
is not performed on input rows.
Thus if the BETWEEN term ends up being used as an index constraint
no tests are ever performed on that term.
On the other hand, the
virtual terms themselves never causes tests to be performed on
input rows.
Thus if the BETWEEN term is not used as an index constraint and
instead must be used to test input rows, the expr1 expression is
only evaluated once.
如果WHERE子句约束是由OR连接的,而不是AND,那么有两种不同的方式来处理。
如果一个term是由OR分隔的多个包含相同列名的子term构成的,例如:
WHERE clause constraints that are connected by OR instead of AND can
be handled in two different ways.
If a term consists of multiple subterms containing a common column
name and separated by OR, like this:
column = expr1 OR column = expr2 OR column = expr3 OR ...
那么这个term会重写为:
Then that term is rewritten as follows:
column IN (expr1,expr2,expr3,...)
这个重写的term这时可能可以按照IN操作符的标准规则来使用索引。注意,每个OR连接的子term中的column必须是同一列,不过这一列可以出现在=操作符的左边或者右边。
The rewritten term then might go on to constrain an index using the
normal rules for IN operators. Note that column must be
the same column in every OR-connected subterm,
although the column can occur on either the left or the right side of
the = operator.
只有当上门所说的OR到IN的转换无法进行时,才会尝试使用第二个OR子句优化。假设OR子句是按照下面这样的多个子term组成的:
If and only if the previously described conversion of OR to an IN operator
does not work, the second OR-clause optimization is attempted.
Suppose the OR clause consists of multiple subterms as follows:
expr1 OR expr2 OR expr3
每个子term都可能是一个独立的比较表达式,例如a=5 或 x>y 或者也可以时LIKE或BETWEEN表达式,一个子term还可能是一个小括号括起的一组AND连接的子子term。
每个子term都将其当做整个WHERE子句来分析,以便看这个term自身是否可以使用索引。如果OR子句中的每个子term都可以独立使用索引,那么OR子句就可以编码为OR子句中的每个子term都独立使用索引来计算。SQLite为每个OR子句的子term使用独立的索引可以被看作为时把WHERE子句重写为下面形式:
termIndividual subterms might be a single comparison expression like
a=5 or x>y or they can be LIKE or BETWEEN expressions, or a subterm
can be a parenthesized list of AND-connected sub-subterms.
Each subterm is analyzed as if it were itself the entire WHERE clause
in order to see if the subterm is indexable by itself.
If every subterm of an OR clause is separately indexable
then the OR clause might be coded such that a separate index is used
to evaluate each term of the OR clause. One way to think about how
SQLite uses separate indices for each OR clause term is to imagine
that the WHERE clause where rewritten as follows:
rowid IN (SELECT rowid FROM table WHERE expr1 UNION SELECT rowid FROM table WHERE expr2 UNION SELECT rowid FROM table WHERE expr3)
上面的重写表达式只是概念上的,包含OR的WHERE子句并不会真的这么重写。OR子句的实际实现是使用了一个机制,这个机制要比子查询高效的多,并且即使在那些重载“rowid”,“rowid”已经不是真正的rowid的表上也可以运行。不过,这个实现的本质还是按照上面语句所描述的:使用OR子句中每个子term的独立索引来找到待选的记录,然后将这些记录联合起来生成最终结果。
The rewritten expression above is conceptual; WHERE clauses containing
OR are not really rewritten this way.
The actual implementation of the OR clause uses a mechanism that is
more efficient than subqueries and which works even
for tables where the "rowid" column name has been
overloaded for other uses and no longer refers to the real rowid.
But the essence of the implementation is captured by the statement
above: Separate indices are used to find candidate result rows
from each OR clause term and the final result is the union of
those rows.
注意,在多数情况下,SQLite在一个查询中FROM子句里的每个表上只会使用一个索引。上面所说的第二个OR子句优化是一个特例。在OR子句中,每个子term可能会使用不同的索引。
Note that in most cases, SQLite will only use a single index for each
table in the FROM clause of a query. The second OR-clause optimization
described here is the exception to that rule. With an OR-clause,
a different index might be used for each subterm in the OR-clause.
对于任何给定的查询,事实上,这里所介绍的可以使用的OR子句优化并不能保证一定会使用。SQLite使用了一个基于开销的查询规划器,它会评估各种可用的查询计划的CPU和磁盘I/O开销,然后选择一个它认为时最快速的计划。如果WHERE子句中有许多OR term,或者一些OR子句的子term上的索引并不是非常好的选择,那么SQLite就会决定使用其他查询算法会更快速,甚至可能时全表扫描更快。应用程序开发者可以在语句前面加一个EXPLAIN QUERY PLAN前缀来获取所选择的查询策略的高层概览。
For any given query, the fact that the OR-clause optimization described
here can be used does not guarantee that it will be used.
SQLite uses a cost-based query planner that estimates the CPU and
disk I/O costs of various competing query plans and chooses the plan
that it thinks will be the fastest. If there are many OR terms in
the WHERE clause or if some of the indices on individual OR-clause
subterms are not very selective, then SQLite might decide that it is
faster to use a different query algorithm, or even a full-table scan.
Application developers can use the
EXPLAIN QUERY PLAN prefix on a statement to get a
high-level overview of the chosen query strategy.
由LIKE 或 GLOB构成的term有时候可以使用索引。使用需要很多条件:
Terms that are composed of the LIKE or GLOB operator
can sometimes be used to constrain indices.
There are many conditions on this use:
LIKE操作符有两个模式,可以使用PRAGMA设置。默认模式是LIKE比较不区分latin1字符的大小写。因此,默认情况,下面的表达式时true:
The LIKE operator has two modes that can be set by a
pragma. The
default mode is for LIKE comparisons to be insensitive to differences
of case for latin1 characters. Thus, by default, the following
expression is true:
'a' LIKE 'A'
但是如果启用了case_sensitive_like PRAGMA,如下:
But if the case_sensitive_like pragma is enabled as follows:
PRAGMA case_sensitive_like=ON;
那么LIKE操作符就会区分大小写,上面示例的计算结果就是false。注意,区分大小写只适用于latin1字符——基本上就是ASCII前127个字节码中的英文字符的大小写。对于国际字符集默认是区分大小写的,除非应用程序自定义排序器 和 like() SQL 函数来提供非ASCII字符的大小写对照。但是如果应用自定义了排序器或like() SQL函数,那么就不会使用上面所说的LIKE优化了。
Then the LIKE operator pays attention to case and the example above would
evaluate to false. Note that case insensitivity only applies to
latin1 characters - basically the upper and lower case letters of English
in the lower 127 byte codes of ASCII. International character sets
are case sensitive in SQLite unless an application-defined
collating sequence and like() SQL function are provided that
take non-ASCII characters into account.
But if an application-defined collating sequence and/or like() SQL
function are provided, the LIKE optimization described here will never
be taken.
LIKE操作符默认不区分大小写,这是因为SQL标准是这么要求的。你可以在编译时使用SQLITE_CASE_SENSITIVE_LIKE命令行选项来修改默认行为。
The LIKE operator is case insensitive by default because this is what
the SQL standard requires. You can change the default behavior at
compile time by using the SQLITE_CASE_SENSITIVE_LIKE command-line option
to the compiler.
当操作符左边的指定的列使用内置的BINARY创建了索引,且case_sensitive_like是开启的,则可以使用LIKE优化。当列使用内置的NOCASE创建的索引,且case_sensitive_like时关闭的,则可以使用LIKE优化,这有这两种组合下LIKE操作符才会被优化。
The LIKE optimization might occur if the column named on the left of the
operator is indexed using the built-in BINARY collating sequence and
case_sensitive_like is turned on. Or the optimization might occur if
the column is indexed using the built-in NOCASE collating sequence and the
case_sensitive_like mode is off. These are the only two combinations
under which LIKE operators will be optimized.
GLOB总是区分大小写的。GLOB操作符左边的列必须使用内置的BINARY创建索引,否则就不会使用索引优化。
The GLOB operator is always case sensitive. The column on the left side
of the GLOB operator must always use the built-in BINARY collating sequence
or no attempt will be made to optimize that operator with indices.
只有当GLOB或LIKE操作符右边时字符串或者绑定到字符串的参数才会考虑使用LIKE优化。字符串的开头不能时通配符,如果右边是以通配符开通的,那么就无法使用优化了。如果右边是一个绑定到字符串的参数,那么只有当包含表达式的预编译表达式是由sqlite3_prepare_v2() 或 sqlite3_prepare16_v2() 编译出来的时才可以使用优化。如果右边是个参数,并且这个语句是使用sqlite3_prepare() 或 sqlite3_prepare16() 编译的,那么就不会使用LIKE优化。如果LIKE操作符包含了EXCEPT,那么也不会使用LIKE优化。
The LIKE optimization will only be attempted if
the right-hand side of the GLOB or LIKE operator is either
literal string or a parameter that has been bound
to a string literal. The string literal must not
begin with a wildcard; if the right-hand side begins with a wildcard
character then this optimization is attempted. If the right-hand side
is a parameter that is bound to a string, then this optimization is
only attempted if the prepared statement containing the expression
was compiled with sqlite3_prepare_v2() or sqlite3_prepare16_v2().
The LIKE optimization is not attempted if the
right-hand side is a parameter and the statement was prepared using
sqlite3_prepare() or sqlite3_prepare16().
The LIKE optimization is not attempted if there is an EXCEPT phrase
on the LIKE operator.
假设LIKE或GLOB操作符右边字符串开头的非通配符前缀是x。我们使用一个字符来表示这个非通配符的前缀,但是读者需要知道,这个前缀是由一个以上的字符构成的。y是与/x/长度相同且大于x的最小字符串。例如,如果x是hello 那么 y 就是 hellp。LIKE和GLOB优化会增加两个虚拟term,例如:
Suppose the initial sequence of non-wildcard characters on the right-hand
side of the LIKE or GLOB operator is x. We are using a single
character to denote this non-wildcard prefix but the reader should
understand that the prefix can consist of more than 1 character.
Let y be the smallest string that is the same length as /x/ but which
compares greater than x. For example, if x is hello then
y would be hellp.
The LIKE and GLOB optimizations consist of adding two virtual terms
like this:
column >= x AND column < y
在多数情况下,即使使用虚拟term来使用索引约束,原始的LIKE或GLOB操作符还是会会每一条记录进行测试。这是因为我们无法知道x前缀右边的字符会额外增加什么样的约束。不过,如果x的右边只有一个全通配符,那么就可以禁止原始的LIKE或GLOB测试。
换句话说,如果是下面的模式:
Under most circumstances, the original LIKE or GLOB operator is still
tested against each input row even if the virtual terms are used to
constrain an index. This is because we do not know what additional
constraints may be imposed by characters to the right
of the x prefix. However, if there is only a single
global wildcard to the right of x, then the original LIKE or
GLOB test is disabled.
In other words, if the pattern is like this:
column LIKE x% column GLOB x*
那么可以使用索引来约束term,然后禁用原始LIKE或GLOB测试,这是因为,在这种情况下,索引选出的所有记录都可以通过LIKE或GLOB测试。
then the original LIKE or GLOB tests are disabled when the virtual
terms constrain an index because in that case we know that all of the
rows selected by the index will pass the LIKE or GLOB test.
注意,当LIKE或GLOB操作符右边是一个参数并且语句是由sqlite3_prepare_v2()或sqlite3_prepare16_v2()编译的,那么如果右边参数绑定的值在上一次运行后修改过了,那么这个语句会自动在第一次sqlite3_step()调用的时候重新解析并重新编译。这个重解析和重编译本质上说和数据库结构发生变化时的行为是一样的。重编译是必须的,这是因为查询规划器要重新检查LIKE或GLOB操作符右边绑定的新值,以此来确定是否能使用上面所述的优化。 Note that when the right-hand side of a LIKE or GLOB operator is a parameter and the statement is prepared using sqlite3_prepare_v2() or sqlite3_prepare16_v2() then the statement is automatically reparsed and recompiled on the first sqlite3_step() call of each run if the binding to the right-hand side parameter has changed since the previous run. This reparse and recompile is essentially the same action that occurs following a schema change. The recompile is necessary so that the query planner can examine the new value bound to the right-hand side of the LIKE or GLOB operator and determine whether or not to employ the optimization described above.
在一般规则下,只有当WHERE子句的约束时索引中最左边的列才能使用索引。不过,在一些情况下,SQLite即使WHERE子句中忽略了索引的前几个列但是包含后几列的时候,也可以使用索引。
The general rule is that indexes are only useful if there are
WHERE-clause constraints on the left-most columns of the index.
However, in some cases,
SQLite is able to use an index even if the first few columns of
the index are omitted from the WHERE clause but later columns
are included.
考虑一下下面这种表:
Consider a table such as the following:
CREATE TABLE people( name TEXT PRIMARY KEY, role TEXT NOT NULL, height INT NOT NULL, -- in cm CHECK( role IN ('student','teacher') ) ); CREATE INDEX people_idx1 ON people(role, height);
people表中的数据时一个大组织中的所有人,一人一条记录。每个人由“role”字段决定时“student”还是“teacher”。还记录了每个人的身高(单位时里面)。其中role和height加了索引。注意,索引的最左边列并不是一个很好的选择——只包含了两种值。
The people table has one entry for each person in a large
organization. Each person is either a "student" or a "teacher",
as determined by the "role" field. And we record the height in
centimeters of each person. The role and height are indexed.
Notice that the left-most column of the index is not very
selective - it only contains two possible values.
现在考虑一个找出组织中所有身高大于等于180cm的人的名字的查询:
Now consider a query to find the names of everyone in the
organization that is 180cm tall or taller:
SELECT name FROM people WHERE height>=180;
由于索引中最左边的列没有出现在查询的WHERE子句中,这会让人觉得无法使用索引,但是SQLite时可以使用索引的。概念上来说,SQLite会把查询当做下面的形式来使用索引:
Because the left-most column of the index does not appear in the
WHERE clause of the query, one is tempted to conclude that the
index is not usable here. But SQLite is able to use the index.
Conceptually, SQLite uses the index as if the query were more
like the following:
SELECT name FROM people WHERE role IN (SELECT DISTINCT role FROM people) AND height>=180;
或者是:
Or this:
SELECT name FROM people WHERE role='teacher' AND height>=180 UNION ALL SELECT name FROM people WHERE role='student' AND height>=180;
上面展示的可选的规划只是概念上的。SQLite并不会真的转换查询。实际的查询计划大概是这样:SQLite查找“role”的第一个可能值,这可以通过读取"people_idx1"的第一个值来获得。SQLite会把第一个“role”值存储到一个内部变量中,这里我们叫做“$role”。这时,SQLite运行一个这样的查询:"SELECT name FROM people WHERE role=$role AND height>=180"。这个查询的约束与索引的最左边相同,所以可以使用索引来执行这个查询。一旦这个查询执行完,SQLite使用“people_idx1”来查询“role”列的下一个可能值,使用一个逻辑上类似"SELECT role FROM people WHERE role>$role LIMIT 1"的代码。将新的“role”值写入到$role变量中,然后重复执行查询,直到“role”的所有可能值都执行完为止。
The alternative query formulations shown above are conceptual only.
SQLite does not really transform the query.
The actual query plan is like this:
SQLite locates the first possible value for "role", which it
can do by rewinding the "people_idx1" index to the beginning and reading
the first record. SQLite stores this first "role" value in an
internal variable that we will here call "$role". Then SQLite
runs a query like: "SELECT name FROM people WHERE role=$role AND height>=180".
This query has an equality constraint on the left-most column of the
index and so the index can be used to resolve that query. Once
that query is finished, SQLite then uses the "people_idx1" index to
locate the next value of the "role" column, using code that is logically
similar to "SELECT role FROM people WHERE role>$role LIMIT 1".
This new "role" value overwrites the $role variable, and the process
repeats until all possible values for "role" have been examined.
我们将这种索引用法叫做“跳跃扫描”,因为数据库引擎本来要执行索引的全扫描,但是这里优化了扫描(使其少于“全扫描”),这时通过偶尔向前跳跃到下一个备选值来实现的。
We call this kind of index usage a "skip-scan" because the database
engine is basically doing a full scan of the index but it optimizes the
scan (making it less than "full") by occasionally skipping ahead to the
next candidate value.
如果知道索引的前几列包含了大量的重复值,那么SQLite就可能会在索引上使用跳跃扫描。如果索引最左边的列只有很少的重复列,那么可能简单的向前搜索下一个值(也就是做全表扫描)会比在索引上做二叉搜索来寻找下一个值要快。
SQLite might use a skip-scan on an index if it knows that the first
one or more columns contain many duplication values.
If there are too few duplicates
in the left-most columns of the index, then it would
be faster to simply step ahead to the next value, and thus do
a full table scan, than to do a binary search on an index to locate
the next left-column value.
让SQLite知道索引最左边的列包含大量重复列的唯一方法是在数据库上运行ANALYZE命令。如果没有ANALYZE的结果,SQLite只能猜测表中数据的“组成”,默认的猜测是索引中最左边的列的值中平均每个值有十个重复记录。但是只有当重复数大于等于18个的时候跳跃扫描才能有利可图(会比全表扫描运行的快)。因此,如果数据库上没有执行过分析,那么永远也不会使用跳跃扫描。
The only way that SQLite can know that the left-most columns of an index
have many duplicate is if the ANALYZE command has been run
on the database.
Without the results of ANALYZE, SQLite has to guess at the "shape" of
the data in the table, and the default guess is that there are an average
of 10 duplicates for every value in the left-most column of the index.
But skip-scan only becomes profitable (it only gets to be faster than
a full table scan) when the number of duplicates is about 18 or more.
Hence, a skip-scan is never used on a database that has not been analyzed.
内部连接中的ON和USING子句会转换成WHERE子句中的附加term,然后才会执行上面1.0章所说的WHERE子句分析。因此在SQLite中,使用新的SQL92连接语法相比老的SQL89都好连接语法来说没有任何计算优势。这两者最终都执行了同样的事情。
The ON and USING clauses of an inner join are converted into additional
terms of the WHERE clause prior to WHERE clause analysis described
above in paragraph 1.0. Thus with SQLite, there is no computational
advantage to use the newer SQL92 join syntax
over the older SQL89 comma-join syntax. They both end up accomplishing
exactly the same thing on inner joins.
对于LEFT OUTER JOIN,情况会更复杂一些。下面两个查询并不等价:
For a LEFT OUTER JOIN the situation is more complex. The following
two queries are not equivalent:
SELECT * FROM tab1 LEFT JOIN tab2 ON tab1.x=tab2.y; SELECT * FROM tab1 LEFT JOIN tab2 WHERE tab1.x=tab2.y;
对于一个内部连接,上面的两个查询时完全相同的。但是在一个OUTER 连接的ON和USING子句上会使用一个特殊处理:如果连接的右边表中包含null记录,那么就不适用于ON或USING子句中的约束,但是如果是在WHERE子句中则可以适用。这个效果是因为将LEFT JOIN的ON或USING子句表达式放入WHERE子句中实际上是将这个查询转换为一个普通的INNER JOIN——虽然内部连接运行的更慢。
For an inner join, the two queries above would be identical. But
special processing applies to the ON and USING clauses of an OUTER join:
specifically, the constraints in an ON or USING clause do not apply if
the right table of the join is on a null row, but the constraints do apply
in the WHERE clause. The net effect is that putting the ON or USING
clause expressions for a LEFT JOIN in the WHERE clause effectively converts
the query to an
ordinary INNER JOIN - albeit an inner join that runs more slowly.
在目前的SQLite实现中只使用循环连接。这就是说,连接会被实现为一个嵌套循环。
The current implementation of
SQLite uses only loop joins. That is to say, joins are implemented as
nested loops.
在连接中嵌套循环的默认顺序是FROM子句中最左边的表构成外层循环,最右边的表构成内层循环。不过如果改变顺序可以帮助选择更好的索引,那么SQLite会使用不同的嵌套顺序。
The default order of the nested loops in a join is for the left-most
table in the FROM clause to form the outer loop and the right-most
table to form the inner loop.
However, SQLite will nest the loops in a different order if doing so
will help it to select better indices.
内部连接可以自用的重新排序。但是LEFT JOIN既不能交换顺序,也不能进行组合,因此不能重新排序。
(TODO:“Inner joins to the left and right of the outer join might be reordered”没搞懂)
Inner joins can be freely reordered. However a left outer join is
neither commutative nor associative and hence will not be reordered.
Inner joins to the left and right of the outer join might be reordered
if the optimizer thinks that is advantageous but the outer joins are
always evaluated in the order in which they occur.
SQLite会对CROSS JOIN操作符特殊对待。CROSS JOIN操作符理论上是可以交换的。但是SQLite从来不会对CROSS JOIN中的表进行重排。这提供了一个机制,程序员通过这个机制可以强制SQLite选择一个特定的嵌套循环顺序。
SQLite treats the CROSS JOIN operator specially.
The CROSS JOIN operator is commutative in theory. But SQLite chooses to
never reorder tables in a CROSS JOIN. This provides a mechanism
by which the programmer can force SQLite to choose a particular loop nesting
order.
当选在连接中的表的顺序时,SQLite使用了一个有效的多项式时间算法。因此,SQLite可以在大约几毫秒的时间内规划出50到60种连接查询的方法。
When selecting the order of tables in a join, SQLite uses an efficient
polynomial-time algorithm. Because of this,
SQLite is able to plan queries with 50- or 60-way joins in a matter of
microseconds.
连接重排序是自动的,并且通常都能运行的很好,所以程序员们不需要关心这个,尤其是在使用ANALYZE命令收集了有效索引的统计信息之后。但是偶尔还是需要一些来自程序员的暗示。例如,考虑一下下面的结构:
Join reordering is automatic and usually works well enough that
programmers do not have to think about it, especially if ANALYZE
has been used to gather statistics about the available indices.
But occasionally some hints from the programmer are needed.
Consider, for example, the following schema:
CREATE TABLE node( id INTEGER PRIMARY KEY, name TEXT ); CREATE INDEX node_idx ON node(name); CREATE TABLE edge( orig INTEGER REFERENCES node, dest INTEGER REFERENCES node, PRIMARY KEY(orig, dest) ); CREATE INDEX edge_idx ON edge(dest,orig);
上面的结构定义了一个有向图,并且可以存储每个节点的名字。现在考虑一个在这个结构上的查询:
The schema above defines a directed graph with the ability to store a
name at each node. Now consider a query against this schema:
SELECT * FROM edge AS e, node AS n1, node AS n2 WHERE n1.name = 'alice' AND n2.name = 'bob' AND e.orig = n1.id AND e.dest = n2.id;
这个查询寻找从节点“alice”到节点“bob”之间的所有边的信息。SQLite中的查询优化器大概有两个选择来实现这个查询。(实际上有六个不同的选择,但是我们只考虑其中的两个)示例下的伪代码大概是这两种。
This query asks for is all information about edges that go from
nodes labeled "alice" to nodes labeled "bob".
The query optimizer in SQLite has basically two choices on how to
implement this query. (There are actually six different choices, but
we will only consider two of them here.)
Pseudocode below demonstrating these two choices.
选项1:
Option 1:
foreach n1 where n1.name='alice' do: foreach n2 where n2.name='bob' do: foreach e where e.orig=n1.id and e.dest=n2.id return n1.*, n2.*, e.* end end end
选项2:
Option 2:
foreach n1 where n1.name='alice' do: foreach e where e.orig=n1.id do: foreach n2 where n2.id=e.dest and n2.name='bob' do: return n1.*, n2.*, e.* end end end
在两个可选实现中都是用相同的索引来加速循环。
这两个查询规划的区别只是嵌套顺序的不同。
The same indices are used to speed up every loop in both implementation
options.
The only difference in these two query plans is the order in which
the loops are nested.
那么,那个查询计划更好一些呢?这要取决于从node和edge表中寻找什么样的数据。
So which query plan is better? It turns out that the answer depends on
what kind of data is found in the node and edge tables.
设alice节点的数量为M,bob节点的数量为N。考虑一下两个场景。第一个场景,M和N都是2,但是每个节点上都有上千个边。这种情况下,选项1时更好的。在选项1中,内部的循环检查两个节点间存在的边,如果找到则输出结果。但是由于alice和bob节点都只有两个,内部循环只需要运行4次,这样查询就会非常快。选项2这时会花费更多时间,选项2的外层循环只执行2次,但是由于每个alice节点都有大量的边,中间的循环需要迭代上千次。这将会时非常慢的。所以第一种场景下,使用选项1会更快。
Let the number of alice nodes be M and the number of bob nodes be N.
Consider two scenarios. In the first scenario, M and N are both 2 but
there are thousands of edges on each node. In this case, option 1 is
preferred. With option 1, the inner loop checks for the existence of
an edge between a pair of nodes and outputs the result if found.
But because there are only 2 alice and bob nodes each, the inner loop
only has to run 4 times and the query is very quick. Option 2 would
take much longer here. The outer loop of option 2 only executes twice,
but because there are a large number of edges leaving each alice node,
the middle loop has to iterate many thousands of times. It will be
much slower. So in the first scenario, we prefer to use option 1.
现在考虑一种情况,M和N都是3500,alice节点是非常多的,但是假设这里面每个节点都只有1条或两条边。这种情况下,选项2就会更好。在选项2中,外层循环依然需要运行3500词,但是每次外层循环中中间循环只需要运行一次或者两次并且每个中间循环中内部循环只需要运行一次。所以,内部循环的次数大约时7000词。选项1中,外层循环和中间循环都需要运行3500,结果就是中间循环一共要运行1200万次。因此,在第二哥场景中,选项2要比选项1快将近2000倍。
Now consider the case where M and N are both 3500. Alice nodes are
abundant. But suppose each of these nodes is connected by only one
or two edges. In this case, option 2 is preferred. With option 2,
the outer loop still has to run 3500 times, but the middle loop only
runs once or twice for each outer loop and the inner loop will only
run once for each middle loop, if at all. So the total number of
iterations of the inner loop is around 7000. Option 1, on the other
hand, has to run both its outer loop and its middle loop 3500 times
each, resulting in 12 million iterations of the middle loop.
Thus in the second scenario, option 2 is nearly 2000 times faster
than option 1.
所以,你可以看出来,根据表中数据组成的不同,查询计划1和查询计划2都有可能更好。SQLite默认会选择哪个呢?在3.6.18版中,在没有运行ANALYZE的情况下,SQLite会选择选项2。但是如果运行ANALYZE命令获取了统计信息,那么就会根据统计指出的哪个选择运行的更快来进行选择了。
So you can see that depending on how the data is structured in the table,
either query plan 1 or query plan 2 might be better. Which plan does
SQLite choose by default? As of version 3.6.18, without running ANALYZE,
SQLite will choose option 2.
But if the ANALYZE command is run in order to gather statistics,
a different choice might be made if the statistics indicate that the
alternative is likely to run faster.
SQLite为高级程序员提供了控制优化器选择查询计划的功能。达到这个目的的一个方法就是篡改sqlite_stat1、sqlite_stat3 和 sqlite_stat4三个表中ANALYZE的结果。除非时遇到下列的场景,否则时不推荐使用这个方法的。
SQLite provides the ability for advanced programmers to exercise control
over the query plan chosen by the optimizer. One method for doing this
is to fudge the ANALYZE results in the sqlite_stat1,
sqlite_stat3, and/or sqlite_stat4 tables. That approach is not
recommended except for the one scenario described in the next paragraph.
对于把SQLite数据库当做应用程序文件格式的应用来说,当数据库刚创建的时候ANALYZE命令是无效的,这是应为数据库中没有任何数据可以用来生成统计。这种情况下,可以再开发时构造一个包含典型数据的大原型数据库,并在这个原型库上运行ANALYZE命令来获取统计信息,然后将这个原型统计保存为应用程序的一部分。在发布后,当应用程序创建一个新数据库文件时,就可以运行ANALYZE命令来生成统计表,然后将预先从原型库计算出来的统计信息复制到新统计表中。这样子,来自大原型库的统计信息就可以在新创建的应用程序文件中加载起来。
For a program that uses an SQLite database as its
application file-format,
when a new database instance is first created the ANALYZE
command is ineffective because the database contain no data from which
to gather statistics. In that case, one could construct a large prototype
database containing typical data during development and run the
ANALYZE command on this prototype database to gather statistics,
then save the prototype statistics as part of the application.
After deployment, when the application goes to create a new database file,
it can run the ANALYZE command in order to create the statistics
tables, then copy the precomputed statistics obtained
from the prototype database into these new statistics tables.
In that way, statistics from large working data sets can be preloaded
into newly created application files.
程序员们可以使用CROSS JOIN替代JOIN、INNER JOIN NATURAL JOIN或“,”连接来强制SQLite使用一个指定的嵌套循环顺序。虽然CROSS JOIN理论上时可交换的,但是SQLite在CROSS JOIN中从不重排表的顺序。因此,CROSS JOIN中左边的表总是在右边表的外层。
Programmers can force SQLite to use a particular loop nesting order
for a join by using the CROSS JOIN operator instead of just JOIN,
INNER JOIN, NATURAL JOIN, or a "," join. Though CROSS JOINs are
commutative in theory, SQLite chooses to never reorder the tables in
a CROSS JOIN. Hence, the left table of a CROSS JOIN will always be
in an outer loop relative to the right table.
在下面的查询中,优化器可以按照任何合适的方式自由的重排FROM子句中的表顺序:
In the following query, the optimizer is free to reorder the
tables of FROM clause anyway it sees fit:
SELECT * FROM node AS n1, edge AS e, node AS n2 WHERE n1.name = 'alice' AND n2.name = 'bob' AND e.orig = n1.id AND e.dest = n2.id;
但是在下面这个逻辑上等价的查询中,使用“CROSS JOIN”替换“,”意味着表的顺序必须是N1,E,N2。
But in the following logically equivalent formulation of the same query,
the substitution of "CROSS JOIN" for the "," means that the order
of tables must be N1, E, N2.
SELECT * FROM node AS n1 CROSS JOIN edge AS e CROSS JOIN node AS n2 WHERE n1.name = 'alice' AND n2.name = 'bob' AND e.orig = n1.id AND e.dest = n2.id;
在后面的查询中,查询计划只能是选项 2。注意,你必须使用关键词“CROSS”来禁止表重排优化。INNER JOIN、NATURAL JOIN、JOIN和其他类似逗号连接的连接操作在优化器中都是可以根据需要自由的重排的(在外部连接中总是禁用表重排的,这是因为外部连接不是不可交换的。在OUTER JOIN中重排表顺序会改变最终结果)。
In the latter query, the query plan must be
option 2. Note that
you must use the keyword "CROSS" in order to disable the table reordering
optimization; INNER JOIN, NATURAL JOIN, JOIN, and other similar
combinations work just like a comma join in that the optimizer is
free to reorder tables as it sees fit. (Table reordering is also
disabled on an outer join, but that is because outer joins are not
associative or commutative. Reordering tables in in OUTER JOIN changes
the result.)
“原始NGQP升级案例研究”一文中有一个使用CROSS JOIN来人工控制连接时的嵌套顺序的真实案例。在这篇文章中后续的查询规划器清单中提供了对人工控制查询规划器的更深层的指导。
See "The Fossil NGQP Upgrade Case Study" for another real-world example
of using CROSS JOIN to manually control the nesting order of a join.
The query planner checklist found later in the same document provides
further guidance on manual control of the query planner.
在查询中的FROM子句中的每个表都最多可以使用一个索引(除了执行OR-子句优化的时候),并且,SQLite会努力在一个表上至少使用一个索引。有时候,一个表可能有两个以上的可用索引,例如:
Each table in the FROM clause of a query can use at most one index
(except when the OR-clause optimization comes into
play)
and SQLite strives to use at least one index on each table. Sometimes,
two or more indices might be candidates for use on a single table.
For example:
CREATE TABLE ex2(x,y,z); CREATE INDEX ex2i1 ON ex2(x); CREATE INDEX ex2i2 ON ex2(y); SELECT z FROM ex2 WHERE x=5 AND y=6;
对于上面的SELECT语句,优化器可以使用ex2i1索引在ex2上搜索包含x=5的记录,然后依次进行y=6的测试。也可以使用ex2i2索引在ex2上搜索包含y=6的记录,然后一次进行x=5的测试。
For the SELECT statement above, the optimizer can use the ex2i1 index
to lookup rows of ex2 that contain x=5 and then test each row against
the y=6 term. Or it can use the ex2i2 index to lookup rows
of ex2 that contain y=6 then test each of those rows against the
x=5 term.
当面对两个以上索引的选择问题时,SQLite会试图评估使用每个选项完成查询所需的总工作量。然后选择预估工作量最小的选项。
When faced with a choice of two or more indices, SQLite tries to estimate
the total amount of work needed to perform the query using each option.
It then selects the option that gives the least estimated work.
为了帮助优化器在多个索引中做出更精确的工作量评估,可以选择使用ANALYZE命令。ANALYZE命令扫描数据库中所有可能会需要做出选择的索引,还会获取这些索引上的统计信息。扫描中的统计信息存储在数据库中特定的表中,这些表的名字都是以"sqlite_stat"开头的。这些表的内容在数据库变动后是不会更新的,所以在做了重大改动后,需要适时的重新运行 ANALYZE。ANALYZE命令的分析结果只对在ANALYZE命令完成后才打开的数据库连接生效。
To help the optimizer get a more accurate estimate of the work involved
in using various indices, the user may optionally run the ANALYZE command.
The ANALYZE command scans all indices of database where there might
be a choice between two or more indices and gathers statistics on the
selectiveness of those indices. The statistics gathered by
this scan are stored in special database tables names shows names all
begin with "sqlite_stat".
The content of these tables is not updated as the database
changes so after making significant changes it might be prudent to
rerun ANALYZE.
The results of an ANALYZE command are only available to database connections
that are opened after the ANALYZE command completes.
各种sqlite_statN表中包含了决定如何选择合适索引所需的信息。例如,表sqlite_stat1可能可以指出在列x上的一个等式约束平均可以将搜索空间降低到10行记录,而列y上的一个等式约束平均可以将搜索空间降低到3行。这种情况下,SQLite就更愿意使用索引ex2i2,因为这个索引时更有效的选择。
The various sqlite_statN tables contain information on how
selective the various indices are. For example, the sqlite_stat1
table might indicate that an equality constraint on column x reduces the
search space to 10 rows on average, whereas an equality constraint on
column y reduces the search space to 3 rows on average. In that case,
SQLite would prefer to use index ex2i2 since that index is more selective.
WHERE子句中的term可以通过在列名前面加一个一元操作符+来手动限制其使用索引。操作符+不会有任何操作行为,也不会在预编译语句中生成任何的字节码。但是+操作符可以阻止这个term使用索引,所以在上面的示例中,如果查询写成:
Terms of the WHERE clause can be manually disqualified for use with
indices by prepending a unary + operator to the column name. The
unary + is a no-op and will not generate any byte code in the prepared
statement.
But the unary + operator will prevent the term from constraining an index.
So, in the example above, if the query were rewritten as:
SELECT z FROM ex2 WHERE +x=5 AND y=6;
x列上的+操作符阻止了这个term使用索引,所以也就强制使用ex2i2索引了。
The + operator on the x column will prevent that term from
constraining an index. This would force the use of the ex2i2 index.
注意,+操作符也会去除表达式的类型亲和性。在上面的示例中,如果列x是TEXT亲和性,那么“x=5”这个比较会按照text执行。但是+会移除亲和性,所以“+x=5”这个比较会用数值5和x的文本进行比较,将永远返回false。
Note that the unary + operator also removes
type affinity from
an expression, and in some cases this can cause subtle changes in
the meaning of an expression.
In the example above,
if column x has TEXT affinity
then the comparison "x=5" will be done as text. But the + operator
removes the affinity. So the comparison "+x=5" will compare the text
in column x with the numeric value 5 and will always be false.
考虑一个略微不同的场景:
Consider a slightly different scenario:
CREATE TABLE ex2(x,y,z); CREATE INDEX ex2i1 ON ex2(x); CREATE INDEX ex2i2 ON ex2(y); SELECT z FROM ex2 WHERE x BETWEEN 1 AND 100 AND y BETWEEN 1 AND 100;
再假设x列包含的值分散在0到1,000,000之间,y列的值分散在0到1,000之间。这种场景下,x列上的范围查找可以将搜索空间减少10,000倍,而在y列上的范围查找只能将搜索空间减少10被。所以就会首选使用ex2i1索引了。
Further suppose that column x contains values spread out
between 0 and 1,000,000 and column y contains values
that span between 0 and 1,000. In that scenario,
the range constraint on column x should reduce the search space by
a factor of 10,000 whereas the range constraint on column y should
reduce the search space by a factor of only 10. So the ex2i1 index
should be preferred.
SQLite可以做出这种决定,但是必须要在编译的时候加上参数SQLITE_ENABLE_STAT3 或 SQLITE_ENABLE_STAT4。SQLITE_ENABLE_STAT3 和 SQLITE_ENABLE_STAT4选项会使ANALYZE命令收集列内容的分布图,然后存储在表sqlite_stat3 或 sqlite_stat4中,然后使用这些分布图来更准确的判断范围查找时的最佳查询方案,比如说上面的例子。STAT3和STAT4的主要区别是STAT3只记录索引中最左列的数据的分布图,而STAT4会记录索引中所有列的数据分布图。对于只有一列的索引来说,STAT3和STAT4就完全是一样的了。
SQLite will make this determination, but only if it has been compiled
with SQLITE_ENABLE_STAT3 or SQLITE_ENABLE_STAT4.
The SQLITE_ENABLE_STAT3 and SQLITE_ENABLE_STAT4 options causes
the ANALYZE command to collect a histogram of column content in the
sqlite_stat3 or sqlite_stat4 tables and to use this histogram to
make a better guess at the best query to use for range constraints
such as the above. The main difference between STAT3 and STAT4 is
that STAT3 records histogram data for only the left-most column of
an index whereas STAT4 records histogram data for all columns of an
index. For single-column indexes, STAT3 and STAT4 work the same.
只有当约束的右边是一个简单的编译器常量或者是一个参数而不是表达式时,才可以使用分布图数据。
The histogram data is only useful if the right-hand side of the constraint
is a simple compile-time constant or parameter and not an expression.
分布图的另一个限制是只有一个索引中最左边的列才能适用分布图。考虑这个场景:
Another limitation of the histogram data is that it only applies to the
left-most column on an index. Consider this scenario:
CREATE TABLE ex3(w,x,y,z); CREATE INDEX ex3i1 ON ex2(w, x); CREATE INDEX ex3i2 ON ex2(w, y); SELECT z FROM ex3 WHERE w=5 AND x BETWEEN 1 AND 100 AND y BETWEEN 1 AND 100;
这里不等式是在列x和y上,这两个都不是索引的最左边列。因此收集的索引最左边列的数据分布的统计是无法用来在范围约束列x和y之间做出选择的。
Here the inequalities are on columns x and y which are not the
left-most index columns. Hence, the histogram data which is collected no
left-most column of indices is useless in helping to choose between the
range constraints on columns x and y.
当使用索引寻找一行记录时,通常的过程是在索引上进行二分查找来寻找索引内容,然后从索引中提取出rowid,再使用rowid在原始表进行二分查找。这种典型的索引搜索包含了两个二叉搜索。不过,如果所有需要从表中获取的列都已经在索引中了,那么SQLite就会直接使用索引中包含的值,而不需要搜索原始表。这为每次行记录都节省了一次搜索,也就使许多查询速度可以提高一倍。
When doing an indexed lookup of a row, the usual procedure is to
do a binary search on the index to find the index entry, then extract
the rowid from the index and use that rowid to do a binary search on
the original table. Thus a typical indexed lookup involves two
binary searches.
If, however, all columns that were to be fetched from the table are
already available in the index itself, SQLite will use the values
contained in the index and will never look up the original table
row. This saves one binary search for each row and can make many
queries run twice as fast.
当一个索引包含了所有查询所需的数据并且无需向原始表查询数据,那么这种索引称之为“覆盖索引”。
When an index contains all of the data needed for a query and when the
original table never needs to be consulted, we call that index a
"covering index".
如果可能,SQLite会尝试使用一个索引来满足查询中的ORDER BY子句。当需要在使用一个索引来满足WHERE子句约束还是用来满足一个ORDER BY子句之间做出选择时,SQLite会像上面所说的做出分析,并选出其认为可以运行的更快的选择。
SQLite attempts to use an index to satisfy the ORDER BY clause of a
query when possible.
When faced with the choice of using an index to satisfy WHERE clause
constraints or satisfying an ORDER BY clause, SQLite does the same
work analysis described above
and chooses the index that it believes will result in the fastest answer.
SQLite还会尝试使用索引来帮助实现GROUP BY喝DISTINCT关键词。如果连接的嵌套循环可以排列出GROUP BY或者DISTINCT中相同的项目连续排列,那么GROUP BY或DISTINCT逻辑可以简单的通过比较当前行和前一行来判断出当前行是否是同一组或者当前行是否是独有的。这要比每行都和之前的所有行进行比较要快的多。
SQLite will also attempt to use indices to help satisfy GROUP BY clauses
and the DISTINCT keyword. If the nested loops of the join can be arranged
such that are equivalent for the GROUP BY or for the DISTINCT are
consecutive, then the GROUP BY or DISTINCT logic can determine if the
current row is part of the same group or if the current row is distinct
simply by comparing the current row to the previous row.
This can be much faster than the alternative of comparing each row to
all prior rows.
当SELECT的FROM中包含一个子查询时,最简单的方法就是计算这个子查询然后存入到一个临时表中,然后在这个临时表上运行外部SELECT。但是这个计划并不是最好的,因为临时表上没有任何索引,并且外层查询(类似join)会强制在临时表上做全表扫描。
When a subquery occurs in the FROM clause of a SELECT, the simplest
behavior is to evaluate the subquery into a transient table, then run
the outer SELECT against the transient table. But such a plan
can be suboptimal since the transient table will not have any indices
and the outer query (which is likely a join) will be forced to do a
full table scan on the transient table.
为了克服这个问题,SQLite会尝试将SELECT的FROM子句中的子查询扁平化。这包含将子查询的FROM子句插入到外层的FROM子句中,以及重写外部查询的表达式来引用子查询的结果集。例如:
To overcome this problem, SQLite attempts to flatten subqueries in
the FROM clause of a SELECT.
This involves inserting the FROM clause of the subquery into the
FROM clause of the outer query and rewriting expressions in
the outer query that refer to the result set of the subquery.
For example:
SELECT a FROM (SELECT x+y AS a FROM t1 WHERE z<100) WHERE a>5
可以使用查询扁平化重写为:
Would be rewritten using query flattening as:
SELECT x+y AS a FROM t1 WHERE z<100 AND a>5
这里有一个很长的条件列表,必须所有都满足了才能使用查询扁平化。
There is a long list of conditions that must all be met in order for
query flattening to occur.
普通读者不要期望能理解或记住上面的这个列表。这个列表的目的是证明判断是否能扁平化查询是一个非常复杂的决定。
The casual reader is not expected to understand or remember any part of
the list above. The point of this list is to demonstrate
that the decision of whether or not to flatten a query is complex.
当使用视图时,每次使用视图都转换成子查询的使用,查询扁平化会是一个非常重要的优化。
Query flattening is an important optimization when views are used as
each use of a view is translated into a subquery.
对于下面形式的查询,如果有合适的索引,那么会优化成对数时间:
Queries of the following forms will be optimized to run in logarithmic
time assuming appropriate indices exist:
SELECT MIN(x) FROM table; SELECT MAX(x) FROM table;
为了能最优化,必须精确的出现在上面的形式中——只能修改表或者列的名字。这里不能加入WHERE子句也不能在结果中加入任何计算。结果只能包含一列。MIN或MAX函数中的列必须时添加了索引的。
In order for these optimizations to occur, they must appear in exactly
the form shown above - changing only the name of the table and column.
It is not permissible to add a WHERE clause or do any arithmetic on the
result. The result set must contain a single column.
The column in the MIN or MAX function must be an indexed column.
当执行查询时没有可用的索引时,SQLite可能会创建一个自动索引,这个索引只能维持在一个SQL语句中。由于构造自动索引的开销是O(NlogN)(其中N时表中记录的数量)而做全表扫描的花费只是O(N)。只有当SQLite认为在SQL语句的执行过程中查询需要运行超过logN次,才会创建自动索引。考虑一个示例:
When no indices are available to aid the evaluation of a query, SQLite
might create an automatic index that lasts only for the duration
of a single SQL statement.
Since the cost of constructing the automatic index is
O(NlogN) (where N is the number of entries in the table) and the cost of
doing a full table scan is only O(N), an automatic index will
only be created if SQLite expects that the lookup will be run more than
logN times during the course of the SQL statement. Consider an example:
CREATE TABLE t1(a,b); CREATE TABLE t2(c,d); -- 在t1和t2中插入大量记录 -- Insert many rows into both t1 and t2 SELECT * FROM t1, t2 WHERE a=c;
在上面的查询中,如果t1和t2大约有N行,那么没有索引的时候查询需要 O(N*N)的时间。另一方面,在表t2上创建一个索引需要O(NlogN) 的时间。在没有ANALYZE信息的情况下,SQLite猜测N时一百万,因此构造一个自动索引将是一个更廉价的行为。
In the query above, if both t1 and t2 have approximately N rows, then
without any indices the query will require O(N*N) time. On the other
hand, creating an index on table t2 requires O(NlogN) time and then using
that index to evaluate the query requires an additional O(NlogN) time.
In the absence of ANALYZE information, SQLite guesses that N is one
million and hence it believes that constructing the automatic index will
be the cheaper approach.
在子查询中也可能会使用自动索引:
An automatic index might also be used for a subquery:
CREATE TABLE t1(a,b); CREATE TABLE t2(c,d); -- 在t1和t2中插入大量记录 -- Insert many rows into both t1 and t2 SELECT a, (SELECT d FROM t2 WHERE c=b) FROM t1;
在这个示例中,在子查询中使用t2表来转化t1.b列的值。如果每个表包含N行,SQLite认为子查询会执行N词,因此,首先在表t2上构造一个自动临时索引,然后使用这个索引来辅助N次的子查询会比较快一些。
In this example, the t2 table is used in a subquery to translate values
of the t1.b column. If each table contains N rows, SQLite expects that
the subquery will run N times, and hence it will believe it is faster
to construct an automatic, transient index on t2 first and then using
that index to satisfy the N instances of the subquery.
可以再运行时使用automatic_index pragma来禁用自动索引的功能。自动索引默认是启用的,但是可以使用SQLITE_DEFAULT_AUTOMATIC_INDEX编译选项将自动编译的默认值修改为禁用。还可以在编译时使用SQLITE_OMIT_AUTOMATIC_INDEX编译选项完全禁止创建自动索引。
The automatic indexing capability can be disabled at run-time using
the automatic_index pragma. Automatic indexing is turned on by
default, but this can be changed so that automatic indexing is off
by default using the SQLITE_DEFAULT_AUTOMATIC_INDEX compile-time option.
The ability to create automatic indices can be completely disabled by
compiling with the SQLITE_OMIT_AUTOMATIC_INDEX compile-time option.
在SQLite 3.8.0 中,每当语句准备使用一个自动索引时就会在错误日志中输入一条SQLITE_WARNING_AUTOINDEX消息。应用程序开发者应当根据这些警告来确定是否需要在数据库中创建一个永久的索引。
In SQLite version 3.8.0, an SQLITE_WARNING_AUTOINDEX message is sent
to the error log every time a statement is prepared that uses an
automatic index. Application developers can and should use these warnings
to identify the need for new persistent indices in the schema.
未来的SQLite发行版中可能会默认禁用自动索引。
Future releases of SQLite may disable automatic indices by default.