The Tukwila Execution Interface

Zack Ives, May 29, 1999

This document is a work in progress. Please see the latest version at: http://www.cs.washington.edu/education/courses/544/CurrentQtr/tukwila_query.html

Introduction

The Tukwila query processing components are designed to be self-contained modules that can be swapped out as needed. Each of the main components (reformulator, optimizer, execution engine, and wrappers) are separate code bases, each optionally in a different language and on a different platform. A sockets-based communication interface with a standardized request model allow us to interchange parts.

Executing a Plan

The query execution engine constantly monitors port 7777 for query execution requests. A request is simply a physical query plan in the format described later in this document. The data transfer format is simply a 32-bit big-endian integer specifying the length of the plan, plus the ASCII text of the plan.

Once a query has been submitted to the execution system, it is expected that the optimizer will hold the socket open for the duration of query execution. The optimizer should monitor the port for a status result string sent back by the optimizer. (Again, this is sent as a 32-bit big-endian length plus a string of bytes.)

The query result string, which is described in a later section, will begin with an execution status. This status should be used as the basis for the optimizer's decision to terminate successfully, terminate with error, or re-optimize a query. If re-optimization is performed, the optimizer will simply send a new query plan through the same socket to the execution system.

Query Plans in Tukwila

The basic layout for a query plan is as follows:

Header and user information
Query requests (generally only one per plan):
1. Fragments (may be multiple per plan):
  1. Query operator graph
  2. Fragment-local execution rules
2. Global execution rules (i.e. rules across fragments)

More Detailed Sketch of a Plan

The header is fairly simple, and most of it can be repeated verbatim. Much of the header information is for the benefit of XML parsers and for future use. Note that "auser" is the default user ID for many of the wrappers we have, so you should use that by default.



<?XML VERSION="1.00"?>

<!DOCTYPE "PLAN" SYSTEM "plan.dtd">

<PLAN ID=PlanName VERSION=1.00 ENCODING="PUBLICKEY">

<REQUESTOR>IP Address of source machine</REQUESTOR>

<REQUEST ID=RequestID RESULT=QueryRootNodeID
USER="auser">

<FRAGMENT NUM=ExecutionStep>

<GRAPH>

    <OUTPUT NODE=FragmentRootNodeID />

    <N ID=NodeID OP=Operator attributes CARD=EstimatedSize>

    </N>

<!-- More nodes here -->

</GRAPH>

<RULES>

    <R ID=RuleID>

        <WHEN>Event</WHEN>

        <IF>ConditionTest</IF>

        <THEN>Action</THEN>

        <ELSE>Action</ELSE>

    </R>

<!-- More rules here -->

</RULES>

</FRAGMENT>

<!-- More fragments here -->

 <RULES> <R ID=RuleID ACTIVE=state> <WHEN>Event</WHEN> <IF>ConditionTest</IF> <THEN>Action</THEN> <ELSE>Action</ELSE> </R>  </RULES> </REQUEST> </PLAN>

Creating Plans for Tukwila

For the most part, constructing a plan is simply a matter of "filling in the template" above. Each plan and request will get a unique ID. The <REQUEST> element also has an attribute for the current user ID to use, and for specifying the root of the entire query plan.

Each fragment within a request is given a number representing an execution step. Fragments are sequenced in order by execution step, and then executed from lowest to highest step number. (A future extension to Tukwila will allow for multiple fragments to execute in the same step, but that is not currently supported.)

Within the fragment are an operator graph and a set of rules specific to the fragment. The operator graph includes an <OUTPUT> element which specifies which node is the root of the fragment (and whose output should thus be written to disk), and the NODE parameter of this attribute must reference one of the nodes in the fragment graph. Nodes are specified with the <N> element, and include operator type, expected cardinality of result, any children, and various other parameters.

The Operators

Join (with specific type including HASH, DOUBLE_PIPELINED). The <WHERE> clause specifies an equality constraint between attributes in each of the children. Note that each attribute must be specified fully with the ChildNodeID.Attribute syntax.

<N ID=NodeID OP=JOIN TYPE=HASH CHILD=LeftChildNodeID CHILD=RightChildNodeID SPACE=KBytesToUse CARD=ExpectedCard> <WHERE>LeftChildNodeID.LeftChildAttrib = RightChildNodeID.RightChildAttrib</WHERE> </N>

Select. As with Join, note that the <WHERE> clause fully specifies each attribute with the ChildNodeID.Attribute syntax.

<N ID=NodeID OP=SELECT CHILD=ChildID CARD=ExpectedCard> <WHERE>ChildID.Attrib {relop} {expression}</WHERE> </N>

Wrapper. Specifies an SQL-syntax query to be sent to a specific host URL and a wrapper-specific service ID (which is usually a table name). Also specifies a timeout in ms, after which an event is triggered. Wrappers can be prefetching (the PREFETCH operator) or non-prefetching (the WRAPPER operator). Note that you should not use the prefetching wrapper as a source for either the double pipelined join or the collector, as in both cases you are simply adding extra threads at the cost of performance.

<N ID=NodeID OP=PREFETCH TIMEOUT=TimeIn_ms CARD=ExpectedCard> <HOST>HostName:Port</HOST> <SERVICE>WrapperService</SERVICE> <QUERY>SQLQuery</QUERY> </N> <N ID=NodeID OP=WRAPPER TIMEOUT=TimeIn_ms CARD=ExpectedCard> <HOST>HostName:Port</HOST> <SERVICE>WrapperService</SERVICE> <QUERY>SQLQuery</QUERY> </N>

Project. Specifies a series of attributes to project from the child node. Note that, as in all other cases, the projected attributes must be fully specified with ChildNodeID.attribute syntax.

<N ID=NodeID OP=PROJECT CHILD=ChildNodeID CARD=ExpectedCard> <ATTR>ChildNodeID.attribute</ATTR> <ATTR>ChildNodeID.attribute</ATTR> ... </N>

Collector. Specifies a series of child nodes for the collector operator. Note that a collector node by default has all children disabled, and rules must be used to enable one or more children (usually based on the OPEN event, described later in this document).



    <N ID=NodeID OP=COLLECTOR CHILD=ChildNodeID1
CHILD=ChildNodeID2 

     CHILD=ChildNodeID3 SPACE=KBytesToUse CARD=ExpectedCard>

    </N>

Rules

A rule R is formally a quintuple <event, condition, action, active_flag, owner>, where the owner is a node, fragment, or plan; and the active_flag must be TRUE for the rule to be in the active set.

As an event E occurs, it is sent to the Tukwila event handler. The event handler checks a hash table to see if any rules in the active set match E; if so, it places the event into the event queue. Each event is individually dequeued in the order it was detected, and now all matching rules are triggered serially before the next event is processed. The condition for each rule is tested, and if it is met the actions are executed sequentially. If the optional ELSE clause is used, an alternate series of actions may be executed. Formally, the rules' conditions should all be tested in parallel, and all actions should be executed in parallel; our implementation does both of these serially.

There are no synchronization or locking semantics, so rules are not allowed to have conflicting actions; the rule generator must prevent this from happening. Rules are not executed for operators which have been terminated, in order to prevent deadlock and race conditions. Once a rule has been executed, it becomes inactive, unless one of its composite actions is to reactivate the same rule.

Events

STATE_BECOMES(node, stateval)
END_OF_FRAGMENT
OUT_OF_MEMORY
WRAPPER_TIMEOUT(node, intval)
EVERY_N_TUPLES(node, intval)
EVERY_N_MS(intval)

Conditions

condition ::= <bool>
bool ::=
- TRUE
- FALSE
- <intval> <relop> <intval>
- (<bool> AND <bool>)
- (<bool> OR <bool>)
- NOT <bool>
- STATE(node) = stateval
intval ::=
- GLOBAL_MEMORY
- MEMORY_ALLOCATED(node)
- MEMORY_USED(node)
- CARD(node) - true cardinality at this moment
- ESTIMATED_CARD(node) - estimated cardinality (though optimizer probably knows this)
- TIMEOUT(node) - timeout count
- TIME_SINCE_LAST_TUPLE(node) - time since last tuple produced
- NEXT_CHILD_INDEX(collector) - next unopened collector child
- NUM_OPEN_CHILDREN(collector) - number of collector children open
- integer value or expression formed with constants
stateval ::=
- OUT_OF_MEMORY
- TIMEOUT
- ERROR
- NOT_OPEN
- OPEN
- CLOSED
activity ::=
- ACTIVE
- INACTIVE

Actions

OVERFLOW_SYMMETRIC_HASH(node) - specify that a DPJoin overflow using symmetric overflow strategy
OVERFLOW_SINGLE_HASH(node) - specify that a DPJoin overflow using left flush overflow strategy
REPLAN - re-optimize the plan
SET_TIMEOUT(node, intval) - changes a wrapper timeout interval
ADD_MEMORY(node, intval) - attempts to grant more memory to a join operator
REDUCE_MEMORY(node, intval) - attempts to steal memory from a join operator
ERROR(string) - sends an error back to the optimizer
ACTIVATE(collector, node) - activates node, the child of collector collector
ACTIVATE_NEXT(collector) - activates the next untouched child of collector collector, in left-to-right order
ACTIVATE(rule) - activates a rule
DEACTIVATE(collector, node) - deactivates node, the child of collector collector
DEACTIVATE(rule) - deactivates a rule

Format

<RULES>

<R ID=ruleID [STATE=activity]>

<WHEN>event</WHEN>; <IF>condition</IF>; <THEN>action{;action}</THEN>; [<ELSE>action{;action}</ELSE></R>]
...

</RULES>

Execution Results

The query optimizer needs to constantly monitor the socket to see if the execution system has returned some sort of status message. The format of the execution response is as follows:

Status (followed by newline):

PARSE_ERROR - an invalid query plan was submitted; see the error log for results.
REPLAN - plan needs incremental reoptimization. See status of operators to see what has completed.
CLOSED - plan successfully executed to completion.
EXECUTION_ERROR - error executing plan.

One status line (with terminating newline) per node in the query plan, with the following items, each separated by a space:

nodeId - the node ID in the query plan
status - execution status, one of:
- ERROR - error executing operator
- TIMEOUT - operator timed out
- OPEN - operator still not complete
- CLOSED - operator execution complete
- OUT_OF_MEMORY - operator (join or collector) ran out of memory
- TERMINATED - operator terminated by rule or execution error
- UNKNOWN - undefined state
cardinality - number of tuples output by node, as of termination of execution

A series of messages generated during parsing and execution, preceded by the header "EXECUTION_MESSAGES" and a newline. These should be ignored by the optimizer, but are provided for debugging purposes.

A termination line reading "DONE" followed by a newline. This marks the end of the results.

Sample Plan



<!--Sample Query execution plan for Tukwila System, zives 5/13/99 -->

<?XML VERSION="1.00"?>

<!DOCTYPE "PLAN" SYSTEM "plan.dtd">

<!-- Header: user, encryption info, etc. -->

<PLAN ID=Example VERSION=1.00 ENCODING="PUBLICKEY">

<!-- The requestor/request info isn't currently used, but is

     here for future expansion. The user is actually used by

     some wrappers -->

<REQUESTOR>127.128.129.130</REQUESTOR>

<REQUEST ID=One RESULT=test2 USER="auser">

<!-- Top-level fragment in this plan -->

<!-- Simply reads results of previous fragment -->

<FRAGMENT NUM=1>

<GRAPH>

    <OUTPUT NODE=test2/>

    <N ID=test2 OP=SCAN FILE=test CARD=600000>

    </N>

</GRAPH>

</FRAGMENT>

<!-- Lower-level fragment -->

<FRAGMENT NUM=0>

<GRAPH>

    <OUTPUT NODE=test/>

    <N ID=test OP=JOIN TYPE=HASH CHILD=projection CHILD=subresult

      SPACE=100000 CARD=600000>

        <WHERE>subresult.L_ORDERKEY =
projection.O_ORDERKEY</WHERE>

    </N>

    <N ID=subresult OP=JOIN TYPE=HASH CHILD=selectli CHILD=supplier

      SPACE=30000 CARD=150000>

        <WHERE>selectli.L_SUPPKEY =
supplier.S_SUPPKEY</WHERE>

    </N>

    <N ID=selectli OP=SELECT CHILD=lineitem CARD=600000>

        <WHERE>lineitem.L_SUPPKEY >
1000</WHERE>

    </N>

    <N ID=lineitem OP=WRAPPER TIMEOUT=10000 CARD=600000>

       
<HOST>data.cs.washington.edu:7779</HOST>

        <SERVICE>TPCT.TPC</SERVICE>

        <QUERY>SELECT * FROM
TPC.LINEITEM</QUERY>

    </N>

    <N ID=supplier OP=WRAPPER TIMEOUT=10000 CARD=15000>

       
<HOST>data.cs.washington.edu:7779</HOST>

        <SERVICE>TPCT.TPC</SERVICE>

        <QUERY>SELECT * FROM
TPC.SUPPLIER</QUERY>

    </N>

    <N ID = projection OP =PROJECT CHILD=order CARD=15000>

        <ATTR>order.O_ORDERKEY</ATTR>

        <ATTR>order.O_CUSTKEY</ATTR>

    </N>

    <N ID = order OP=PREFETCH TIMEOUT=10000 CARD=15000>

       
<HOST>data.cs.washington.edu:7779</HOST>

        <SERVICE>TPCT.TPC</SERVICE>

        <QUERY>SELECT * FROM
TPC.ORDER</QUERY>

    </N>

<!-- A collector looks like (assuming wrapper1-wrapper3 are node IDs):

    <N ID = misc OP=COLLECTOR CHILD=wrapper1 CHILD=wrapper2 

     CHILD=wrapper3 SPACE=10000 CARD=15000>

    </N>

-->

</GRAPH>

<RULES>

<!-- Simple rule; not very useful, but demonstrates syntax -->

    <R ID=myRule>

        <WHEN>END_OF_FRAGMENT</WHEN>

        <IF>CARD(test) > 1000</IF>

        <THEN>REPLAN</THEN>

        <ELSE>ADD_MEMORY(test, 1)</ELSE>

    </R>

</RULES>

</FRAGMENT>

<!-- Global Rules; none here -->

<RULES>

</RULES>

</REQUEST>

</PLAN>