Querying DINAMITE traces using SQL
10 Mar 2017In this blog post we describe how to query DINAMITE traces using SQL.
Setting up the database and importing traces
If you have read our earlier
post,
you know how to process the binary trace files generated by DINAMITE
using the process-logs.py
script.
This script will produce a compressed trace.sql.gz
file with the entire trace
in a human-readable format, and a few SQL commands allowing you to import this
file directly into the database. That way, you can query your trace using the
mighty SQL!
If for some reason you don’t want to use a database, you can visually examine
the trace.sql
file and query it using any other tools you like, because it is
just plain text (just remove the SQL commands at the top and at the bottom of
the file). In that case, decompress the file first by running gunzip
trace.sql.gz
.
If you are a brave soul willing to embrace the power of databases, we will tell you how to set up a free database MonetDB and use it to query the trace.
-
Download and install MonetDB as explained here.
- Start the database daemon and create the database called ‘dinamite’ like so:
monetdbd create monetdb monetdbd start monetdb monetdb create dinamite monetdb release dinamite
-
Create a file
.monetdb
in your home directory with the following content. This is needed, so you don’t have to type the username and password every time. Note, we are using the default password for the trace database. If you want to protect your trace with a stronger password, change it accordingly and don’t put it in the text file.user=monetdb password=monetdb
-
Import the
trace.sql.gz
file created byprocess-logs.py
into the database:gunzip -c trace.sql.gz | mclient -d dinamite
You are good to go! Now you can query your traces using SQL. For those not fluent in SQL we provide a few example queries to get you started.
Querying the traces
The database you created in the preceding steps will have three
tables: trace
,avg_stdev
and outliers
. The trace
table contains
raw trace records sorted by time. The avg_stdev
table contains the
average duration and the standard deviation of durations for each
function. The table outliers
contains the functions whose duration
was more than two standard deviations above the average.
The tables have the following schema:
CREATE TABLE "sys"."trace" (
"id" INTEGER,
"dir" INTEGER,
"func" VARCHAR(255),
"tid" INTEGER,
"time" BIGINT,
"duration" BIGINT
);
where id
is the unique identifier for each invocation of a function,
dir
is an indicator whether the given record is a function entry
(dir=0) or exit point (dir=1), func
is the function name, tid
is
the thread id, time
is the timestamp (in nanoseconds, by default),
and duration
is the duration of the corresponding function.
CREATE TABLE "sys"."avg_stdev" (
"func" VARCHAR(255),
"stdev" DOUBLE,
"avg" DOUBLE
);
where func
is the function name, avg
is its average duration
across all threads and stdev
is the standard deviation.
CREATE TABLE "sys"."outliers" (
"id" INTEGER,
"func" VARCHAR(255),
"time" BIGINT,
"duration" BIGINT
);
where id
is the unique ID for each function invocation, func
is
the function name, time
is the timestamp of the function’s entry
record and duration
is the duration.
You can examine the schema yourself by opening a shell into the database, like so:
mclient -d dinamite
and issuing a command ‘\d’. Running this command without arguments will list all database tables. Running this command with a “table” argument will give you the table’s schema, for example:
sql> \d trace
Suppose now you wanted to list all “outlier” functions, sorted by duration. You could accomplish this by typing the following SQL query:
sql> SELECT * FROM outliers ORDER BY duration DESC;
Or if you wanted to display the first 10 entry records for functions that took more than 60 milliseconds to run:
sql> SELECT * FROM trace WHERE duration > 60000000 AND dir=0 LIMIT 10;
Now, suppose you wanted to see what happened during the execution while an particular unusually long function was running? You could list all events between the boundaries of the outlier function, sorted in the chronological order, via the following steps.
-
First, find the ID of the outlier function invocation you want to examine using the
SELECT * FROM outliers ORDER BY duration DESC;
, picking any function you like and noting its ID (the first column). Suppose you picked a function invocation with the id 4096. -
Run the following query to obtain all concurrent events:
sql> WITH lowerboundtime AS (SELECT time FROM trace WHERE id=4096 AND dir=0), upperboundtime AS (SELECT time FROM trace WHERE id=4096 and dir=1) SELECT * FROM trace WHERE time>= (SELECT * FROM lowerboundtime) AND time <= (SELECT * FROM upperboundtime);
That query will probably give you a very large list of events – so large that it’s really inconvenient to look at it in a text form. To make viewing event traces more pleasant, we built a tool called TimeSquared. You can use it to view event traces generated from SQL queries. Read about it in our next blog post!
And before we finish this post, here is another very useful query that could help diagnose the root cause of an “outlier” function invocation. The following query will list all the children function-invocations of the long-lasting functions and sort them by duration. This way, you can quickly spot if there is a particular callee that is responsible for high latency in the parent. Again, assuming that you previously determined that the id of the long-lasting function equals to 4096 and the function was executed by the thread with tid=18, the query would look like:
sql> WITH lowerboundtime AS
(SELECT time FROM trace WHERE id=4096 AND dir=0),
upperboundtime AS
(SELECT time FROM trace WHERE id=4096 and dir=1)
SELECT func, id, time, duration FROM trace WHERE
time>= (SELECT * FROM lowerboundtime)
AND time <= (SELECT * FROM upperboundtime)
AND tid=18 ORDER BY duration DESC;