Querying DINAMITE traces using SQL

In this blog post we describe how to query DINAMITE traces using SQL.

Setting up the database and importing traces

If you have read our earlier post, you know how to process the binary trace files generated by DINAMITE using the process-logs.py script.

This script will produce a compressed trace.sql.gz file with the entire trace in a human-readable format, and a few SQL commands allowing you to import this file directly into the database. That way, you can query your trace using the mighty SQL!

If for some reason you don’t want to use a database, you can visually examine the trace.sql file and query it using any other tools you like, because it is just plain text (just remove the SQL commands at the top and at the bottom of the file). In that case, decompress the file first by running gunzip trace.sql.gz.

If you are a brave soul willing to embrace the power of databases, we will tell you how to set up a free database MonetDB and use it to query the trace.

  1. Download and install MonetDB as explained here.

  2. Start the database daemon and create the database called ‘dinamite’ like so:
    monetdbd create monetdb
    monetdbd start monetdb
    monetdb create dinamite
    monetdb release dinamite
    
  3. Create a file .monetdb in your home directory with the following content. This is needed, so you don’t have to type the username and password every time. Note, we are using the default password for the trace database. If you want to protect your trace with a stronger password, change it accordingly and don’t put it in the text file.

    user=monetdb
    password=monetdb
    
  4. Import the trace.sql.gz file created by process-logs.py into the database:

    gunzip -c trace.sql.gz | mclient -d dinamite
    

You are good to go! Now you can query your traces using SQL. For those not fluent in SQL we provide a few example queries to get you started.

Querying the traces

The database you created in the preceding steps will have three tables: trace,avg_stdev and outliers. The trace table contains raw trace records sorted by time. The avg_stdev table contains the average duration and the standard deviation of durations for each function. The table outliers contains the functions whose duration was more than two standard deviations above the average.

The tables have the following schema:

   CREATE TABLE "sys"."trace" (
    "id"       INTEGER,
    "dir"      INTEGER,
    "func"     VARCHAR(255),
    "tid"      INTEGER,
    "time"     BIGINT,
    "duration" BIGINT
   );

where id is the unique identifier for each invocation of a function, dir is an indicator whether the given record is a function entry (dir=0) or exit point (dir=1), func is the function name, tid is the thread id, time is the timestamp (in nanoseconds, by default), and duration is the duration of the corresponding function.

   CREATE TABLE "sys"."avg_stdev" (
    "func"  VARCHAR(255),
    "stdev" DOUBLE,
    "avg"   DOUBLE
   );

where func is the function name, avg is its average duration across all threads and stdev is the standard deviation.

   CREATE TABLE "sys"."outliers" (
    "id"       INTEGER,
    "func"     VARCHAR(255),
    "time"     BIGINT,
    "duration" BIGINT
   );

where id is the unique ID for each function invocation, func is the function name, time is the timestamp of the function’s entry record and duration is the duration.

You can examine the schema yourself by opening a shell into the database, like so:

   mclient -d dinamite

and issuing a command ‘\d’. Running this command without arguments will list all database tables. Running this command with a “table” argument will give you the table’s schema, for example:

   sql> \d trace

Suppose now you wanted to list all “outlier” functions, sorted by duration. You could accomplish this by typing the following SQL query:

   sql> SELECT * FROM outliers ORDER BY duration DESC;

Or if you wanted to display the first 10 entry records for functions that took more than 60 milliseconds to run:

   sql> SELECT * FROM trace WHERE duration > 60000000 AND dir=0 LIMIT 10;

Now, suppose you wanted to see what happened during the execution while an particular unusually long function was running? You could list all events between the boundaries of the outlier function, sorted in the chronological order, via the following steps.

  1. First, find the ID of the outlier function invocation you want to examine using the SELECT * FROM outliers ORDER BY duration DESC;, picking any function you like and noting its ID (the first column). Suppose you picked a function invocation with the id 4096.

  2. Run the following query to obtain all concurrent events:

       sql> WITH lowerboundtime AS
       (SELECT time FROM trace WHERE id=4096 AND dir=0),
       upperboundtime AS
       (SELECT time FROM trace WHERE id=4096 and dir=1)
       SELECT * FROM trace WHERE
       time>= (SELECT * FROM lowerboundtime)
       AND time <= (SELECT * FROM upperboundtime);
    

That query will probably give you a very large list of events – so large that it’s really inconvenient to look at it in a text form. To make viewing event traces more pleasant, we built a tool called TimeSquared. You can use it to view event traces generated from SQL queries. Read about it in our next blog post!

And before we finish this post, here is another very useful query that could help diagnose the root cause of an “outlier” function invocation. The following query will list all the children function-invocations of the long-lasting functions and sort them by duration. This way, you can quickly spot if there is a particular callee that is responsible for high latency in the parent. Again, assuming that you previously determined that the id of the long-lasting function equals to 4096 and the function was executed by the thread with tid=18, the query would look like:

   sql> WITH lowerboundtime AS
   (SELECT time FROM trace WHERE id=4096 AND dir=0),
   upperboundtime AS
   (SELECT time FROM trace WHERE id=4096 and dir=1)
   SELECT func, id, time, duration FROM trace WHERE
   time>= (SELECT * FROM lowerboundtime)
   AND time <= (SELECT * FROM upperboundtime)
   AND tid=18  ORDER BY	duration DESC;