Viewing DINAMITE traces with TimeSquared.

In this blog post we describe how to view DINAMITE traces using TimeSquared.

In the previous post we described how to query DINAMITE traces using a SQL database.

If you would like to get an idea of what the output of TimeSquared will look like, check out the demo. Just click on any button under ‘Sample Data’.

Pre-requisite: You will need to set up the database as described in that post and try the example queries, because TimeSquared interfaces with the database to show portions of the trace.

Set up TimeSquared:

   # Clone the repository
   % git clone https://github.com/dinamite-toolkit/timesquared.git
   % cd timesquared

   # Install all dependent modules
   % npm install
   % npm install monetdb

   # Start the server
   % cd timesquared/timesquared
   % node server.js

If you are using the settings for MonetDB other than the ones recommended in our previous post, modify timesquared/server/TimeSquaredDB.js to reflect these changes. The default options look like these:

   var MDB_OPTIONS = {
   host : 'localhost',
   port : 50000,
   dbname : 'dinamite',
   user : 'monetdb',
   password : 'monetdb'
   };

As you can see, they reflect the same database name, user name and password, we used for MonetDB.

If your server ran successfully, you will see a message like this:

TimeSquared server hosted on http://:::3000

Open your browser and type the following for the URL:


Click on the "Dashboard" button. You will see a textbox, where you can begin
typing SQL queries to visualize parts of the trace.

Suppose that after reading the [previous
post](/2017/03/10/querying-dinamite-traces) you want to generate a portion to
the trace corresponding to the time when a function of interest took a
particularly long time to run. Suppose that you found, using the query `SELECT *
FROM outliers ORDER BY duration DESC;` that the ID of your "outlier" function
invocation equals to 4096.

Type the following query into the query box and press the "Query" button.

with lbt as (select time from trace where id=4096 and dir=0), ubt as (select time from trace where id=4096 and dir=1) select dir, func, tid, time from trace where time>= (select time from lbt) and time<=(select time from ubt) ```

The browser will open another tab where it will try to render the results of your query. You can watch the console where you started the server for any log messages indicating the progress, or error messages. If there are too many records in the result of your query (e.g., >500K), the rendering will take a long time and may even crash the browser. In that case, try limiting the number of records returned by adding the ‘LIMIT’ command to the end of the query. For instance, to limit the results to 100,000 records, append ‘LIMIT 100000’ to the query above.

Once rendering is complete, you will see a succession of callstacks over time, for each thread. Time flows on the horizontal axis, threads and callstacks are along the vertical axis, like so:

TimeSquared

If you hover with the mouse over a coloured strip in the callstack, TimeSquared will show you the name of the function and some information about it.

If you want to see what the TimeSquared output will look like you can

TimeSquared is work in progress. We are working on building an interface to effectively navigate very large traces. At the moment TimeSquared cannot show too many events – and even if it could, you’d hardly want to look at all of them. We will report on these features in our future blog posts.

Querying DINAMITE traces using SQL

In this blog post we describe how to query DINAMITE traces using SQL.

Setting up the database and importing traces

If you have read our earlier post, you know how to process the binary trace files generated by DINAMITE using the process-logs.py script.

This script will produce a compressed trace.sql.gz file with the entire trace in a human-readable format, and a few SQL commands allowing you to import this file directly into the database. That way, you can query your trace using the mighty SQL!

If for some reason you don’t want to use a database, you can visually examine the trace.sql file and query it using any other tools you like, because it is just plain text (just remove the SQL commands at the top and at the bottom of the file). In that case, decompress the file first by running gunzip trace.sql.gz.

If you are a brave soul willing to embrace the power of databases, we will tell you how to set up a free database MonetDB and use it to query the trace.

  1. Download and install MonetDB as explained here.

  2. Start the database daemon and create the database called ‘dinamite’ like so:
    monetdbd create monetdb
    monetdbd start monetdb
    monetdb create dinamite
    monetdb release dinamite
    
  3. Create a file .monetdb in your home directory with the following content. This is needed, so you don’t have to type the username and password every time. Note, we are using the default password for the trace database. If you want to protect your trace with a stronger password, change it accordingly and don’t put it in the text file.

    user=monetdb
    password=monetdb
    
  4. Import the trace.sql.gz file created by process-logs.py into the database:

    gunzip -c trace.sql.gz | mclient -d dinamite
    

You are good to go! Now you can query your traces using SQL. For those not fluent in SQL we provide a few example queries to get you started.

Querying the traces

The database you created in the preceding steps will have three tables: trace,avg_stdev and outliers. The trace table contains raw trace records sorted by time. The avg_stdev table contains the average duration and the standard deviation of durations for each function. The table outliers contains the functions whose duration was more than two standard deviations above the average.

The tables have the following schema:

   CREATE TABLE "sys"."trace" (
    "id"       INTEGER,
    "dir"      INTEGER,
    "func"     VARCHAR(255),
    "tid"      INTEGER,
    "time"     BIGINT,
    "duration" BIGINT
   );

where id is the unique identifier for each invocation of a function, dir is an indicator whether the given record is a function entry (dir=0) or exit point (dir=1), func is the function name, tid is the thread id, time is the timestamp (in nanoseconds, by default), and duration is the duration of the corresponding function.

   CREATE TABLE "sys"."avg_stdev" (
    "func"  VARCHAR(255),
    "stdev" DOUBLE,
    "avg"   DOUBLE
   );

where func is the function name, avg is its average duration across all threads and stdev is the standard deviation.

   CREATE TABLE "sys"."outliers" (
    "id"       INTEGER,
    "func"     VARCHAR(255),
    "time"     BIGINT,
    "duration" BIGINT
   );

where id is the unique ID for each function invocation, func is the function name, time is the timestamp of the function’s entry record and duration is the duration.

You can examine the schema yourself by opening a shell into the database, like so:

   mclient -d dinamite

and issuing a command ‘\d’. Running this command without arguments will list all database tables. Running this command with a “table” argument will give you the table’s schema, for example:

   sql> \d trace

Suppose now you wanted to list all “outlier” functions, sorted by duration. You could accomplish this by typing the following SQL query:

   sql> SELECT * FROM outliers ORDER BY duration DESC;

Or if you wanted to display the first 10 entry records for functions that took more than 60 milliseconds to run:

   sql> SELECT * FROM trace WHERE duration > 60000000 AND dir=0 LIMIT 10;

Now, suppose you wanted to see what happened during the execution while an particular unusually long function was running? You could list all events between the boundaries of the outlier function, sorted in the chronological order, via the following steps.

  1. First, find the ID of the outlier function invocation you want to examine using the SELECT * FROM outliers ORDER BY duration DESC;, picking any function you like and noting its ID (the first column). Suppose you picked a function invocation with the id 4096.

  2. Run the following query to obtain all concurrent events:

       sql> WITH lowerboundtime AS
       (SELECT time FROM trace WHERE id=4096 AND dir=0),
       upperboundtime AS
       (SELECT time FROM trace WHERE id=4096 and dir=1)
       SELECT * FROM trace WHERE
       time>= (SELECT * FROM lowerboundtime)
       AND time <= (SELECT * FROM upperboundtime);
    

That query will probably give you a very large list of events – so large that it’s really inconvenient to look at it in a text form. To make viewing event traces more pleasant, we built a tool called TimeSquared. You can use it to view event traces generated from SQL queries. Read about it in our next blog post!

And before we finish this post, here is another very useful query that could help diagnose the root cause of an “outlier” function invocation. The following query will list all the children function-invocations of the long-lasting functions and sort them by duration. This way, you can quickly spot if there is a particular callee that is responsible for high latency in the parent. Again, assuming that you previously determined that the id of the long-lasting function equals to 4096 and the function was executed by the thread with tid=18, the query would look like:

   sql> WITH lowerboundtime AS
   (SELECT time FROM trace WHERE id=4096 AND dir=0),
   upperboundtime AS
   (SELECT time FROM trace WHERE id=4096 and dir=1)
   SELECT func, id, time, duration FROM trace WHERE
   time>= (SELECT * FROM lowerboundtime)
   AND time <= (SELECT * FROM upperboundtime)
   AND tid=18  ORDER BY	duration DESC;

Compiling Facebook's RocksDB using DINAMITE

In this blog post we describe how we use DINAMITE to compiled RocksDB, an open source persistent key-value store developed by Facebook.

Building RocksDB

Here is the RocksDB Github Repo of RocksDB. If you have your gcc correctly set up, a simple make static_lib db_bench instruction in the root of the repo can build a ‘‘normal’’ version of RocksDB static library and some db_bench sample applications.

Build and Run With Docker

  1. A Docker container with LLVM and DINAMITE can be found here. Run and setup the docker as suggested in the repo.

  2. In the docker command line, setting the following environment variables:

    export WORKING_DIR=/root/dinamite
    export LLVM_SOURCE=$WORKING_DIR/llvm-3.5.0.src
    export LLVM_BUILD=$WORKING_DIR/build
    
  3. Clone the RocksDB.

    cd $WORKING_DIR
    git clone https://github.com/facebook/rocksdb.git
    cd rocksdb
    
    
  4. Set up environment variables for building RocksDB with DINAMITE

    # Variables for DINAMITE build:
    export PATH="$LLVM_BUILD/bin:$PATH"
    export DIN_FILTERS="$LLVM_SOURCE/projects/dinamite/function_filter.json" # default filter
    export DIN_MAPS="$WORKING_DIR/din_maps" 
    export ALLOC_IN="$LLVM_SOURCE/projects/dinamite"
    export INST_LIB="$LLVM_SOURCE/projects/dinamite/library"
    
    # variables for loading DINAMITE pass on RocksDB:
    export CXX=clang++
    export EXTRA_CFLAGS="-O3 -g -v -Xclang -load -Xclang $LLVM_SOURCE/Release+Asserts/lib/AccessInstrument.so"
    export EXTRA_CXXFLAGS=$EXTRA_CFLAGS
    export EXTRA_LDFLAGS="-L$LLVM_SOURCE/projects/dinamite/library/ -linstrumentation"
    
    # Set up variables for runtime, can be set up later before running
    export DINAMITE_TRACE_PREFIX=$WORKING_DIR/din_traces
    export LD_LIBRARY_PATH=$INST_LIB
    
    # Create directories if not exist:
    mkdir $DIN_MAPS
    mkdir $DINAMITE_TRACE_PREFIX
    

    Note: If you are running on MacOS, please do the following changes:

    export EXTRA_CFLAGS="-O3 -g -v -Xclang -load -Xclang $LLVM_SOURCE/Release+Asserts/lib/AccessInstrument.dylib"
    export DYLD_LIBRARY_PATH=$INST_LIB
    
  5. Build the RocksDB library and some sample applications.

    make clean
    make static_lib db_bench
    cd examples
    make clean
    make
    cd ..
    
  6. Now run the simple example RocksDB program to get some traces!

    cd examples
    ./simple_example
    

Build and Run in Your Own Environment

If you don’t like docker (don’t worry, we have the same feeling), you can definitely build DINAMITE instrumented RocksDB in your own environment.

  1. Basically you need to (1) build your LLVM; (2) Place DINAMITE pass to the correct place; (3) Build DINAMITE pass and its runtime library.

    You can copy the commands in the Dockerfile to apply them appropriately to your system. Details information is provided in the section Environment in another article Compiling the WiredTiger key-value store using DINAMITE.

  2. Make sure your are assigning those environment variables correctly:

    export WORKING_DIR="Your working directory, which will later contain rocksdb repo"
    export LLVM_SOURCE="Your LLVM source base directory"
    export LLVM_BUILD="The directory contains your LLVM build. This directory is supposed to have a directory called bin, which contains the clang/clang++ binary"
    
  3. Do the same things starting from (including) Step 3 in section Build and Run With Docker.

Appendix

DINAMITE Variables Explanation:

* `DIN_FILTERS` is a configuration file in json format indicating what get filtered out for injecting instrumentation code
* `DIN_MAPS` is the place where storing id to name maps, important for processing later traces.
* `ALLOC_IN` is a directory contains alloc.in, which let you specify your allocation function
* `INST_LIB` is the variable telling the runtime library DINAMITE uses for writing traces

If you run into trouble, get in touch. If you’d like to find out how to control what’s instrumented, take a look at the User Guide. If you’d like to know what to do with the traces now that you have them, take a look at the other Technical Articles

Discovering lock contention in Facebook's RocksDB

This post describes how we discovered a source of scaling problems for the multireadrandom benchmark in RocksDB.

A while ago, we ran some scaling tests on a set of RocksDB benchmarks. We noticed that some of the benchmarks didn’t scale well as we increased the number of threads. We decided to instrument the runs and use FlowViz to visualize what is going on with multithreaded performance. The discovery helped open two issues in RocksDB’s repository: 1809 and 1811.

Here’s what we did:

(This tutorial was tested on DINAMITE’s docker container. To set it up, follow our Quick start guide. If any of the instructions do not work in your environment, please contact us.)

Setting up RocksDB

Configuring RocksDB to use DINAMITE’s compiler is straightforward. The following is a set of environment variables that need to be set in order to do this. Save it into a file, adjust the paths to point to your DINAMITE installation path, and source before running ./configure

############################################################
## source this file before building RocksDB with DINAMITE ##
############################################################

export DIN_FILTERS=/dinamite_workspace/rocksdb/rocksdb_filters.json
export DINAMITE_TRACE_PREFIX=<path_to_trace_output_dir>
export DIN_MAPS=<path_to_json_maps_output_dir>
export EXTRA_CFLAGS="-O0 -g -v -Xclang -load -Xclang
${LLVM_SOURCE}/Release+Asserts/lib/AccessInstrument.so"
export EXTRA_CXXFLAGS=$EXTRA_CFLAGS
export EXTRA_LDFLAGS="-L${INST_LIB} -linstrumentation"

Make sure that $LLVM_SOURCE points to the root of your LLVM build, and that $INST_LIB points to the library/ subdirectory of DINAMITE compiler’s root.

The contents of rocksdb_filters.json are as follows:

{
    "blacklist" : {
        "function_filters" : [
            "*atomic*",
            "*SpinMutex*"
        ],
        "file_filters" : [ ]
        },
    "whitelist": {
        "file_filters" : [ ],
        "function_filters" : {
            "*" : {
                "events" : [ "function" ]
            },
            "*SpinMutex*" : {
                "events" : [ "function" ]
            },
            "*InstrumentedMutex*" : {
                "events" : [ "function" ]
            }
        }
    }
}

Run ./configure in RocksDB root, and then run make release.

After the build is done, you should have a db_bench executable in RocksDB root.

Running the instrumented benchmark

The following is the script we used to obtain our results:

#!/bin/bash
BENCH=$1
THREADS=$2
./db_bench --db=<path_to_your_db_storage> --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --cache_size=17179869184 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --disable_wal=0 --wal_dir=/tmpfs/rocksdb/wal --sync=0 --disable_data_sync=1 --verify_checksum=1 --delete_obsolete_files_period_micros=314572800 --max_background_compactions=4 --max_background_flushes=0 --level0_slowdown_writes_trigger=16 --level0_stop_writes_trigger=24 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --mmap_read=1 --mmap_write=0 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --benchmarks=${BENCH} --use_existing_db=1 --num=524880 --threads=${THREADS}

Make sure you fix the --db argument to point to where your database will be stored. In our case, the first argument ($BENCH) was set to multireadrandom and the number of threads varied from 1 to 8 in powers of 2.

Results

The output logs were processed and visualized with FlowViz. The guide on how to use this tool is available here and will be omitted from this post for brevity.

Here is the FlowViz output for runs with 1 and 8 threads:

1 thread:

RocksDB 1 thread run

8 threads:

RocksDB 8 threads run

From these, it is immediately obvious that the amount of time spent performing locking operations increases with the number of threads.

These results were used to infer inefficiencies in workloads using MultiGet() and led to opening of the two aforementioned issues on the RocksDB tracker.

Visualizing DINAMITE traces with execution flow diagrams

In this post we introduce DINAMITE’s execution flow charts – charts that visually convey both how the program works and how it performs.

Our previous post described how to parse binary traces created in the process of running the DINAMITE-instrumented program. The parsing step will convert the binary traces to text, create summary files, and generate execution flow charts showing how the program transitions between different functions and how much time it spends there.

The process-logs.py script in the binary trace toolkit (see previous post) will now produce a directory called HTML in the same directory where you ran the script. By opening HTML/index.html inside your browser, you will see execution flow charts for all threads on a single page and clicking on any image takes you to a new page, where this image is interactive: you can click on rectangles corresponding to different functions to get summaries of their execution. Let us go through all of this in more detail.

First, let’s take another look at an execution flow chart. The one below is for one of the application threads in a a multi-table workload executing on top of the WiredTiger key-value store.

WiredTiger execution flow diagram

The image shows a state transition diagram, where states are function entry/exit points. Edges between states show how many times each transition occurs. Nodes corresponding to states are coloured and annotated accordingly to percent of execution time spent by this thread in the corresponding function. Percent values are cumulative. That is, if foo() took 100% of the execution, and bar(), called by foo(), took 90%, they will be annotated as 100% and 90% respectively.

From this diagram we can intuit how the program works, even if we have never seen the code and have only a vague understanding of how key-value stores work. We observe that it searches a btree (100% of the time is spent in __wt_btcur_search), which probably has a row-based layout (__wt_row_search, __cursor_row+search). We can guess that in this execution the thread performed 118,907 searches (the number of transitions between enter __cursor_row_search and enter __wt_row_search) and of those searches most of the records (118,806, to be precise) were found in memory: we can guess the latter from the fact that there are 118,806 transitions directly from enter __wt_row_search to enter __wt_lex_compare_skip. 110 of the searches, on the other hand, did not find the needed page in memory and caused the system to read the data from disk: hence the right subtree emanating from the enter __wt_row_search function and dominated by __wt_page_in_func (38% of the execution time) – the function that, judging by its name, pages the data from disk.

Furthermore, we get interesting statistics about the effect of locks on performance.

  • ~19% of the time is spent waiting on a spinlock somewhere in __wt_row_modify function (or its children).
  • A lock acquired by __evict_get_ref takes about 3% of the execution time
  • There is also a read/write spinlock, acquired by __wt_txn_get_snapshot (or its children) — it takes about 9% of the execution time for this thread.

Note that lock releases are not shown here, because they take less than 2% of the execution time, and by default the trace-processing script filters any functions with small contributions to the overall time to avoid having to show a huge number of nodes.

Now, if you follow this link, you will be taken to a page that includes interactive version of execution flow charts for all threads for the same workload.

If you click on any image, you will be taken to a page where the same image is interactive. You can click on any node of the flow diagram and be taken to a page containing a short summary of this function’s execution. Those summaries are not great yet; we are working on making them more useful, but one curious thing we discover from them, by clicking on a summary for any lock-acquiring function, is that the largest time to acquire a lock was more than a whole second (!!!) – much larger than the average time to acquire the lock. This is definitely a red flag that a performance engineer would want to investigate.

For another example, take a look at the following execution flow charts for RocksDB. We executed dbbench with the multireadrandom workload using one, two, four and eight threads respectively. Note how the time spent acquiring locks increases as we increase the number of threads.

Note: If you have already converted binary traces to text and want to re-generate the HTML, simply go into the directory containing text traces and run:

   % $DINAMITE_BINTRACE_TOOLKIT/process_logs.py trace.bin.*.txt

By default, all functions taking less than 2% of the execution time will not be included in the diagrams. To remove this filter, run the script like this:

% $DINAMITE_BINTRACE_TOOLKIT/process_logs.py -p 0.0 trace.bin.*.txt

The -p argument controls the percent value for the filter. Without filtering, the diagrams may become very large and not suitable for viewing.

See previous post if you need information on how to download the DINAMITE binary trace toolkit.