Apache Druid®’s best kept secret – The Tuple Sketch: Part1 – Student’s t test

The tuple sketch is an interesting and intriguing capability in druid. This sketch enables many interesting capabilities

  1. Distinct Counts
  2. Mean and Variances for different metrics
  3. Set operations
  4. Student’s t test

In this post I will be looking at the student’s t test.

Student’s t test

The Student’s t test measures the probability of the differences between two distributions being statistically significant. For instance, when a life saving drug is tested and the average life expectancy for the control group is 5 years and the average life expectancy for the group receiving the drug is 6 years then the question is whether this difference is statistically significant or just random?

Student’s t test with Druid

Install Druid and load the wikiticker-2015-09-12-sampled data. The metric “added” in the dataset captures the number of lines added by a user to a channel at a specific time. The difference between average lines added per user for different channels can show the difference in the level of engagement. Run the following query

select sum("added") filter (where channel='#en.wikipedia')/
APPROX_COUNT_DISTINCT_DS_THETA("user") filter (where channel='#en.wikipedia') en_avg ,
sum("added") filter (where channel='#hi.wikipedia')/
APPROX_COUNT_DISTINCT_DS_THETA("user") filter (where channel='#hi.wikipedia') hi_avg,
sum("added") filter (where channel='#fr.wikipedia')
/APPROX_COUNT_DISTINCT_DS_THETA("user") filter (where channel='#fr.wikipedia') fr_avg from "wikiticker-2015-09-12-sampled"

This give us the below output

The question is whether the difference between avg lines added per user is statistically significant or random. If the difference is significant, then we can conclude that the English wikipedia may have lower engagement than the Hindi wikipedia and the French wikipedia may have a higher engagement than the English wikipedia.The dataset has a number of different metrics (commentLength, lines deleted etc) and each can be analysed to get a different perspective on the engagement level. To understand the statistical significance of the above result I ran a student’s t test using the below query (the tuple sketch uses the native Druid query and has no SQL support for now.

{"dataSource": "wikiticker-2015-09-12-sampled","queryType": "groupBy",
  "intervals": [
        "2010-06-27T00:00:00.000Z/2016-06-28T00:00:00.000Z"
    ],
    "granularity":"all",
  "aggregations": [
  {  "type": "filtered",
     "aggregator":
        {
            "type": "arrayOfDoublesSketch",
            "name": "sketch_en",
            "fieldName": "user",
            "metricColumns" : ["added"]
            "nominalEntries": 16384
        },
        "filter": {
       "type": "selector",
       "dimension": "channel",
       "value": "#en.wikipedia",
       "extractionFn": null
     }
     },
     {  "type": "filtered",
        "aggregator":
           {
               "type": "arrayOfDoublesSketch",
               "name": "sketch_hi",
               "fieldName": "user",
               "metricColumns" : ["added"]
               "nominalEntries": 16384
           },
           "filter": {
          "type": "selector",
          "dimension": "channel",
          "value": "#hi.wikipedia",
          "extractionFn": null
        }
        }
],
"postAggregations": [
        {
  "type"  : "arrayOfDoublesSketchTTest",
  "name": "t_test",
  "fields"  : [{"type": "fieldAccess",
            "fieldName": "sketch_en"},
            {"type": "fieldAccess",
                    "fieldName": "sketch_hi"}
            ]
}


],
    "dimensions": [

    ]
}

   



This is a groupby query on a druid data source called wikiticker-2015-09-12-sampled for a large time interval (the dataset if just for one day within that interval.

{ “type”: “filtered”,
“aggregator”:
{
“type”: “arrayOfDoublesSketch”,
“name”: “sketch_en”,
“fieldName”: “user”,
“metricColumns” : [“added”]
“nominalEntries”: 16384
},
“filter”: {
“type”: “selector”,
“dimension”: “channel”,
“value”: “#en.wikipedia”,
“extractionFn”: null
}
}

basically creates a tuple sketch (arrayOfDoublesSketch) on user using added as the metric column and filtered for the channel #en.wikipedia. In a similar I have also created a tuple sketch for #hi.wikipedia. The post aggregation

“postAggregations”: [
{
“type” : “arrayOfDoublesSketchTTest”,
“name”: “t_test”,
“fields” : [{“type”: “fieldAccess”,
“fieldName”: “sketch_en”},
{“type”: “fieldAccess”,
“fieldName”: “sketch_hi”}
]
}

]

creates a t test between the two distributions (#en.wikipedia and #hi.wikipedia). When I run this query I get

the column t_test essentially the p value (probability) that the difference in mean between the english and french wikipedia is random and not statistically significant. The p value is between 0 and 1 and a value of 0.07…. above tells me that this difference between the average lines added per user between the hindi and english wikipedia is likely not random and hence statistically significant. Hence one say that there is a difference in the level of engagement between the channels based on the number of lines added per user. The other interesting thing is that the number of distinct users for the hindi wikipedia (16) is much smaller than the number for english (4343). So we get a different picture of the engagement (at least in terms adding content). The point is the tuple sketch allows us to look at both these perspectives in one query that gives us an estimate. When I do a similar exercise between #fr.wikipedia and #en.wikipedia I get

The above shows a p value of 0.52….This tells me that the difference in average lines added per user between english and french wikipedia is statistically not significant.

Key take aways

I have written out a tuple sketch query in druid to do the student’s t test on the wikipedia dataset for different channels to measure the statistical significance of difference in average lines added per user. The tuple sketch allows us to look at both the t-test and the number of distinct users for each channel. This gives us a different picture of the level of engagement in terms of adding content to each channel. The tuple sketch uses an approximation algorithm and hence is very fast on large datasets.

Key links

  1. Druid – https://druid.apache.org/docs/latest/tutorials/index.html
  2. Druid tuple sketch – https://druid.apache.org/docs/latest/development/extensions-core/datasketches-tuple.html
  3. Data sketches – https://datasketches.apache.org/

Leave a comment