Implement ANY aggregator #9187

maytasm · 2020-01-15T00:38:15Z

Implement ANY aggregator

Description

Implement ANY function that returns the first value that we can find

AnySqlAggregator can reuse a lot of code in EarliestLatestSqlAggregator. Hence, added ANY to EarliestLatestSqlAggregator and rename EarliestLatestSqlAggregator to EarliestLatestAnySqlAggregator. (Maybe there can be a better name for that)
SQL function keyword is ANY_VALUE since ANY is SQL reserved already.
Create LongAnyAggregatorFactory, FloatAnyAggregatorFactory, DoubleAnyAggregatorFactory, StringAnyAggregatorFactory. The combine logic will just return lhs if not null otherwise return rhs. (lhs, rhs are just the column value)
Create LongAnyAggregator, FloatAnyAggregator, DoubleAnyAggregator, StringAnyAggregator, LongAnyBufferAggregator, FloatAnyBufferAggregator, DoubleAnyBufferAggregator, StringAnyBufferAggregator. Aggregate logic will just return value if we already see non-null value without getObject from selector. LongAnyBufferAggregator, FloatAnyBufferAggregator, and DoubleAnyBufferAggregator have 0 as default values.

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths.
added integration tests.
been tested in a test Druid cluster.

suneet-s

I've only reviewed the Double*Aggregator* since I assume the pattern is the same across the other aggregators.

I learnt a bunch about aggregators reviewing this PR 🎉 I'll read through the tests later, but posting now to unblock you.

processing/src/main/java/org/apache/druid/query/aggregation/AggregatorUtil.java

processing/src/main/java/org/apache/druid/query/aggregation/any/DoubleAnyAggregator.java

processing/src/main/java/org/apache/druid/query/aggregation/any/DoubleAnyAggregatorFactory.java

processing/src/main/java/org/apache/druid/query/aggregation/any/DoubleAnyBufferAggregator.java

processing/src/main/java/org/apache/druid/query/aggregation/first/StringAggregatorUtils.java

processing/src/main/java/org/apache/druid/query/aggregation/AggregatorUtil.java

processing/src/main/java/org/apache/druid/query/aggregation/any/DoubleAnyAggregator.java

processing/src/main/java/org/apache/druid/query/aggregation/any/DoubleAnyAggregatorFactory.java

maytasm · 2020-01-16T00:01:32Z

Note: ANY on Double/Float/Long column with the useDefaultValueForNull=true will not prefer non-null values over default value for null (i.e. 0)

processing/src/main/java/org/apache/druid/query/aggregation/any/DoubleAnyAggregatorFactory.java

processing/src/main/java/org/apache/druid/query/aggregation/any/DoubleAnyBufferAggregator.java

docs/querying/sql.md

jon-wei · 2020-01-16T02:49:05Z

docs/querying/sql.md

@@ -203,6 +203,10 @@ Only the COUNT aggregation can accept DISTINCT.
 |`EARLIEST(expr, maxBytesPerString)`|Like `EARLIEST(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|
 |`LATEST(expr)`|Returns the latest non-null value of `expr`, which must be numeric. If `expr` comes from a relation with a timestamp column (like a Druid datasource) then "latest" is the value last encountered with the maximum overall timestamp of all values being aggregated. If `expr` does not come from a relation with a timestamp, then it is simply the last value encountered.|
 |`LATEST(expr, maxBytesPerString)`|Like `LATEST(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|
+|`ANY_VALUE(expr)`|Returns any value of `expr`, which must be numeric. If `druid.generic.useDefaultValueForNull=true` this can returns the default value for null and does not prefer "non-null" values over the default value for null. If `druid.generic.useDefaultValueForNull=false`, then this will returns any non-null value of `expr`|
+|`ANY_VALUE(expr, maxBytesPerString)`|Like `ANY_VALUE(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|


This should mention that the default maxBytesPerString is 1024

What do you mean default? You need to always pass the value. There is no default value for maxBytesPerString

you have this block in StringAnyAggregatorFactory:

this.maxStringBytes = maxStringBytes == null ? StringFirstAggregatorFactory.DEFAULT_MAX_STRING_SIZE : maxStringBytes;

I would give the SQL function consistent behavior

Currently, the implementation for LATEST, EARLIEST (and ANY since I based it off LATEST, EARLIEST) is that if you use the json stuff, then maxStringBytes is optional and if not present will default to 1024 (as per the docs in docs/querying/aggregations.md).
However, this does not work the same if you issue the query through SQL. To use LATEST, EARLIEST (and ANY) in SQL, you must give the maxStringBytes as the second argument. If you do not, then the column actually gets cast into double (super weird).

Let's discuss. We can change this behaviour for LATEST, EARLIEST (and ANY)

jon-wei · 2020-01-16T02:49:46Z

docs/querying/sql.md

@@ -203,6 +203,10 @@ Only the COUNT aggregation can accept DISTINCT.
 |`EARLIEST(expr, maxBytesPerString)`|Like `EARLIEST(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|
 |`LATEST(expr)`|Returns the latest non-null value of `expr`, which must be numeric. If `expr` comes from a relation with a timestamp column (like a Druid datasource) then "latest" is the value last encountered with the maximum overall timestamp of all values being aggregated. If `expr` does not come from a relation with a timestamp, then it is simply the last value encountered.|
 |`LATEST(expr, maxBytesPerString)`|Like `LATEST(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|
+|`ANY_VALUE(expr)`|Returns any value of `expr`, which must be numeric. If `druid.generic.useDefaultValueForNull=true` this can returns the default value for null and does not prefer "non-null" values over the default value for null. If `druid.generic.useDefaultValueForNull=false`, then this will returns any non-null value of `expr`|


Can you also add entries for the new aggregators under docs/querying/aggregations.md?

Done. Btw I saw filterNullValues for stringLast and stringFirst. Is that still true?

Hm, looks like the docs are out of date for those, we can fix those later

clintropolis

overall lgtm

clintropolis · 2020-01-16T04:55:29Z

processing/src/main/java/org/apache/druid/query/aggregation/any/DoubleAnyBufferAggregator.java

+  public void init(ByteBuffer buf, int position)
+  {
+    buf.put(position, BYTE_FLAG_IS_NOT_SET);
+    buf.putDouble(position + Byte.BYTES, NULL_VALUE);


instead of NULL_VALUE maybe use NullHandling.ZERO_DOUBLE or like just 0 since this is the only place this is used

processing/src/main/java/org/apache/druid/query/aggregation/any/DoubleAnyBufferAggregator.java

clintropolis · 2020-01-16T05:05:27Z

processing/src/main/java/org/apache/druid/query/aggregation/any/StringAnyAggregator.java

+  @Override
+  public Object get()
+  {
+    return StringAggregatorUtils.chop(foundValue, maxStringBytes);


nit: It is probably worth pushing chop down into StringUtils rather than renaming and widening the usage of StringAggregatorUtils

sounds good to me. Done

clintropolis · 2020-01-16T05:06:47Z

processing/src/main/java/org/apache/druid/query/aggregation/first/StringAggregatorUtils.java

@@ -29,7 +29,7 @@
 import java.nio.ByteBuffer;
 import java.nio.charset.StandardCharsets;

-public class StringFirstLastUtils
+public class StringAggregatorUtils


If you decide to end up moving chop to StringUtils, please revert this rename

clintropolis · 2020-01-16T05:10:26Z

sql/src/test/java/org/apache/druid/sql/calcite/CalciteQueryTest.java

@@ -1297,6 +1301,46 @@ public void testLatestAggregators() throws Exception
    );
  }

+  // This test the on-heap version of the AnyAggregator (Double/Float/Long/String)
+  @Test
+  public void testAnyAggregator() throws Exception


nit: It would probably be worth adding an additional test that tests numeric columns agains druid.numfoo table since it contains numeric columns that have null values when run in sql compatible null mode, and also tests for ordering by each 'any' aggregator.

…sql compatible null mode

clintropolis · 2020-01-16T09:05:43Z

processing/src/main/java/org/apache/druid/query/aggregation/any/StringAnyAggregator.java

+  {
+    if (foundValue == null) {
+      final Object object = valueSelector.getObject();
+      if (object != null) {


sorry I missed this earlier, this check isn't necessary, DimensionHandlerUtils.convertObjectToString has it's own null check

clintropolis · 2020-01-16T09:06:53Z

processing/src/main/java/org/apache/druid/query/aggregation/any/StringAnyBufferAggregator.java

+    int stringSizeBytes = buf.getInt(position);
+    if (stringSizeBytes < 0) {
+      final Object object = valueSelector.getObject();
+      if (object != null) {


same comment about unnecessary check

clintropolis · 2020-01-16T09:10:54Z

sql/src/test/java/org/apache/druid/sql/calcite/util/CalciteTests.java

@@ -377,6 +377,15 @@ public AuthenticationResult createEscalatedAuthenticationResult()
  );

  public static final List<InputRow> ROWS1_WITH_NUMERIC_DIMS = ImmutableList.of(
+      createRow(


Hmm, it seems like this change is causing some unrelated test failures

Seems like the VarianceSqlAggregatorTest is using this data too and when the ordering of the rows changed, the variance also changed. I created a new datasource that have the numeric dim first for my test and changed the numfoo datasource back to how it was. The reason I wanted to have numeric null first is because the ANY will select the first row and skip everything after. So if the first row is not null, then there is not really any point in testing (if we want to test the numeric null stuff)

Actually, I think it's fine to just test with the same numfoo datasource (with first row being non-null)

jon-wei

lgtm after CI

* Implement ANY aggregator * Add copyright headers * Add unit tests * fix BufferAggregator * Fix bug in BufferAggregator * hook up the SQL command * add check for buffer aggregator * Address comment * address comments * add docs * Address comments * add more tests for numeric columns that have null values when run in sql compatible null mode * fix checkstyle errors * fix failing tests * fix failing tests

This reverts commit d07159d.

…nd adding custom optimization to only preform ANY aggregations on a single row Summary: Note this diff contains a custom optimization for ANY aggregator and needs to be treated with care during an update. We optimized ANY aggregations to only be applied to a single row in the response. The custom code for this is in NumericTopNColumnSelectorStrategy and StringTopNColumnSelectorStrategy and BaseTopNAlgorithm Reviewers: ericnguyen, O1139 Druid Reviewed By: ericnguyen, O1139 Druid Subscribers: jgu, jwang, jenkins, mleonard Differential Revision: https://phabricator.pinadmin.com/D595982 (cherry picked from commit c42dbe2)

…nd adding custom optimization to only preform ANY aggregations on a single row Summary: Note this diff contains a custom optimization for ANY aggregator and needs to be treated with care during an update. We optimized ANY aggregations to only be applied to a single row in the response. The custom code for this is in NumericTopNColumnSelectorStrategy and StringTopNColumnSelectorStrategy and BaseTopNAlgorithm Reviewers: ericnguyen, O1139 Druid Reviewed By: ericnguyen, O1139 Druid Subscribers: jgu, jwang, jenkins, mleonard Differential Revision: https://phabricator.pinadmin.com/D595982

EnsDeLiz

@jihoonson

maytasm added 5 commits January 14, 2020 16:37

Implement ANY aggregator

5da72a0

Add copyright headers

57dd3cb

Add unit tests

b6d1dbb

fix BufferAggregator

ebd8fc0

Fix bug in BufferAggregator

e3579d3

maytasm requested review from gianm and clintropolis January 15, 2020 06:51

hook up the SQL command

d27db11

suneet-s reviewed Jan 15, 2020

View reviewed changes

add check for buffer aggregator

a9b7113

clintropolis reviewed Jan 15, 2020

View reviewed changes

Address comment

2fa5a51

maytasm added 2 commits January 15, 2020 17:17

address comments

e20a743

add docs

3436332

jon-wei reviewed Jan 16, 2020

View reviewed changes

clintropolis reviewed Jan 16, 2020

View reviewed changes

clintropolis added Area - Querying Area - SQL labels Jan 16, 2020

maytasm added 3 commits January 15, 2020 23:21

Address comments

a370ffc

add more tests for numeric columns that have null values when run in …

bcae86e

…sql compatible null mode

fix checkstyle errors

d6f9543

clintropolis reviewed Jan 16, 2020

View reviewed changes

maytasm added 2 commits January 16, 2020 11:21

fix failing tests

9532752

fix failing tests

a3925dc

jon-wei approved these changes Jan 16, 2020

View reviewed changes

clintropolis approved these changes Jan 16, 2020

View reviewed changes

jon-wei merged commit 42359c9 into apache:master Jan 16, 2020

maytasm added a commit to implydata/druid-public that referenced this pull request Jan 25, 2020

Revert "Implement ANY aggregator (apache#9187) (#37)"

fb04615

This reverts commit d07159d.

jihoonson added this to the 0.18.0 milestone Mar 26, 2020

EnsDeLiz reviewed Aug 28, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement ANY aggregator #9187

Implement ANY aggregator #9187

maytasm commented Jan 15, 2020 •

edited

Loading

suneet-s left a comment •

edited

Loading

maytasm commented Jan 16, 2020

jon-wei Jan 16, 2020

maytasm Jan 16, 2020

jon-wei Jan 16, 2020

maytasm Jan 16, 2020

maytasm Jan 16, 2020

jon-wei Jan 16, 2020

maytasm Jan 16, 2020

jon-wei Jan 16, 2020

clintropolis left a comment

clintropolis Jan 16, 2020

maytasm Jan 16, 2020

clintropolis Jan 16, 2020

maytasm Jan 16, 2020

clintropolis Jan 16, 2020

maytasm Jan 16, 2020

clintropolis Jan 16, 2020

maytasm Jan 16, 2020

clintropolis Jan 16, 2020

maytasm Jan 16, 2020

clintropolis Jan 16, 2020

maytasm Jan 16, 2020

clintropolis Jan 16, 2020

maytasm Jan 16, 2020

maytasm Jan 16, 2020

jon-wei left a comment

EnsDeLiz left a comment

Implement ANY aggregator #9187

Implement ANY aggregator #9187

Conversation

maytasm commented Jan 15, 2020 • edited Loading

Description

suneet-s left a comment • edited Loading

Choose a reason for hiding this comment

maytasm commented Jan 16, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clintropolis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jon-wei left a comment

Choose a reason for hiding this comment

EnsDeLiz left a comment

Choose a reason for hiding this comment

maytasm commented Jan 15, 2020 •

edited

Loading

suneet-s left a comment •

edited

Loading