-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement ANY aggregator #9187
Implement ANY aggregator #9187
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've only reviewed the Double*Aggregator* since I assume the pattern is the same across the other aggregators.
I learnt a bunch about aggregators reviewing this PR 🎉 I'll read through the tests later, but posting now to unblock you.
processing/src/main/java/org/apache/druid/query/aggregation/AggregatorUtil.java
Outdated
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/query/aggregation/any/DoubleAnyAggregator.java
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/query/aggregation/any/DoubleAnyAggregator.java
Outdated
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/query/aggregation/any/DoubleAnyAggregator.java
Outdated
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/query/aggregation/any/DoubleAnyAggregator.java
Outdated
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/query/aggregation/any/DoubleAnyAggregatorFactory.java
Outdated
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/query/aggregation/any/DoubleAnyAggregatorFactory.java
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/query/aggregation/any/DoubleAnyBufferAggregator.java
Outdated
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/query/aggregation/any/DoubleAnyBufferAggregator.java
Outdated
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/query/aggregation/first/StringAggregatorUtils.java
Outdated
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/query/aggregation/AggregatorUtil.java
Outdated
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/query/aggregation/any/DoubleAnyAggregator.java
Outdated
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/query/aggregation/any/DoubleAnyAggregator.java
Outdated
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/query/aggregation/any/DoubleAnyAggregatorFactory.java
Outdated
Show resolved
Hide resolved
Note: ANY on Double/Float/Long column with the useDefaultValueForNull=true will not prefer non-null values over default value for null (i.e. 0) |
processing/src/main/java/org/apache/druid/query/aggregation/any/DoubleAnyAggregatorFactory.java
Outdated
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/query/aggregation/any/DoubleAnyBufferAggregator.java
Show resolved
Hide resolved
@@ -203,6 +203,10 @@ Only the COUNT aggregation can accept DISTINCT. | |||
|`EARLIEST(expr, maxBytesPerString)`|Like `EARLIEST(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.| | |||
|`LATEST(expr)`|Returns the latest non-null value of `expr`, which must be numeric. If `expr` comes from a relation with a timestamp column (like a Druid datasource) then "latest" is the value last encountered with the maximum overall timestamp of all values being aggregated. If `expr` does not come from a relation with a timestamp, then it is simply the last value encountered.| | |||
|`LATEST(expr, maxBytesPerString)`|Like `LATEST(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.| | |||
|`ANY_VALUE(expr)`|Returns any value of `expr`, which must be numeric. If `druid.generic.useDefaultValueForNull=true` this can returns the default value for null and does not prefer "non-null" values over the default value for null. If `druid.generic.useDefaultValueForNull=false`, then this will returns any non-null value of `expr`| | |||
|`ANY_VALUE(expr, maxBytesPerString)`|Like `ANY_VALUE(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should mention that the default maxBytesPerString is 1024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean default? You need to always pass the value. There is no default value for maxBytesPerString
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you have this block in StringAnyAggregatorFactory:
this.maxStringBytes = maxStringBytes == null
? StringFirstAggregatorFactory.DEFAULT_MAX_STRING_SIZE
: maxStringBytes;
I would give the SQL function consistent behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, the implementation for LATEST, EARLIEST (and ANY since I based it off LATEST, EARLIEST) is that if you use the json stuff, then maxStringBytes is optional and if not present will default to 1024 (as per the docs in docs/querying/aggregations.md).
However, this does not work the same if you issue the query through SQL. To use LATEST, EARLIEST (and ANY) in SQL, you must give the maxStringBytes as the second argument. If you do not, then the column actually gets cast into double (super weird).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss. We can change this behaviour for LATEST, EARLIEST (and ANY)
docs/querying/sql.md
Outdated
@@ -203,6 +203,10 @@ Only the COUNT aggregation can accept DISTINCT. | |||
|`EARLIEST(expr, maxBytesPerString)`|Like `EARLIEST(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.| | |||
|`LATEST(expr)`|Returns the latest non-null value of `expr`, which must be numeric. If `expr` comes from a relation with a timestamp column (like a Druid datasource) then "latest" is the value last encountered with the maximum overall timestamp of all values being aggregated. If `expr` does not come from a relation with a timestamp, then it is simply the last value encountered.| | |||
|`LATEST(expr, maxBytesPerString)`|Like `LATEST(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.| | |||
|`ANY_VALUE(expr)`|Returns any value of `expr`, which must be numeric. If `druid.generic.useDefaultValueForNull=true` this can returns the default value for null and does not prefer "non-null" values over the default value for null. If `druid.generic.useDefaultValueForNull=false`, then this will returns any non-null value of `expr`| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add entries for the new aggregators under docs/querying/aggregations.md
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Btw I saw filterNullValues for stringLast and stringFirst. Is that still true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, looks like the docs are out of date for those, we can fix those later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall lgtm
public void init(ByteBuffer buf, int position) | ||
{ | ||
buf.put(position, BYTE_FLAG_IS_NOT_SET); | ||
buf.putDouble(position + Byte.BYTES, NULL_VALUE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of NULL_VALUE
maybe use NullHandling.ZERO_DOUBLE
or like just 0 since this is the only place this is used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
processing/src/main/java/org/apache/druid/query/aggregation/any/DoubleAnyBufferAggregator.java
Show resolved
Hide resolved
@Override | ||
public Object get() | ||
{ | ||
return StringAggregatorUtils.chop(foundValue, maxStringBytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It is probably worth pushing chop
down into StringUtils
rather than renaming and widening the usage of StringAggregatorUtils
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good to me. Done
@@ -29,7 +29,7 @@ | |||
import java.nio.ByteBuffer; | |||
import java.nio.charset.StandardCharsets; | |||
|
|||
public class StringFirstLastUtils | |||
public class StringAggregatorUtils |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you decide to end up moving chop
to StringUtils
, please revert this rename
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -1297,6 +1301,46 @@ public void testLatestAggregators() throws Exception | |||
); | |||
} | |||
|
|||
// This test the on-heap version of the AnyAggregator (Double/Float/Long/String) | |||
@Test | |||
public void testAnyAggregator() throws Exception |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It would probably be worth adding an additional test that tests numeric columns agains druid.numfoo
table since it contains numeric columns that have null values when run in sql compatible null mode, and also tests for ordering by each 'any' aggregator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
{ | ||
if (foundValue == null) { | ||
final Object object = valueSelector.getObject(); | ||
if (object != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry I missed this earlier, this check isn't necessary, DimensionHandlerUtils.convertObjectToString
has it's own null check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
int stringSizeBytes = buf.getInt(position); | ||
if (stringSizeBytes < 0) { | ||
final Object object = valueSelector.getObject(); | ||
if (object != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment about unnecessary check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -377,6 +377,15 @@ public AuthenticationResult createEscalatedAuthenticationResult() | |||
); | |||
|
|||
public static final List<InputRow> ROWS1_WITH_NUMERIC_DIMS = ImmutableList.of( | |||
createRow( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it seems like this change is causing some unrelated test failures
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the VarianceSqlAggregatorTest is using this data too and when the ordering of the rows changed, the variance also changed. I created a new datasource that have the numeric dim first for my test and changed the numfoo datasource back to how it was. The reason I wanted to have numeric null first is because the ANY will select the first row and skip everything after. So if the first row is not null, then there is not really any point in testing (if we want to test the numeric null stuff)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think it's fine to just test with the same numfoo datasource (with first row being non-null)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm after CI
* Implement ANY aggregator * Add copyright headers * Add unit tests * fix BufferAggregator * Fix bug in BufferAggregator * hook up the SQL command * add check for buffer aggregator * Address comment * address comments * add docs * Address comments * add more tests for numeric columns that have null values when run in sql compatible null mode * fix checkstyle errors * fix failing tests * fix failing tests
* Implement ANY aggregator * Add copyright headers * Add unit tests * fix BufferAggregator * Fix bug in BufferAggregator * hook up the SQL command * add check for buffer aggregator * Address comment * address comments * add docs * Address comments * add more tests for numeric columns that have null values when run in sql compatible null mode * fix checkstyle errors * fix failing tests * fix failing tests
This reverts commit d07159d.
…nd adding custom optimization to only preform ANY aggregations on a single row Summary: Note this diff contains a custom optimization for ANY aggregator and needs to be treated with care during an update. We optimized ANY aggregations to only be applied to a single row in the response. The custom code for this is in NumericTopNColumnSelectorStrategy and StringTopNColumnSelectorStrategy and BaseTopNAlgorithm Reviewers: ericnguyen, O1139 Druid Reviewed By: ericnguyen, O1139 Druid Subscribers: jgu, jwang, jenkins, mleonard Differential Revision: https://phabricator.pinadmin.com/D595982 (cherry picked from commit c42dbe2)
…nd adding custom optimization to only preform ANY aggregations on a single row Summary: Note this diff contains a custom optimization for ANY aggregator and needs to be treated with care during an update. We optimized ANY aggregations to only be applied to a single row in the response. The custom code for this is in NumericTopNColumnSelectorStrategy and StringTopNColumnSelectorStrategy and BaseTopNAlgorithm Reviewers: ericnguyen, O1139 Druid Reviewed By: ericnguyen, O1139 Druid Subscribers: jgu, jwang, jenkins, mleonard Differential Revision: https://phabricator.pinadmin.com/D595982 (cherry picked from commit c42dbe2)
…nd adding custom optimization to only preform ANY aggregations on a single row Summary: Note this diff contains a custom optimization for ANY aggregator and needs to be treated with care during an update. We optimized ANY aggregations to only be applied to a single row in the response. The custom code for this is in NumericTopNColumnSelectorStrategy and StringTopNColumnSelectorStrategy and BaseTopNAlgorithm Reviewers: ericnguyen, O1139 Druid Reviewed By: ericnguyen, O1139 Druid Subscribers: jgu, jwang, jenkins, mleonard Differential Revision: https://phabricator.pinadmin.com/D595982 (cherry picked from commit c42dbe2)
…nd adding custom optimization to only preform ANY aggregations on a single row Summary: Note this diff contains a custom optimization for ANY aggregator and needs to be treated with care during an update. We optimized ANY aggregations to only be applied to a single row in the response. The custom code for this is in NumericTopNColumnSelectorStrategy and StringTopNColumnSelectorStrategy and BaseTopNAlgorithm Reviewers: ericnguyen, O1139 Druid Reviewed By: ericnguyen, O1139 Druid Subscribers: jgu, jwang, jenkins, mleonard Differential Revision: https://phabricator.pinadmin.com/D595982 (cherry picked from commit c42dbe2)
…nd adding custom optimization to only preform ANY aggregations on a single row Summary: Note this diff contains a custom optimization for ANY aggregator and needs to be treated with care during an update. We optimized ANY aggregations to only be applied to a single row in the response. The custom code for this is in NumericTopNColumnSelectorStrategy and StringTopNColumnSelectorStrategy and BaseTopNAlgorithm Reviewers: ericnguyen, O1139 Druid Reviewed By: ericnguyen, O1139 Druid Subscribers: jgu, jwang, jenkins, mleonard Differential Revision: https://phabricator.pinadmin.com/D595982 (cherry picked from commit c42dbe2)
…nd adding custom optimization to only preform ANY aggregations on a single row Summary: Note this diff contains a custom optimization for ANY aggregator and needs to be treated with care during an update. We optimized ANY aggregations to only be applied to a single row in the response. The custom code for this is in NumericTopNColumnSelectorStrategy and StringTopNColumnSelectorStrategy and BaseTopNAlgorithm Reviewers: ericnguyen, O1139 Druid Reviewed By: ericnguyen, O1139 Druid Subscribers: jgu, jwang, jenkins, mleonard Differential Revision: https://phabricator.pinadmin.com/D595982
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implement ANY aggregator
Description
Implement ANY function that returns the first value that we can find
AnySqlAggregator can reuse a lot of code in EarliestLatestSqlAggregator. Hence, added ANY to EarliestLatestSqlAggregator and rename EarliestLatestSqlAggregator to EarliestLatestAnySqlAggregator. (Maybe there can be a better name for that)
SQL function keyword is ANY_VALUE since ANY is SQL reserved already.
Create LongAnyAggregatorFactory, FloatAnyAggregatorFactory, DoubleAnyAggregatorFactory, StringAnyAggregatorFactory. The combine logic will just return lhs if not null otherwise return rhs. (lhs, rhs are just the column value)
Create LongAnyAggregator, FloatAnyAggregator, DoubleAnyAggregator, StringAnyAggregator, LongAnyBufferAggregator, FloatAnyBufferAggregator, DoubleAnyBufferAggregator, StringAnyBufferAggregator. Aggregate logic will just return value if we already see non-null value without getObject from selector. LongAnyBufferAggregator, FloatAnyBufferAggregator, and DoubleAnyBufferAggregator have 0 as default values.
This PR has: