Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement ANY aggregator #9187

Merged
merged 15 commits into from
Jan 16, 2020
Merged

Implement ANY aggregator #9187

merged 15 commits into from
Jan 16, 2020

Conversation

maytasm
Copy link
Contributor

@maytasm maytasm commented Jan 15, 2020

Implement ANY aggregator

Description

Implement ANY function that returns the first value that we can find

  • AnySqlAggregator can reuse a lot of code in EarliestLatestSqlAggregator. Hence, added ANY to EarliestLatestSqlAggregator and rename EarliestLatestSqlAggregator to EarliestLatestAnySqlAggregator. (Maybe there can be a better name for that)

  • SQL function keyword is ANY_VALUE since ANY is SQL reserved already.

  • Create LongAnyAggregatorFactory, FloatAnyAggregatorFactory, DoubleAnyAggregatorFactory, StringAnyAggregatorFactory. The combine logic will just return lhs if not null otherwise return rhs. (lhs, rhs are just the column value)

  • Create LongAnyAggregator, FloatAnyAggregator, DoubleAnyAggregator, StringAnyAggregator, LongAnyBufferAggregator, FloatAnyBufferAggregator, DoubleAnyBufferAggregator, StringAnyBufferAggregator. Aggregate logic will just return value if we already see non-null value without getObject from selector. LongAnyBufferAggregator, FloatAnyBufferAggregator, and DoubleAnyBufferAggregator have 0 as default values.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths.
  • added integration tests.
  • been tested in a test Druid cluster.
Copy link
Contributor

@suneet-s suneet-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've only reviewed the Double*Aggregator* since I assume the pattern is the same across the other aggregators.

I learnt a bunch about aggregators reviewing this PR 🎉 I'll read through the tests later, but posting now to unblock you.

@maytasm
Copy link
Contributor Author

maytasm commented Jan 16, 2020

Note: ANY on Double/Float/Long column with the useDefaultValueForNull=true will not prefer non-null values over default value for null (i.e. 0)

docs/querying/sql.md Outdated Show resolved Hide resolved
@@ -203,6 +203,10 @@ Only the COUNT aggregation can accept DISTINCT.
|`EARLIEST(expr, maxBytesPerString)`|Like `EARLIEST(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|
|`LATEST(expr)`|Returns the latest non-null value of `expr`, which must be numeric. If `expr` comes from a relation with a timestamp column (like a Druid datasource) then "latest" is the value last encountered with the maximum overall timestamp of all values being aggregated. If `expr` does not come from a relation with a timestamp, then it is simply the last value encountered.|
|`LATEST(expr, maxBytesPerString)`|Like `LATEST(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|
|`ANY_VALUE(expr)`|Returns any value of `expr`, which must be numeric. If `druid.generic.useDefaultValueForNull=true` this can returns the default value for null and does not prefer "non-null" values over the default value for null. If `druid.generic.useDefaultValueForNull=false`, then this will returns any non-null value of `expr`|
|`ANY_VALUE(expr, maxBytesPerString)`|Like `ANY_VALUE(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should mention that the default maxBytesPerString is 1024

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean default? You need to always pass the value. There is no default value for maxBytesPerString

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have this block in StringAnyAggregatorFactory:


    this.maxStringBytes = maxStringBytes == null
                          ? StringFirstAggregatorFactory.DEFAULT_MAX_STRING_SIZE
                          : maxStringBytes;

I would give the SQL function consistent behavior

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the implementation for LATEST, EARLIEST (and ANY since I based it off LATEST, EARLIEST) is that if you use the json stuff, then maxStringBytes is optional and if not present will default to 1024 (as per the docs in docs/querying/aggregations.md).
However, this does not work the same if you issue the query through SQL. To use LATEST, EARLIEST (and ANY) in SQL, you must give the maxStringBytes as the second argument. If you do not, then the column actually gets cast into double (super weird).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss. We can change this behaviour for LATEST, EARLIEST (and ANY)

@@ -203,6 +203,10 @@ Only the COUNT aggregation can accept DISTINCT.
|`EARLIEST(expr, maxBytesPerString)`|Like `EARLIEST(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|
|`LATEST(expr)`|Returns the latest non-null value of `expr`, which must be numeric. If `expr` comes from a relation with a timestamp column (like a Druid datasource) then "latest" is the value last encountered with the maximum overall timestamp of all values being aggregated. If `expr` does not come from a relation with a timestamp, then it is simply the last value encountered.|
|`LATEST(expr, maxBytesPerString)`|Like `LATEST(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|
|`ANY_VALUE(expr)`|Returns any value of `expr`, which must be numeric. If `druid.generic.useDefaultValueForNull=true` this can returns the default value for null and does not prefer "non-null" values over the default value for null. If `druid.generic.useDefaultValueForNull=false`, then this will returns any non-null value of `expr`|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add entries for the new aggregators under docs/querying/aggregations.md?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Btw I saw filterNullValues for stringLast and stringFirst. Is that still true?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, looks like the docs are out of date for those, we can fix those later

Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm

public void init(ByteBuffer buf, int position)
{
buf.put(position, BYTE_FLAG_IS_NOT_SET);
buf.putDouble(position + Byte.BYTES, NULL_VALUE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of NULL_VALUE maybe use NullHandling.ZERO_DOUBLE or like just 0 since this is the only place this is used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@Override
public Object get()
{
return StringAggregatorUtils.chop(foundValue, maxStringBytes);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It is probably worth pushing chop down into StringUtils rather than renaming and widening the usage of StringAggregatorUtils

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good to me. Done

@@ -29,7 +29,7 @@
import java.nio.ByteBuffer;
import java.nio.charset.StandardCharsets;

public class StringFirstLastUtils
public class StringAggregatorUtils
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you decide to end up moving chop to StringUtils, please revert this rename

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -1297,6 +1301,46 @@ public void testLatestAggregators() throws Exception
);
}

// This test the on-heap version of the AnyAggregator (Double/Float/Long/String)
@Test
public void testAnyAggregator() throws Exception
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It would probably be worth adding an additional test that tests numeric columns agains druid.numfoo table since it contains numeric columns that have null values when run in sql compatible null mode, and also tests for ordering by each 'any' aggregator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

{
if (foundValue == null) {
final Object object = valueSelector.getObject();
if (object != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I missed this earlier, this check isn't necessary, DimensionHandlerUtils.convertObjectToString has it's own null check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

int stringSizeBytes = buf.getInt(position);
if (stringSizeBytes < 0) {
final Object object = valueSelector.getObject();
if (object != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment about unnecessary check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -377,6 +377,15 @@ public AuthenticationResult createEscalatedAuthenticationResult()
);

public static final List<InputRow> ROWS1_WITH_NUMERIC_DIMS = ImmutableList.of(
createRow(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it seems like this change is causing some unrelated test failures

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the VarianceSqlAggregatorTest is using this data too and when the ordering of the rows changed, the variance also changed. I created a new datasource that have the numeric dim first for my test and changed the numfoo datasource back to how it was. The reason I wanted to have numeric null first is because the ANY will select the first row and skip everything after. So if the first row is not null, then there is not really any point in testing (if we want to test the numeric null stuff)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think it's fine to just test with the same numfoo datasource (with first row being non-null)

Copy link
Contributor

@jon-wei jon-wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm after CI

@jon-wei jon-wei merged commit 42359c9 into apache:master Jan 16, 2020
maytasm added a commit to implydata/druid-public that referenced this pull request Jan 16, 2020
* Implement ANY aggregator

* Add copyright headers

* Add unit tests

* fix BufferAggregator

* Fix bug in BufferAggregator

* hook up the SQL command

* add check for buffer aggregator

* Address comment

* address comments

* add docs

* Address comments

* add more tests for numeric columns that have null values when run in sql compatible null mode

* fix checkstyle errors

* fix failing tests

* fix failing tests
maytasm added a commit to implydata/druid-public that referenced this pull request Jan 17, 2020
* Implement ANY aggregator

* Add copyright headers

* Add unit tests

* fix BufferAggregator

* Fix bug in BufferAggregator

* hook up the SQL command

* add check for buffer aggregator

* Address comment

* address comments

* add docs

* Address comments

* add more tests for numeric columns that have null values when run in sql compatible null mode

* fix checkstyle errors

* fix failing tests

* fix failing tests
maytasm added a commit to implydata/druid-public that referenced this pull request Jan 25, 2020
@jihoonson jihoonson added this to the 0.18.0 milestone Mar 26, 2020
debasatwa29 pushed a commit to debasatwa29/druid that referenced this pull request Apr 12, 2022
…nd adding custom optimization to only preform ANY aggregations on a single row

Summary: Note this diff contains a custom optimization for ANY aggregator and needs to be treated with care during an update. We optimized ANY aggregations to only be applied to a single row in the response. The custom code for this is in NumericTopNColumnSelectorStrategy and StringTopNColumnSelectorStrategy and BaseTopNAlgorithm

Reviewers: ericnguyen, O1139 Druid

Reviewed By: ericnguyen, O1139 Druid

Subscribers: jgu, jwang, jenkins, mleonard

Differential Revision: https://phabricator.pinadmin.com/D595982

(cherry picked from commit c42dbe2)
debasatwa29 pushed a commit to debasatwa29/druid that referenced this pull request Apr 26, 2022
…nd adding custom optimization to only preform ANY aggregations on a single row

Summary: Note this diff contains a custom optimization for ANY aggregator and needs to be treated with care during an update. We optimized ANY aggregations to only be applied to a single row in the response. The custom code for this is in NumericTopNColumnSelectorStrategy and StringTopNColumnSelectorStrategy and BaseTopNAlgorithm

Reviewers: ericnguyen, O1139 Druid

Reviewed By: ericnguyen, O1139 Druid

Subscribers: jgu, jwang, jenkins, mleonard

Differential Revision: https://phabricator.pinadmin.com/D595982

(cherry picked from commit c42dbe2)
debasatwa29 pushed a commit to debasatwa29/druid that referenced this pull request Apr 26, 2022
…nd adding custom optimization to only preform ANY aggregations on a single row

Summary: Note this diff contains a custom optimization for ANY aggregator and needs to be treated with care during an update. We optimized ANY aggregations to only be applied to a single row in the response. The custom code for this is in NumericTopNColumnSelectorStrategy and StringTopNColumnSelectorStrategy and BaseTopNAlgorithm

Reviewers: ericnguyen, O1139 Druid

Reviewed By: ericnguyen, O1139 Druid

Subscribers: jgu, jwang, jenkins, mleonard

Differential Revision: https://phabricator.pinadmin.com/D595982

(cherry picked from commit c42dbe2)
debasatwa29 pushed a commit to debasatwa29/druid that referenced this pull request Apr 27, 2022
…nd adding custom optimization to only preform ANY aggregations on a single row

Summary: Note this diff contains a custom optimization for ANY aggregator and needs to be treated with care during an update. We optimized ANY aggregations to only be applied to a single row in the response. The custom code for this is in NumericTopNColumnSelectorStrategy and StringTopNColumnSelectorStrategy and BaseTopNAlgorithm

Reviewers: ericnguyen, O1139 Druid

Reviewed By: ericnguyen, O1139 Druid

Subscribers: jgu, jwang, jenkins, mleonard

Differential Revision: https://phabricator.pinadmin.com/D595982

(cherry picked from commit c42dbe2)
debasatwa29 pushed a commit to debasatwa29/druid that referenced this pull request Apr 27, 2022
…nd adding custom optimization to only preform ANY aggregations on a single row

Summary: Note this diff contains a custom optimization for ANY aggregator and needs to be treated with care during an update. We optimized ANY aggregations to only be applied to a single row in the response. The custom code for this is in NumericTopNColumnSelectorStrategy and StringTopNColumnSelectorStrategy and BaseTopNAlgorithm

Reviewers: ericnguyen, O1139 Druid

Reviewed By: ericnguyen, O1139 Druid

Subscribers: jgu, jwang, jenkins, mleonard

Differential Revision: https://phabricator.pinadmin.com/D595982

(cherry picked from commit c42dbe2)
debasatwa29 pushed a commit to debasatwa29/druid that referenced this pull request Jun 2, 2022
…nd adding custom optimization to only preform ANY aggregations on a single row

Summary: Note this diff contains a custom optimization for ANY aggregator and needs to be treated with care during an update. We optimized ANY aggregations to only be applied to a single row in the response. The custom code for this is in NumericTopNColumnSelectorStrategy and StringTopNColumnSelectorStrategy and BaseTopNAlgorithm

Reviewers: ericnguyen, O1139 Druid

Reviewed By: ericnguyen, O1139 Druid

Subscribers: jgu, jwang, jenkins, mleonard

Differential Revision: https://phabricator.pinadmin.com/D595982
Copy link

@EnsDeLiz EnsDeLiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment