Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add cum_count_*() operations to update_by and Numeric library #6270

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

lbooker42
Copy link
Contributor

@lbooker42 lbooker42 commented Oct 23, 2024

Adds new operation to update_by to output the cumulative count of values encountered.

  • cum_count_all - cumulative count of all values (including null)
  • cum_count_non_null - cumulative count of non-null values
  • cum_count_null - cumulative count of null values
  • cum_count_neg - cumulative count of non-null negative values
  • cum_count_pos - cumulative count of non-null positive values
  • cum_count_zero - cumulative count of zero values
  • cum_count_nan - cumulative count of NaN values
  • cum_count_inf - cumulative count of +Inf/-Inf values
  • cum_count_finite - cumulative count of non-null finite values

Example Code

Groovy

t = emptyTable(100).update("A=ii%11==0?null:ii%11-5", "B=(double)A", "C=A==0?Double.NaN:A==2?Double.POSITIVE_INFINITY:A==4?Double.NEGATIVE_INFINITY:(double)A")

opArr = [
    CumCountAll("countA=A","countB=B","countC=C"),
    CumCountNonNull("countA=A","countB=B","countC=C"),
    CumCountNull("countNullA=A","countNullB=B","countNullC=C"),
    CumCountNegative("countNegA=A","countNegB=B","countNegC=C"),
    CumCountPositive("countPosA=A","countPosB=B","countPosC=C"),
    CumCountZero("countZeroA=A","countZeroB=B","countZeroC=C"),
    CumCountNaN("countNaNA=A","countNaNB=B","countNaNC=C"),
    CumCountInfinite("countInfA=A","countInfB=B","countInfC=C"),
    CumCountFinite("countFiniteA=A","countFiniteB=B","countFiniteC=C"),
]

t_count = t.updateBy(opArr)

Python

from deephaven import empty_table
from deephaven.updateby import cum_count, cum_count_null, cum_count_neg, cum_count_pos, cum_count_zero, cum_count_nan, cum_count_inf, cum_count_finite

t = empty_table(100).update(["A=ii%11==0?null:ii%11-5", "B=(double)A","C=A==0?Double.NaN:A==2?Double.POSITIVE_INFINITY:A==4?Double.NEGATIVE_INFINITY:(double)A"])

op_arr = [
    cum_count_all(cols=["count_A=A","count_B=B", "count_C=C"]),
    cum_count_non_null(cols=["count_A=A","count_B=B", "count_C=C"]),
    cum_count_null(cols=["count_null_A=A","count_null_B=B", "count_null_C=C"]),
    cum_count_neg(cols=["count_neg_A=A","count_neg_B=B", "count_neg_C=C"]),
    cum_count_pos(cols=["count_pos_A=A","count_pos_B=B", "count_pos_C=C"]),
    cum_count_zero(cols=["count_zero_A=A","count_zero_B=B", "count_zero_C=C"]),
    cum_count_nan(cols=["count_nan_A=A","count_nan_B=B", "count_nan_C=C"]),
    cum_count_inf(cols=["count_inf_A=A","count_inf_B=B", "count_inf_C=C"]),
    cum_count_finite(cols=["count_finite_A=A","count_finite_B=B", "count_finite_C=C"]),
]

t_count = t.update_by(ops=op_arr)

@lbooker42 lbooker42 changed the title feat!: Add cum_count() operation to update_by feat!: Add cum_count_*() operations to update_by Oct 24, 2024
@lbooker42 lbooker42 changed the title feat!: Add cum_count_*() operations to update_by feat: Add cum_count_*() operations to update_by Oct 25, 2024
@lbooker42 lbooker42 changed the title feat: Add cum_count_*() operations to update_by feat: Add cum_count_*() operations to update_by and Numeric library Oct 28, 2024
Copy link
Member

@devinrsmith devinrsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not take a look at anything except TableSpec stuff.

import org.immutables.value.Value.Immutable;

/**
* A {@link UpdateBySpec} for performing a Cumulative Count of non-null values in the specified columns.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"non-null", seems incorrect.

Comment on lines 19 to 21
public enum CumCountType {
NON_NULL, NULL, NEGATIVE, POSITIVE, ZERO, NAN, INFINITE, FINITE
}
Copy link
Member

@devinrsmith devinrsmith Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pointing out (I think we do this in a couple other places), having values that are only applicable for some numeric types is a bit of a "code smell" - but something I guess we are "okay" with.

I'm worried about double logic; does ZERO include -0.0? Is -0.0 counted for negative?

Instead of an enum, I think we might want an EnumSet here where the user could say "I want to count NULL | NAN, or NULL | NAN | INFINITE, etc (and then, NON_NULL doesn't exist, it would just be represented by the lack of NULL).

If we really wanted to be more generally, we could even consider using a Filter/Expression - of course, we could limit it the scope of what would be allowable in this context, but it could give us room in the future to support more advance counts like "count number of odd longs", etc.

* @param pairs The input/output column name pairs
* @return The aggregation
*/
static UpdateByOperation CumCount(String... pairs) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having a method for per enum is not desirable... just expose enum (or, however it is configured) to the caller?

@lbooker42
Copy link
Contributor Author

Performance Note

Extensive use of lambdas does not penalize performance in testing. Comparing to CumSum which has type-specific operators and evaluates every value and CumCountAll which does no evaluation.

image

Results show that the new operators match the CumSum performance in the bucketed tests (in which the partitioning cost dominates) and exceed it in zero-key tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants