Using components: Aggregate

Use the Aggregate component to group the input dataset by one or more fields and use aggregate functions such as Count, Average, Minimum, Maximum, etc. For example, you may want to count the number of users in each country that downloaded a file.

To aggregate records:

  1. Add an Aggregate component where required in your package.
  2. Open the component and name it.
  3. Under group by, select the fields on which to perform summary functions. The group by fields return the unique records for those fields.
  4. Under function, field and projected field (for Min By and Max By), select the aggregate function you want to apply (according to the groupings you specified under group by) as follows:
    • Count - returns the number of non-null values in the field you specify in the field column, according to the groupings. Return value data type is long.
    • Count Distinct - returns the number of unique values in the field you specify in the field column, according to the groupings. Return value data type is long.
    • Count All - returns the number of records, according to the groupings. Return value data type is long.
    • HLL - uses the HyperLogLog++ algorithm to return a cardinality estimate or an approximate number of distinct values in the field you specify, according to the groupings. Return value data type is long.
    • Average - returns the average for numeric fields you specify in the field column, according to the groupings. See the following table for return value data types:
      Argument field data typeReturn value data type
      int, longlong
      float, doubledouble
    • Sum - returns the sum for numeric fields you specify in the field column, according to the groupings. See the following table for return value data types:
      Argument field data typeReturn value data type
      int, longlong
      float, doubledouble
    • Min - returns the minimum value for the field you specify in the field column, according to the groupings. Return value data type is the same as the input argument's data type.
    • Min By - for the minimum value in the field you specify in the field column, and according to the groupings, returns the value defined by projected field. Return value data type is the same as the projected field's data type.
    • Max - calculates the maximum value for the field you specify in the field column, according to the groupings. Return value data type is the same as the input argument's data type.
    • Max By - for the maximum value in the field you specify in the field column, and according to the groupings, returns the value defined by projected field. Return value data type is the same as the projected field's data type.
    • VAR - returns the statistical variance for all values in the field you specify in the field column and according to the groupings. Return value data type is double.
    • VARP - returns the statistical variance for the population of all values in the field you specify in the field column and according to the groupings. Return value data type is double.
    • STDEV - returns the statistical standard deviation for all values in the field you specify in the field column and according to the groupings. Return value data type is double.
    • STDEVP - returns the statistical standard deviation for the population of all values in the field you specify in the field column and according to the groupings. Return value data type is double.
    • Collect - returns a collection (bag) of the values in the field you specify in the field column, according to the groupings. The bag can be manipulated further in a Select component using bag functions. Returned data type is bag.
  5. Pick the field(s) to apply the function to.
  6. Type an alias for the field that contains the resulting value for the function.
  7. Add another function if required.

Feedback and Knowledge Base