Use the Join component to combine records from two different dataflows, referred to (in the Join component) as the left source and right source.
To join two dataflows, simply drag and drop the component menu button onto the join component:
A join condition is formed by selecting a field from the left dataflow and a field with a similar datatype from the right dataflow.
The join condition is an equijoin. It compares the values in the fields being joined for equality and then includes all the fields in the left and right sources.
If you define more than one join condition, the Join component returns records with values that meet all the join conditions specified (logical "AND").
The records returned by the join condition depend on the type of join as follows:
- Inner - returns only those records that have a matching value in the joined fields from both the left and right sources. Note that null values are not considered matching values.
- Left - returns the same records as an inner join, as well as records from the left source that have no matches in the joined field in the right source. Such records will have null values in the right source fields.
- Right - returns the same records as an inner join, as well as records from the right source that have no matches in the joined field in the left source. Such records will have null values in the left source fields.
- Full - returns all records that would be returned by a inner, left and right joins (all records from both tables are returned).
Hint: After a Join, you usually add a Select component to fix up the aliases and field order.
To join two sources by one or more fields:
- Add a Join component in your package where two dataflows can be joined.
- Open the component and name it.
- Under general options, define the type of join.
- In the first row, define the fields to be joined for each source (left and right).
- If required, add fields for additional join criteria.
To optimize a join according to the data in the records:
- Under join type, click advanced options.
- Select one of the optimization types as follows:
- Default - uses Hash join - both inputs are read, tagged by source and are sorted and put into buckets according to the join keys. Then for each key, the records are cross joined by source tags.
- Replicated - use when one input is small enough to fit into main memory, thereby improving efficiency. The large relation should be the left source and the small one should be the right source. If the small relation doesn't fit into main memory, the process fails and an error is generated. Replicated join only works with inner or left joins.
- Skewed - use if the underlying key values are very skewed, so that processing isn't evenly distributed. This will affect performance and may cause the reducer that deals with most of the data to go out of memory. When Skewed join is used, a histogram is computed on the join key using the left source and this data is used to allocate more reducers for a given key.
- Merge - use if both inputs are already sorted on the join key, enabling a significant performance improvement. Merge join only works with inner joins.