Xplenty’s platform allows organizations to integrate, process, and prepare data for analytics on the cloud. By providing a coding and jargon-free environment, Xplenty’s scalable platform ensures businesses can quickly and easily benefit from the opportunities offered by big data without having to invest in hardware, software, or related personnel. With Xplenty, every company can have immediate connectivity to a variety of data stores and a rich set of out-of-the-box data transformation components.
The main terms you will encounter in the Xplenty documentation include:
Connections define the data repositories or services your Xplenty account can read data from or write data to. The connections contain access information that is stored securely and can only be used by your account's members.
An Xplenty package is a data flow definition. Each package can be either dataflow or workflow.
- Dataflow- Xplenty Dataflow describes the data sources, transformations to perform, and the output destinations (location, schema).
- Workflow- Workflows are packages that allows defining dependencies between tasks, such as executing SQL query or running a dataflow package. You can define the dependencies and the conditions for executing a task, for example- task can be executed only in case that the previous task was completed successfully.
You can create your own package from scratch or use one of our pre-defined templates. To create package from template, create a new package and select the desired template from 'Templates' dropdown.
An Xplenty cluster is a group of machines (nodes) that is allocated exclusively to your account's users. You can create one or more clusters, and you can run one or more jobs on each cluster. A cluster that you've created remains allocated to your account until you request to terminate the cluster.
Your account includes a free sandbox cluster for testing jobs on relatively small amounts data during the package development cycle.
An Xplenty job is a process that is responsible for running a specific package on a cluster. The job is a batch process that processes a finite amount of data and then terminates. Several jobs can run the same package simultaneously. When you run a new job, you select the name of the package whose workflow the job should perform, and the cluster on which to run.
After signing in to your account, you will be taken to Xplenty’s dashboard, where you can access all components of Xplenty: connections, packages, jobs, clusters, schedules, and settings.
- The first step would be creating a package, by clicking on the "new package" button (2) under packages (1) and selecting dataflow (3).
You can create your own package from scratch or use one of our pre-defined templates. Templates are packages that capture all of the important information from a particular source and pushing the data into the determined destination.
To create package from template, create a new package and select the desired template from 'Templates' dropdown. Follow the instructions specified on the package notes for changing variables values or adjusting the template to your needs.
- In the dataflow UI, select ‘+ Add Component’ and choose which data source you would like to pull data from.
- After choosing your source component, click in the middle of the striped rectangle and configure this component (4).
- Click on ‘+ New’ and select the connection type. NOTE: if your source isn’t listed in the connections list, you can use the REST API component to connect to most SaaS applications and other data stores that support REST API, and you may probably find a template to help you speed the process.
- On the selected connection form, add the required details. Note that for part of the connection types, you should allow Xplenty access to the service or data repository before creating the connection. This may require setting up firewall rules, starting SSH tunnels or creating users with minimum required permissions. Read more about allowing Xplenty access to your data repositories here.
- After selecting the desired connection, define your source properties and the schema. You will be able to view the detected schema and a preview of the data, which should assist in selecting the desired fields.
- Hover on the source component and click the `+` sign below (5). You can add transformation and destination components to the flow.Use transformations to manipulate, shape, standardize and enrich your data. Read more about using transformations here
For now, let’s choose the destination of interest (6).
- Click the destination component and configure the endpoint. After selecting the target connection and defining the destination properties, you will be able to map the input schema to the target table columns. One important item to note is that clicking the `Auto-fill` button in the upper right corner on schema mapping step will auto-detect the schema of the data pipeline, saving you the time and effort to populate the schema manually. When completed, click `Save`.
- In the upper right corner of the package editor, click the checkmark button to `save and validate` (8) the package. The validation checks the package for design-time errors, and saves the changes made in the package. Read more about validating a package here. After the package completes validation, you can click on ‘Save & Run job’ to (9) save the package and run it as a job on a suitable cluster.
- In the ‘Run job’ dialog, select the desired cluster. You can select a sandbox (which is free in all accounts and meant for development purposes) or a distributed cluster. If needed- you can create a new cluster. Read more about creating cluster here.
- Then, select the package that should be executed (by default, the current package will be selected), and click `Run job`.
You can see monitor the progress and details of each job, and once completed - sample of the job's outputs. If the job failed, you will be able to see the error messages that caused it to fail.
Congratulations, you have completed your first job!