How Do I Read data from S3 incrementally

This article will show you how to design a data flow that reads data incrementally from an S3 bucket, so that only new (or changed) files are read every time the process is executing.

What you’ll need

  1. An S3 bucket with a directory that contains files to read incrementally, with read-only permissions for Xplenty at the minimum.
  2. An S3 bucket with read-write permissions for Xplenty to write its meta data.

How to incrementally read data from S3

  1. Start with adding a File Storage source component to your package and click the new component to edit its properties.

    Add file storage source
  2. Fill in the bucket and path for the source data (in this example, the bucket is xplenty.public and the path is /twitter)

  3. Change the pre-process action to “Copy” and add a files manifest path. In the path, use a bucket with read-write permission for Xplenty. In our example it’s xplenty.dumpster/manifests/twitter_reader.gz

    Edit file storage source
  4. That’s essentially it! This tells Xplenty to list the files in the source path, compare the list to the manifest file and read only the new or changed files. If the manifest file doesn’t exist, Xplenty will read all files in the source path - that’s what happens when you execute the package for the first time, or if you delete the manifest file. Once the package executes successfully, the files read by the package are added to the manifest file.

Note that your path can contain a pattern or a variable and incremental reading would still work. However, files that are not found in the source path are removed from the manifest. This can be a good thing if it allows you to maintain a smaller manifest file, but if you intend to add paths you previously read from to the source component, these files will be read again.

  1. Finally, complete your data flow and execute it.

    Finally, complete your data flow and execute it

Feedback and Knowledge Base