Using components: File Storage Source

Use the file storage source component to define the file data source for your data flow: which connection to use, where the files reside and which fields will be used in the package.
Source components are always the first components in a package.

To define the cloud storage source:

  1. Add a file source component to begin your dataflow.
  2. Open the component and name it.

source location

Define the source connection parameters:

  • connection - either click the drop-down arrow and select an existing connection, or click create new to create a new connection (see Defining connections).
  • bucket/container - The name of your cloud storage bucket or container that contains the folders and objects. Only relevant in the case of object stores such as Amazon S3, Google Cloud Storage and Swift based object stores.
  • path - The path to your folder or object in the form of:
    • folder: sales/2015/01/
    • file or object: sales/2015/01/log.csv
    • pattern: sales/2015/{01,02}/
      You can use wild card characters for pattern globbing (see >Using pattern-matching in source component paths).
      Note - file and directory names that begin with an underscore (_) or a dot (.) are ignored and data contained in them will not be read. 
  • pre-process action - You can select copy as a pre-process action for Amazon S3 connections. This can be very useful if the data is comprised of many small files or if the path (with or without pattern globbing) contains many objects. Once copy is selected, files will be read and then concatenated and compressed onto the cluster's local storage. Note that the process will fail if you try to copy a single file, and that you may get strange results if your delimited data doesn't end in a line break or if you have headers in the files.
  • files manifest path - if you would like to process only new or changed files in your source path, select copy pre-process action and add a files manifest path. The files manifest path is comprised of a bucket, path and gz compressed file name: bucket/path/file.gz. Every time the job is executed successfully, the list of files processed is added to the manifest file. In the following job execution, only the files in the source path that don't exist in the manifest are processed.. Note that the connection used in the source component must have read+write access to the files manifest path.
  • source path field name - Name the field that holds the input file's path. Leave empty if you don't need the source files' path in the data flow.
Then click the Test Connection button to help check that the connection is good and that the bucket/container and path exist.
Note: Paths that contain variables are not validated correctly.

source type

Define the type of your source object:

  • record delimiter - defines what breaks the data into records.
    • New line (\n,\n\r,\r) - each line in your files is a record.
    • End of file - each file is treated as a record.
  • record type - defines the format of the record.
    • Delimited values - fields are delimited by a delimiter you define such as tab, comma or otherwise (see delimited values parameters below).
    • Json object - each record is a Json object (enclosed in curly brackets).
    • Raw - the record is read in its entirety into a single string field.

The examples below provide an easy guide to picking record delimiter and type.

My dataSource type parameters
id,name
1,john
2,bruce
3,ashley
New line, Delimited values, delimiter [,]
{"id":1,"name":"john"}
{"id":2,"name":"bruce"}
{"id":3,"name":"ashley"}
New line, Json object
{"id":1,
"name":"john"}
{"id":2,
"name":"bruce"}
End of file, Json object
Lorem ipsum dolor sit amet, 
consectetur adipiscing elit, 
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
End of file, Raw
actor
actress
actual
actually
ad
adapt
add
addition
additional
add on
address
add up
add up to
adequate
adequately
adjust
New line, raw

Note: The source data can be compressed (gzip or bzip2) or uncompressed. The source data must be utf-8 encoded.

Delimited values parameters

If you selected new line record delimiter and delimited values record type, you need to define the delimiter character used to separate fields in your objects and whether the object data is quote enclosed.

  1. In the field delimiter drop-down list, select one of the characters (, tab). You can also type a single character or one of the following escape sequences:
    • \b (backspace)
    • \f (formfeed)
    • \n (newline)
    • \r (carriage return)
    • \t (tab)
    • \' (single quote)
    • \" (double quote)
    • \\ (backslash)

  2. If some or all of the fields are enclosed in single or double quotes, select ' or " in the string qualifier drop-down list. If the fields may also contain line breaks, select " (newline inside) or ' (newline inside) according to the string qualifier used in the files. Use the "newline inside" option with caution as unbalanced double quotes may have undesired effects on job performance.
  3. Check First row contains column names if there is a header row in each source file and you wish to skip it.

Json parameters

  • json path - You can use a custom JSONPath expression to define the base record and extract nested objects and arrays. Using the object preset, the response fields are the keys of the JSON object. Using the array preset, the response fields are the keys of the JSON objects within the array and each object is returned as a record in the component's output.

Fields

  1. After defining the source location and type you can optionally either:
    • click the green Auto-detect Schema button to automatically populate the field names and data types and open a window to preview the data
    • click the Show data preview button to preview the data.
  2. Define the source data fields as follows:
    • for a flat file, you must define all fields that exist in the source.
    • for a Json file, you only need to define the fields that you wish to use in your package.
  3. Define how you will refer to these fields (alias) in the other components of your package. If you use illegal characters, we'll let you know before you close the dialog.
Defining fields for use in your components
  • For delimited values or raw data:
    • alias - type the alias you will use for this field in other components
    • type - select the data type that matches the source field
  • For Json objects:
    • key - type the source field name in the Json file
    • alias - type the alias you will use for this field in other components
    • type - select the data type that matches the source field

In addition to the usual data types (string, integer, long, float, etc.), you can select the map data type if the field's value is a Json object and bag data type if the field's value is a Json array.

Creating packages

  1. Creating a new package
  2. Create a package from a template
  3. Working in the package designer
  4. Using Components: Facebook Ads Insights Source (Beta)
  5. Using components: File Storage Source
  6. Using components: Database Source
  7. Using components: Google AdWords Source
  8. Using components: NetSuite Source
  9. Using Components: Google Analytics Source
  10. Using Components: Google BigQuery Source
  11. Using components: Google Cloud Spanner Source
  12. Using Components: Bing Ads Source
  13. Using components: MongoDB Source
  14. Using components: Amazon Redshift Source
  15. Using Components: Rest API Source
  16. Using Components: Salesforce Source
  17. Using components: Select
  18. Using components: Sort
  19. Using components: Rank
  20. Using components: Limit
  21. Using components: Sample
  22. Using components: Join
  23. Using components: Cross Join
  24. Using components: Clone
  25. Using components: Cube and Rollup
  26. Using components: Union
  27. Using components: Filter
  28. Using Components: Window
  29. Using components: Assert
  30. Using components: Aggregate
  31. Using components: Distinct
  32. Using components: File Storage Destination
  33. Using components: Amazon Redshift Destination
  34. Using Components: Salesforce Destination (Beta)
  35. Using components: Google BigQuery Destination
  36. Using components: Google Cloud Spanner Destination
  37. Using components: Database Destination
  38. Using components: MongoDB Destination
  39. Using and setting variables in your packages
  40. Validating a package
  41. Using pattern-matching in source component paths
  42. Using ISO 8601 string functions
  43. Using Expressions in Xplenty
  44. Xplenty Functions

Feedback and Knowledge Base