Using components: File Storage Source

Use the file storage source component to define the file data source for your data flow: which connection to use, where the files reside and which fields will be used in the package.
Source components are always the first components in a package.

To define the cloud storage source:

  1. Add a file source component to begin your dataflow.
  2. Open the component and name it.

source location

Define the source connection parameters:

  • connection - either click the drop-down arrow and select an existing connection, or click create new to create a new connection (see Defining connections).
  • bucket/container - The name of your cloud storage bucket or container that contains the folders and objects. Only relevant in the case of object stores such as Amazon S3 and Google Cloud Storage object stores.
  • path - The path to your folder or object in the form of:
    • folder: sales/2015/01/
      Note This is NOT supported for GCS. Use pattern instead to specify multiple files.
    • file or object: sales/2015/01/log.csv
    • pattern: sales/2015/{01,02}/
      You can use wild card characters for pattern globbing (see >Using pattern-matching in source component paths).
      Note - file and directory names that begin with an underscore (_) or a dot (.) are ignored and data contained in them will not be read. 
  • pre-process action - You can select copy as a pre-process action for Amazon S3 connections. This can be very useful if the data is comprised of many small files or if the path (with or without pattern globbing) contains many objects. Once copy is selected, files will be read and then concatenated and compressed onto the cluster's local storage. Note that the process will fail if you try to copy a single file, and that you may get strange results if your delimited data doesn't end in a line break or if you have headers in the files.
  • files manifest path - if you would like to process only new or changed files in your source path, select copy pre-process action and add a files manifest path. The files manifest path is comprised of a bucket, path and gz compressed file name: bucket/path/file.gz. Every time the job is executed successfully, the list of files processed is added to the manifest file. In the following job execution, only the files in the source path that don't exist in the manifest are processed.. Note that the connection used in the source component must have read+write access to the files manifest path.
  • source path field name - Name the field that holds the input file's path. Leave empty if you don't need the source files' path in the data flow.
Then click the Test Connection button to help check that the connection is good and that the bucket/container and path exist.
Note: Paths that contain variables are not validated correctly.

source type

Define the type of your source object:

  • record delimiter - defines what breaks the data into records.
    • New line (\n,\n\r,\r) - each line in your files is a record.
    • End of file - each file is treated as a record.
  • record type - defines the format of the record.
    • Delimited values - fields are delimited by a delimiter you define such as tab, comma or otherwise (see delimited values parameters below).
    • Json object - each record is a Json object (enclosed in curly brackets).
    • Raw - the record is read in its entirety into a single string field.

The examples below provide an easy guide for picking record delimiter and type.

My dataSource type parameters
New line, Delimited values, delimiter [,]
New line, Json object
End of file, Json object
Lorem ipsum dolor sit amet, 
consectetur adipiscing elit, 
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
End of file, Raw
add on
add up
add up to
New line, raw

Note: The source data can be compressed (gzip or bzip2) or uncompressed. The source data must be utf-8 encoded.

Delimited values parameters

If you selected new line record delimiter and delimited values record type, you need to define the delimiter character used to separate fields in your objects and whether the object data is quote enclosed.

  1. In the field delimiter drop-down list, select one of the characters (, tab). You can also type a single character or one of the following escape sequences:
    • \b (backspace)
    • \f (formfeed)
    • \n (newline)
    • \r (carriage return)
    • \t (tab)
    • \' (single quote)
    • \" (double quote)
    • \\ (backslash)

  2. If some or all of the fields are enclosed in single or double quotes, select ' or " in the string qualifier drop-down list. If the fields may also contain line breaks, select " (newline inside) or ' (newline inside) according to the string qualifier used in the files. Use the "newline inside" option with caution as unbalanced double quotes may have undesired effects on job performance.
  3. Check First row contains column names if there is a header row in each source file and you wish to skip it.

Json parameters

  • json path - You can use a custom JSONPath expression to define the base record and extract nested objects and arrays. Using the object preset, the response fields are the keys of the JSON object. Using the array preset, the response fields are the keys of the JSON objects within the array and each object is returned as a record in the component's output.


  1. After defining the source location and type you can optionally either:
    • click the green Auto-detect Schema button to automatically populate the field names and data types and open a window to preview the data
    • click the Show data preview button to preview the data.
  2. Define the source data fields as follows:
    • for a flat file, you must define all fields that exist in the source.
    • for a Json file, you only need to define the fields that you wish to use in your package.
  3. Define how you will refer to these fields (alias) in the other components of your package. If you use illegal characters, we'll let you know before you close the dialog.
Defining fields for use in your components
  • For delimited values or raw data:
    • alias - type the alias you will use for this field in other components
    • type - select the data type that matches the source field
  • For Json objects:
    • key - type the source field name in the Json file
    • alias - type the alias you will use for this field in other components
    • type - select the data type that matches the source field

In addition to the usual data types (string, integer, long, float, etc.), you can select the map data type if the field's value is a Json object and bag data type if the field's value is a Json array.

Feedback and Knowledge Base