Allowing Integrate.io ETL access to my data on Hadoop Distributed File System (HDFS)

Integrate.io ETL can access data residing on any Hadoop distributed file system (HDFS). This article details creating an HDFS connection in Integrate.io ETL.

You must provide Integrate.io ETL access to the cluster's HDFS. Please consult our support team if the HDFS is behind firewall.

To create a Hadoop Distributed File System (HDFS) connection in Integrate.io ETL:

  1. Click the Connections icon (lightning bolt) on the top left menu.
  2. Click New connection and choose Hadoop Distributed File System (HDFS).
    thumbnail image
  3. In the new HDFS connection window, name the connection and enter the connection information:
  • User Name - the user name to use when connecting to HDFS (Kerberos authorization is not currently supported).
  • NameNode Hostname - the host name of the NameNode server or the logical name of the NameNode in a high availability configuration.  
  • NameNode Port - the TCP port of the name node. Leave empty if the NameNode is in a high availability configuration.
  • HttpFS Hostname - the host name of the Hadoop HttpFS gateway node. This should be available to Integrate.io ETL's platform.
  • HttpFS Port - the TCP port of the Hadoop HttpFS gateway node (Default is 14000).
  • Click Test connection If the credentials are correct, a message that the connection test was successful appears.
  • Click Create HDFS connection.
  • The connection is created and appears in the list of file storage connections.
  • Now you can create a package and test it on your actual data stored in Hadoop Distributed File System (HDFS).