Secure FTP for Azure Blob Storage (2024)

By: John Miner |Updated: 2022-12-14 |Comments | Related: > Azure


Problem

It is not uncommon for companies to grow through mergers and acquisitions—meaninga single parent company might have several child companies with different ERP systems.To get a holistic view for reporting, the IT departments in the past used a SecureFile Transfer Protocol (SFTP) server at the parent company to collect data filesfrom the child companies. If the data for a given table type is in the sameformat regardless of the source system, it is very easy to ingest the data intoa database table for reporting.

Nowadays, the centralized database might reside in the cloud to reduce the totalcost of ownership. With that in mind, how do we support the SFTP protocolusing Azure Blob Storage?

Solution

On October 22, 2022, Microsoftannounced the general availability of SFTP support for Azure Blob Storage.This service supports both passwords and/or key pairs (private/public) for authentication.Once the data lands in the Azure Blob Storage Gen 2, tools like Azure Databrickscan read and process the data. The image below shows how the SFTP protocolsupports hierarchical namespaces.

Secure FTP for Azure Blob Storage (1)

Business Problem

There are several objectives that our manager wants us to learn today.First, how do we deploy and configure Secure File Transfer Protocol for Azure BlobStorage? Second, how can we configure password versus key pair security?Third, can we write a batch file to call our SFTP utility to automate data sendingfrom remote servers to the cloud? Fourth, how can we manually transfer filesusing a graphical user interface (GUI)? This eliminates the need for the endusers to learn command line syntax. Last but not least, how can we look atsome sample data using Azure Databricks?

The IT team will be ready to transition the on-premises FTP workloads to in-cloudsystems at the end of the proof of concept.

Create and Configure Storage

Today, we will add SFTP support to our existing data lake in the Azure cloud.We currently have a storage account named sa4adls2030 with a storagecontainer named sc4adls2030. The image below shows the additionof a new container called sc4sftp.

Secure FTP for Azure Blob Storage (2)

There are two sample datasets for the proof of concept. We have the datafiles for the Adventure Works SalesLt database in parquet format.Also, we have the S&P 500 stock data for several years. In this hypotheticaltest, each child company owns a single dataset. We want to set up two distinctusers: one using password security and the other using key-based security.The image below shows two directories: advwrks andstocks, created using the Azure portal.

Secure FTP for Azure Blob Storage (3)

Taking one last look at the containers, we can see we are ready to enable SFTPfor the storage account.

Secure FTP for Azure Blob Storage (4)

Enable SFTP Protocol

The option to enable the SFTP protocol is located under the settings sectionof the storage account. We can see that both enable options display checkboxes.This means the service and local users are disabled at this time. Click thehyperlink to enable the SFTP protocol at this time.

Secure FTP for Azure Blob Storage (5)

The image below is a confirmation to enable both the protocol and local users.

Secure FTP for Azure Blob Storage (6)

The following image shows there are no current local users. In the nextsection, we will create a user and enable security for each subdirectory under thecontainer named sc4sftp.

Secure FTP for Azure Blob Storage (7)

Create Two Local Users

In the real world, we might have data from several child companies arriving atthe SFTP site. We want to enforce isolation by creating an account for eachcompany and allowing each account to access only one subdirectory under the containernamed sc4sftp. Let's now create the stockuserlocal account that will use an SSH password for authentication. The imagebelow shows the selection of the local account name as well as the authenticationmethod. Click Next to continue.

Secure FTP for Azure Blob Storage (8)

The next screen allows the administrator to select the container, landing directory,and permissions to give to the local account. The following storage permissionsare available: read, write, list, delete, and create. For this demonstration,let's give the account full access. The home directory is sc4sftp/stocks.This ensures segregation of the data between the two fictitious child companies.

Secure FTP for Azure Blob Storage (9)

By clicking Add, the local account will be created. Make sure to copy theSSH password to a secure location. If you forget the password, you can alwaysre-generate it without recreating the local user account.

Secure FTP for Azure Blob Storage (10)

Let's now create the advwrksuser local account that willuse an SSH key pair for authentication. The image below shows the selectionof the local account name as well as the authentication method. We need tochoose which key pair to use. See documentation for options. At thistime, let's have the Azure Service generate one. We have to give thekey a name and optional comment. Click Next to continue.

Secure FTP for Azure Blob Storage (11)

Just like before, container permissions need to be specified. The homedirectory will be set to sc4sftp/advwrks. This will keepthe isolation between the two child companies' fictitious data.

Secure FTP for Azure Blob Storage (12)

Since the SSH key is larger than a password, a prompt appears to download thefile. Again, keep the file for later use.

Secure FTP for Azure Blob Storage (13)

Let's review the newly created accounts. The image below shows thetwo local accounts. The connection string is important when creating a connection.We can see that the password can be regenerated at will. However, we mustrecreate the local account to get a new SSH key pair. As we saw, this is nota big deal.

Secure FTP for Azure Blob Storage (14)

Using Putty

The PuTTY organization providesa set of free tools for Telnet, SCP, and SFTP. Please download and installthe tools now. The Windows Explorer screenshot below shows the various executablesin the Program Files directory after installation. We are interested in thesftp.exe file.

Secure FTP for Azure Blob Storage (15)

In the past, automation was achieved using a command line utility and a batchfile. I have five years of S&P 500 data in the c:\stocks directory.We need to copy over the SFTP utility in the same directory. This will shortenthe paths required in the batch file. We also need to create a script fileto perform the necessary actions.

Secure FTP for Azure Blob Storage (16)

The easiest way to get a list of commands is to execute the helpcommand. You can also look at thedocumentation on thePutty website.

Secure FTP for Azure Blob Storage (17)

The script file put-script.txt copies over the five directoriesin a recursive manner from my local computer to Azure Blob Storage. See thebelow commands for details.

put -r S&P-2017put -r S&P-2016put -r S&P-2015put -r S&P-2014put -r S&P-2013exit

The batch file put-data.bat connects to the service using theconnection string shown in the previous local user window. The only thingyou need to do is change the SSH password for your environment. It executesto the script file until completion or an error is encountered.

psftp -b .\put-script.txt [emailprotected] -pw <your ssh password>

The next step is to validate that the SFTP script completed the actions correctly.The image below shows the five directories created in ADLS Gen 2.

Secure FTP for Azure Blob Storage (18)

The final step to make this script production ready is to schedule it.We can useWindowstask scheduler if you do not have a third-party enterprise package.

Using WinSCP

The WinSCP applicationis a free, award-winning file manager. Please download and install the packagenow. Begin by creating a new site connection. Enter the connection stringfrom the Azure local user window as the host name. If using an SSH password,enter it now. Since we are using key pair authentication, click Advanced…

Secure FTP for Azure Blob Storage (19)

Under the advanced settings, find the private key for SSH authentication.Choose the SSH key pair file named advwrkskey.ssh, which was downloadedfrom the Azure Portal after creating the advwrksuser. Unfortunately,WinSCP wants the key in a different format.

Secure FTP for Azure Blob Storage (20)

Save the key in PuTTY format in the c:\temp directory. The parquet filesfor the SalesLt schema already exist as a sub-directory.

Secure FTP for Azure Blob Storage (21)

Click OK to accept the key in the new format.

Secure FTP for Azure Blob Storage (22)

Most of the time, you will want to save the site settings. This allowsyou to connect without entering the connection information. The image belowshows the settings saved as the site named "Azure Test".

Secure FTP for Azure Blob Storage (23)

I like using a file manager since all actions are drag and drop or cut and paste.Drag the "advwrks-parquet" sub-directory to an empty destination.

Secure FTP for Azure Blob Storage (24)

We are prompted to send the files from on-premises to the cloud using the binarymethod. When we used the command line utility, we did not need to choose afile format since all files were ASCII. However, since parquet is a binaryformat, the file manager is smart enough to change the transfer type. ClickOK to proceed with the copy action.

Secure FTP for Azure Blob Storage (25)

The image below shows that the copy action has been completed. We havenine dimensional files and two logical facts. The Internet Sales data hasbeen partitioned into four files.

Secure FTP for Azure Blob Storage (26)

In short, if you manually upload files, use a file manager like WinSCP.If you want to automate file uploads, use a command line utility such as PuTTY.

File Processing

There are several ways to process files that are landed in data lake storage.Speaking about Apache Spark, let's discuss how to ingest and explore the datawith Azure Databricks. In the past, the sc4adls2030 accountwas mounted to the Databricks workspace. However, the sc4sftpis a new container. We must give the service principle both RBAC (contributorand blob storage contributor) and ACL (read, write, and execute) rights to the foldersand files. We will explore the stock files that were transferred using theSFTP protocol. The notebook below mounts the remote storage to the Azure Databricksfile system (DBFS).

Secure FTP for Azure Blob Storage (27)

At the end of the notebook is a simple test to ensure the storage is mounted.If we list the files in the sftp directory, we can see that the two sub-directoriesare available to browse.

Secure FTP for Azure Blob Storage (28)

Most of the time, I write code using Spark SQL. However, there are timeswhen it makes sense to work with the dataframe methods. We have stock datastored in five separate directories. One way to load all this data into atemporary view for querying is to read each directory into a separate dataframe.Afterward, we create a new dataframe called df_all,which combinesall the previous data frames. Please look at theunionAll method for more details.

There are two more methods. ThewithColumn method allows the developer to define a new field within the dataset,and theinput_file_name function returns the file path from which the data was read.The image below shows the top 10 records from the tmp_stocks view.

Secure FTP for Azure Blob Storage (29)

Let's count the number of files per year. The date of the stock informationis stored as a string. So, we can use thesubstring function to get the four-digit year as a string. We will filter,group, and order by on this expression. Note: The set used by the in expressionis text values. Last, we cannot use the count function on the filename.It will return all records; we only want the number of distinct files (companies)per year.

Secure FTP for Azure Blob Storage (30)

The stock data for 2013 to 2017 was obtained by calling the Yahoo service witha PowerShell program. There is always the possibility of errors when obtainingdata through a web call. Since the S&P 500 should have at least 500 separatefiles, I would have our company analyst validate the files for 2013 and 2014.It seems like we might be a little short.

Summary

The secure file transfer protocol (SFTP) has been around for many years.In fact, I remember using it at an employer about 20 years ago. However, manyof these old systems are still in existence. How can you land these filesinto Azure Blob Storage with little change?

Microsoft has recently announced the general availability of the SFTP protocolfor Azure Blob Storage. It supports both password and key pair security.In this tip, we saw how easy it was to create accounts for both types of security.I suggest using different home directories or landing zones for each account.This will keep the data isolated between different user groups.

Both command line (CMD) and graphical user interface (GUI) applications supportthe SFTP protocol. The CMD application can automate the transfer of data usingbatch files. The GUI application is great for quick ad hoc transfer of data.This tip's focus was on utilities available in Windows. However, otheroperating systems like Linux support a similar set of tools.

I will continue to write about Apache Spark in the coming year since it willbe a huge player in the future. In this tip, we saw that once the remote storageis mounted, our Python program can read the raw data and expose it as a temporaryview. It is important to learn both the PySpark Dataframe methods and SQLcoding techniques. We could have created five separate temporary views andthen used a common table expression (CTE) to work with the combined data in SparkSQL. However, it was a lot easier to call a dataframe method to union allthe data into one dataframe before turning it into a temporary view.

Enclosed are three zip files containing code/data for thePutty,WinSCP,and Databricksexamples.

Next Steps
  • Check out these otherAzure articles.




About the author

John Miner is a Data Architect at Insight Digital Innovation helping corporations solve their business needs with various data platform solutions.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips

Article Last Updated: 2022-12-14

Secure FTP for Azure Blob Storage (2024)

References

Top Articles
Latest Posts
Article information

Author: Foster Heidenreich CPA

Last Updated:

Views: 6233

Rating: 4.6 / 5 (76 voted)

Reviews: 83% of readers found this page helpful

Author information

Name: Foster Heidenreich CPA

Birthday: 1995-01-14

Address: 55021 Usha Garden, North Larisa, DE 19209

Phone: +6812240846623

Job: Corporate Healthcare Strategist

Hobby: Singing, Listening to music, Rafting, LARPing, Gardening, Quilting, Rappelling

Introduction: My name is Foster Heidenreich CPA, I am a delightful, quaint, glorious, quaint, faithful, enchanting, fine person who loves writing and wants to share my knowledge and understanding with you.