Monday, September 21, 2020

Snowflake - Data Loading Startegies

Snowflake is a key player in the cloud database offering space, along with Redshift which is a amazon offering. Interestingly Snowflake uses Amazon S3 for storage as part of the amazon cloud offering, while amazon continues to promote redshift. It is going to be interesting to see how this pans out the relationship between Amazon and SnowFlake, There is another competitor in the mix, which is the vendor Cloudera. More on this dynamics later, now let us move forward with data loading strategies in snowflake.

At a very high level, snowflake supports the following in terms of Location of the files:
1. Local Environment (files in a local folder) - In such instances the files are first move to a snowflake stage area and then loaded into a table in snowflake DB.
2. Amazon S3 - Files that are loaded from a user supplied S3 Bucket
3. Microsoft Azure - Flies are loaded from user defined Azure container.
4. Google Cloud Storage - Files loaded from user supplied cloud storage container

In addition to the above, the file formats that are supported are: CSV,JSON,AVRO,ORC,Parquet, XML is a preview feature at this point. There are different ways of data loading into snowflake, the method i would like to highlight in this blog post is the Bulk loading using COPY method.
The Bulk Load Using COPY method steps are a little different for each of the file locations mentioned above.

In the Situation where data has to be copied from a local file system, the data is first copied to a snow flake stage using the PUT command and then moved to a snowflake table. There are different types of Stage that are available in Snowflake. 1. User Stages, 2. Table Stages, 3. Internal Named Stages. User Stage is useful when the files are copied to multiple tables but accessed by a single user. The table stage is used when all the files are copied to a single table but used by multiple users. Internal Named Stage provided the maximum flexibility in terms of data loading. Based on privileges the data can be loaded into any table, this is recommended when doing regular data loads that involve multiple users and tables.

Once you decided on the type of the stage that is needed, then you create the stage, copy the files using the PUT command, and then use the COPY command to move the data into the snow flake table. The steps mentioned could vary slight based on the location of the files. For Amazon S3 storage you would use AWS tools to move the files to the stage area and then COPY into SnowFlake DB. For Google and Microsoft Azure use similar tools available in each cloud platform to move the files into the Stage area in Snowflake. For all the detailed information and support, please refer to the link below.


Loading data into snowflake db is the first step in exploring the features and the power of the cloud database offering, where once can test out the columnar database features.

No comments:

Post a Comment