feature engineering Fidelity Optimizes Feature Engineering With Snowpark ML

Snowflake · January 22

For the past few years, Fidelity Investments has been moving a significant percentage of its applications to a cloud-based infrastructure. As part of that transition, Fidelity has consolidated its analytics data into its Enterprise Analytics Platform, which is engineered using the Snowflake Data Cloud, making it easier for teams and departments across the company to access the data they need.

Fidelity’s data scientists use the AI/ML-enabled Enterprise Analytics Platform to process a large volume of structured and unstructured data for deeper insights and better decision-making. Historically, the platform was housed in physical servers. In 2020, Fidelity kicked off its digital transformation and established an Enterprise Data Lake (EDL) along with Data Labs. Recently, the company wanted to conduct parallel analytics in the cloud and decided to use Snowpark and Snowpark ML.

Fidelity has two main enterprise data architecture guiding principles for its data scientists and data engineers:

For data storage, Snowflake is the platform for storing all of the company’s structured and semi-structured analytical data in its Enterprise Data Lake and Data Labs. All of Fidelity’s storage abides by its data security framework and data governance policies, and provides a holistic approach to metadata management.
For compute, Fidelity’s principles are to minimize the transfer of data across networks, avoid duplication of data, and process the data in the database — bringing the compute to the data wherever possible.

Feature engineering in focus

Fidelity creates and transforms features to improve the performance of its ML models. Some common feature engineering techniques include encoding, data scaling and correlation analysis.

The company’s data science architecture team was running into computation pain points, especially around feature engineering. Feature engineering is a stage in the data science process — before refinement and after expansion and encoding — where data can be at its peak volume. Pandas DataFrames offer a flexible data structure to manipulate various types of data and apply a wealth of computations. However, the trade-off with Pandas DataFrames is the restriction of memory, including the size of the DataFrame in memory and the expansion of memory due to the space complexity of the computation being applied to the data. This was only exacerbated by the speed of single-node processing, where memory contention and distribution of work had limited resources.

The team also considered Spark ML purely for the flexibility of distributed processing, but Spark involves complex configuration and tasks and required maintenance overhead for both the hardware and software. Fidelity wanted to leverage capabilities like parallel processing without the complexity of Spark, so the company turned to Snowpark ML.

Benefits of Snowpark ML

Snowpark ML includes the Python library and underlying infrastructure for end-to-end ML workflows in Snowflake. Fidelity decided to work with the Snowpark ML Modeling API for feature engineering because of the improved performance and scalability with distributed execution for common sklearn-style preprocessing functions. In addition to being simple to use, it offered a number of additional benefits:

All the computation is done within Snowflake, enabling in-database processing.
It handles large data volumes and scales both vertically and horizontally.
The correlation and preprocessing computation linearly scales the size of data to Snowflake standard warehouse size.
Data is not duplicated nor transferred across the network.
It leverages extensive RBAC controls, enabling tightly managed security.
Lazy evaluation avoids unnecessary computation and data transfer, and improves memory management.

Comparing three scenarios

The Fidelity team compared Snowpark ML for three different scenarios: MinMax scaling, one-hot encoding and Pearson correlation.

MinMax scaling is a critical preprocessing step to get Fidelity’s data ready for modeling. For numerical values, Fidelity wanted to scale its data into a fixed range between zero and one. With Pandas, the performance is fine for small data sets but does not scale to large data sets with thousands or millions of rows. Snowpark ML eliminates all data movement and scales out execution for much better performance.

Figure 1. Performance improvement of 77x with Snowpark ML, compared to in-memory processing for MinMax scaling.

One-hot encoding is a feature transformation technique for categorical values. With Snowpark ML, the execution is much faster by leveraging the distributed parallel processing for the data transformation and eliminating the data read and write times.

Figure 2. Performance improvement of 50x with Snowpark ML, compared to in-memory processing for one-hot encoding.

By using Snowpark ML to derive Pearson product moment or Pearson correlation matrix, Fidelity achieved a magnitude of performance improvement by scaling the computation both vertically and horizontally. This is especially useful for use cases with large and wide data sets in which there are, for example, 29 million rows and over 4,000 columns.

Figure 3. Performance improvement of 17x with Snowpark ML, compared to in-memory processing for Pearson correlation.

Fidelity achieved significant time, performance and cost benefits by bringing the compute closer to the data and increasing the capacity to handle more load. By speeding up computations, the company’s data scientists now iterate on features faster. Those time savings have allowed the team to become more innovative with feature engineering, explore new and different algorithms, and improve model performance.

For more details, check out Fidelity’s full presentation on Snowpark ML for feature engineering. Ready to start building models of your own with Snowpark ML? Refer to Snowflake’s developer documentation for technical details, or try it for yourself with our step-by-step quickstart.

The post Fidelity Optimizes Feature Engineering With Snowpark ML appeared first on Snowflake.

View the full article