Search the Community
Showing results for tags 'snowpark ml'.
-
For the past few years, Fidelity Investments has been moving a significant percentage of its applications to a cloud-based infrastructure. As part of that transition, Fidelity has consolidated its analytics data into its Enterprise Analytics Platform, which is engineered using the Snowflake Data Cloud, making it easier for teams and departments across the company to access the data they need. Fidelity’s data scientists use the AI/ML-enabled Enterprise Analytics Platform to process a large volume of structured and unstructured data for deeper insights and better decision-making. Historically, the platform was housed in physical servers. In 2020, Fidelity kicked off its digital transformation and established an Enterprise Data Lake (EDL) along with Data Labs. Recently, the company wanted to conduct parallel analytics in the cloud and decided to use Snowpark and Snowpark ML. Fidelity has two main enterprise data architecture guiding principles for its data scientists and data engineers: For data storage, Snowflake is the platform for storing all of the company’s structured and semi-structured analytical data in its Enterprise Data Lake and Data Labs. All of Fidelity’s storage abides by its data security framework and data governance policies, and provides a holistic approach to metadata management. For compute, Fidelity’s principles are to minimize the transfer of data across networks, avoid duplication of data, and process the data in the database — bringing the compute to the data wherever possible. Feature engineering in focus Fidelity creates and transforms features to improve the performance of its ML models. Some common feature engineering techniques include encoding, data scaling and correlation analysis. The company’s data science architecture team was running into computation pain points, especially around feature engineering. Feature engineering is a stage in the data science process — before refinement and after expansion and encoding — where data can be at its peak volume. Pandas DataFrames offer a flexible data structure to manipulate various types of data and apply a wealth of computations. However, the trade-off with Pandas DataFrames is the restriction of memory, including the size of the DataFrame in memory and the expansion of memory due to the space complexity of the computation being applied to the data. This was only exacerbated by the speed of single-node processing, where memory contention and distribution of work had limited resources. The team also considered Spark ML purely for the flexibility of distributed processing, but Spark involves complex configuration and tasks and required maintenance overhead for both the hardware and software. Fidelity wanted to leverage capabilities like parallel processing without the complexity of Spark, so the company turned to Snowpark ML. Benefits of Snowpark ML Snowpark ML includes the Python library and underlying infrastructure for end-to-end ML workflows in Snowflake. Fidelity decided to work with the Snowpark ML Modeling API for feature engineering because of the improved performance and scalability with distributed execution for common sklearn-style preprocessing functions. In addition to being simple to use, it offered a number of additional benefits: All the computation is done within Snowflake, enabling in-database processing. It handles large data volumes and scales both vertically and horizontally. The correlation and preprocessing computation linearly scales the size of data to Snowflake standard warehouse size. Data is not duplicated nor transferred across the network. It leverages extensive RBAC controls, enabling tightly managed security. Lazy evaluation avoids unnecessary computation and data transfer, and improves memory management. Comparing three scenarios The Fidelity team compared Snowpark ML for three different scenarios: MinMax scaling, one-hot encoding and Pearson correlation. MinMax scaling is a critical preprocessing step to get Fidelity’s data ready for modeling. For numerical values, Fidelity wanted to scale its data into a fixed range between zero and one. With Pandas, the performance is fine for small data sets but does not scale to large data sets with thousands or millions of rows. Snowpark ML eliminates all data movement and scales out execution for much better performance. Figure 1. Performance improvement of 77x with Snowpark ML, compared to in-memory processing for MinMax scaling. One-hot encoding is a feature transformation technique for categorical values. With Snowpark ML, the execution is much faster by leveraging the distributed parallel processing for the data transformation and eliminating the data read and write times. Figure 2. Performance improvement of 50x with Snowpark ML, compared to in-memory processing for one-hot encoding. By using Snowpark ML to derive Pearson product moment or Pearson correlation matrix, Fidelity achieved a magnitude of performance improvement by scaling the computation both vertically and horizontally. This is especially useful for use cases with large and wide data sets in which there are, for example, 29 million rows and over 4,000 columns. Figure 3. Performance improvement of 17x with Snowpark ML, compared to in-memory processing for Pearson correlation. Fidelity achieved significant time, performance and cost benefits by bringing the compute closer to the data and increasing the capacity to handle more load. By speeding up computations, the company’s data scientists now iterate on features faster. Those time savings have allowed the team to become more innovative with feature engineering, explore new and different algorithms, and improve model performance. For more details, check out Fidelity’s full presentation on Snowpark ML for feature engineering. Ready to start building models of your own with Snowpark ML? Refer to Snowflake’s developer documentation for technical details, or try it for yourself with our step-by-step quickstart. The post Fidelity Optimizes Feature Engineering With Snowpark ML appeared first on Snowflake. View the full article
-
Snowflake has invested heavily in extending the Data Cloud to AI/ML workloads, starting in 2021 with the introduction of Snowpark, the set of libraries and runtimes in Snowflake that securely deploy and process Python and other popular programming languages. Since then, we’ve significantly opened up the ways Snowflake’s platform, including its elastic compute engine can be used to accelerate the path from AI/ML development to production. Since Snowpark takes advantage of that scale and performance of Snowflake’s logically integrated but physically separated storage and compute, our customers are seeing a median of 3.5 times faster performance and 34% lower costs for their AI/ML and data engineering use cases. As of September 2023, we’ve already seen many organizations benefit from bringing processing directly to the data, with over 35% of Snowflake customers using Snowpark on a weekly basis. To further accelerate the entire ML workflow from development to production, the Snowflake platform continues to evolve with a new development interface and more functionality to securely productionize both features and models. Let’s unpack these announcements! ... View the full article
-
- snowpark
- snowpark ml
- (and 3 more)
-
Forum Statistics
63.6k
Total Topics61.7k
Total Posts