Jump to content

Search the Community

Showing results for tags 'feature engineering'.

  • Search By Tags

    Type tags separated by commas.
  • Search By Author

Content Type


Forums

There are no results to display.

There are no results to display.


Find results in...

Find results that contain...


Date Created

  • Start

    End


Last Updated

  • Start

    End


Filter by number of...

Joined

  • Start

    End


Group


Website URL


LinkedIn Profile URL


About Me


Cloud Platforms


Cloud Experience


Development Experience


Current Role


Skills


Certifications


Favourite Tools


Interests

Found 7 results

  1. Fabric Madness part 3Image by author and ChatGPT. “Design an illustration, featuring a Paralympic basketball player in action, this time the theme is on data pipelines” prompt. ChatGPT, 4, OpenAI, 15April. 2024. https://chat.openai.com.In the previous post, we discussed how to use Notebooks with PySpark for feature engineering. While spark offers a lot of flexibility and power, it can be quite complex and requires a lot of code to get started. Not everyone is comfortable with writing code or has the time to learn a new programming language, which is where Dataflow Gen2 comes in. What is Dataflow Gen2?Dataflow Gen2 is a low-code data transformation and integration engine that allows you to create data pipelines for loading data from a wide variety of sources into Microsoft Fabric. It’s based on Power Query, which is integrated into many Microsoft products, such as Excel, Power BI, and Azure Data Factory. Dataflow Gen2 is a great tool for creating data pipelines without code via a visual interface, making it easy to create data pipelines quickly. If you are already familiar with Power Query or are not afraid of writing code, you can also use the underlying M (“Mashup”) language to create more complex transformations. In this post, we will walk through how to use Dataflow Gen2 to create the same features needed to train our machine learning model. We will use the same dataset as in the previous post, which contains data about college basketball games. Fig. 1 — The final result. Image by author.The ChallengeThere are two datasets that we will be using to create our features: the regular season games and the tournament games. These two datasets are also split into the Men’s and Women’s tournaments, which will need to be combined into a single dataset. In total there are four csv files, that need to be combined and transformed into two separate tables in the Lakehouse. Using Dataflows there are multiple ways to solve this problem, and in this post I want to show three different approaches: a no code approach, a low code approach and finally a more advanced all code approach. The no code approachThe first and simplest approach is to use the Dataflow Gen2 visual interface to load the data and create the features. The DataThe data we are looking at is from the 2024 US college basketball tournaments, which was obtained from the on-going March Machine Learning Mania 2024 Kaggle competition, the details of which can be found here, and is licensed under CC BY 4.0 Loading the dataThe first step is to get the data from the Lakehouse, which can be done by selecting the “Get Data” button in the Home ribbon and then selecting More… from the list of data sources. Fig. 2 — Choosing a data source. Image by author.From the list, select OneLake data hub to find the Lakehouse and then once selected, find the csv file in the Files folder. Fig. 3 — Select the csv file. Image by author.This will create a new query with four steps, which are: Source: A function that queries the Lakehouse for all the contents.Navigation 1: Converts the contents of the Lakehouse into a table.Navigation 2: Filters the table to retrieve the selected csv file by name.Imported CSV: Converts the binary file into a table.Fig. 4 — Initial load. Image by author.Now that the data is loaded we can start with some basic data preparation to get it into a format that we can use to create our features. The first thing we need to do is set the column names to be based on the first row of the dataset. This can be done by selecting the “Use first row as headers” option in either the Transform group on the Home ribbon or in the Transform menu item. The next step is to rename the column “WLoc” to “location” by either selecting the column in the table view, or by right clicking on the column and selecting “Rename”. The location column contains the location of the game, which is either “H” for home, “A” for away, or “N” for neutral. For our purposes, we want to convert this to a numerical value, where “H” is 1, “A” is -1, and “N” is 0, as this will make it easier to use in our model. This can be done by selecting the column and then using the Replace values… transform in the Transform menu item. Fig. 5 — Replace Values. Image by author.This will need to be done for the other two location values as well. Finally, we need to change the data type of the location column to be a Whole number instead of Text. This can be done by selecting the column and then selecting the data type from the drop down list in the Transform group on the Home ribbon. Fig. 6 — Final data load. Image by author.Instead of repeating the rename step for each of the location types, a little bit of M code can be used to replace the values in the location column. This can be done by selecting the previous transform in the query (Renamed columns) and then selecting the Insert step button in the formula bar. This will add a new step, and you can enter the following code to replace the values in the location column. Table.ReplaceValue(#"Renamed columns", each [location], each if Text.Contains([location], "H") then "1" else if Text.Contains([location], "A") then "-1" else "0", Replacer.ReplaceText, {"location"})Adding featuresWe’ve got the data loaded, but it’s still not right for our model. Each row in the dataset represents a game between two teams, and includes the scores and statistics for both the winning and losing team in a single wide table. We need to create features that represent the performance of each team in the game and to have a row per team per game. To do this we need to split the data into two tables, one for the winning team and one for the losing team. The simplest way to do this is to create a new query for each team and then merge them back together at the end. There are a few ways that this could be done, however to keep things simple and understandable (especially if we ever need to come back to this later), we will create two references to the source query and then append them together again, after doing some light transformations. Referencing a column can be done either from the Queries panel on the left, or by selecting the context menu of the query if using Diagram view. This will create a new query that references the original query, and any changes made to the original query will be reflected in the new query. I did this twice, once for the winning team and once for the losing team and then renamed the columns by prefixing them with “T1_” and “T2_” respectively. Fig. 7 — Split the dataset. Image by author.Once the column values are set, we can then combine the two queries back together by using Append Queries and then create our first feature, which is the point difference between the two teams. This can be done by selecting the T1_Score and T2_Score columns and then selecting “Subtract” from the “Standard” group on the Add column ribbon. Now that’s done, we can then load the data into the Lakehouse as a new table. The final result should look something like this: Fig. 8 — All joined up. Image by author.There are a few limitations with the no code approach, the main one is that it’s not easy to reuse queries or transformations. In the above example we would need to repeat the same steps another three times to load each of the individual csv files. This is where copy / paste comes in handy, but it’s not ideal. Let’s look at a low code approach next. The low code approachIn the low code approach we will use a combination of the visual interface and the M language to load and transform the data. This approach is more flexible than the no code approach, but still doesn’t require a lot of code to be written. Loading the dataThe goal of the low code approach is to reduce the number of repeated queries that are needed and to make it easier to reuse transformations. To do this we will take advantage of the fact that Power Query is a functional language and that we can create functions to encapsulate the transformations that we want to apply to the data. When we first loaded the data from the Lakehouse there were four steps that were created, the second step was to convert the contents of the Lakehouse into a table, with each row containing a reference to a binary csv file. We can use this as the input into a function, which will load the csv into a new table, using the Invoke custom function transformation for each row of the table. Fig. 9 — Lakehouse query with the binary csv files in a column called Content. Image by author.To create the function, select “Blank query” from the Get data menu, or right click the Queries panel and select “New query” > “Blank query”. In the new query window, enter the following code: (TableContents as binary) =>let Source = Csv.Document(TableContents, [Delimiter = ",", Columns = 34, QuoteStyle = QuoteStyle.None]), PromoteHeaders = Table.PromoteHeaders(Source, [PromoteAllScalars = true]) in PromoteHeadersThe code of this function has been copied from our initial no code approach, but instead of loading the csv file directly, it takes a parameter called TableContents, reads it as a csv file Csv.Document and then sets the first row of the data to be the column headers Table.PromoteHeaders. We can then use the Invoke custom function transformation to apply this function to each row of the Lakehouse query. This can be done by selecting the “Invoke custom function” transformation from the Add column ribbon and then selecting the function that we just created. Fig. 10 — Invoke custom function. Image by author.This will create a new column in the Lakehouse query, with the entire contents of the csv file loaded into a table, which is represented as [Table] in the table view. We can then use the expand function on the column heading to expand the table into individual columns. Fig. 11 — Expand columns. Image by author.The result effectively combines the two csv files into a single table, which we can then continue to create our features from as before. There are still some limitations with this approach, while we’ve reduced the number of repeated queries, we still need to duplicate everything for both the regular season and tournament games datasets. This is where the all code approach comes in. The all code approachThe all code approach is the most flexible and powerful approach, but also requires the most amount of code to be written. This approach is best suited for those who are comfortable with writing code and want to have full control over the transformations that are applied to the data. Essentially what we’ll do is grab all the M code that was generated in each of the queries and combine them into a single query. This will allow us to load all the csv files in a single query and then apply the transformations to each of them in a single step. To get all the M code, we can select each query and then click on the Advanced Editor from the Home ribbon, which displays all the M code that was generated for that query. We can then copy and paste this code into a new query and then combine them all together. To do this, we need to create a new blank query and then enter the following code: (TourneyType as text) => let Source = Lakehouse.Contents(null){[workspaceId = "..."]}[Data]{[lakehouseId = "..."]}[Data], #"Navigation 1" = Source{[Id = "Files", ItemKind = "Folder"]}[Data], #"Filtered rows" = Table.SelectRows(#"Navigation 1", each Text.Contains([Name], TourneyType)), #"Invoked custom function" = Table.AddColumn(#"Filtered rows", "Invoked custom function", each LoadCSV([Content])), #"Removed columns" = Table.RemoveColumns(#"Invoked custom function", {"Content", "Name", "Extension", "Date accessed", "Date modified", "Date created", "Attributes", "Folder Path", "ItemKind", "IsLeaf"}), #"Expanded Invoked custom function" = Table.ExpandTableColumn(#"Removed columns", "Invoked custom function", {"Season", "DayNum", "WTeamID", "WScore", "LTeamID", "LScore", "WLoc", "NumOT", "WFGM", "WFGA", "WFGM3", "WFGA3", "WFTM", "WFTA", "WOR", "WDR", "WAst", "WTO", "WStl", "WBlk", "WPF", "LFGM", "LFGA", "LFGM3", "LFGA3", "LFTM", "LFTA", "LOR", "LDR", "LAst", "LTO", "LStl", "LBlk", "LPF"}, {"Season", "DayNum", "WTeamID", "WScore", "LTeamID", "LScore", "WLoc", "NumOT", "WFGM", "WFGA", "WFGM3", "WFGA3", "WFTM", "WFTA", "WOR", "WDR", "WAst", "WTO", "WStl", "WBlk", "WPF", "LFGM", "LFGA", "LFGM3", "LFGA3", "LFTM", "LFTA", "LOR", "LDR", "LAst", "LTO", "LStl", "LBlk", "LPF"}), #"Renamed columns" = Table.RenameColumns(#"Expanded Invoked custom function", {{"WLoc", "location"}}), Custom = Table.ReplaceValue(#"Renamed columns", each [location], each if Text.Contains([location], "H") then "1" else if Text.Contains([location], "A") then "-1" else "0", Replacer.ReplaceText, {"location"}), #"Change Types" = Table.TransformColumnTypes(Custom, {{"Season", Int64.Type}, {"DayNum", Int64.Type}, {"WTeamID", Int64.Type}, {"WScore", Int64.Type}, {"LTeamID", Int64.Type}, {"LScore", Int64.Type}, {"location", Int64.Type}, {"NumOT", Int64.Type}, {"WFGM", Int64.Type}, {"WFGA", Int64.Type}, {"WFGM3", Int64.Type}, {"WFGA3", Int64.Type}, {"WFTM", Int64.Type}, {"WFTA", Int64.Type}, {"WOR", Int64.Type}, {"WDR", Int64.Type}, {"WAst", Int64.Type}, {"WTO", Int64.Type}, {"WStl", Int64.Type}, {"WBlk", Int64.Type}, {"WPF", Int64.Type}, {"LFGM", Int64.Type}, {"LFGA", Int64.Type}, {"LFGM3", Int64.Type}, {"LFGA3", Int64.Type}, {"LFTM", Int64.Type}, {"LFTA", Int64.Type}, {"LOR", Int64.Type}, {"LDR", Int64.Type}, {"LAst", Int64.Type}, {"LTO", Int64.Type}, {"LStl", Int64.Type}, {"LBlk", Int64.Type}, {"LPF", Int64.Type}}), Winners = Table.TransformColumnNames(#"Change Types", each if Text.StartsWith(_, "W") then Text.Replace(_, "W", "T1_") else Text.Replace(_, "L", "T2_")), #"Rename L" = Table.TransformColumnNames(#"Change Types", each if Text.StartsWith(_, "W") then Text.Replace(_, "W", "T2_") else Text.Replace(_, "L", "T1_")), #"Replaced Value L" = Table.ReplaceValue(#"Rename L", each [location], each if [location] = 1 then -1 else if Text.Contains([location], -1) then 1 else [location], Replacer.ReplaceValue, {"location"}), Losers = Table.TransformColumnTypes(#"Replaced Value L", {{"location", Int64.Type}}), Combined = Table.Combine({Winners, Losers}), PointDiff = Table.AddColumn(Combined, "PointDiff", each [T1_Score] - [T2_Score], Int64.Type) in PointDiffNote: the Lakehouse connection values have been removed What’s happening here is that we’re: Loading the data from the Lakehouse;Filtering the rows to only include the csv files that match the TourneyType parameter;Loading the csv files into tables;Expanding the tables into columns;Renaming the columns;Changing the data types;Combining the two tables back together;Calculating the point difference between the two teams.Using the query is then as simple as selecting it, and then invoking the function with the TourneyType parameter. Fig. 12 — Invoke function. Image by author.This will create a new query with the function as it’s source, and the data loaded and transformed. It’s then just a case of loading the data into the Lakehouse as a new table. Fig. 13 — Function load. Image by author.As you can see, the LoadTournamentData function is invoked with the parameter “RegularSeasonDetailedResults” which will load both the Men’s and Women’s regular season games into a single table. ConclusionAnd that’s it! Hopefully this post has given you a good overview of how to use Dataflow Gen2 to prepare data and create features for your machine learning model. Its low code approach makes it easy to create data pipelines quickly, and it contains a lot of powerful features that can be used to create complex transformations. It’s a great first port of call for anyone who needs to transform data, but more importantly, has the benefit of not needing to write complex code that is prone to errors, is hard to test, and is difficult to maintain. At the time of writing, Dataflows Gen2 are unsupported with the Git integration, and so it’s not possible to version control or share the dataflows. This feature is expected to be released in Q4 2024. Originally published at https://nobledynamic.com on April 15, 2024. Feature Engineering with Microsoft Fabric and Dataflow Gen2 was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story. View the full article
  2. Fabric Madness part 2Image by author and ChatGPT. “Design an illustration, focusing on a basketball player in action, this time the theme is on using pyspark to generate features for machine leaning models in a graphic novel style” prompt. ChatGPT, 4, OpenAI, 4 April. 2024. https://chat.openai.com.A Huge thanks to Martim Chaves who co-authored this post and developed the example scripts. In our previous post we took a high level view of how to train a machine learning model in Microsoft Fabric. In this post we wanted to dive deeper into the process of feature engineering. Feature engineering is a crucial part of the development lifecycle for any Machine Learning (ML) systems. It is a step in the development cycle where raw data is processed to better represent its underlying structure and provide additional information that enhance our ML models. Feature engineering is both an art and a science. Even though there are specific steps that we can take to create good features, sometimes, it is only through experimentation that good results are achieved. Good features are crucial in guaranteeing a good system performance. As datasets grow exponentially, traditional feature engineering may struggle with the size of very large datasets. This is where PySpark can help — as it is a scalable and efficient processing platform for massive datasets. A great thing about Fabric is that it makes using PySpark easy! In this post, we’ll be going over: How does PySpark Work?Basics of PySparkFeature Engineering in ActionBy the end of this post, hopefully you’ll feel comfortable carrying out feature engineering with PySpark in Fabric. Let’s get started! How does PySpark work?Spark is a distributed computing system that allows for the processing of large datasets with speed and efficiency across a cluster of machines. It is built around the concept of a Resilient Distributed Dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. RDDs are the fundamental data structure of Spark, and they allow for the distribution of data across a cluster of machines. PySpark is the Python API for Spark. It allows for the creation of Spark DataFrames, which are similar to Pandas DataFrames, but with the added benefit of being distributed across a cluster of machines. PySpark DataFrames are the core data structure in PySpark, and they allow for the manipulation of large datasets in a distributed manner. At the core of PySpark is the SparkSession object, which is what fundamentally interacts with Spark. This SparkSession is what allows for the creation of DataFrames, and other functionalities. Note that, when running a Notebook in Fabric, a SparkSession is automatically created for you, so you don't have to worry about that. Having a rough idea of how PySpark works, let’s get to the basics. Basics of PySparkAlthough Spark DataFrames may remind us of Pandas DataFrames due to their similarities, the syntax when using PySpark can be a bit different. In this section, we’ll go over some of the basics of PySpark, such as reading data, combining DataFrames, selecting columns, grouping data, joining DataFrames, and using functions. The DataThe data we are looking at is from the 2024 US college basketball tournaments, which was obtained from the on-going March Machine Learning Mania 2024 Kaggle competition, the details of which can be found here, and is licensed under CC BY 4.0 [1] Reading dataAs mentioned in the previous post of this series, the first step is usually to create a Lakehouse and upload some data. Then, when creating a Notebook, we can attach it to the created Lakehouse, and we’ll have access to the data stored there. PySpark Dataframes can read various data formats, such as CSV, JSON, Parquet, and others. Our data is stored in CSV format, so we’ll be using that, like in the following code snippet: # Read women's data w_data = ( spark.read.option("header", True) .option("inferSchema", True) .csv(f"Files/WNCAATourneyDetailedResults.csv") .cache() )In this code snippet, we’re reading the detailed results data set of the final women’s basketball college tournament matches. Note that the "header" option being true means that the names of the columns will be derived from the first row of the CSV file. The inferSchema option tells Spark to guess the data types of the columns - otherwise they would all be read as strings. .cache() is used to keep the DataFrame in memory. If you’re coming from Pandas, you may be wondering what the equivalent of df.head() is for PySpark - it's df.show(5). The default for .show() is the top 20 rows, hence the need to specifically select 5. Combining DataFramesCombining DataFrames can be done in multiple ways. The first we will look at is a union, where the columns are the same for both DataFrames: # Read women's data ... # Read men's data m_data = ( spark.read.option("header", True) .option("inferSchema", True) .csv(f"Files/MNCAATourneyDetailedResults.csv") .cache() ) # Combine (union) the DataFrames combined_results = m_data.unionByName(w_data)Here, unionByName joins the two DataFrames by matching the names of the columns. Since both the women's and the men's detailed match results have the same columns, this is a good approach. Alternatively, there's also union, which combines two DataFrames, matching column positions. Selecting ColumnsSelecting columns from a DataFrame in PySpark can be done using the .select() method. We just have to indicate the name or names of the columns that are relevant as a parameter. Here’s the output for w_scores.show(5): # Selecting a single column w_scores = w_data.select("WScore") # Selecting multiple columns teamid_w_scores = w_data.select("WTeamID", "WScore") ``` Here's the output for `w_scores.show(5)`: ``` +------+ |Season| +------+ | 2010| | 2010| | 2010| | 2010| | 2010| +------+ only showing top 5 rowsThe columns can also be renamed when being selected using the .alias() method: winners = w_data.select( w_data.WTeamID.alias("TeamID"), w_data.WScore.alias("Score") )Grouping DataGrouping allows us to carry out certain operations for the groups that exist within the data and is usually combined with a aggregation functions. We can use .groupBy() for this: # Grouping and aggregating winners_average_scores = winners.groupBy("TeamID").avg("Score")In this example, we are grouping by "TeamID", meaning we're considering the groups of rows that have a distinct value for "TeamID". For each of those groups, we're calculating the average of the "Score". This way, we get the average score for each team. Here’s the output of winners_average_scores.show(5), showing the average score of each team: +------+-----------------+ |TeamID| avg(Score)| +------+-----------------+ | 3125| 68.5| | 3345| 74.2| | 3346|79.66666666666667| | 3376|73.58333333333333| | 3107| 61.0| +------+-----------------+Joining DataJoining two DataFrames can be done using the .join() method. Joining is essentially extending the DataFrame by adding the columns of one DataFrame to another. # Joining on Season and TeamID final_df = matches_df.join(stats_df, on=['Season', 'TeamID'], how='left')In this example, both stats_df and matches_df were using Season and TeamID as unique identifiers for each row. Besides Season and TeamID, stats_df has other columns, such as statistics for each team during each season, whereas matches_df has information about the matches, such as date and location. This operation allows us to add those interesting statistics to the matches information! FunctionsThere are several functions that PySpark provides that help us transform DataFrames. You can find the full list here. Here’s an example of a simple function: from pyspark.sql import functions as F w_data = w_data.withColumn("HighScore", F.when(F.col("Score") > 80, "Yes").otherwise("No"))In the code snippet above, a "HighScore" column is created when the score is higher than 80. For each row in the "Score" column (indicated by the .col() function), the value "Yes" is chosen for the "HighScore" column if the "Score" value is larger than 80, determined by the .when() function. .otherwise(), the value chosen is "No". Feature Engineering in ActionNow that we have a basic understanding of PySpark and how it can be used, let’s go over how the regular season statistics features were created. These features were then used as inputs into our machine learning model to try to predict the outcome of the final tournament games. The starting point was a DataFrame, regular_data, that contained match by match statistics for the regular seasons, which is the United States College Basketball Season that happens from November to March each year. Each row in this DataFrame contained the season, the day the match was held, the ID of team 1, the ID of team 2, and other information such as the location of the match. Importantly, it also contained statistics for each team for that specific match, such as "T1_FGM", meaning the Field Goals Made (FGM) for team 1, or "T2_OR", meaning the Offensive Rebounds (OR) of team 2. The first step was selecting which columns would be used. These were columns that strictly contained in-game statistics. # Columns that we'll want to get statistics from boxscore_cols = [ 'T1_FGM', 'T1_FGA', 'T1_FGM3', 'T1_FGA3', 'T1_OR', 'T1_DR', 'T1_Ast', 'T1_Stl', 'T1_PF', 'T2_FGM', 'T2_FGA', 'T2_FGM3', 'T2_FGA3', 'T2_OR', 'T2_DR', 'T2_Ast', 'T2_Stl', 'T2_PF' ]If you’re interested, here’s what each statistic’s code means: FGM: Field Goals MadeFGA: Field Goals AttemptedFGM3: Field Goals Made from the 3-point-lineFGA3: Field Goals Attempted for 3-point-line goalsOR: Offensive Rebounds. A rebounds is when the ball rebounds from the board when a goal is attempted, not getting in the net. If the team that attempted the goal gets possession of the ball, it’s called an “Offensive” rebound. Otherwise, it’s called a “Defensive” Rebound.DR: Defensive ReboundsAst: Assist, a pass that led directly to a goalStl: Steal, when the possession of the ball is stolenPF: Personal Foul, when a player makes a foulFrom there, a dictionary of aggregation expressions was created. Basically, for each column name in the previous list of columns, a function was stored that would calculate the mean of the column, and rename it, by adding a suffix, "mean". from pyspark.sql import functions as F from pyspark.sql.functions import col # select a column agg_exprs = {col: F.mean(col).alias(col + 'mean') for col in boxscore_cols}Then, the data was grouped by "Season" and "T1_TeamID", and the aggregation functions of the previously created dictionary were used as the argument for .agg(). season_statistics = regular_data.groupBy(["Season", "T1_TeamID"]).agg(*agg_exprs.values())Note that the grouping was done by season and the ID of team 1 — this means that "T2_FGAmean", for example, will actually be the mean of the Field Goals Attempted made by the opponents of T1, not necessarily of a specific team. So, we actually need to rename the columns that are something like "T2_FGAmean" to something like "T1_opponent_FGAmean". # Rename columns for T1 for col in boxscore_cols: season_statistics = season_statistics.withColumnRenamed(col + 'mean', 'T1_' + col[3:] + 'mean') if 'T1_' in col \ else season_statistics.withColumnRenamed(col + 'mean', 'T1_opponent_' + col[3:] + 'mean')At this point, it’s important to mention that the regular_data DataFrame actually has two rows per each match that occurred. This is so that both teams can be "T1" and "T2", for each match. This little "trick" is what makes these statistics useful. Note that we “only” have the statistics for “T1”. We “need” the statistics for “T2” as well — “need” in quotations because there are no new statistics being calculated. We just need the same data, but with the columns having different names, so that for a match with “T1” and “T2”, we have statistics for both T1 and T2. So, we created a mirror DataFrame, where, instead of “T1…mean” and “T1_opponent_…mean”, we have “T2…mean” and “T2_opponent_…mean”. This is important because, later on, when we’re joining these regular season statistics to tournament matches, we’ll be able to have statistics for both team 1 and team 2. season_statistics_T2 = season_statistics.select( *[F.col(col).alias(col.replace('T1_opponent_', 'T2_opponent_').replace('T1_', 'T2_')) if col not in ['Season'] else F.col(col) for col in season_statistics.columns] )Now, there are two DataFrames, with season statistics for “both” T1 and T2. Since the final DataFrame will contain the “Season”, the “T1TeamID” and the “T2TeamID”, we can join these newly created features with a join! tourney_df = tourney_df.join(season_statistics, on=['Season', 'T1_TeamID'], how='left') tourney_df = tourney_df.join(season_statistics_T2, on=['Season', 'T2_TeamID'], how='left')Elo RatingsFirst created by Arpad Elo, Elo is a rating system for zero-sum games (games where one player wins and the other loses), like basketball. With the Elo rating system, each team has an Elo rating, a value that generally conveys the team’s quality. At first, every team has the same Elo, and whenever they win, their Elo increases, and when they lose, their Elo decreases. A key characteristic of this system is that this value increases more with a win against a strong opponent than with a win against a weak opponent. Thus, it can be a very useful feature to have! We wanted to capture the Elo rating of a team at the end of the regular season, and use that as feature for the tournament. To do this, we calculated the Elo for each team on a per match basis. To calculate Elo for this feature, we found it more straightforward to use Pandas. Central to Elo is calculating the expected score for each team. It can be described in code like so: # Function to calculate expected score def expected_score(ra, rb): # ra = rating (Elo) team A # rb = rating (Elo) team B # Elo function return 1 / (1 + 10 ** ((rb - ra) / 400))Considering a team A and a team B, this function computes the expected score of team A against team B. For each match, we would update the teams’ Elos. Note that the location of the match also played a part — winning at home was considered less impressive than winning away. # Function to update Elo ratings, keeping T1 and T2 terminology def update_elo(t1_elo, t2_elo, location, T1_Score, T2_Score): expected_t1 = expected_score(t1_elo, t2_elo) expected_t2 = expected_score(t2_elo, t1_elo) actual_t1 = 1 if T1_Score > T2_Score else 0 actual_t2 = 1 - actual_t1 # Determine K based on game location # The larger the K, the bigger the impact # team1 winning at home (location=1) less impressive than winning away (location = -1) if actual_t1 == 1: # team1 won if location == 1: k = 20 elif location == 0: k = 30 else: # location = -1 k = 40 else: # team2 won if location == 1: k = 40 elif location == 0: k = 30 else: # location = -1 k = 20 new_t1_elo = t1_elo + k * (actual_t1 - expected_t1) new_t2_elo = t2_elo + k * (actual_t2 - expected_t2) return new_t1_elo, new_t2_eloTo apply the Elo rating system, we iterated through each season’s matches, initializing teams with a base rating and updating their ratings match by match. The final Elo available for each team in each season will, hopefully, be a good descriptor of the team’s quality. def calculate_elo_through_seasons(regular_data): # For this feature, using Pandas regular_data = regular_data.toPandas() # Set value of initial elo initial_elo = 1500 # DataFrame to collect final Elo ratings final_elo_list = [] for season in sorted(regular_data['Season'].unique()): print(f"Season: {season}") # Initialize elo ratings dictionary elo_ratings = {} print(f"Processing Season: {season}") # Get the teams that played in the season season_teams = set(regular_data[regular_data['Season'] == season]['T1_TeamID']).union(set(regular_data[regular_data['Season'] == season]['T2_TeamID'])) # Initialize season teams' Elo ratings for team in season_teams: if (season, team) not in elo_ratings: elo_ratings[(season, team)] = initial_elo # Update Elo ratings per game season_games = regular_data[regular_data['Season'] == season] for _, row in season_games.iterrows(): t1_elo = elo_ratings[(season, row['T1_TeamID'])] t2_elo = elo_ratings[(season, row['T2_TeamID'])] new_t1_elo, new_t2_elo = update_elo(t1_elo, t2_elo, row['location'], row['T1_Score'], row['T2_Score']) # Only keep the last season rating elo_ratings[(season, row['T1_TeamID'])] = new_t1_elo elo_ratings[(season, row['T2_TeamID'])] = new_t2_elo # Collect final Elo ratings for the season for team in season_teams: final_elo_list.append({'Season': season, 'TeamID': team, 'Elo': elo_ratings[(season, team)]}) # Convert list to DataFrame final_elo_df = pd.DataFrame(final_elo_list) # Separate DataFrames for T1 and T2 final_elo_t1_df = final_elo_df.copy().rename(columns={'TeamID': 'T1_TeamID', 'Elo': 'T1_Elo'}) final_elo_t2_df = final_elo_df.copy().rename(columns={'TeamID': 'T2_TeamID', 'Elo': 'T2_Elo'}) # Convert the pandas DataFrames back to Spark DataFrames final_elo_t1_df = spark.createDataFrame(final_elo_t1_df) final_elo_t2_df = spark.createDataFrame(final_elo_t2_df) return final_elo_t1_df, final_elo_t2_dfIdeally, we wouldn’t calculate Elo changes on a match-by-match basis to determine each team’s final Elo for the season. However, we couldn’t come up with a better approach. Do you have any ideas? If so, let us know! Value AddedThe feature engineering steps demonstrated show how we can transform raw data — regular season statistics — into valuable information with predictive power. It is reasonable to assume that a team’s performance during the regular season is indicative of its potential performance in the final tournaments. By calculating the mean of observed match-by-match statistics for both the teams and their opponents, along with each team’s Elo rating in their final match, we were able to create a dataset suitable for modelling. Then, models were trained to predict the outcome of tournament matches using these features, among others developed in a similar way. With these models, we only need the two team IDs to look up the mean of their regular season statistics and their Elos to feed into the model and predict a score! ConclusionIn this post, we looked at some of the theory behind Spark and PySpark, how that can be applied, and a concrete practical example. We explored how feature engineering can be done in the case of sports data, creating regular season statistics to use as features for final tournament games. Hopefully you’ve found this interesting and helpful — happy feature engineering! The full source code for this post and others in the series can be found here. Originally published at https://nobledynamic.com on April 8, 2024. References[1] Jeff Sonas, Ryan Holbrook, Addison Howard, Anju Kandru. (2024). March Machine Learning Mania 2024. Kaggle. https://kaggle.com/competitions/march-machine-learning-mania-2024 Feature Engineering with Microsoft Fabric and PySpark was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story. View the full article
  3. For the past few years, Fidelity Investments has been moving a significant percentage of its applications to a cloud-based infrastructure. As part of that transition, Fidelity has consolidated its analytics data into its Enterprise Analytics Platform, which is engineered using the Snowflake Data Cloud, making it easier for teams and departments across the company to access the data they need. Fidelity’s data scientists use the AI/ML-enabled Enterprise Analytics Platform to process a large volume of structured and unstructured data for deeper insights and better decision-making. Historically, the platform was housed in physical servers. In 2020, Fidelity kicked off its digital transformation and established an Enterprise Data Lake (EDL) along with Data Labs. Recently, the company wanted to conduct parallel analytics in the cloud and decided to use Snowpark and Snowpark ML. Fidelity has two main enterprise data architecture guiding principles for its data scientists and data engineers: For data storage, Snowflake is the platform for storing all of the company’s structured and semi-structured analytical data in its Enterprise Data Lake and Data Labs. All of Fidelity’s storage abides by its data security framework and data governance policies, and provides a holistic approach to metadata management. For compute, Fidelity’s principles are to minimize the transfer of data across networks, avoid duplication of data, and process the data in the database — bringing the compute to the data wherever possible. Feature engineering in focus Fidelity creates and transforms features to improve the performance of its ML models. Some common feature engineering techniques include encoding, data scaling and correlation analysis. The company’s data science architecture team was running into computation pain points, especially around feature engineering. Feature engineering is a stage in the data science process — before refinement and after expansion and encoding — where data can be at its peak volume. Pandas DataFrames offer a flexible data structure to manipulate various types of data and apply a wealth of computations. However, the trade-off with Pandas DataFrames is the restriction of memory, including the size of the DataFrame in memory and the expansion of memory due to the space complexity of the computation being applied to the data. This was only exacerbated by the speed of single-node processing, where memory contention and distribution of work had limited resources. The team also considered Spark ML purely for the flexibility of distributed processing, but Spark involves complex configuration and tasks and required maintenance overhead for both the hardware and software. Fidelity wanted to leverage capabilities like parallel processing without the complexity of Spark, so the company turned to Snowpark ML. Benefits of Snowpark ML Snowpark ML includes the Python library and underlying infrastructure for end-to-end ML workflows in Snowflake. Fidelity decided to work with the Snowpark ML Modeling API for feature engineering because of the improved performance and scalability with distributed execution for common sklearn-style preprocessing functions. In addition to being simple to use, it offered a number of additional benefits: All the computation is done within Snowflake, enabling in-database processing. It handles large data volumes and scales both vertically and horizontally. The correlation and preprocessing computation linearly scales the size of data to Snowflake standard warehouse size. Data is not duplicated nor transferred across the network. It leverages extensive RBAC controls, enabling tightly managed security. Lazy evaluation avoids unnecessary computation and data transfer, and improves memory management. Comparing three scenarios The Fidelity team compared Snowpark ML for three different scenarios: MinMax scaling, one-hot encoding and Pearson correlation. MinMax scaling is a critical preprocessing step to get Fidelity’s data ready for modeling. For numerical values, Fidelity wanted to scale its data into a fixed range between zero and one. With Pandas, the performance is fine for small data sets but does not scale to large data sets with thousands or millions of rows. Snowpark ML eliminates all data movement and scales out execution for much better performance. Figure 1. Performance improvement of 77x with Snowpark ML, compared to in-memory processing for MinMax scaling. One-hot encoding is a feature transformation technique for categorical values. With Snowpark ML, the execution is much faster by leveraging the distributed parallel processing for the data transformation and eliminating the data read and write times. Figure 2. Performance improvement of 50x with Snowpark ML, compared to in-memory processing for one-hot encoding. By using Snowpark ML to derive Pearson product moment or Pearson correlation matrix, Fidelity achieved a magnitude of performance improvement by scaling the computation both vertically and horizontally. This is especially useful for use cases with large and wide data sets in which there are, for example, 29 million rows and over 4,000 columns. Figure 3. Performance improvement of 17x with Snowpark ML, compared to in-memory processing for Pearson correlation. Fidelity achieved significant time, performance and cost benefits by bringing the compute closer to the data and increasing the capacity to handle more load. By speeding up computations, the company’s data scientists now iterate on features faster. Those time savings have allowed the team to become more innovative with feature engineering, explore new and different algorithms, and improve model performance. For more details, check out Fidelity’s full presentation on Snowpark ML for feature engineering. Ready to start building models of your own with Snowpark ML? Refer to Snowflake’s developer documentation for technical details, or try it for yourself with our step-by-step quickstart. The post Fidelity Optimizes Feature Engineering With Snowpark ML appeared first on Snowflake. View the full article
  4. Here at Wayfair, our data scientists rely on multiple sources of data to obtain features for model training. An ad hoc approach to feature engineering led to multiple versions of feature definitions, making it challenging to share features between different models. Most of the features were stored and used with minimal oversight on freshness, schema, and data guarantees. As a result, our data scientists frequently encountered discrepancies in model performance between development and production environments, making the feedback loop for retraining cumbersome. The whole process of curating new stable features and developing new model versions often took several months. To address these issues, the Service Intelligence team at Wayfair decided to create a centralized feature engineering system. Our goal was to standardize feature definitions, automate ingestion processes, and simplify maintenance. We worked with Google to adopt different Vertex AI offerings, especially Vertex AI Feature Store and Vertex AI Pipelines. The former provides a centralized repository for organizing, storing, and serving ML features, and the latter helps to automate, monitor, and manage ML workflows. These offerings became the two main components of our feature engineering architecture. On the data side, we developed workflows to streamline the flow of raw features data into BigQuery tables. We created a centralized repository of feature definitions that specify how each feature should be pulled, processed, and stored in the feature store. Using the Vertex AI Feature Store’s API, we automatically create features based on the given definitions. We use GitHub’s PR approval process to enforce governance and track changes. Sample feature definition We set up Vertex AI Pipelines to transform raw data in BigQuery into features in the feature store. These pipelines run SQL queries to extract the data, transform it, and then ingest it into the feature store. The pipelines run on different cadences depending on how frequently the features change, and what level of recency is required by the models that consume them. The pipelines are triggered by Cloud Functions that listen for Pub/Sub messages. These messages are generated both on a static schedule from Cloud Scheduler, and dynamically from other pipelines and processes. Feature Engineering System Diagram The Vertex AI Feature Store enables both training and inference. For training it allows data scientists to export historical feature values via point-in-time lookup to retrain their models. For inference it serves features at low latency to production models that make their predictions in real-time. Furthermore, it ensures consistency between our development and production environments, avoiding training-serving skew. Data scientists are able to confidently iterate on new model versions without worrying about data-related issues. Our new feature engineering system makes it easy for data scientists to share and reuse features, while helping to provide guarantees around offline-online consistency and feature freshness. We are looking forward to adopting the new version of Vertex AI Feature Store that is now in public preview, as it will provide more transparent access to the underlying data and should reduce our cloud costs by allowing us to use BigQuery resources dedicated to our project. The authors would like to thank Duncan Renfrow-Symon and Sandeep Kandekar from Wayfair for their technical contributions and Neela Chaudhari, Kieran Kavanagh, and Brij Dhanda from Google for their support with Google Cloud.
  5. The preprocessing and transformation of raw data into features constitutes a pivotal yet time-intensive phase within the machine learning (ML) process. This holds particularly true when data scientists or data engineers are required to transfer data across diverse platforms for the purpose of carrying out MLOps. In February 2023 we announced the preview of two new capabilities for BigQuery ML: more data preprocessing functions and the ability to export the BigQuery ML TRANSFORM clause as part of the model artifact. Today, these features are going GA and have even more capabilities for optimizing your ML workflow. In this blogpost, we describe how we streamline feature engineering by keeping it close to ML training and serving, with the following new functionalities: More manual preprocessing functions that give the flexibility users need to prepare their data as features for ML while also enabling simplified serving by embedding the preprocessing steps directly in the model. More seamless integration with Vertex AI amplifies this embedded preprocessing by making it fast to host BigQuery ML models on Vertex AI Prediction Endpoints for serverless online predictions that scale to meet your applications demand. Ability to export the BigQuery ML TRANSFORM clause as part of the model artifact which makes the BigQuery ML models portable and can be used in other workflows where the same preprocessing steps are needed. Feature EngineeringThe manual preprocessing functions are big timesavers for setting up your data columns as features for ML. The list of available preprocessing functions now includes: ML.MAX_ABS_SCALER Scale a numerical column to the range [-1, 1] without centering by dividing by the maximum absolute value. ML.ROBUST_SCALER Scale a numerical column by centering with the median (optional) and dividing by the quantile range of choice ([25, 75] by default). ML.NORMALIZER Turn a numerical array into a unit norm array for any p-norm: 0, 1, >1, +inf. The default is 2 resulting in a normalized array where the sum of squares is 1. ML.IMPUTER Replace missing values in a numerical or categorical input with the mean, median or mode (most frequent). ML.ONE_HOT_ENCODER One-hot encodes a categorical input. Also, it optionally does dummy encoding by dropping the most frequent value. It is also possible to limit the size of the encoding by specifying k for the k most frequent categories and/or a lower threshold for the frequency of categories. ML.MULTI_HOT_ENCODER Encode an array of strings with integer values representing categories. It is possible to limit the size of the encoding by specifying k for the k most frequent categories and/or a lower threshold for the frequency of categories. ML.LABEL_ENCODER Encode a categorical input to integer values [0, n categories] where 0 represents NULL and excluded categories. You can exclude categories by specifying k for k most frequent categories and/or a lower threshold for the frequency of categories. Step-by-step examples of all preprocessing functionsThis first tutorial shows how to use each of the preprocessing functions. In the interactive notebook a data sample and multiple uses of each function are used to highlight the operation and options available to adapt these functions to any feature engineering tasks. For example, the task of imputing missing values has different options depending on the data type of the column (string or numeric). The example below (from the interactive notebook) shows each possible way to impute missing value for each data type: code_block[StructValue([(u'code', u"SELECT\r\n num_column,\r\n ML.IMPUTER(num_column, 'mean') OVER() AS num_imputed_mean,\r\n ML.IMPUTER(num_column, 'median') OVER() AS num_imputed_median,\r\n ML.IMPUTER(num_column, 'most_frequent') OVER() AS num_imputed_mode,\r\n string_column,\r\n ML.IMPUTER(string_column, 'most_frequent') OVER() AS string_imputed_mode,\r\n FROM\r\n UNNEST([1, 1, 2, 3, 4, 5, NULL]) AS num_column WITH OFFSET pos1,\r\n UNNEST(['a', 'a', 'b', 'c', 'd', 'e', NULL]) AS string_column WITH OFFSET pos2\r\n WHERE pos1 = pos2\r\n ORDER BY num_column"), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3e259af45a10>)])]The table that follows shows the inputs with missing values highlighted in red and the outputs with imputed values for the different strategies highlighted in green. Visit the notebook linked above for this and more examples of all the preprocessing functions. Training with the TRANSFORM clauseNow, when exporting models with a TRANSFORM clause even more SQL functions are supported for the accompanying exported preprocessing model. Supported SQL functions include: Manual preprocessing functions Operators Conditional expressions Mathematical functions Conversion functions String functions Date, Datetime, Time, and Timestamp functions To host a BigQuery ML trained model on Vertex AI you can bypass the export steps and automatically register the model to the Vertex AI Model Registry during training. Then, when you deploy the model to a Vertex AI Prediction Endpoint for online prediction the TRANSFORM clauses preprocessing is also included in the endpoint for seamless training-serving workflows. This means there is no need to apply preprocessing functions again before getting predictions from the online endpoint! Serving models is also as simple as always within BigQuery ML using the PREDICT function. Step-by-step guide to incorporating manual preprocessing inside the model with the inline TRANSFORM clause:In this tutorial, we will use the bread recipe competition dataset to predict judges rating using linear regression and boosted tree models. Objective: To demonstrate how to preprocess data using the new functions, register the model with Vertex AI Model Registry, and deploy the model for online prediction with Vertex AI Prediction endpoints. Dataset: Each row represents a bread recipe with columns for each ingredient (flour, salt, water, yeast) and procedure (mixing time, mixing speed, cooking temperature, resting time). There are also columns that include judges ratings of the final product from each recipe. Overview of the tutorial: Step 1 shows how to use the TRANSFORM statement while training the model. Step 2 demonstrates how to deploy the model for online prediction using Vertex AI Prediction Endpoints. A final example is given to show how to export the model and access the transform model directly. For the best learning experience, follow this blog post alongside the tutorial notebook. Step 1: Create models using an inline TRANSFORM clauseUsing the BigQuery ML manual preprocessing function highlighted above and additional BigQuery functions to prepare input columns into features within a TRANSFORM clause is very similar to writing SQL. The added benefit of having the preprocessing logic embedded within the trained model is that the preprocessing is incorporated in the prediction routine both within BigQuery with ML.PREDICT and outside of BigQuery, like the Vertex AI Model Registry for deployment to Vertex AI Prediction Endpoints. The query below creates a model to predict judge A’s rating for bread recipes. The TRANSFORM statement uses multiple numerical preprocessing functions to scale columns into features. The values needed for scaling are stored and used at prediction to scale prediction instances as well. The contestant_id column is not particularly helpful for prediction as new seasons will have new contestants but the order of contestants could be helpful if, perhaps, contestants are getting generally better at bread baking. To transform contestants into ordered labels the ML.LABEL_ENCODER function is used. Using columns like season and round as features might not be helpful for predicting future values. A more general indicator of time would be the year and week within the year. Turning the airdate (date on which the episode aired) into features with the EXTRACT function is done directly in the TRANSFORM clause as well. code_block[StructValue([(u'code', u"CREATE OR REPLACE MODEL `statmike-mlops-349915.feature_engineering.bqml_feature_engineering_transform`\r\nTRANSFORM (\r\n JUDGE_A,\r\n ML.LABEL_ENCODER(contestant_id) OVER() as contestant,\r\n EXTRACT(YEAR FROM airdate) as year,\r\n EXTRACT(ISOWEEK FROM airdate) as week,\r\n\r\n ML.MIN_MAX_SCALER(flourAmt) OVER() as scale_flourAmt, \r\n ML.ROBUST_SCALER(saltAmt) OVER() as scale_saltAmt,\r\n ML.MAX_ABS_SCALER(yeastAmt) OVER() as scale_yeastAmt,\r\n ML.STANDARD_SCALER(water1Amt) OVER() as scale_water1Amt,\r\n ML.STANDARD_SCALER(water2Amt) OVER() as scale_water2Amt,\r\n\r\n ML.STANDARD_SCALER(waterTemp) OVER() as scale_waterTemp,\r\n ML.ROBUST_SCALER(bakeTemp) OVER() as scale_bakeTemp,\r\n ML.MIN_MAX_SCALER(ambTemp) OVER() as scale_ambTemp,\r\n ML.MAX_ABS_SCALER(ambHumidity) OVER() as scale_ambHumidity,\r\n\r\n ML.ROBUST_SCALER(mix1Time) OVER() as scale_mix1Time,\r\n ML.ROBUST_SCALER(mix2Time) OVER() as scale_mix2Time,\r\n ML.ROBUST_SCALER(mix1Speed) OVER() as scale_mix1Speed,\r\n ML.ROBUST_SCALER(mix2Speed) OVER() as scale_mix2Speed,\r\n ML.STANDARD_SCALER(proveTime) OVER() as scale_proveTime,\r\n ML.MAX_ABS_SCALER(restTime) OVER() as scale_restTime,\r\n ML.MAX_ABS_SCALER(bakeTime) OVER() as scale_bakeTime\r\n)\r\nOPTIONS (\r\n model_type = 'BOOSTED_TREE_REGRESSOR',\r\n booster_type = 'GBTREE',\r\n num_parallel_tree = 25,\r\n early_stop = TRUE,\r\n min_rel_progress = 0.01,\r\n tree_method = 'HIST',\r\n subsample = 0.85, \r\n input_label_cols = ['JUDGE_A'],\r\n enable_global_explain = TRUE,\r\n data_split_method = 'AUTO_SPLIT',\r\n l1_reg = 10,\r\n l2_reg = 10,\r\n MODEL_REGISTRY = 'VERTEX_AI',\r\n VERTEX_AI_MODEL_ID = 'bqml_bqml_feature_engineering_transform',\r\n VERTEX_AI_MODEL_VERSION_ALIASES = ['run-20230705114026']\r\n ) AS\r\nSELECT *\r\nFROM `statmike-mlops-349915.feature_engineering.bread`"), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3e259af45a50>)])]Note that the model training used options to directly register the model in Vertex AI Model Registry. This bypasses the need to export and subsequently register the model artifacts in the Vertex AI Model Registry while also keeping the two locations connected so that if the model is removed from BigQuery it is also removed from Vertex AI. It also enables a very simple path to online predictions as shown in Step 2 below. In the interactive notebook the resulting model is also used with the many other functions to enable an end-to-end MLOps journey directly in BigQuery: ML.FEATURE_INFO to review summary information for each input feature used to train the model ML.TRAINING_INFO to see details from each training iteration of the model ML.EVALUATE to review model metrics ML.FEATURE_IMPORTANCE to review the feature importance scores from the construction of the boosted tree ML.GLOBAL_EXPLAIN to get aggregated feature attribution for features across the evaluation data ML.EXPLAIN_PREDICT to get prediction and feature attributions for each instance of the input ML.PREDICT to get predictions for input instances Step 2: Serve online predictions with Vertex AI Prediction EndpointsBy using options to register the resulting model in the Vertex AI Model Registry during step 1 the path to online predictions is made very simple. Models in the Vertex AI Model Registry can be deployed to Vertex AI Prediction Endpoints where they can serve predictions from Vertex AI API using any of the client libraries (Python, Java, Node.js), gcloud ai, REST or gRPC. The process can be done directly from the Vertex AI console as shown here and is demonstrated below with the popular Python client for Vertex AI, named google-cloud-aiplatform. Setting up the Python environment to work with the Vertex AI client requires just an import and setting the project and region for resources: code_block[StructValue([(u'code', u'from google.cloud import aiplatform\r\naiplatform.init(project = PROJECT_ID, location = REGION)'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3e259af45890>)])]Connecting to the model in the Vertex AI Model Registry is done using the model name which was specified in the CREATE MODEL statement with the option VERTEX_AI_MODEL_ID: code_block[StructValue([(u'code', u"vertex_model = aiplatform.Model(model_name = 'bqml_bqml_bqml_feature_engineering_transform')"), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3e259af53d50>)])]Creating a Vertex AI Prediction Endpoints requires just a display_name: code_block[StructValue([(u'code', u'endpoint = aiplatform.Endpoint.create(display_name = "bqml_feature_engineering")'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3e25989a4f50>)])]The action of deploying the model to the endpoint requires specifying the compute environment with: traffic_percentage: percentage of requests routed to the model machine_type: the compute specification min_replica_count and max_replica_count: the compute environment's minimum and maximum number of machines used in scaling to meet the demand for predictions. code_block[StructValue([(u'code', u"endpoint.deploy(\r\n model = vertex_model,\r\n deployed_model_display_name = vertex_model.display_name,\r\n traffic_percentage = 100,\r\n machine_type = 'n1-standard-2',\r\n min_replica_count = 1,\r\n max_replica_count = 1\r\n)"), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3e2598ad0f90>)])]Request a prediction by sending an input instance with key:value pairs for each feature. Note that the features are the raw features rather than needing to preprocess them into the model features like contestant, year, week and other scaled features: code_block[StructValue([(u'code', u"endpoint.predict(instances = ['contestant_id': 'c_1',\r\n 'airdate': '2003-05-26',\r\n 'flourAmt': 484.28986452656386,\r\n 'saltAmt': 9,\r\n 'yeastAmt': 10,\r\n 'mix1Time': 5,\r\n 'mix1Speed': 3,\r\n 'mix2Time': 5,\r\n 'mix2Speed': 5,\r\n 'water1Amt': 311.66349401065276,\r\n 'water2Amt': 98.61283742264706,\r\n 'waterTemp': 46,\r\n 'proveTime': 105.67304373851782,\r\n 'restTime': 44,\r\n 'bakeTime': 28,\r\n 'bakeTemp': 435.39349280229476,\r\n 'ambTemp': 51.27996072412186,\r\n 'ambHumidity': 61.44333141984406\r\n])"), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3e25989831d0>)])]The response returned is the predicted score from Judge A of a 73.5267944335937 which is also confirmed in the tutorial notebook using the model in BigQuery with ML.PREDICT. Not the best bread, but a great prediction since the actual answer is 75.0! (Optional) Exporting Models With Inline TRANSFORM clauseWhile there is no longer a need to export the model for use in Vertex AI thanks to the direct registration options available during model creation, it can still be very helpful to make BigQuery ML models portable for use elsewhere or in more complex workflows like model co-hosting with deployment resource pools or workflows with multiple models using NVIDIA Triton on Vertex AI Prediction. When exporting BigQuery ML models to GCS the TRANSFORM clause is also exported as a separate model in a subfolder named /transform. This means even the transform model is portable and can be used in other workflows where the same preprocessing steps are needed. If you used BigQuery time or date functions (Date functions, Datetime functions, Time functions and Timestamp functions) then you might wonder how the exported TensorFlow model that represents the TRANSFORM clause handles those data types. We implemented a TensorFlow Custom op that can be easily added to your custom serving environment via the bigquery-ml-utils Python package. To initiate the export to GCS use the BigQuery EXPORT MODEL statement: code_block[StructValue([(u'code', u"EXPORT MODEL `statmike-mlops-349915.feature_engineering.bqml_feature_engineering_transform`\r\n OPTIONS (URI = 'gs://statmike-mlops-349915-us-central1-bqml-exports/bqml/model')"), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3e2598983350>)])]The tutorial notebooks show the folder structure and contents and how to use the TensorFlow SavedModel CLIto review the transform models input and output signature. ConclusionBigQuery ML preprocessing functions give the flexibility users need to prepare their data as features for ML while also enabling simplified serving by embedding the preprocessing steps directly in the model. Creating a seamless integration with Vertex AI amplifies this embedded preprocessing by making it fast to host BigQuery ML models on Vertex AI Prediction Endpoints for serverless online predictions that scale to meet your applications demand. Ultimately making building models easy while making the models useful through simple serving options. In the future you can expect to see even more ways to simplify ML workflows with BigQuery ML while seamlessly integrating with Vertex AI.
  6. Preprocessing and transforming raw data into features is a critical but time consuming step in the ML process. This is especially true when a data scientist or data engineer has to move data across different platforms to do MLOps. In this blogpost, we describe how we streamline this process by adding two feature engineering capabilities in BigQuery ML Our previous blog outlines the data to AI journey with BigQuery ML, highlighting two powerful features that simplify MLOps - data preprocessing functions for feature engineering and the ability to export BigQuery ML TRANSFORM statement as part of the model artifact. In this blog post, we share how to use these features for creating a seamless experience from BigQuery ML to Vertex AI. Data Preprocessing Functions Preprocessing and transforming raw data into features is a critical but time consuming step when operationalizing ML. We recently announced the public preview of advanced feature engineering functions in BigQuery ML. These functions help you impute, normalize or encode data. When this is done inside the database, BigQuery, the entire process becomes easier, faster, and more secure to preprocess data. Here is a list of the new functions we are introducing in this release. The full list of preprocessing functions can be found here. ML.MAX_ABS_SCALER Scale a numerical column to the range [-1, 1] without centering by dividing by the maximum absolute value. ML.ROBUST_SCALER Scale a numerical column by centering with the median (optional) and dividing by the quantile range of choice ([25, 75] by default). ML.NORMALIZER Turn an input numerical array into a unit norm array for any p-norm: 0, 1, >1, +inf. The default is 2 resulting in a normalized array where the sum of squares is 1. ML.IMPUTER Replace missing values in a numerical or categorical input with the mean, median or mode (most frequent). ML.ONE_HOT_ENCODER One-hot encode a categorical input. Also, it optionally does dummy encoding by dropping the most frequent value. It is also possible to limit the size of the encoding by specifying k for the k most frequent categories and/or a lower threshold for the frequency of categories. ML.LABEL_ENCODER Encode a categorical input to integer values [0, n categories] where 0 represents NULL and excluded categories. You can exclude categories by specifying k for k most frequent categories and/or a lower threshold for the frequency of categories. Model Export with TRANSFORM Statement You can now export BigQuery ML models that include a feature TRANSFORM statement. The ability to include TRANSFORM statements makes models more portable when exporting them for online prediction. This capability also works when BigQuery ML models are registered with Vertex AI Model Registry and deployed to Vertex AI Prediction endpoints. More details about exporting models can be found in BigQuery ML Exporting models. These new features are available through the Google Cloud Console, BigQuery API, and client libraries. Step-by-step guide to use the two features In this tutorial, we will use the bread recipe competition dataset to predict judges rating using linear regression and boosted tree models. Objective: To demonstrate how to preprocess data using the new functions, register the model with Vertex AI Model Registry, and deploy the model for online prediction with Vertex AI Prediction endpoints. Dataset: Each row represents a bread recipe with columns for each ingredient (flour, salt, water, yeast) and procedure (mixing time, mixing speed, cooking temperature, resting time). There are also columns that include judges ratings of the final product from each recipe. Overview of the tutorial: Steps 1 and 2 show how to use the TRANSFORM statement. Steps 3 and 4 demonstrate how to manually export and register the models. Steps 5 through 7 show how to deploy a model to Vertex AI Prediction endpoint. For the best learning experience, follow this blog post alongside the tutorial notebook. Step 1: Transform BigQuery columns into ML features with SQL Before training an ML model, exploring the data within columns is essential to identifying the data type, distribution, scale, missing patterns, and extreme values. BigQuery ML enables this exploratory analysis with SQL. With the new preprocessing functions it is now even easier to transform BigQuery columns into ML features with SQL while iterating to find the optimal transformation. For example, when using the ML.MAX_ABS_SCALER function for an input column, each value is divided by the maximum absolute value (10 in the example): code_block [StructValue([(u'code', u'SELECT\r\n input_column,\r\n ML.MAX_ABS_SCALER (input_column) OVER() AS scale_column\r\nFROM\r\n UNNEST([0, -1, 2, -3, 4, -5, 6, -7, 8, -9, 10]) as input_column\r\nORDER BY input_column'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3ecc7cf7a710>)])] Once the input columns for an ML model are identified and the feature transformations are chosen, it is enticing to apply the transformation and save the output as a view. But this has an impact on our predictions later on because these same transformations will need to be applied before requesting predictions. Step 2 shows how to prevent this separation of processing and model training. Step 2: Iterate through multiple models with inline TRANSFORM functions Building on the preprocessing explorations in Step 1, the chosen transformations are applied inline with model training using the TRANSFORM statement. This interlocks the model iteration with the preprocessing explorations while making any candidate ready for serving with BigQuery or beyond. This means you can immediately try multiple model types without any delayed impact of feature transformations on predictions. In this step, two models, linear regression and boosted tree, are trained side-by-side with identical TRANSFORM statements: Training with linear regression - Model a code_block [StructValue([(u'code', u"CREATE OR REPLACE MODEL `statmike-mlops-349915.feature_engineering.03_feature_engineering_2a`\r\nTRANSFORM (\r\n JUDGE_A,\r\n\r\n ML.MIN_MAX_SCALER(flourAmt) OVER() as scale_flourAmt, \r\n ML.ROBUST_SCALER(saltAmt) OVER() as scale_saltAmt,\r\n ML.MAX_ABS_SCALER(yeastAmt) OVER() as scale_yeastAmt,\r\n ML.STANDARD_SCALER(water1Amt) OVER() as scale_water1Amt,\r\n ML.STANDARD_SCALER(water2Amt) OVER() as scale_water2Amt,\r\n\r\n ML.STANDARD_SCALER(waterTemp) OVER() as scale_waterTemp,\r\n ML.ROBUST_SCALER(bakeTemp) OVER() as scale_bakeTemp,\r\n ML.MIN_MAX_SCALER(ambTemp) OVER() as scale_ambTemp,\r\n ML.MAX_ABS_SCALER(ambHumidity) OVER() as scale_ambHumidity,\r\n\r\n ML.ROBUST_SCALER(mix1Time) OVER() as scale_mix1Time,\r\n ML.ROBUST_SCALER(mix2Time) OVER() as scale_mix2Time,\r\n ML.ROBUST_SCALER(mix1Speed) OVER() as scale_mix1Speed,\r\n ML.ROBUST_SCALER(mix2Speed) OVER() as scale_mix2Speed,\r\n ML.STANDARD_SCALER(proveTime) OVER() as scale_proveTime,\r\n ML.MAX_ABS_SCALER(restTime) OVER() as scale_restTime,\r\n ML.MAX_ABS_SCALER(bakeTime) OVER() as scale_bakeTime\r\n)\r\nOPTIONS (\r\n model_type = 'LINEAR_REG',\r\n input_label_cols = ['JUDGE_A'],\r\n enable_global_explain = TRUE,\r\n data_split_method = 'AUTO_SPLIT',\r\n MODEL_REGISTRY = 'VERTEX_AI',\r\n VERTEX_AI_MODEL_ID = 'bqml_03_feature_engineering_2a',\r\n VERTEX_AI_MODEL_VERSION_ALIASES = ['run-20230112234821']\r\n ) AS\r\nSELECT * EXCEPT(Recipe, JUDGE_B)\r\nFROM `statmike-mlops-349915.feature_engineering.bread`"), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3ecc7cf7add0>)])] Training with boosted tree - Model b code_block [StructValue([(u'code', u"CREATE OR REPLACE MODEL `statmike-mlops-349915.feature_engineering.03_feature_engineering_2b`\r\nTRANSFORM (\r\n JUDGE_A,\r\n\r\n ML.MIN_MAX_SCALER(flourAmt) OVER() as scale_flourAmt, \r\n ML.ROBUST_SCALER(saltAmt) OVER() as scale_saltAmt,\r\n ML.MAX_ABS_SCALER(yeastAmt) OVER() as scale_yeastAmt,\r\n ML.STANDARD_SCALER(water1Amt) OVER() as scale_water1Amt,\r\n ML.STANDARD_SCALER(water2Amt) OVER() as scale_water2Amt,\r\n\r\n ML.STANDARD_SCALER(waterTemp) OVER() as scale_waterTemp,\r\n ML.ROBUST_SCALER(bakeTemp) OVER() as scale_bakeTemp,\r\n ML.MIN_MAX_SCALER(ambTemp) OVER() as scale_ambTemp,\r\n ML.MAX_ABS_SCALER(ambHumidity) OVER() as scale_ambHumidity,\r\n\r\n ML.ROBUST_SCALER(mix1Time) OVER() as scale_mix1Time,\r\n ML.ROBUST_SCALER(mix2Time) OVER() as scale_mix2Time,\r\n ML.ROBUST_SCALER(mix1Speed) OVER() as scale_mix1Speed,\r\n ML.ROBUST_SCALER(mix2Speed) OVER() as scale_mix2Speed,\r\n ML.STANDARD_SCALER(proveTime) OVER() as scale_proveTime,\r\n ML.MAX_ABS_SCALER(restTime) OVER() as scale_restTime,\r\n ML.MAX_ABS_SCALER(bakeTime) OVER() as scale_bakeTime\r\n)\r\nOPTIONS (\r\n model_type = 'BOOSTED_TREE_REGRESSOR',\r\n booster_type = 'GBTREE',\r\n num_parallel_tree = 1,\r\n max_iterations = 30,\r\n early_stop = TRUE,\r\n min_rel_progress = 0.01,\r\n tree_method = 'HIST',\r\n subsample = 0.85, \r\n input_label_cols = ['JUDGE_A'],\r\n enable_global_explain = TRUE,\r\n data_split_method = 'AUTO_SPLIT',\r\n l1_reg = 10,\r\n l2_reg = 10,\r\n MODEL_REGISTRY = 'VERTEX_AI',\r\n VERTEX_AI_MODEL_ID = 'bqml_03_feature_engineering_2b',\r\n VERTEX_AI_MODEL_VERSION_ALIASES = ['run-20230112234926']\r\n ) AS\r\nSELECT * EXCEPT(Recipe, JUDGE_B)\r\nFROM `statmike-mlops-349915.feature_engineering.bread`"), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3ecc7cf7ae90>)])] Identical input columns that have the same preprocessing means you can easily compare the accuracy of the models. Using the BigQuery ML function ML.EVALUATE makes this comparison as simple as a single SQL query that stacks these outcomes with the UNION ALL set operator: code_block [StructValue([(u'code', u"SELECT 'Manual Feature Engineering - 2A' as Approach, mean_squared_error, r2_score\r\nFROM ML.EVALUATE(MODEL `statmike-mlops-349915.feature_engineering.03_feature_engineering_2a`)\r\nUNION ALL\r\nSELECT 'Manual Feature Engineering - 2B' as Approach, mean_squared_error, r2_score\r\nFROM ML.EVALUATE(MODEL `statmike-mlops-349915.feature_engineering.03_feature_engineering_2b`)"), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3ecc7cf7af50>)])] The results of the evaluation comparison show that using the boosted tree model results in a much better model than linear regression with drastically lower mean squared error and higher r2. Both models are ready to serve predictions, but the clear choice is the boosted tree regressor. Once we decide which model to use, you can predict directly within BigQuery ML using the ML.PREDICT function. In the rest of the tutorial, we show how to export the model outside of BigQuery ML and predict using Google Cloud Vertex AI. Using BigQuery Models for Inference Outside of BigQuery Once your model is trained, if you want to do online inference for low latency responses in your application for online prediction, you have to deploy the model outside of BigQuery. The following steps demonstrate how to deploy the models to Vertex AI Prediction endpoints. This can be accomplished in one of two ways: Manually export the model from BigQuery ML and set up a Vertex AI Prediction Endpoint. To do this, you need to do steps 3 and 4 first. Register the model and deploy from Vertex AI Model Registry automatically. The capability is not available yet but will be available in a forthcoming release. Once it’s available steps 3 and 4 can be skipped. Step 3. Manually export models from BigQuery BigQuery ML supports an EXPORT MODEL statement to deploy models outside of BigQuery. A manual export includes two models - a preprocessing model that reflects the TRANSFORM statement and a prediction model. Both models are exported with a single export statement in BigQuery ML. code_block [StructValue([(u'code', u"EXPORT MODEL `statmike-mlops-349915.feature_engineering.03_feature_engineering_2b`\r\n OPTIONS (URI = 'gs://statmike-mlops-349915-us-central1-bqml-exports/03/2b/model')"), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3ecc7cf80350>)])] The preprocessing model that captures the TRANSFORM statement is exported as a TensorFlow SavedModel file. In this example it is exported to a GCS bucket located at ‘gs://statmike-mlops-349915-us-central1-bqml-exports/03/2b/model/transform’. The prediction models are saved in portable formats that match the frameworks in which they were trained by BigQuery ML. The linear regression model is exported as a TensorFlow SavedModel and the boosted tree regressor is exported as Booster file (XGBoost). In this example, the boost tree model is exported to a GCS bucket located at ‘gs://statmike-mlops-349915-us-central1-bqml-exports/03/2b/model’ These export files are in a standard open format of the native model types making them completely portable to be deployed anywhere - they can be deployed to Vertex AI (Steps 4-7 below), on your own infrastructure, or even in edge applications. Steps 4 through 7 show how to register and deploy a model to Vertex AI Prediction endpoint. These steps need to be repeated separately for the preprocessing models and the prediction models. Step 4. Register models to Vertex AI Model Registry To deploy the models in Vertex AI Prediction, they first need to be registered with the Vertex AI Model Registry To do this two inputs are needed - the links to the model files and a URI to a pre-built container. Go to Step 4 in the tutorial to see how exactly it’s done. The registration can be done with the Vertex AI console or programmatically with one of the clients. In the example below, the Python client for Vertex AI is used to register the models like this: code_block [StructValue([(u'code', u'vertex_model = aiplatform.Model.upload(\r\n display_name = \'gcs_03_feature_engineering_2b\',\r\n serving_container_image_uri = \'us-docker.pkg.dev/vertex-ai/prediction/xgboost-cpu.1-1:latest\',\r\n artifact_uri = "gs://statmike-mlops-349915-us-central1-bqml-exports/03/2b/model"\r\n)'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3ecc7cf80810>)])] Step 5. Create Vertex AI Prediction endpoints Vertex AI includes a service forhosting models for online predictions. To host a model on a Vertex AI Prediction endpoint you first create an endpoint. This can also be done directly from the Vertex AI Model Registry console or programmatically with one of the clients. In the example below, the Python client for Vertex AI is used to create the endpoint like this: code_block [StructValue([(u'code', u'vertex_endpoint = aiplatform.Endpoint.create (\r\n display_name = \u201803_feature_engineering_manual_2b\u2019\r\n)'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3ecc7cf80f90>)])] Step 6. Deploy models to endpoints Deploying a model from the Vertex AI Model Registry (Step 4) to a Vertex AI Prediction endpoint (Step 5) is done in a single deployment action where the model definition is supplied to the endpoint along with the type of machine to utilize. Vertex AI Prediction endpoints can automatically scale up or down to handle prediction traffic needs by providing the number of replicas to utilize (default is 1 for min and max). In the example below, the Python client for Vertex AI is being used with the deploy method for the endpoint (Step 5) using the models (Step 4): code_block [StructValue([(u'code', u"vertex_endpoint.deploy(\r\n model = vertex_model,\r\n deployed_model_display_name = vertex_model.display_name,\r\n traffic_percentage = 100,\r\n machine_type = 'n1-standard-2',\r\n min_replica_count = 1,\r\n max_replica_count = 1\r\n)"), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3ecc7cf800d0>)])] Step 7. Request predictions from endpoints Once the model is deployed to a Vertex AI Prediction endpoint (Step 6) it can serve predictions. Rows of data, called instances, are passed to the endpoint and results are returned that include the processed information: preprocessing result or prediction. Getting prediction results from Vertex AI Prediction endpoints can be done with any of the Vertex AI API interfaces (REST, gRPC, gcloud, Python, Java, Node.js). Here, the request is demonstrated directly with the predict method of the endpoint (Step 6) using the Python client for Vertex AI as follows: code_block [StructValue([(u'code', u"results = vertex_endpoint.predict(instances = [\r\n{'flourAmt': 511.21695405324624,\r\n 'saltAmt': 9,\r\n 'yeastAmt': 11,\r\n 'mix1Time': 6,\r\n 'mix1Speed': 4,\r\n 'mix2Time': 5,\r\n 'mix2Speed': 4,\r\n 'water1Amt': 338.3989183746999,\r\n 'water2Amt': 105.43955159464981,\r\n 'waterTemp': 48,\r\n 'proveTime': 92.27755071811586,\r\n 'restTime': 43,\r\n 'bakeTime': 29,\r\n 'bakeTemp': 462.14028505497805,\r\n 'ambTemp': 38.20572852497746,\r\n 'ambHumidity': 63.77836403396154}])"), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3ecc7cf80550>)])] The result of an endpoint with a preprocessing model will be identical to applying the TRANSFORM statement from BigQuery ML. The results can then be pipelined to an endpoint with the prediction model to serve predictions that match the results of the ML.PREDICT function in BigQuery ML. The results of both methods, Vertex AI Prediction endpoints and BigQuery ML with ML.PREDICT are shown side-by-side in the tutorial to show that the results of the model are replicated. Now the model can be used for online serving with extremely low latency. This even includes using private endpoints for even lower latency and secure connections with VPC Network Peering. Conclusion With the new preprocessing functions, you can simplify data exploration and feature preprocessing. Further, by embedding preprocessing within model training using the TRANSFORM statement, the serving process is simplified by using prepped models without needing additional steps. In other words, predictions are done right inside BigQuery or alternatively the models can be exported to any location outside of BigQuery such as Vertex AI Prediction for online serving. The tutorial demonstrated how BigQuery ML works with Vertex AI Model Registry and Prediction to create a seamless end-to-end ML experience. In the future you can expect to see more capabilities that bring BigQuery, BigQuery ML and Vertex AI together. Click here to access the tutorial or check out the documentation to learn more about BigQuery ML Thanks to Ian Zhao, Abhinav Khushraj, Yan Sun, Amir Hormati, Mingge Deng and Firat Tekiner from the BigQuery ML team
  7. With a true SQL HTAP database, the Feature Strore streamlines the most time-consuming and expensive task of the data science life cycle and enables ML models to integrate past and present data to predict the most accurate future San Francisco, CA, January 19, 2021 – Splice Machine, the only scale-out SQL database with built-in machine learning, today […] The post Splice Machine Launches The Splice Machine Feature Store To Simplify Feature Engineering And Democratize Machine Learning appeared first on DevOps.com. View the full article
  • Forum Statistics

    67.4k
    Total Topics
    65.3k
    Total Posts
×
×
  • Create New...