Search the Community
Showing results for tags 'data transformation'.
-
Transform Data with Hyperbolic Sine
TDS posted a topic in Databases, Data Engineering & Data Science
Why handling negative values should be a cinchPhoto by Osman Rana on UnsplashMany models are sensitive to outliers, such as linear regression, k-nearest neighbor, and ARIMA. Machine learning algorithms suffer from over-fitting and may not generalize well in the presence of outliers.¹ However, the right transformation can shrink these extreme values and improve your model’s performance. Transformations for data with negative values include: Shifted LogShifted Box-CoxInverse Hyperbolic SineSinh-arcsinhLog and Box-Cox are effective tools when working with positive data, but inverse hyperbolic sine (arcsinh) is much more effective on negative values. Sinh-arcsinh is even more powerful. It has two parameters that can adjust the skew and kurtosis of your data to make it close to normal. These parameters can be derived using gradient descent. See an implementation in python at the end of this post. Shifted LogThe log transformation can be adapted to handle negative values with a shifting term α. Throughout the article, I use log to mean natural log.Visually, this is moving the log’s vertical asymptote from 0 to α. Plot of shifted log transformation with offset of -5, made with Desmos available under CC BY-SA 4.0. Equation text added to image.Forecasting Stock PricesImagine you are a building a model to predict the stock market. Hosenzade and Haratizadeh tackle this problem with a convolutional neural network using a large set of feature variables that I have pulled from UCI Irvine Machine Learning Repository². Below is distribution of the change of volume feature — an important technical indicator for stock market forecasts. made with MatplotlibThe quantile-quantile (QQ) plot reveals heavy right and left tails. The goal of our transformation will be to bring the tails closer to normal (the red line) so that it has no outliers. Using a shift value of -250, I get this log distribution. The right tail looks a little better, but the left tail still shows deviation from the red line. Log works by applying a concave function to the data which skews the data left by compressing the high values and stretching out the low values. The log transformation only makes the right tail lighter.While this works well for positively skewed data, it is less effective for data with negative outliers. made with Desmos available under CC BY-SA 4.0. Text and arrows added to image.In the stock data, skewness is not the issue. The extreme values are on both left and right sides. The kurtosis is high, meaning that both tails are heavy. A simple concave function is not equipped for this situation. Shifted Box-CoxBox-Cox is a generalized version of log, which can also be shifted to include negative values, written as The λ parameter controls the concavity of the transformation allowing it to take on a variety of forms. Box-cox is quadratic when λ = 2. It’s linear when λ = 1, and log as λ approaches 0. This can be verified by using L’Hôpital’s rule. Plot of shifted box-cox transformation with shift -5 and five different values for λ, made with Desmos available under CC BY-SA 4.0. Text added to image.To apply this transformation on our stock price data, I use a shift value -250 and determine λ with scipy's boxcox function. from scipy.stats import boxcox y, lambda_ = boxcox(x - (-250))The resulting transformed data looks like this: Despite the flexibility of this transformation, it fails to reduce the tails on the stock price data. Low values of λ skew the data left, shrinking the right tail. High values of λ skew the data right, shrinking the left tail, but there isn’t any value that can shrink both simultaneously. Inverse Hyperbolic SineThe hyperbolic sine function (sinh) is defined as and its inverse is In this case, the inverse is a more helpful function because it’s approximately log for large x (positive or negative) and linear for small values of x. In effect, this shrinks extremes while keeping the central values, more or less, the same. Arcsinh reduces both positive and negative tails.For positive values, arcsinh is concave, and for negative values, it’s convex. This change in curvature is the secret sauce that allows it to handle positive and negative extreme values simultaneously. plot of inverse hyperbolic sine (arcsinh) compared to a log function, made with Desmos available under CC BY-SA 4.0. Text, arrows, and box shape added to image.Using this transformation on the stock data results in near normal tails. The new data has no outliers! Scale MattersConsider how your data is scaled before it’s passed into arcsinh. For log, your choice of units is irrelevant. Dollars or cents, grams or kilograms, miles or feet —it’s all the same to the log function. The scale of your inputs only shifts the transformed values by a constant value. The same is not true for arcsinh. Values between -1 and 1 are left almost unchanged while large numbers are log-dominated. You may need to play around with different scales and offsets before feeding your data into arcsinh to get a result you are satisfied with. At the end of the article, I implement a gradient descent algorithm in python to estimate these transformation parameters more precisely. Sinh-arcsinhProposed by Jones and Pewsey³, the sinh-arcsinh transformation is Jones and Pewsey do not include the constant 1/δ term at the front. However, I include it here because it makes it easier to show arcsinh as a limiting case.Parameter ε adjusts the skew of the data and δ adjusts the kurtosis³, allowing the transformation to take on many forms. For example, the identity transformation f(x) = x is a special case of sinh-arcsinh when ε = 0 and δ = 1. Arcsinh is a limiting case for ε = 0 and δ approaching zero, as can be seen using L’Hôpital’s rule again. plots of sinh-arcsinh for different values of ε and δ. On the left, ε is fixed at zero, and on the right, δ is fixed at 0.5, made with Desmos available under CC BY-SA 4.0. Text added to image.Scale Still MattersJust like with arcsinh, there are meaningful differences in the results of the sinh-arcsinh transformation based on how your input data is shifted or scaled, meaning there are not two, but four parameters that can be chosen. Parameter EstimationWe’ve seen how data transformations can make the tails of your data more Gaussian. Now, let’s take it to the next level by picking the parameters that maximize the normal log likelihood. Let T(x) be my transformation, and let N(x | μ, σ) be the probability density function for a normal distribution with mean μ and standard deviation σ. Assuming independence, the likelihood of the entire dataset X is where I’ve made use of the Jacobian of the transformation. The log likelihood is where I can drop the absolute value signs because the derivative of the transformation is always positive. Gradient DescentI estimate my parameters with gradient descent, setting the loss function to the negative mean log likelihood. Tensorflow’s GradientTape automatically calculates the partial derivatives with respect to the four parameters of sinh-arcsinh as well as μ and σ from the normal probability density function. Parameters β, δ, and σ are represented in log form to ensure they stay positive. You may want to try a few initializations of the variables in case the algorithm gets stuck at a local minimum. I also recommend normalizing your inputs to mean zero and standard deviation one before running the script for the best performance. import tensorflow as tf import numpy as np @tf.function def log_prob(x, params): # extract parameters alpha, log_beta = params[0], params[1] # rescaling params beta = tf.math.exp(log_beta) epsilon, log_delta = params[2], params[3] # transformation params delta = tf.math.exp(log_delta) mu, log_sigma = params[4], params[5] # normal dist params sigma = tf.math.exp(log_sigma) # rescale x_scaled = (x - alpha)/beta # transformation sinh_arg = epsilon + delta * tf.math.asinh(x_scaled) x_transformed = (1/delta) * tf.math.sinh(sinh_arg) # log jacobian of transformation d_sinh = tf.math.log(tf.math.cosh(sinh_arg)) d_arcsinh = - 0.5*tf.math.log(x_scaled**2 + 1) d_rescaling = - log_beta jacobian = d_sinh + d_arcsinh + d_rescaling # chain rule # normal likelihood z = (x_transformed - mu)/sigma # standardized normal_prob = -0.5*tf.math.log(2*np.pi) - log_sigma -0.5*z**2 return normal_prob + jacobian # Learning rate and number of epochs learning_rate = 0.1 epochs = 1000 # Initialize variables tf.random.set_seed(892) params = tf.Variable(tf.random.normal(shape=(6,), mean=0.0, stddev=1.0), dtype=tf.float32) # Use the Adam optimizer optimizer = tf.optimizers.Adam(learning_rate=learning_rate) # Perform gradient descent for epoch in range(epochs): with tf.GradientTape() as tape: loss = - tf.reduce_mean(log_prob(x_tf, params)) # Compute gradients gradients = tape.gradient(loss, [params]) # Apply gradients to variables optimizer.apply_gradients(zip(gradients, [params])) if (epoch % 100) == 0: print(-loss.numpy()) print(f"Optimal vals: {params}")This optimized approach resulted in a distribution very close to Gaussian — not only the tails, but the mid-section too! ConclusionLog and Box-Cox are powerful transformations when working with positive data, but merely shifting these transformations to include negative values has severe limitations. The arcsinh transformation is much better at handling extreme positive and negative values simultaneously. If you are willing to increase the complexity, the sinh-arcsinh transformation is a more powerful function that generalizes arcsinh. When normality is very important, its parameters can also be derived using gradient descent to match a Gaussian distribution. Arcsinh doesn’t get much attention, but it’s an essential transformation that should be a part of every data engineer’s tool kit. If you’ve found these transformation techniques useful or have any questions about applying them to your own datasets, please share your thoughts and experiences in the comments below. Unless otherwise noted, all images are by the author. Note from Towards Data Science’s editors: While we allow independent authors to publish articles in accordance with our rules and guidelines, we do not endorse each author’s contribution. You should not rely on an author’s works without seeking professional advice. See our Reader Terms for details. [1] Jabbar, H. K., & Khan, R. Z. (2014). Methods to avoid over-fitting and under-fitting in supervised machine learning (Comparative study). [2] CNNpred: CNN-based stock market prediction using a diverse set of variables. (2019). UCI Machine Learning Repository. [3] Jones, M & Pewsey, Arthur. (2009). Sinh-arcsinh distributions: a broad family giving rise to powerful tests of normality and symmetry. Biometrika. Transform Data with Hyperbolic Sine was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story. View the full article -
Data transformation is the process of converting data from one format to another, the “T” in ELT, or extract, load, transform, which enables organizations to get their data analytics-ready and derive insights and value from it. As companies collect more data, from disparate sources and in disparate formats, building and managing transformations has become exponentially more complex and time-consuming. The Snowflake Data Cloud includes powerful capabilities for transforming data and orchestrating data pipelines, and we partner with best-in-class providers to give customers a choice in the data transformation technologies they use. Today, we are excited to announce that Snowflake Ventures is investing in our partner, Coalesce, which offers an intuitive, low-code transformation platform for developing and managing data pipelines. The Coalesce platform is uniquely built for Snowflake. Coalesce allows data teams to build complex transformations quickly and efficiently without deep coding expertise, while still providing all the extensibility the most technical Snowflake users will need. This expands the number of users who can contribute to data projects and enhances collaboration. Coalesce automatically generates Snowflake-native SQL and supports Snowflake data engineering features such as Snowpark, Dynamic Tables, AI/ML capabilities, and more. Our investment helps Coalesce to continue providing first-class experiences for Snowflake users, including integrating closely to take advantage of the latest Data Cloud innovations. Coalesce will also lean into Snowpark Container Services and the Snowflake Native App Framework to provide a seamless user experience. With Snowflake Native Apps, customers can instantly deploy Coalesce on their Snowflake account and transact directly through Snowflake Marketplace. Our goal at Snowflake is to provide developers, data engineers, and other users with optimal choice in the tools they use to prepare and manage data. We will continue to add new transformation capabilities to the Data Cloud and look forward to working with Coalesce to provide the best possible experience for transforming data so organizations can unlock the full potential of their data. The post Snowflake Ventures Invests in Coalesce to Enable Simplified Data Transformation Development and Management Natively on the Data Cloud appeared first on Snowflake. View the full article
-
Editor’s note: Earlier this year, we heard from Gojek, the on-demand services platform, about the open-source data ingestion tool it developed for use with data warehouses like BigQuery. Today, Gojek VP of Engineering Ravi Suhag is back to discuss the open-source data transformation tool it is building. In a recent post, we introduced Firehose, an open source solution by Gojek for ingesting data to cloud data warehouses like Cloud Storage and BigQuery. Today, we take a look at another project within the data transformation and data processing flow. As Indonesia’s largest hyperlocal on-demand services platform, Gojek has diverse data needs across transportation, logistics, food delivery, and payments processing. We also run hundreds of microservices across billions of application events. While Firehose solved our need for smarter data ingestion across different use cases, our data transformation tool, Optimus, ensures the data is ready to be accessed with precision wherever it is needed. The challenges in implementing simplicity At Gojek, we run our data warehousing across a large number of data layers within BigQuery to standardize and model data that’s on its way to being ready for use across our apps and services. Gojek’s data warehouse has thousands of BigQuery tables. More than 100 analytics engineers run nearly 4,000 jobs on a daily basis to transform data across these tables. These transformation jobs process more than 1 petabyte of data every day. Apart from the transformation of data within BigQuery tables, teams also regularly export the cleaned data to other storage locations to unlock features across various apps and services. This process addresses a number of challenges: Complex workflows: The large number of BigQuery tables and hundreds of analytics engineers writing transformation jobs simultaneously creates a huge dependency on very complex database availability groups (DAGs) to be scheduled and processed reliably. Support for different programming languages: Data transformation tools must ensure standardization of inputs and job configurations, but they must also comfortably support the needs of all data users. They cannot, for instance, limit users to only a single programming language. Difficult to use transformation tools: Some transformation tools are hard to use for anyone that’s not a data warehouse engineer. Having easy-to-use tools helps remove bottlenecks and ensure that every data user can produce their own analytical tables. Integrating changes to data governance rules: Decentralizing access to transformation tools requires strict adherence to data governance rules. The transformation tool needs to ensure columns and tables have personally identifiable information (PII) and non-PII data classifications correctly inserted, across a high volume of tables. Time-consuming manual feature updates: New requirements for data extraction and transformation for use in new applications and storage locations are part of Gojek’s operational routine. We need to design a data transformation tool that could be updated and extended with minimal development time and disruption to existing use cases. Enabling reliable data transformation on data warehouses like BigQuery With Optimus, Gojek created an easy-to-use and reliable performance workflow orchestrator for data transformation, data modeling, data pipelines, and data quality management. If you’re using BigQuery as your data warehouse, Optimus makes data transformation more accessible for your analysts and engineers. This is made possible through simple SQL queries and YAML configurations, with Optimus handling many key demands including dependency management, and scheduling data transformation jobs to run at scale. Key features include: Command line interface (CLI): The Optimus command line tool offers effective access to services and job specifications. Users can create, run, and replay jobs, dump a compiled specification for a scheduler, create resource specifications for data stores, add hooks to existing jobs, and more. Optimized scheduling: Optimus offers an easy way to schedule SQL transformation through YAML based configuration. While it recommends Airflow by default, it is extensible enough to support other schedulers that can execute Docker containers. Dependency resolution and dry runs: Optimus parses data transformation queries and builds dependency graphs automatically. Deployment queries are given a dry-run to ensure they pass basic sanity checks. Powerful templating: Users can write complex transformation logic with compile time template options for variables, loops, IF statements, macros, and more. Cross-tenant dependency: With more than two tenants registered, Optimus can resolve cross-tenant dependencies automatically. Built-in hooks: If you need to sink a BigQuery table to Kafka, Optimus can make it happen thanks to hooks for post-transformation logic that extend the functionality of your transformations. Extensibility with plugins: By focusing on the building blocks, Optimus leaves governance for how to execute a transformation to its plugin system. Each plugin features an adapter and a Docker image, and Optimus supports Python transformation for easy custom plugin development. Key advantages of Optimus Like Google Cloud, Gojek is all about flexibility and agility, so we love to see open source software like Optimus helping users take full advantage of multi-tenancy solutions to meet their specific needs. Through a variety of configuration options and a robust CLI, Optimus ensures that data transformation remains fast and focused by preparing SQL correctly. Optimus handles all scheduling, dependencies, and table creation. With the capability to build custom features quickly based on new needs through Optimus plugins, you can explore more possibilities. Errors are also minimized with a configurable alert system that flags job failures immediately. Whether to email or Slack, you can trigger alerts based on specific requirements – from point of failure to warnings based on SLA requirements. How you can contribute With Firehose and Optimus working in tandem with Google Cloud, Gojek is helping pave the way in building tools that enable data users and engineers to achieve fast results in complex data environments. Optimus is developed and maintained at Github and uses Requests for Comments (RFCs) to communicate ideas for its ongoing development. The team is always keen to receive bug reports, feature requests, assistance with documentation, and general discussion as part of its Slack community. Related Article Introducing Firehose: An open source tool from Gojek for seamless data ingestion to BigQuery and Cloud Storage The Firehose open source tool allows Gojek to turbocharge the rate it streams its data into BigQuery and Cloud Storage. Read Article
-
- optimus
- open source
-
(and 1 more)
Tagged with:
-
Forum Statistics
63.6k
Total Topics61.7k
Total Posts