Jump to content

Search the Community

Showing results for tags 'google bigquery'.

  • Search By Tags

    Type tags separated by commas.
  • Search By Author

Content Type


Forums

  • General
    • General Discussion
    • Artificial Intelligence
    • DevOpsForum News
  • DevOps & SRE
    • DevOps & SRE General Discussion
    • Databases, Data Engineering & Data Science
    • Development & Programming
    • CI/CD, GitOps, Orchestration & Scheduling
    • Docker, Containers, Microservices, Serverless & Virtualization
    • Infrastructure-as-Code
    • Kubernetes & Container Orchestration
    • Linux
    • Logging, Monitoring & Observability
    • Security, Governance, Risk & Compliance
  • Cloud Providers
    • Amazon Web Services
    • Google Cloud Platform
    • Microsoft Azure

Find results in...

Find results that contain...


Date Created

  • Start

    End


Last Updated

  • Start

    End


Filter by number of...

Joined

  • Start

    End


Group


Website URL


LinkedIn Profile URL


About Me


Cloud Platforms


Cloud Experience


Development Experience


Current Role


Skills


Certifications


Favourite Tools


Interests

Found 11 results

  1. MongoDB is a popular NoSQL database that requires data to be modeled in JSON format. If your application’s data model has a natural fit to MongoDB’s recommended data model, it can provide good performance, flexibility, and scalability for transaction types of workloads. However, due to a few restrictions that you can face while analyzing data, […]View the full article
  2. In today’s data-driven world, data storage and analysis are essential to derive deeper insights for smarter decision-making. As data volumes increase, organizations consider shifting transactional data from Oracle databases on AWS RDS to a powerful platform like Google BigQuery. It can be due to several reasons, which include AWS RDS Oracle’s storage limits, high query […]View the full article
  3. Do you rely heavily on GA4 data for analyzing the metrics of your website engagement? If yes, then you would face problems while collecting all the GA4 data and performing advanced analytics on it. If you want to gain business-critical insights from your GA4 data, then you can’t simply manipulate it. You need to have […]View the full article
  4. Data is one of the most valuable assets, allowing corporations to make insightful decisions to boost their business performance. By efficiently utilizing their on-premise data, companies are transitioning towards an advanced analytical environment to extract more profound insights. AWS Relational Database Service (RDS) is an Amazon data management web service that can help you manage […]View the full article
  5. BigQuery allows you to analyze your data using a range of large language models (LLMs) hosted in Vertex AI including Gemini 1.0 Pro, Gemini 1.0 Pro Vision and text-bison. These models work well for several tasks such as text summarization, sentiment analysis, etc. using only prompt engineering. However, in some scenarios, additional customization via model fine-tuning is needed, such as when the expected behavior of the model is hard to concisely define in a prompt, or when prompts do not produce expected results consistently enough. Fine-tuning also helps the model learn specific response styles (e.g., terse or verbose), new behaviors (e.g., answering as a specific persona), or to update itself with new information. Today, we are announcing support for customizing LLMs in BigQuery with supervised fine-tuning. Supervised fine-tuning via BigQuery uses a dataset which has examples of input text (the prompt) and the expected ideal output text (the label), and fine-tunes the model to mimic the behavior or task implied from these examples.Let’s see how this works. Feature walkthrough To illustrate model fine-tuning, let’s look at a classification problem using text data. We’ll use a medical transcription dataset and ask our model to classify a given transcript into one of 17 categories, e.g. ‘Allergy/Immunology’, ‘Dentistry’, ‘Cardiovascular/ Pulmonary’, etc. Dataset Our dataset is from mtsamples.com as provided on Kaggle. To fine-tune and evaluate our model, we first create an evaluation table and a training table in BigQuery using a subset of this data available in Cloud Storage as follows: code_block <ListValue: [StructValue([('code', "-- Create a eval table\r\n\r\nLOAD DATA INTO\r\n bqml_tutorial.medical_transcript_eval\r\nFROM FILES( format='NEWLINE_DELIMITED_JSON',\r\n uris = ['gs://cloud-samples-data/vertex-ai/model-evaluation/peft_eval_sample.jsonl'] )\r\n\r\n-- Create a train table\r\n\r\nLOAD DATA INTO\r\n bqml_tutorial.medical_transcript_train\r\nFROM FILES( format='NEWLINE_DELIMITED_JSON',\r\n uris = ['gs://cloud-samples-data/vertex-ai/model-evaluation/peft_train_sample.jsonl'] )"), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eb42d770ee0>)])]> The training and evaluation dataset has an ‘input_text’ column that contains the transcript, and a ‘output_text’ column that contains the label, or ground truth. Baseline performance of text-bison model First, let’s establish a performance baseline for the text-bison model. You can create a remote text-bison model in BigQuery using a SQL statement like the one below. For more details on creating a connection and remote models refer to the documentation (1,2). code_block <ListValue: [StructValue([('code', "CREATE OR REPLACE MODEL\r\n `bqml_tutorial.text_bison_001` REMOTE\r\nWITH CONNECTION `LOCATION. ConnectionID`\r\nOPTIONS (ENDPOINT ='text-bison@001')"), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eb42d770be0>)])]> For inference on the model, we first construct a prompt by concatenating the task description for our model and the transcript from the tables we created. We then use the ML.GENERATE_TEXT function to get the output. While the model gets many classifications correct out of the box, it classifies some transcripts erroneously. Here’s a sample response where it classifies incorrectly. code_block <ListValue: [StructValue([('code', 'Prompt\r\n\r\nPlease assign a label for the given medical transcript from among these labels [Allergy / Immunology, Autopsy, Bariatrics, Cardiovascular / Pulmonary, Chiropractic, Consult - History and Phy., Cosmetic / Plastic Surgery, Dentistry, Dermatology, Diets and Nutritions, Discharge Summary, ENT - Otolaryngology, Emergency Room Reports, Endocrinology, Gastroenterology, General Medicine, Hematology - Oncology, Hospice - Palliative Care, IME-QME-Work Comp etc., Lab Medicine - Pathology, Letters, Nephrology, Neurology, Neurosurgery, Obstetrics / Gynecology, Office Notes, Ophthalmology, Orthopedic, Pain Management, Pediatrics - Neonatal, Physical Medicine - Rehab, Podiatry, Psychiatry / Psychology, Radiology, Rheumatology, SOAP / Chart / Progress Notes, Sleep Medicine, Speech - Language, Surgery, Urology]. TRANSCRIPT: \r\nINDICATIONS FOR PROCEDURE:, The patient has presented with atypical type right arm discomfort and neck discomfort. She had noninvasive vascular imaging demonstrating suspected right subclavian stenosis. Of note, there was bidirectional flow in the right vertebral artery, as well as 250 cm per second velocities in the right subclavian. Duplex ultrasound showed at least a 50% stenosis.,APPROACH:, Right common femoral artery.,ANESTHESIA:, IV sedation with cardiac catheterization protocol. Local infiltration with 1% Xylocaine.,COMPLICATIONS:, None.,ESTIMATED BLOOD LOSS:, Less than 10 ml.,ESTIMATED CONTRAST:, Less than 250 ml.,PROCEDURE PERFORMED:, Right brachiocephalic angiography, right subclavian angiography, selective catheterization of the right subclavian, selective aortic arch angiogram, right iliofemoral angiogram, 6 French Angio-Seal placement.,DESCRIPTION OF PROCEDURE:, The patient was brought to the cardiac catheterization lab in the usual fasting state. She was laid supine on the cardiac catheterization table, and the right groin was prepped and draped in the usual sterile fashion. 1% Xylocaine was infiltrated into the right femoral vessels. Next, a #6 French sheath was introduced into the right femoral artery via the modified Seldinger technique.,AORTIC ARCH ANGIOGRAM:, Next, a pigtail catheter was advanced to the aortic arch. Aortic arch angiogram was then performed with injection of 45 ml of contrast, rate of 20 ml per second, maximum pressure 750 PSI in the 4 degree LAO view.,SELECTIVE SUBCLAVIAN ANGIOGRAPHY:, Next, the right subclavian was selectively cannulated. It was injected in the standard AP, as well as the RAO view. Next pull back pressures were measured across the right subclavian stenosis. No significant gradient was measured.,ANGIOGRAPHIC DETAILS:, The right brachiocephalic artery was patent. The proximal portion of the right carotid was patent. The proximal portion of the right subclavian prior to the origin of the vertebral and the internal mammary showed 50% stenosis.,IMPRESSION:,1. Moderate grade stenosis in the right subclavian artery.,2. Patent proximal edge of the right carotid.\r\n\r\nResponse\r\nRadiology'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eb42d770d90>)])]> In the above case the correct classification should have been ‘Cardiovascular/ Pulmonary’. Metrics-based evaluation for base modelTo perform a more robust evaluation of the model’s performance, you can use BigQuery’s ML.EVALUATE function to compute metrics on how the model responses compare against the ideal responses from a test/eval dataset. You can do so as follows: code_block <ListValue: [StructValue([('code', '-- Evaluate base model\r\n\r\nSELECT\r\n *\r\nFROM\r\n ml.evaluate(MODEL bqml_tutorial.text_bison_001,\r\n (\r\n SELECT\r\n CONCAT("Please assign a label for the given medical transcript from among these labels [Allergy / Immunology, Autopsy, Bariatrics, Cardiovascular / Pulmonary, Chiropractic, Consult - History and Phy., Cosmetic / Plastic Surgery, Dentistry, Dermatology, Diets and Nutritions, Discharge Summary, ENT - Otolaryngology, Emergency Room Reports, Endocrinology, Gastroenterology, General Medicine, Hematology - Oncology, Hospice - Palliative Care, IME-QME-Work Comp etc., Lab Medicine - Pathology, Letters, Nephrology, Neurology, Neurosurgery, Obstetrics / Gynecology, Office Notes, Ophthalmology, Orthopedic, Pain Management, Pediatrics - Neonatal, Physical Medicine - Rehab, Podiatry, Psychiatry / Psychology, Radiology, Rheumatology, SOAP / Chart / Progress Notes, Sleep Medicine, Speech - Language, Surgery, Urology]. ", input_text) AS input_text,\r\n output_text\r\n FROM\r\n `bqml_tutorial.medical_transcript_eval` ),\r\n STRUCT("classification" AS task_type))'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eb42d7708e0>)])]> In the above code we provided an evaluation table as input and chose ‘classification‘ as the task type on which we evaluate the model. We left other inference parameters at their defaults but they can be modified for the evaluation. The evaluation metrics that are returned are computed for each class (label). The results look like following: Focusing on the F1 score (harmonic mean of precision and recall), you can see that the model performance varies between classes. For example, the baseline model performs well for ‘Autopsy’, ‘Diets and Nutritions’, and ‘Dentistry’, but performs poorly for ‘Consult - History and Phy.’, ‘Chiropractic’, and ‘Cardiovascular / Pulmonary’ classes. Now let’s fine-tune our model and see if we can improve on this baseline performance. Creating a fine-tuned model Creating a fine-tuned model in BigQuery is simple. You can perform fine-tuning by specifying the training data with ‘prompt’ and ‘label’ columns in it in the Create Model statement. We use the same prompt for fine-tuning that we used in the evaluation earlier. Create a fine-tuned model as follows: code_block <ListValue: [StructValue([('code', '-- Fine tune a textbison model\r\n\r\nCREATE OR REPLACE MODEL\r\n `bqml_tutorial.text_bison_001_medical_transcript_finetuned` REMOTE\r\nWITH CONNECTION `LOCATION. ConnectionID`\r\nOPTIONS (endpoint="text-bison@001",\r\n max_iterations=300,\r\n data_split_method="no_split") AS\r\nSELECT\r\n CONCAT("Please assign a label for the given medical transcript from among these labels [Allergy / Immunology, Autopsy, Bariatrics, Cardiovascular / Pulmonary, Chiropractic, Consult - History and Phy., Cosmetic / Plastic Surgery, Dentistry, Dermatology, Diets and Nutritions, Discharge Summary, ENT - Otolaryngology, Emergency Room Reports, Endocrinology, Gastroenterology, General Medicine, Hematology - Oncology, Hospice - Palliative Care, IME-QME-Work Comp etc., Lab Medicine - Pathology, Letters, Nephrology, Neurology, Neurosurgery, Obstetrics / Gynecology, Office Notes, Ophthalmology, Orthopedic, Pain Management, Pediatrics - Neonatal, Physical Medicine - Rehab, Podiatry, Psychiatry / Psychology, Radiology, Rheumatology, SOAP / Chart / Progress Notes, Sleep Medicine, Speech - Language, Surgery, Urology]. ", input_text) AS prompt,\r\n output_text AS label\r\nFROM\r\n `bqml_tutorial.medical_transcript_train`'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eb42d770cd0>)])]> The CONNECTION you use to create the fine-tuned model should have (a) Storage Object User and (b) Vertex AI Service Agent roles attached. In addition, your Compute Engine (GCE) default service account should have an editor access to the project. Refer to the documentation for guidance on working with BigQuery connections. BigQuery performs model fine-tuning using a technique known as Low-Rank Adaptation (LoRA. LoRA tuning is a parameter efficient tuning (PET) method that freezes the pretrained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture to reduce the number of trainable parameters. The model fine-tuning itself happens on a Vertex AI compute and you have the option to choose GPUs or TPUs as accelerators. You are billed by BigQuery for the data scanned or slots used, as well as by Vertex AI for the Vertex AI resources consumed. The fine-tuning job creates a new model endpoint that represents the learned weights. The Vertex AI inference charges you incur when querying the fine-tuned model are the same as for the baseline model. This fine-tuning job may take a couple of hours to complete, varying based on training options such as ‘max_iterations’. Once completed, you can find the details of your fine-tuned model in the BigQuery UI, where you will see a different remote endpoint for the fine-tuned model. Endpoint for the baseline model vs a fine tuned model. Currently, BigQuery supports fine-tuning of text-bison-001 and text-bison-002 models. Evaluating performance of fine-tuned model You can now generate predictions from the fine-tuned model using code such as following: code_block <ListValue: [StructValue([('code', 'SELECT\r\n ml_generate_text_llm_result,\r\n label,\r\n prompt\r\nFROM\r\n ml.generate_text(MODEL bqml_tutorial.text_bison_001_medical_transcript_finetuned,\r\n (\r\n SELECT\r\n CONCAT("Please assign a label for the given medical transcript from among these labels [Allergy / Immunology, Autopsy, Bariatrics, Cardiovascular / Pulmonary, Chiropractic, Consult - History and Phy., Cosmetic / Plastic Surgery, Dentistry, Dermatology, Diets and Nutritions, Discharge Summary, ENT - Otolaryngology, Emergency Room Reports, Endocrinology, Gastroenterology, General Medicine, Hematology - Oncology, Hospice - Palliative Care, IME-QME-Work Comp etc., Lab Medicine - Pathology, Letters, Nephrology, Neurology, Neurosurgery, Obstetrics / Gynecology, Office Notes, Ophthalmology, Orthopedic, Pain Management, Pediatrics - Neonatal, Physical Medicine - Rehab, Podiatry, Psychiatry / Psychology, Radiology, Rheumatology, SOAP / Chart / Progress Notes, Sleep Medicine, Speech - Language, Surgery, Urology]. ", input_text) AS prompt,\r\n output_text as label\r\n FROM\r\n `bqml_tutorial.medical_transcript_eval`\r\n ),\r\n STRUCT(TRUE AS flatten_json_output))'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eb42d770e20>)])]> Let us look at the response to the sample prompt we evaluated earlier. Using the same prompt, the model now classifies the transcript as ‘Cardiovascular / Pulmonary’ — the correct response. Metrics based evaluation for fine tuned model Now, we will compute metrics on the fine-tuned model using the same evaluation data and the same prompt we previously used for evaluating the base model. code_block <ListValue: [StructValue([('code', '-- Evaluate fine tuned model\r\n\r\n\r\nSELECT\r\n *\r\nFROM\r\n ml.evaluate(MODEL bqml_tutorial.text_bison_001_medical_transcript_finetuned,\r\n (\r\n SELECT\r\n CONCAT("Please assign a label for the given medical transcript from among these labels [Allergy / Immunology, Autopsy, Bariatrics, Cardiovascular / Pulmonary, Chiropractic, Consult - History and Phy., Cosmetic / Plastic Surgery, Dentistry, Dermatology, Diets and Nutritions, Discharge Summary, ENT - Otolaryngology, Emergency Room Reports, Endocrinology, Gastroenterology, General Medicine, Hematology - Oncology, Hospice - Palliative Care, IME-QME-Work Comp etc., Lab Medicine - Pathology, Letters, Nephrology, Neurology, Neurosurgery, Obstetrics / Gynecology, Office Notes, Ophthalmology, Orthopedic, Pain Management, Pediatrics - Neonatal, Physical Medicine - Rehab, Podiatry, Psychiatry / Psychology, Radiology, Rheumatology, SOAP / Chart / Progress Notes, Sleep Medicine, Speech - Language, Surgery, Urology]. ", input_text) AS prompt,\r\n output_text as label\r\n FROM\r\n `bqml_tutorial.medical_transcript_eval`), STRUCT("classification" AS task_type))'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eb42d770820>)])]> The metrics from the fine-tuned model are below. Even though the fine-tuning (training) dataset we used for this blog contained only 519 examples, we already see a marked improvement in performance. F1 scores on the labels, where the model had performed poorly earlier, have improved, with the “macro” F1 score (a simple average of F1 score across all labels) jumping from 0.54 to 0.66. Ready for inference The fine-tuned model can now be used for inference using the ML.GENERATE_TEXT function, which we used in the previous steps to get the sample responses. You don’t need to manage any additional infrastructure for your fine-tuned model and you are charged the same inference price as you would have incurred for the base model. To try fine-tuning for text-bison models in BigQuery, check out the documentation. Have feedback or need fine-tuning support for additional models? Let us know at bqml-feedback@google.com>. Special thanks to Tianxiang Gao for his contributions to this blog. View the full article
  6. Navigating the complexities of the data-to-insights journey can be frustrating. Data professionals spend valuable time sifting through data sources, reinventing the wheel with each new question that comes their way. They juggle multiple tools, hop between coding languages, and collaborate with a wide array of teams across their organizations. This fragmented approach is riddled with bottlenecks, preventing analysts from generating insights and doing high-impact work as quickly as they should. Yesterday at Google Cloud Next ‘24, we introduced BigQuery data canvas, which reimagines how data professionals work with data. This novel user experience helps customers create graphical data workflows that map to their mental model while AI innovations accelerate finding, preparing, analyzing, visualizing and sharing data and insights. Watch this video for a quick overview of BigQuery data canvas. BigQuery data canvas: a NL-driven analytics experience BigQuery data canvas makes data analytics faster and easier with a unified, natural language-driven experience that centralizes data discovery, preparation, querying, and visualization. Rather than toggling between multiple tools, you can now use data canvas to focus on the insights that matter most to your business. Data canvas addresses the challenges of traditional data analysis workflow in two areas: Natural language-centric experience: Instead of writing code, you can now speak directly to your data. Ask questions, direct tasks, and let the AI guide you through various analytics tasks. Reimagined user experience: Data canvas rethinks the notebook concept. Its expansive canvas workspace fosters iteration and easy collaboration, allowing you to refine your work, chain results, and share workspaces with colleagues. For example, to analyze a recent marketing campaign with BigQuery data canvas, you could use natural language prompts to discover campaign data sources, integrate them with existing customer data, derive insights, collaborate with teammates and share visual reports with executives — all within a single canvas experience. Natural language-based visual workflow with BigQuery data canvas Do more with BigQuery data canvas BigQuery provides a variety of features that can help analysts accelerate their analytics tasks: Search and discover: Easily find the specific data asset visualization table or view that you need to work with. Or search for the most relevant data assets. Data canvas works with all data that can be managed with BigQuery, including BigQuery managed storage, BigLake, Google Cloud Storage objects, and BigQuery Omni tables. For example, you could use either of the follow inputs to pull data with data canvas: Specific table: project_name.dataset_name.table_name Search: "customer transaction data" or "projectid:my-project-name winter jacket sales Atlanta" Explore data assets: Review the table schema, review their details or preview data and compare it side by side. Generate SQL queries: Iterate with NL inputs to generate the exact SQL query you need to accomplish the analytics task at hand. You can also edit the SQL before executing it. Combine results: Define joins with plain language instructions and refine the generated SQL as needed. Use query results as a starting point for further analysis with prompts like "Join this data with our customer demographics on order id." Visualize: Use natural language prompts to easily create and customize charts and graphs to visualize your data, e.g., “create a bar chart with gradient” Then, seamlessly share your findings by exporting your results to Looker Studio or Google Sheets. Automated insights: Data canvas can interpret query results and chart data and generate automated insights from them. For example, it can look at the query results of sales deal sizes and automatically provide the insight “the median deal size is $73,500.” Share to collaborate: Data analytics projects are often a team effort. You can simply save your canvas and share it with others using a link. Popular use cases While BigQuery data canvas can accelerate many analytics tasks, it’s particularly helpful for: Ad hoc analysis: When working on a tight deadline, data canvas makes it easy to pull data from various sources. Exploratory data analysis (EDA): This critical early step in the data analysis process focuses on summarizing the main characteristics of a dataset, often visually. Data canvas helps find data sources and then presents the results visually. Collaboration: Data canvas makes it easy to share an analytics project with multiple people. What our customers are saying Companies large and small have been experimenting with BigQuery data canvas for their day-to-day analytics tasks and their feedback has been very positive. Wunderkind, a performance marketing channel that powers one-to-one customer interactions, has been using BigQuery data canvas across their analytics team for several weeks and is experiencing significant time savings. “For any sort of investigation or exploratory exercise resulting in multiple queries there really is no replacement [for data canvas]. [It] Saves us so much time and mental capacity!” - Scott Schaen, VP of Data & Analytics, Wunderkind How Wunderkind accelerates time to insights with BigQuery data canvas Veo, a micro mobility company that operates in 50+ locations across the USA, is seeing immediate benefits from the AI capabilities in data canvas. “I think it's been great in terms of being able to turn ideas in the form of NL to SQL to derive insights. And the best part is that I can review and edit the query before running it - that’s a very smart and responsible design. It gives me the space to confirm it and ensure accuracy as well as reliability!” - Tim Velasquez, Head of Analytics, Veo Give BigQuery data canvas a try To learn more, watch this video and check out the documentation. BigQuery data canvas is launching in preview and will be rolled out to all users starting on April 15th. Submit this form to get early access. For any bugs and feedback, please reach out to the product and engineering team at datacanvas-feedback@google.com. We’re looking forward to hearing how you use the new data canvas! View the full article
  7. The journey of going from data to insights can be fragmented, complex and time consuming. Data teams spend time on repetitive and routine tasks such as ingesting structured and unstructured data, wrangling data in preparation for analysis, and optimizing and maintaining pipelines. Obviously, they’d rather prefer doing higher-value analysis and insights-led decision making. At Next ‘23, we introduced Duet AI in BigQuery. This year at Next ‘24, Duet AI in BigQuery becomes Gemini in BigQuery which provides AI-powered experiences for data preparation, analysis and engineering as well as intelligent recommendations to enhance user productivity and optimize costs. "With the new AI-powered assistive features in BigQuery and ease of integrating with other Google Workspace products, our teams can extract valuable insights from data. The natural language-based experiences, low-code data preparation tools, and automatic code generation features streamline high-priority analytics workflows, enhancing the productivity of data practitioners and providing the space to focus on high impact initiatives. Moreover, users with varying skill sets, including our business users, can leverage more accessible data insights to effect beneficial changes, fostering an inclusive data-driven culture within our organization." said Tim Velasquez, Head of Analytics, Veo Let’s take a closer look at the new features of Gemini in BigQuery. Accelerate data preparation with AI Your business insights are only as good as your data. When you work with large datasets that come from a variety of sources, there are often inconsistent formats, errors, and missing data. As such, cleaning, transforming, and structuring them can be a major hurdle. To simplify data preparation, validation, and enrichment, BigQuery now includes AI augmented data preparation that helps users to cleanse and wrangle their data. Additionally we are enabling users to build low-code visual data pipelines, or rebuild legacy pipelines in BigQuery. Once the pipelines are running in production, AI assists with finding and resolving issues such as schema or data drift, significantly reducing the toil associated with maintaining a data pipeline. Because the resulting pipelines run in BigQuery, users also benefit from integrated metadata management, automatic end-to-end data lineage, and capacity management. Gemini in BigQuery provides AI-driven assistance for users to clean and wrangle data Kickstart the data-to-insights journey Most data analysis starts with exploration — finding the right dataset, understanding the data’s structure, identifying key patterns, and identifying the most valuable insights you want to extract. This step can be cumbersome and time-consuming, especially if you are working with a new dataset or if you are new to the team. To address this problem, Gemini in BigQuery provides new semantic search capabilities to help you pinpoint the most relevant tables for your tasks. Leveraging the metadata and profiling information of these tables from Dataplex, Gemini in BigQuery surfaces relevant, executable queries that you can run with just one click. You can learn more about BigQuery data insights here. Gemini in BigQuery suggests executable queries for tables that you can run in single click Reimagine analytics workflows with natural language To boost user productivity, we’re also rethinking the end-to-end user experience. The new BigQuery data canvas provides a reimagined natural language-based experience for data exploration, curation, wrangling, analysis, and visualization, allowing you to explore and scaffold your data journeys in a graphical workflow that mirrors your mental model. For example, to analyze a recent marketing campaign, you can use simple natural language prompts to discover campaign data sources, integrate with existing customer data, derive insights, and share visual reports with executives — all within a single experience. Watch this video for a quick overview of BigQuery data canvas. BigQuery data canvas allows you to explore and analyze datasets, and create a customized visualization, all using natural language prompts within the same interface Enhance productivity with SQL and Python code assistance Even advanced users sometimes struggle to remember all the details of SQL or Python syntax, and navigating through numerous tables, columns, and relationships can be daunting. Gemini in BigQuery helps you write and edit SQL or Python code using simple natural language prompts, referencing relevant schemas and metadata. You can also leverage BigQuery’s in-console chat interface to explore tutorials, documentation and best practices for specific tasks using simple prompts such as: “How can I use BigQuery materialized views?” “How do I ingest JSON data?” and “How can I improve query performance?” Optimize analytics for performance and speed With growing data volumes, analytics practitioners including data administrators, find it increasingly challenging to effectively manage capacity and enhance query performance. We are introducing recommendations that can help continuously improve query performance, minimize errors and optimize your platform costs. With these recommendations, you can identify materialized views that can be created or deleted based on your query patterns and partition or cluster of your tables. Additionally, you can autotune Spark pipelines and troubleshoot failures and performance issues. Get started To learn more about Gemini in BigQuery, watch this short overview video and refer to the documentation , and sign up to get early access to the preview features. If you’re at Next ‘24, join our data and analytics breakout sessions and stop by at the demo stations to explore further and see these capabilities in action. Pricing details for Gemini in BigQuery will be shared when generally available to all customers. View the full article
  8. While working with BigQuery for years, I observed 5 issues that are commonly made, even by experienced Data Scientists ... View the full article
  9. The preprocessing and transformation of raw data into features constitutes a pivotal yet time-intensive phase within the machine learning (ML) process. This holds particularly true when data scientists or data engineers are required to transfer data across diverse platforms for the purpose of carrying out MLOps. In February 2023 we announced the preview of two new capabilities for BigQuery ML: more data preprocessing functions and the ability to export the BigQuery ML TRANSFORM clause as part of the model artifact. Today, these features are going GA and have even more capabilities for optimizing your ML workflow. In this blogpost, we describe how we streamline feature engineering by keeping it close to ML training and serving, with the following new functionalities: More manual preprocessing functions that give the flexibility users need to prepare their data as features for ML while also enabling simplified serving by embedding the preprocessing steps directly in the model. More seamless integration with Vertex AI amplifies this embedded preprocessing by making it fast to host BigQuery ML models on Vertex AI Prediction Endpoints for serverless online predictions that scale to meet your applications demand. Ability to export the BigQuery ML TRANSFORM clause as part of the model artifact which makes the BigQuery ML models portable and can be used in other workflows where the same preprocessing steps are needed. Feature EngineeringThe manual preprocessing functions are big timesavers for setting up your data columns as features for ML. The list of available preprocessing functions now includes: ML.MAX_ABS_SCALER Scale a numerical column to the range [-1, 1] without centering by dividing by the maximum absolute value. ML.ROBUST_SCALER Scale a numerical column by centering with the median (optional) and dividing by the quantile range of choice ([25, 75] by default). ML.NORMALIZER Turn a numerical array into a unit norm array for any p-norm: 0, 1, >1, +inf. The default is 2 resulting in a normalized array where the sum of squares is 1. ML.IMPUTER Replace missing values in a numerical or categorical input with the mean, median or mode (most frequent). ML.ONE_HOT_ENCODER One-hot encodes a categorical input. Also, it optionally does dummy encoding by dropping the most frequent value. It is also possible to limit the size of the encoding by specifying k for the k most frequent categories and/or a lower threshold for the frequency of categories. ML.MULTI_HOT_ENCODER Encode an array of strings with integer values representing categories. It is possible to limit the size of the encoding by specifying k for the k most frequent categories and/or a lower threshold for the frequency of categories. ML.LABEL_ENCODER Encode a categorical input to integer values [0, n categories] where 0 represents NULL and excluded categories. You can exclude categories by specifying k for k most frequent categories and/or a lower threshold for the frequency of categories. Step-by-step examples of all preprocessing functionsThis first tutorial shows how to use each of the preprocessing functions. In the interactive notebook a data sample and multiple uses of each function are used to highlight the operation and options available to adapt these functions to any feature engineering tasks. For example, the task of imputing missing values has different options depending on the data type of the column (string or numeric). The example below (from the interactive notebook) shows each possible way to impute missing value for each data type: code_block[StructValue([(u'code', u"SELECT\r\n num_column,\r\n ML.IMPUTER(num_column, 'mean') OVER() AS num_imputed_mean,\r\n ML.IMPUTER(num_column, 'median') OVER() AS num_imputed_median,\r\n ML.IMPUTER(num_column, 'most_frequent') OVER() AS num_imputed_mode,\r\n string_column,\r\n ML.IMPUTER(string_column, 'most_frequent') OVER() AS string_imputed_mode,\r\n FROM\r\n UNNEST([1, 1, 2, 3, 4, 5, NULL]) AS num_column WITH OFFSET pos1,\r\n UNNEST(['a', 'a', 'b', 'c', 'd', 'e', NULL]) AS string_column WITH OFFSET pos2\r\n WHERE pos1 = pos2\r\n ORDER BY num_column"), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3e259af45a10>)])]The table that follows shows the inputs with missing values highlighted in red and the outputs with imputed values for the different strategies highlighted in green. Visit the notebook linked above for this and more examples of all the preprocessing functions. Training with the TRANSFORM clauseNow, when exporting models with a TRANSFORM clause even more SQL functions are supported for the accompanying exported preprocessing model. Supported SQL functions include: Manual preprocessing functions Operators Conditional expressions Mathematical functions Conversion functions String functions Date, Datetime, Time, and Timestamp functions To host a BigQuery ML trained model on Vertex AI you can bypass the export steps and automatically register the model to the Vertex AI Model Registry during training. Then, when you deploy the model to a Vertex AI Prediction Endpoint for online prediction the TRANSFORM clauses preprocessing is also included in the endpoint for seamless training-serving workflows. This means there is no need to apply preprocessing functions again before getting predictions from the online endpoint! Serving models is also as simple as always within BigQuery ML using the PREDICT function. Step-by-step guide to incorporating manual preprocessing inside the model with the inline TRANSFORM clause:In this tutorial, we will use the bread recipe competition dataset to predict judges rating using linear regression and boosted tree models. Objective: To demonstrate how to preprocess data using the new functions, register the model with Vertex AI Model Registry, and deploy the model for online prediction with Vertex AI Prediction endpoints. Dataset: Each row represents a bread recipe with columns for each ingredient (flour, salt, water, yeast) and procedure (mixing time, mixing speed, cooking temperature, resting time). There are also columns that include judges ratings of the final product from each recipe. Overview of the tutorial: Step 1 shows how to use the TRANSFORM statement while training the model. Step 2 demonstrates how to deploy the model for online prediction using Vertex AI Prediction Endpoints. A final example is given to show how to export the model and access the transform model directly. For the best learning experience, follow this blog post alongside the tutorial notebook. Step 1: Create models using an inline TRANSFORM clauseUsing the BigQuery ML manual preprocessing function highlighted above and additional BigQuery functions to prepare input columns into features within a TRANSFORM clause is very similar to writing SQL. The added benefit of having the preprocessing logic embedded within the trained model is that the preprocessing is incorporated in the prediction routine both within BigQuery with ML.PREDICT and outside of BigQuery, like the Vertex AI Model Registry for deployment to Vertex AI Prediction Endpoints. The query below creates a model to predict judge A’s rating for bread recipes. The TRANSFORM statement uses multiple numerical preprocessing functions to scale columns into features. The values needed for scaling are stored and used at prediction to scale prediction instances as well. The contestant_id column is not particularly helpful for prediction as new seasons will have new contestants but the order of contestants could be helpful if, perhaps, contestants are getting generally better at bread baking. To transform contestants into ordered labels the ML.LABEL_ENCODER function is used. Using columns like season and round as features might not be helpful for predicting future values. A more general indicator of time would be the year and week within the year. Turning the airdate (date on which the episode aired) into features with the EXTRACT function is done directly in the TRANSFORM clause as well. code_block[StructValue([(u'code', u"CREATE OR REPLACE MODEL `statmike-mlops-349915.feature_engineering.bqml_feature_engineering_transform`\r\nTRANSFORM (\r\n JUDGE_A,\r\n ML.LABEL_ENCODER(contestant_id) OVER() as contestant,\r\n EXTRACT(YEAR FROM airdate) as year,\r\n EXTRACT(ISOWEEK FROM airdate) as week,\r\n\r\n ML.MIN_MAX_SCALER(flourAmt) OVER() as scale_flourAmt, \r\n ML.ROBUST_SCALER(saltAmt) OVER() as scale_saltAmt,\r\n ML.MAX_ABS_SCALER(yeastAmt) OVER() as scale_yeastAmt,\r\n ML.STANDARD_SCALER(water1Amt) OVER() as scale_water1Amt,\r\n ML.STANDARD_SCALER(water2Amt) OVER() as scale_water2Amt,\r\n\r\n ML.STANDARD_SCALER(waterTemp) OVER() as scale_waterTemp,\r\n ML.ROBUST_SCALER(bakeTemp) OVER() as scale_bakeTemp,\r\n ML.MIN_MAX_SCALER(ambTemp) OVER() as scale_ambTemp,\r\n ML.MAX_ABS_SCALER(ambHumidity) OVER() as scale_ambHumidity,\r\n\r\n ML.ROBUST_SCALER(mix1Time) OVER() as scale_mix1Time,\r\n ML.ROBUST_SCALER(mix2Time) OVER() as scale_mix2Time,\r\n ML.ROBUST_SCALER(mix1Speed) OVER() as scale_mix1Speed,\r\n ML.ROBUST_SCALER(mix2Speed) OVER() as scale_mix2Speed,\r\n ML.STANDARD_SCALER(proveTime) OVER() as scale_proveTime,\r\n ML.MAX_ABS_SCALER(restTime) OVER() as scale_restTime,\r\n ML.MAX_ABS_SCALER(bakeTime) OVER() as scale_bakeTime\r\n)\r\nOPTIONS (\r\n model_type = 'BOOSTED_TREE_REGRESSOR',\r\n booster_type = 'GBTREE',\r\n num_parallel_tree = 25,\r\n early_stop = TRUE,\r\n min_rel_progress = 0.01,\r\n tree_method = 'HIST',\r\n subsample = 0.85, \r\n input_label_cols = ['JUDGE_A'],\r\n enable_global_explain = TRUE,\r\n data_split_method = 'AUTO_SPLIT',\r\n l1_reg = 10,\r\n l2_reg = 10,\r\n MODEL_REGISTRY = 'VERTEX_AI',\r\n VERTEX_AI_MODEL_ID = 'bqml_bqml_feature_engineering_transform',\r\n VERTEX_AI_MODEL_VERSION_ALIASES = ['run-20230705114026']\r\n ) AS\r\nSELECT *\r\nFROM `statmike-mlops-349915.feature_engineering.bread`"), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3e259af45a50>)])]Note that the model training used options to directly register the model in Vertex AI Model Registry. This bypasses the need to export and subsequently register the model artifacts in the Vertex AI Model Registry while also keeping the two locations connected so that if the model is removed from BigQuery it is also removed from Vertex AI. It also enables a very simple path to online predictions as shown in Step 2 below. In the interactive notebook the resulting model is also used with the many other functions to enable an end-to-end MLOps journey directly in BigQuery: ML.FEATURE_INFO to review summary information for each input feature used to train the model ML.TRAINING_INFO to see details from each training iteration of the model ML.EVALUATE to review model metrics ML.FEATURE_IMPORTANCE to review the feature importance scores from the construction of the boosted tree ML.GLOBAL_EXPLAIN to get aggregated feature attribution for features across the evaluation data ML.EXPLAIN_PREDICT to get prediction and feature attributions for each instance of the input ML.PREDICT to get predictions for input instances Step 2: Serve online predictions with Vertex AI Prediction EndpointsBy using options to register the resulting model in the Vertex AI Model Registry during step 1 the path to online predictions is made very simple. Models in the Vertex AI Model Registry can be deployed to Vertex AI Prediction Endpoints where they can serve predictions from Vertex AI API using any of the client libraries (Python, Java, Node.js), gcloud ai, REST or gRPC. The process can be done directly from the Vertex AI console as shown here and is demonstrated below with the popular Python client for Vertex AI, named google-cloud-aiplatform. Setting up the Python environment to work with the Vertex AI client requires just an import and setting the project and region for resources: code_block[StructValue([(u'code', u'from google.cloud import aiplatform\r\naiplatform.init(project = PROJECT_ID, location = REGION)'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3e259af45890>)])]Connecting to the model in the Vertex AI Model Registry is done using the model name which was specified in the CREATE MODEL statement with the option VERTEX_AI_MODEL_ID: code_block[StructValue([(u'code', u"vertex_model = aiplatform.Model(model_name = 'bqml_bqml_bqml_feature_engineering_transform')"), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3e259af53d50>)])]Creating a Vertex AI Prediction Endpoints requires just a display_name: code_block[StructValue([(u'code', u'endpoint = aiplatform.Endpoint.create(display_name = "bqml_feature_engineering")'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3e25989a4f50>)])]The action of deploying the model to the endpoint requires specifying the compute environment with: traffic_percentage: percentage of requests routed to the model machine_type: the compute specification min_replica_count and max_replica_count: the compute environment's minimum and maximum number of machines used in scaling to meet the demand for predictions. code_block[StructValue([(u'code', u"endpoint.deploy(\r\n model = vertex_model,\r\n deployed_model_display_name = vertex_model.display_name,\r\n traffic_percentage = 100,\r\n machine_type = 'n1-standard-2',\r\n min_replica_count = 1,\r\n max_replica_count = 1\r\n)"), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3e2598ad0f90>)])]Request a prediction by sending an input instance with key:value pairs for each feature. Note that the features are the raw features rather than needing to preprocess them into the model features like contestant, year, week and other scaled features: code_block[StructValue([(u'code', u"endpoint.predict(instances = ['contestant_id': 'c_1',\r\n 'airdate': '2003-05-26',\r\n 'flourAmt': 484.28986452656386,\r\n 'saltAmt': 9,\r\n 'yeastAmt': 10,\r\n 'mix1Time': 5,\r\n 'mix1Speed': 3,\r\n 'mix2Time': 5,\r\n 'mix2Speed': 5,\r\n 'water1Amt': 311.66349401065276,\r\n 'water2Amt': 98.61283742264706,\r\n 'waterTemp': 46,\r\n 'proveTime': 105.67304373851782,\r\n 'restTime': 44,\r\n 'bakeTime': 28,\r\n 'bakeTemp': 435.39349280229476,\r\n 'ambTemp': 51.27996072412186,\r\n 'ambHumidity': 61.44333141984406\r\n])"), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3e25989831d0>)])]The response returned is the predicted score from Judge A of a 73.5267944335937 which is also confirmed in the tutorial notebook using the model in BigQuery with ML.PREDICT. Not the best bread, but a great prediction since the actual answer is 75.0! (Optional) Exporting Models With Inline TRANSFORM clauseWhile there is no longer a need to export the model for use in Vertex AI thanks to the direct registration options available during model creation, it can still be very helpful to make BigQuery ML models portable for use elsewhere or in more complex workflows like model co-hosting with deployment resource pools or workflows with multiple models using NVIDIA Triton on Vertex AI Prediction. When exporting BigQuery ML models to GCS the TRANSFORM clause is also exported as a separate model in a subfolder named /transform. This means even the transform model is portable and can be used in other workflows where the same preprocessing steps are needed. If you used BigQuery time or date functions (Date functions, Datetime functions, Time functions and Timestamp functions) then you might wonder how the exported TensorFlow model that represents the TRANSFORM clause handles those data types. We implemented a TensorFlow Custom op that can be easily added to your custom serving environment via the bigquery-ml-utils Python package. To initiate the export to GCS use the BigQuery EXPORT MODEL statement: code_block[StructValue([(u'code', u"EXPORT MODEL `statmike-mlops-349915.feature_engineering.bqml_feature_engineering_transform`\r\n OPTIONS (URI = 'gs://statmike-mlops-349915-us-central1-bqml-exports/bqml/model')"), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3e2598983350>)])]The tutorial notebooks show the folder structure and contents and how to use the TensorFlow SavedModel CLIto review the transform models input and output signature. ConclusionBigQuery ML preprocessing functions give the flexibility users need to prepare their data as features for ML while also enabling simplified serving by embedding the preprocessing steps directly in the model. Creating a seamless integration with Vertex AI amplifies this embedded preprocessing by making it fast to host BigQuery ML models on Vertex AI Prediction Endpoints for serverless online predictions that scale to meet your applications demand. Ultimately making building models easy while making the models useful through simple serving options. In the future you can expect to see even more ways to simplify ML workflows with BigQuery ML while seamlessly integrating with Vertex AI.
  10. Preprocessing and transforming raw data into features is a critical but time consuming step in the ML process. This is especially true when a data scientist or data engineer has to move data across different platforms to do MLOps. In this blogpost, we describe how we streamline this process by adding two feature engineering capabilities in BigQuery ML Our previous blog outlines the data to AI journey with BigQuery ML, highlighting two powerful features that simplify MLOps - data preprocessing functions for feature engineering and the ability to export BigQuery ML TRANSFORM statement as part of the model artifact. In this blog post, we share how to use these features for creating a seamless experience from BigQuery ML to Vertex AI. Data Preprocessing Functions Preprocessing and transforming raw data into features is a critical but time consuming step when operationalizing ML. We recently announced the public preview of advanced feature engineering functions in BigQuery ML. These functions help you impute, normalize or encode data. When this is done inside the database, BigQuery, the entire process becomes easier, faster, and more secure to preprocess data. Here is a list of the new functions we are introducing in this release. The full list of preprocessing functions can be found here. ML.MAX_ABS_SCALER Scale a numerical column to the range [-1, 1] without centering by dividing by the maximum absolute value. ML.ROBUST_SCALER Scale a numerical column by centering with the median (optional) and dividing by the quantile range of choice ([25, 75] by default). ML.NORMALIZER Turn an input numerical array into a unit norm array for any p-norm: 0, 1, >1, +inf. The default is 2 resulting in a normalized array where the sum of squares is 1. ML.IMPUTER Replace missing values in a numerical or categorical input with the mean, median or mode (most frequent). ML.ONE_HOT_ENCODER One-hot encode a categorical input. Also, it optionally does dummy encoding by dropping the most frequent value. It is also possible to limit the size of the encoding by specifying k for the k most frequent categories and/or a lower threshold for the frequency of categories. ML.LABEL_ENCODER Encode a categorical input to integer values [0, n categories] where 0 represents NULL and excluded categories. You can exclude categories by specifying k for k most frequent categories and/or a lower threshold for the frequency of categories. Model Export with TRANSFORM Statement You can now export BigQuery ML models that include a feature TRANSFORM statement. The ability to include TRANSFORM statements makes models more portable when exporting them for online prediction. This capability also works when BigQuery ML models are registered with Vertex AI Model Registry and deployed to Vertex AI Prediction endpoints. More details about exporting models can be found in BigQuery ML Exporting models. These new features are available through the Google Cloud Console, BigQuery API, and client libraries. Step-by-step guide to use the two features In this tutorial, we will use the bread recipe competition dataset to predict judges rating using linear regression and boosted tree models. Objective: To demonstrate how to preprocess data using the new functions, register the model with Vertex AI Model Registry, and deploy the model for online prediction with Vertex AI Prediction endpoints. Dataset: Each row represents a bread recipe with columns for each ingredient (flour, salt, water, yeast) and procedure (mixing time, mixing speed, cooking temperature, resting time). There are also columns that include judges ratings of the final product from each recipe. Overview of the tutorial: Steps 1 and 2 show how to use the TRANSFORM statement. Steps 3 and 4 demonstrate how to manually export and register the models. Steps 5 through 7 show how to deploy a model to Vertex AI Prediction endpoint. For the best learning experience, follow this blog post alongside the tutorial notebook. Step 1: Transform BigQuery columns into ML features with SQL Before training an ML model, exploring the data within columns is essential to identifying the data type, distribution, scale, missing patterns, and extreme values. BigQuery ML enables this exploratory analysis with SQL. With the new preprocessing functions it is now even easier to transform BigQuery columns into ML features with SQL while iterating to find the optimal transformation. For example, when using the ML.MAX_ABS_SCALER function for an input column, each value is divided by the maximum absolute value (10 in the example): code_block [StructValue([(u'code', u'SELECT\r\n input_column,\r\n ML.MAX_ABS_SCALER (input_column) OVER() AS scale_column\r\nFROM\r\n UNNEST([0, -1, 2, -3, 4, -5, 6, -7, 8, -9, 10]) as input_column\r\nORDER BY input_column'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3ecc7cf7a710>)])] Once the input columns for an ML model are identified and the feature transformations are chosen, it is enticing to apply the transformation and save the output as a view. But this has an impact on our predictions later on because these same transformations will need to be applied before requesting predictions. Step 2 shows how to prevent this separation of processing and model training. Step 2: Iterate through multiple models with inline TRANSFORM functions Building on the preprocessing explorations in Step 1, the chosen transformations are applied inline with model training using the TRANSFORM statement. This interlocks the model iteration with the preprocessing explorations while making any candidate ready for serving with BigQuery or beyond. This means you can immediately try multiple model types without any delayed impact of feature transformations on predictions. In this step, two models, linear regression and boosted tree, are trained side-by-side with identical TRANSFORM statements: Training with linear regression - Model a code_block [StructValue([(u'code', u"CREATE OR REPLACE MODEL `statmike-mlops-349915.feature_engineering.03_feature_engineering_2a`\r\nTRANSFORM (\r\n JUDGE_A,\r\n\r\n ML.MIN_MAX_SCALER(flourAmt) OVER() as scale_flourAmt, \r\n ML.ROBUST_SCALER(saltAmt) OVER() as scale_saltAmt,\r\n ML.MAX_ABS_SCALER(yeastAmt) OVER() as scale_yeastAmt,\r\n ML.STANDARD_SCALER(water1Amt) OVER() as scale_water1Amt,\r\n ML.STANDARD_SCALER(water2Amt) OVER() as scale_water2Amt,\r\n\r\n ML.STANDARD_SCALER(waterTemp) OVER() as scale_waterTemp,\r\n ML.ROBUST_SCALER(bakeTemp) OVER() as scale_bakeTemp,\r\n ML.MIN_MAX_SCALER(ambTemp) OVER() as scale_ambTemp,\r\n ML.MAX_ABS_SCALER(ambHumidity) OVER() as scale_ambHumidity,\r\n\r\n ML.ROBUST_SCALER(mix1Time) OVER() as scale_mix1Time,\r\n ML.ROBUST_SCALER(mix2Time) OVER() as scale_mix2Time,\r\n ML.ROBUST_SCALER(mix1Speed) OVER() as scale_mix1Speed,\r\n ML.ROBUST_SCALER(mix2Speed) OVER() as scale_mix2Speed,\r\n ML.STANDARD_SCALER(proveTime) OVER() as scale_proveTime,\r\n ML.MAX_ABS_SCALER(restTime) OVER() as scale_restTime,\r\n ML.MAX_ABS_SCALER(bakeTime) OVER() as scale_bakeTime\r\n)\r\nOPTIONS (\r\n model_type = 'LINEAR_REG',\r\n input_label_cols = ['JUDGE_A'],\r\n enable_global_explain = TRUE,\r\n data_split_method = 'AUTO_SPLIT',\r\n MODEL_REGISTRY = 'VERTEX_AI',\r\n VERTEX_AI_MODEL_ID = 'bqml_03_feature_engineering_2a',\r\n VERTEX_AI_MODEL_VERSION_ALIASES = ['run-20230112234821']\r\n ) AS\r\nSELECT * EXCEPT(Recipe, JUDGE_B)\r\nFROM `statmike-mlops-349915.feature_engineering.bread`"), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3ecc7cf7add0>)])] Training with boosted tree - Model b code_block [StructValue([(u'code', u"CREATE OR REPLACE MODEL `statmike-mlops-349915.feature_engineering.03_feature_engineering_2b`\r\nTRANSFORM (\r\n JUDGE_A,\r\n\r\n ML.MIN_MAX_SCALER(flourAmt) OVER() as scale_flourAmt, \r\n ML.ROBUST_SCALER(saltAmt) OVER() as scale_saltAmt,\r\n ML.MAX_ABS_SCALER(yeastAmt) OVER() as scale_yeastAmt,\r\n ML.STANDARD_SCALER(water1Amt) OVER() as scale_water1Amt,\r\n ML.STANDARD_SCALER(water2Amt) OVER() as scale_water2Amt,\r\n\r\n ML.STANDARD_SCALER(waterTemp) OVER() as scale_waterTemp,\r\n ML.ROBUST_SCALER(bakeTemp) OVER() as scale_bakeTemp,\r\n ML.MIN_MAX_SCALER(ambTemp) OVER() as scale_ambTemp,\r\n ML.MAX_ABS_SCALER(ambHumidity) OVER() as scale_ambHumidity,\r\n\r\n ML.ROBUST_SCALER(mix1Time) OVER() as scale_mix1Time,\r\n ML.ROBUST_SCALER(mix2Time) OVER() as scale_mix2Time,\r\n ML.ROBUST_SCALER(mix1Speed) OVER() as scale_mix1Speed,\r\n ML.ROBUST_SCALER(mix2Speed) OVER() as scale_mix2Speed,\r\n ML.STANDARD_SCALER(proveTime) OVER() as scale_proveTime,\r\n ML.MAX_ABS_SCALER(restTime) OVER() as scale_restTime,\r\n ML.MAX_ABS_SCALER(bakeTime) OVER() as scale_bakeTime\r\n)\r\nOPTIONS (\r\n model_type = 'BOOSTED_TREE_REGRESSOR',\r\n booster_type = 'GBTREE',\r\n num_parallel_tree = 1,\r\n max_iterations = 30,\r\n early_stop = TRUE,\r\n min_rel_progress = 0.01,\r\n tree_method = 'HIST',\r\n subsample = 0.85, \r\n input_label_cols = ['JUDGE_A'],\r\n enable_global_explain = TRUE,\r\n data_split_method = 'AUTO_SPLIT',\r\n l1_reg = 10,\r\n l2_reg = 10,\r\n MODEL_REGISTRY = 'VERTEX_AI',\r\n VERTEX_AI_MODEL_ID = 'bqml_03_feature_engineering_2b',\r\n VERTEX_AI_MODEL_VERSION_ALIASES = ['run-20230112234926']\r\n ) AS\r\nSELECT * EXCEPT(Recipe, JUDGE_B)\r\nFROM `statmike-mlops-349915.feature_engineering.bread`"), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3ecc7cf7ae90>)])] Identical input columns that have the same preprocessing means you can easily compare the accuracy of the models. Using the BigQuery ML function ML.EVALUATE makes this comparison as simple as a single SQL query that stacks these outcomes with the UNION ALL set operator: code_block [StructValue([(u'code', u"SELECT 'Manual Feature Engineering - 2A' as Approach, mean_squared_error, r2_score\r\nFROM ML.EVALUATE(MODEL `statmike-mlops-349915.feature_engineering.03_feature_engineering_2a`)\r\nUNION ALL\r\nSELECT 'Manual Feature Engineering - 2B' as Approach, mean_squared_error, r2_score\r\nFROM ML.EVALUATE(MODEL `statmike-mlops-349915.feature_engineering.03_feature_engineering_2b`)"), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3ecc7cf7af50>)])] The results of the evaluation comparison show that using the boosted tree model results in a much better model than linear regression with drastically lower mean squared error and higher r2. Both models are ready to serve predictions, but the clear choice is the boosted tree regressor. Once we decide which model to use, you can predict directly within BigQuery ML using the ML.PREDICT function. In the rest of the tutorial, we show how to export the model outside of BigQuery ML and predict using Google Cloud Vertex AI. Using BigQuery Models for Inference Outside of BigQuery Once your model is trained, if you want to do online inference for low latency responses in your application for online prediction, you have to deploy the model outside of BigQuery. The following steps demonstrate how to deploy the models to Vertex AI Prediction endpoints. This can be accomplished in one of two ways: Manually export the model from BigQuery ML and set up a Vertex AI Prediction Endpoint. To do this, you need to do steps 3 and 4 first. Register the model and deploy from Vertex AI Model Registry automatically. The capability is not available yet but will be available in a forthcoming release. Once it’s available steps 3 and 4 can be skipped. Step 3. Manually export models from BigQuery BigQuery ML supports an EXPORT MODEL statement to deploy models outside of BigQuery. A manual export includes two models - a preprocessing model that reflects the TRANSFORM statement and a prediction model. Both models are exported with a single export statement in BigQuery ML. code_block [StructValue([(u'code', u"EXPORT MODEL `statmike-mlops-349915.feature_engineering.03_feature_engineering_2b`\r\n OPTIONS (URI = 'gs://statmike-mlops-349915-us-central1-bqml-exports/03/2b/model')"), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3ecc7cf80350>)])] The preprocessing model that captures the TRANSFORM statement is exported as a TensorFlow SavedModel file. In this example it is exported to a GCS bucket located at ‘gs://statmike-mlops-349915-us-central1-bqml-exports/03/2b/model/transform’. The prediction models are saved in portable formats that match the frameworks in which they were trained by BigQuery ML. The linear regression model is exported as a TensorFlow SavedModel and the boosted tree regressor is exported as Booster file (XGBoost). In this example, the boost tree model is exported to a GCS bucket located at ‘gs://statmike-mlops-349915-us-central1-bqml-exports/03/2b/model’ These export files are in a standard open format of the native model types making them completely portable to be deployed anywhere - they can be deployed to Vertex AI (Steps 4-7 below), on your own infrastructure, or even in edge applications. Steps 4 through 7 show how to register and deploy a model to Vertex AI Prediction endpoint. These steps need to be repeated separately for the preprocessing models and the prediction models. Step 4. Register models to Vertex AI Model Registry To deploy the models in Vertex AI Prediction, they first need to be registered with the Vertex AI Model Registry To do this two inputs are needed - the links to the model files and a URI to a pre-built container. Go to Step 4 in the tutorial to see how exactly it’s done. The registration can be done with the Vertex AI console or programmatically with one of the clients. In the example below, the Python client for Vertex AI is used to register the models like this: code_block [StructValue([(u'code', u'vertex_model = aiplatform.Model.upload(\r\n display_name = \'gcs_03_feature_engineering_2b\',\r\n serving_container_image_uri = \'us-docker.pkg.dev/vertex-ai/prediction/xgboost-cpu.1-1:latest\',\r\n artifact_uri = "gs://statmike-mlops-349915-us-central1-bqml-exports/03/2b/model"\r\n)'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3ecc7cf80810>)])] Step 5. Create Vertex AI Prediction endpoints Vertex AI includes a service forhosting models for online predictions. To host a model on a Vertex AI Prediction endpoint you first create an endpoint. This can also be done directly from the Vertex AI Model Registry console or programmatically with one of the clients. In the example below, the Python client for Vertex AI is used to create the endpoint like this: code_block [StructValue([(u'code', u'vertex_endpoint = aiplatform.Endpoint.create (\r\n display_name = \u201803_feature_engineering_manual_2b\u2019\r\n)'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3ecc7cf80f90>)])] Step 6. Deploy models to endpoints Deploying a model from the Vertex AI Model Registry (Step 4) to a Vertex AI Prediction endpoint (Step 5) is done in a single deployment action where the model definition is supplied to the endpoint along with the type of machine to utilize. Vertex AI Prediction endpoints can automatically scale up or down to handle prediction traffic needs by providing the number of replicas to utilize (default is 1 for min and max). In the example below, the Python client for Vertex AI is being used with the deploy method for the endpoint (Step 5) using the models (Step 4): code_block [StructValue([(u'code', u"vertex_endpoint.deploy(\r\n model = vertex_model,\r\n deployed_model_display_name = vertex_model.display_name,\r\n traffic_percentage = 100,\r\n machine_type = 'n1-standard-2',\r\n min_replica_count = 1,\r\n max_replica_count = 1\r\n)"), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3ecc7cf800d0>)])] Step 7. Request predictions from endpoints Once the model is deployed to a Vertex AI Prediction endpoint (Step 6) it can serve predictions. Rows of data, called instances, are passed to the endpoint and results are returned that include the processed information: preprocessing result or prediction. Getting prediction results from Vertex AI Prediction endpoints can be done with any of the Vertex AI API interfaces (REST, gRPC, gcloud, Python, Java, Node.js). Here, the request is demonstrated directly with the predict method of the endpoint (Step 6) using the Python client for Vertex AI as follows: code_block [StructValue([(u'code', u"results = vertex_endpoint.predict(instances = [\r\n{'flourAmt': 511.21695405324624,\r\n 'saltAmt': 9,\r\n 'yeastAmt': 11,\r\n 'mix1Time': 6,\r\n 'mix1Speed': 4,\r\n 'mix2Time': 5,\r\n 'mix2Speed': 4,\r\n 'water1Amt': 338.3989183746999,\r\n 'water2Amt': 105.43955159464981,\r\n 'waterTemp': 48,\r\n 'proveTime': 92.27755071811586,\r\n 'restTime': 43,\r\n 'bakeTime': 29,\r\n 'bakeTemp': 462.14028505497805,\r\n 'ambTemp': 38.20572852497746,\r\n 'ambHumidity': 63.77836403396154}])"), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3ecc7cf80550>)])] The result of an endpoint with a preprocessing model will be identical to applying the TRANSFORM statement from BigQuery ML. The results can then be pipelined to an endpoint with the prediction model to serve predictions that match the results of the ML.PREDICT function in BigQuery ML. The results of both methods, Vertex AI Prediction endpoints and BigQuery ML with ML.PREDICT are shown side-by-side in the tutorial to show that the results of the model are replicated. Now the model can be used for online serving with extremely low latency. This even includes using private endpoints for even lower latency and secure connections with VPC Network Peering. Conclusion With the new preprocessing functions, you can simplify data exploration and feature preprocessing. Further, by embedding preprocessing within model training using the TRANSFORM statement, the serving process is simplified by using prepped models without needing additional steps. In other words, predictions are done right inside BigQuery or alternatively the models can be exported to any location outside of BigQuery such as Vertex AI Prediction for online serving. The tutorial demonstrated how BigQuery ML works with Vertex AI Model Registry and Prediction to create a seamless end-to-end ML experience. In the future you can expect to see more capabilities that bring BigQuery, BigQuery ML and Vertex AI together. Click here to access the tutorial or check out the documentation to learn more about BigQuery ML Thanks to Ian Zhao, Abhinav Khushraj, Yan Sun, Amir Hormati, Mingge Deng and Firat Tekiner from the BigQuery ML team
  11. Editor’s note: Here we take a look at how Branch, a fintech startup, built their data platform with BigQuery and other Google Cloud solutions that democratized data for their analysts and scientists. As a startup in the fintech sector, Branch helps redefine the future of work by building innovative, simple-to-use tech solutions. We’re an employer payments platform, helping businesses provide faster pay and fee-free digital banking to their employees. As head of the Behavioral and Data Science team, I was tapped last year to build out Branch’s team and data platform. I brought my enthusiasm for Google Cloud and its easy-to-use solutions to the first day on the job. We chose Google Cloud for ease-of-use, data & savings I had worked with Google Cloud previously, and one of the primary mandates from our CTO was “Google Cloud-first,” with the larger goal of simplifying unnecessary complexity in the system architecture and controlling the costs associated with being on multiple cloud platforms. From the start, Google Cloud’s suite of solutions supported my vision of how to design a data team. There’s no one-size-fits-all approach. It starts with asking questions: what does Branch need? Which stage are we at? Will we be distributed or centralized? But above all, what parameters in the product will need to be optimized with analytics and data science approaches? With team design, product parameterization is critical. With a product-driven company, the data science team can be most effective by tuning a product’s parameters—for example, a recommendation engine for an ecommerce site is driven by algorithms and underlying models that are updating parameters. “Show X to this type of person but Y to this type of person,” X and Y are the parameters optimized by modeling behavioral patterns. Data scientists behind the scenes can run models as to how that engine should work, and determine which changes are needed. By focusing on tuning parameters, the team is designed around determining and optimizing an objective function. That of course relies heavily on the data behind it. How do we label the outcome variable? Is a whole labeling service required? Is it clean data with a pipeline that won’t require a lot of engineering work? What data augmentation will be needed? With that data science team design envisioned, I started by focusing on user behavior—deciding how to monitor and track it, how to partner with the product team to ensure it’s in line with the product objectives, then spinning up A/B testing and monitoring. On the optimization side, transaction monitoring is critical in fintech. We need to look for low-probability events and abnormal patterns in the data, and then take action, either reaching out to the user as quickly as possible to inform them, or stopping the transaction directly. In the design phase, we need to determine if these actions need to be done in real-time or after the fact. Is it useful to the user to have that information in real time? For example, if we are working to encourage engagement, and we miss an event or an interaction, it’s not the end of the world. It’s different with a fraud monitoring system, for which you’ve got to be much more strict about real-time notifications. Our data infrastructure There are many use cases at Branch for data cloud technologies from Google Cloud. One is with “basic” data work. It’s been incredibly easy to use BigQuery, Google’s serverless data warehouse, which is where we’ve replicated all of our SQL databases, and Cloud Scheduler, the fully managed enterprise-grade cron job scheduler. These two tools, working together, make it easy to organize data pipelining. And because of their deep integration, they play well with other Google Cloud solutions like Cloud Composer and Dataform, as well as with services, like Airflow, from other providers. Especially for us as a startup, the whole Google Cloud suite of products accelerates the process of getting established and up and running, so we can perform the “bread-and-butter” work of data science. We also use BigQuery as a holder of heavier stats, and we train our models there, weekly, monthly, nightly, depending on how much data we collect. Then we leverage the messaging and ingestion tool Pub/Sub and its event systems to get the response in real time. We evaluate the output for that model in a Dataproc cluster or Dataform, and run all of that in Python notebooks, which can call out to BigQuery to train a model, or get evaluated and pass the event system through. Full integration of data solutions At the next level, you need to push data out to your internal teams. We are growing and evolving, so I looked for ways to save on costs during this transition. We do a heavy amount of work in Google Sheets because it integrates well with other Google services, getting data and visuals out to the people who need them; enabling them to access raw data and refresh as needed. Google Groups also makes it easy to restrict access to data tables, which is a vital concern in the fintech space. The infrastructure management and integration of Google Groups make it super useful. If an employee departs the organization, we can easily delete or control their level of access. We can add new employees to a group that has a certain level of rights, or read and write access to the underlying databases. As we grow with Google Cloud, I also envision being able to track the user levels, including who’s running which SQLs and who’s straining the database and raising our costs. A streamlined data science team saves costs I’d estimate that Google Cloud’s solutions have saved us the equivalent of one full-time engineer we’d otherwise need to hire to link the various tools together, making sure that they are functional and adding more monitoring. Because of the fully managed features of many of Google Cloud’s products, that work is done for us, and we can focus on expanding our customer products. We’re now 100% Google Cloud for all production systems, having consolidated from IBM, AWS, and other cloud point solutions. For example, Branch is now expanding financial wellness offerings for our customers to encourage better financial behavior through transaction monitoring, forecasting their spend and deposits, and notifying them of risks or anomalies. With those products and others, we’ll be using and benefiting from the speed, scalability, and ease of use of Google Cloud solutions, where they always keep data—and data teams—top of mind. Learn more about Branch. Curious about other use cases for BigQuery? Read how retailers can use BigQuery ML to create demand forecasting models. Related Article Inventory management with BigQuery and Cloud Run Building a simple inventory management system with Cloud Run and BigQuery Read Article
  • Forum Statistics

    43.6k
    Total Topics
    43.2k
    Total Posts
×
×
  • Create New...