Jump to content

Search the Community

Showing results for tags 'bigquery'.

  • Search By Tags

    Type tags separated by commas.
  • Search By Author

Content Type


Forums

There are no results to display.

There are no results to display.


Find results in...

Find results that contain...


Date Created

  • Start

    End


Last Updated

  • Start

    End


Filter by number of...

Joined

  • Start

    End


Group


Website URL


LinkedIn Profile URL


About Me


Cloud Platforms


Cloud Experience


Development Experience


Current Role


Skills


Certifications


Favourite Tools


Interests

Found 18 results

  1. As analytics in your company graduates from a MySQL/PostgreSQL/SQL Server, a pertinent question that you need to answer is which data warehouse is best suited for you. This blog tries to compare Redshift vs BigQuery – two very famous cloud data warehouses today. In this post, we are going to talk about the two most […]View the full article
  2. Embeddings represent real-world objects, like entities, text, images, or videos as an array of numbers (a.k.a vectors) that machine learning models can easily process. Embeddings are the building blocks of many ML applications such as semantic search, recommendations, clustering, outlier detection, named entity extraction, and more. Last year, we introduced support for text embeddings in BigQuery, allowing machine learning models to understand real-world data domains more effectively and earlier this year we introduced vector search, which lets you index and work with billions of embeddings and build generative AI applications on BigQuery. At Next ’24, we announced further enhancement of embedding generation capabilities in BigQuery with support for: Multimodal embeddings generation in BigQuery via Vertex AI’s multimodalembedding model, which lets you embed text and image data in the same semantic space Embedding generation for structured data using PCA, Autoencoder or Matrix Factorization models that you train on your data in BigQuery Multimodal embeddings Multimodal embedding generates embedding vectors for text and image data in the same semantic space (vectors of items similar in meaning are closer together) and the generated embeddings have the same dimensionality (text and image embeddings are the same size). This enables a rich array of use cases such as embedding and indexing your images and then searching for them via text. You can start using multimodal embedding in BigQuery using the following simple flow. If you like, you can take a look at our overview video which walks through a similar example. Step 0: Create an object table which points to your unstructured data You can work with unstructured data in BigQuery via object tables. For example, if you have your images stored in a Google Cloud Storage bucket on which you want to generate embeddings, you can create a BigQuery object table that points to this data without needing to move it. To follow along the steps in this blog you will need to reuse an existing BigQuery CONNECTION or create a new one following instruction here. Ensure that the principal of the connection used has the ‘Vertex AI User’ role and that the Vertex AI API is enabled for your project. Once the connection is created you can create an object table as follows: code_block <ListValue: [StructValue([('code', "CREATE OR REPLACE EXTERNAL TABLE\r\n `bqml_tutorial.met_images`\r\nWITH CONNECTION `Location.ConnectionID`\r\nOPTIONS\r\n( object_metadata = 'SIMPLE',\r\n uris = ['gs://gcs-public-data--met/*']\r\n);"), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e9805ba60a0>)])]> In this example, we are creating an object table that contains public domain art images from The Metropolitan Museum of Art (a.k.a. “The Met”) using a public Cloud Storage bucket that contains this data. The resulting object table has the following schema: Let’s look at a sample of these images. You can do this using a BigQuery Studio Colab notebook by following instructions in this tutorial. As you can see, the images represent a wide range of objects and art pieces. Image source: The Metropolitan Museum of Art Now that we have the object table with images, let’s create embeddings for them. Step 1: Create model To generate embeddings, first create a BigQuery model that uses the Vertex AI hosted ‘multimodalembedding@001’ endpoint. code_block <ListValue: [StructValue([('code', "CREATE OR REPLACE MODEL\r\n bqml_tutorial.multimodal_embedding_model REMOTE\r\nWITH CONNECTION `LOCATION.CONNNECTION_ID`\r\nOPTIONS (endpoint = 'multimodalembedding@001')"), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e9805ba6970>)])]> Note that while the multimodalembedding model supports embedding generation for text, it is specifically designed for cross-modal semantic search scenarios, for example, searching images given text. For text-only use cases, we recommend using the textembedding-gecko@ model instead. Step 2: Generate embeddings You can generate multimodal embeddings in BigQuery via the ML.GENERATE_EMBEDDING function. This function also works for generating text embeddings (via textembedding-gecko model) and structured data embeddings (via PCA, AutoEncoder and Matrix Factorization models). To generate embeddings, simply pass in the embedding model and the object table you created in previous steps to the ML.GENERATE_EMBEDDING function. code_block <ListValue: [StructValue([('code', "CREATE OR REPLACE TABLE `bqml_tutorial.met_image_embeddings`\r\nAS\r\nSELECT * FROM ML.GENERATE_EMBEDDING(\r\n MODEL `bqml_tutorial.multimodal_embedding_model`,\r\n TABLE `bqml_tutorial.met_images`)\r\nWHERE content_type = 'image/jpeg'\r\nLimit 10000"), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e9805ba6610>)])]> To reduce the tutorial’s runtime, we limit embedding generation to 10,000 images. This query will take 30 minutes to 2 hours to run. Once this step is completed you can see a preview of the output in BigQuery Studio. The generated embeddings have a dimension of 1408. Step 3 (optional): Create a vector index on generated embeddings While the embeddings generated in the previous step can be persisted and used directly in downstream models and applications, we recommend creating a vector index for improving embedding search performance and enabling the nearest-neighbor query pattern. You can learn more about vector search in BigQuery here. code_block <ListValue: [StructValue([('code', "-- Create a vector index on the embeddings\r\n\r\nCREATE OR REPLACE VECTOR INDEX `met_images_index`\r\nON bqml_tutorial.met_image_embeddings(ml_generate_embedding_result)\r\nOPTIONS(index_type = 'IVF',\r\n distance_type = 'COSINE')"), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e9805ba62e0>)])]> Step 4: Use embeddings for text-to-image (cross-modality) search You can now use these embeddings in your applications. For example, to search for “pictures of white or cream colored dress from victorian era” you first embed the search string like so: code_block <ListValue: [StructValue([('code', '-- embed search string\r\n\r\nCREATE OR REPLACE TABLE `bqml_tutorial.search_embedding`\r\nAS\r\nSELECT * FROM ML.GENERATE_EMBEDDING(\r\n MODEL `bqml_tutorial.multimodal_embedding_model`,\r\n (\r\n SELECT "pictures of white or cream colored dress from victorian era" AS content\r\n )\r\n)'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e9805ba6850>)])]> You can now use the embedded search string to find similar (nearest) image embeddings as follows: code_block <ListValue: [StructValue([('code', '-- use the embedded search string to search for images\r\n\r\nCREATE OR REPLACE TABLE\r\n `bqml_tutorial.vector_search_results` AS\r\nSELECT\r\n base.uri AS gcs_uri,\r\n distance\r\nFROM\r\n VECTOR_SEARCH( TABLE `bqml_tutorial.met_image_embeddings`,\r\n "ml_generate_embedding_result",\r\n TABLE `bqml_tutorial.search_embedding`,\r\n "ml_generate_embedding_result",\r\n top_k => 5)'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e980766d250>)])]> Step 5: Visualize results Now let’s visualize the results along with the computed distance and see how we performed on the search query “pictures of white or cream colored dress from victorian era”. Refer the accompanying tutorial on how to render this output using a BigQuery notebook. Image source: The Metropolitan Museum of Art The results look quite good! Wrapping up In this blog, we demonstrated a common vector search usage pattern but there are many other use cases for embeddings. For example, with multimodal embeddings you can perform zero-shot classification of images by converting a table of images and a separate table containing sentence-like labels to embeddings. You can then classify images by computing distance between images and each descriptive label’s embedding. You can also use these embeddings as input for training other ML models, such as clustering models in BigQuery to help you discover hidden groupings in your data. Embeddings are also useful wherever you have free text input as a feature, for example, embeddings of user reviews or call transcripts can be used in a churn prediction model, embeddings of images of a house can be used as input features in a price prediction model etc. You can even use embeddings instead of categorical text data when such categories have semantic meaning, for example, product categories in a deep-learning recommendation model. In addition to multimodal and text embeddings, BigQuery also supports generating embeddings on structured data using PCA, AUTOENCODER and Matrix Factorization models that have been trained on your data in BigQuery. These embeddings have a wide range of use cases. For example, embeddings from PCA and AUTOENCODER models can be used for anomaly detection (embeddings further away from other embeddings are deemed anomalies) and as input features to other models, for example, a sentiment classification model trained on embeddings from an autoencoder. Matrix Factorization models are classically used for recommendation problems, and you can use them to generate user and item embeddings. Then, given a user embedding you can find the nearest item embeddings and recommend these items, or cluster users so that they can be targeted with specific promotions. To generate such embeddings, first use the CREATE MODEL function to create a PCA, AutoEncoder or Matrix Factorization model and pass in your data as input, and then use ML.GENERATE_EMBEDDING function providing the model, and a table input to generate embeddings on this data. Getting started Support for multimodal embeddings and support for embeddings on structured data in BigQuery is now available in preview. Get started by following our documentation and tutorials. Have feedback? Let us know what you think at bqml-feedback@google.com. View the full article
  3. Delta Lake is an open-source optimized storage layer that provides a foundation for tables in lake houses and brings reliability and performance improvements to existing data lakes. It sits on top of your data lake storage (like cloud object stores) and provides a performant and scalable metadata layer on top of data stored in the Parquet format. Organizations use BigQuery to manage and analyze all data types, structured and unstructured, with fine-grained access controls. In the past year, customer use of BigQuery to process multiformat, multicloud, and multimodal data using BigLake has grown over 60x. Support for open table formats gives you the flexibility to use existing open source and legacy tools while getting the benefits of an integrated data platform. This is enabled via BigLake — a storage engine that allows you to store data in open file formats on cloud object stores such as Google Cloud Storage, and run Google-Cloud-native and open-source query engines on it in a secure, governed, and performant manner. BigLake unifies data warehouses and lakes by providing an advanced, uniform data governance model. This week at Google Cloud Next '24, we announced that this support now extends to the Delta Lake format, enabling you to query Delta Lake tables stored in Cloud Storage or Amazon Web Services S3 directly from BigQuery, without having to export, copy, nor use manifest files to query the data. Why is this important? If you have existing dependencies on Delta Lake and prefer to continue utilizing Delta Lake, you can now leverage BigQuery native support. Google Cloud provides an integrated and price-performant experience for Delta Lake workloads, encompassing unified data management, centralized security, and robust governance. Many customers already harness the capabilities of Dataproc or Serverless Spark to manage Delta Lake tables on Cloud Storage. Now, BigQuery’s native Delta Lake support enables seamless delivery of data for downstream applications such as business intelligence, reporting, as well as integration with Vertex AI. This lets you do a number of things, including: Build a secure and governed lakehouse with BigLake’s fine-grained security model Securely exchange Delta Lake data using Analytics Hub Run data science workloads on Delta Lake using BigQuery ML and Vertex AI How to use Delta Lake with BigQuery Delta Lake tables follow the same table creation process as BigLake tables. Required roles To create a BigLake table, you need the following BigQuery identity and access management (IAM) permissions: bigquery.tables.create bigquery.connections.delegate Prerequisites Before you create a BigLake table, you need to have a dataset and a Cloud resource connection that can access Cloud Storage. Table creation using DDL Here is the DDL statement to create a Delta lake Table code_block <ListValue: [StructValue([('code', 'CREATE EXTERNAL TABLE `PROJECT_ID.DATASET.DELTALAKE_TABLE_NAME`\r\nWITH CONNECTION `PROJECT_ID.REGION.CONNECTION_ID`\r\nOPTIONS (\r\n format ="DELTA_LAKE",\r\n uris=[\'DELTA_TABLE_GCS_BASE_PATH\']);'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e9803cb44f0>)])]> Querying Delta Lake tables After creating a Delta Lake BigLake table, you can query it using GoogleSQL syntax, the same as you would a standard BigQuery table. For example: code_block <ListValue: [StructValue([('code', 'SELECT FIELD1, FIELD2 FROM `PROJECT_ID.DATASET.DELTALAKE_TABLE_NAME`'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e9803cb4550>)])]> You can also enforce fine-grained security at the table level, including row-level and column-level security. For Delta Lake tables based on Cloud Storage, you can also use dynamic data masking. Conclusion We believe that BigQuery’s support for Delta Lake is a major step forward for customers building lakehouses using Delta Lake. This integration will make it easier for you to get insights from your data and make data-driven decisions. We are excited to see how you use Delta Lake and BigQuery together to solve their business challenges. For more information on how to use Delta Lake with BigQuery, please refer to the documentation. Acknowledgments: Mahesh Bogadi, Garrett Casto, Yuri Volobuev, Justin Levandoski, Gaurav Saxena, Manoj Gunti, Sami Akbay, Nic Smith and the rest of the BigQuery Engineering team. View the full article
  4. RudderStack explains how to churn prediction can happen using Google’s BigQueryML together with the clickstream data gathered and delivered using the stage.View the full article
  5. Navigating the complexities of the data-to-insights journey can be frustrating. Data professionals spend valuable time sifting through data sources, reinventing the wheel with each new question that comes their way. They juggle multiple tools, hop between coding languages, and collaborate with a wide array of teams across their organizations. This fragmented approach is riddled with bottlenecks, preventing analysts from generating insights and doing high-impact work as quickly as they should. Yesterday at Google Cloud Next ‘24, we introduced BigQuery data canvas, which reimagines how data professionals work with data. This novel user experience helps customers create graphical data workflows that map to their mental model while AI innovations accelerate finding, preparing, analyzing, visualizing and sharing data and insights. Watch this video for a quick overview of BigQuery data canvas. BigQuery data canvas: a NL-driven analytics experience BigQuery data canvas makes data analytics faster and easier with a unified, natural language-driven experience that centralizes data discovery, preparation, querying, and visualization. Rather than toggling between multiple tools, you can now use data canvas to focus on the insights that matter most to your business. Data canvas addresses the challenges of traditional data analysis workflow in two areas: Natural language-centric experience: Instead of writing code, you can now speak directly to your data. Ask questions, direct tasks, and let the AI guide you through various analytics tasks. Reimagined user experience: Data canvas rethinks the notebook concept. Its expansive canvas workspace fosters iteration and easy collaboration, allowing you to refine your work, chain results, and share workspaces with colleagues. For example, to analyze a recent marketing campaign with BigQuery data canvas, you could use natural language prompts to discover campaign data sources, integrate them with existing customer data, derive insights, collaborate with teammates and share visual reports with executives — all within a single canvas experience. Natural language-based visual workflow with BigQuery data canvas Do more with BigQuery data canvas BigQuery provides a variety of features that can help analysts accelerate their analytics tasks: Search and discover: Easily find the specific data asset visualization table or view that you need to work with. Or search for the most relevant data assets. Data canvas works with all data that can be managed with BigQuery, including BigQuery managed storage, BigLake, Google Cloud Storage objects, and BigQuery Omni tables. For example, you could use either of the follow inputs to pull data with data canvas: Specific table: project_name.dataset_name.table_name Search: "customer transaction data" or "projectid:my-project-name winter jacket sales Atlanta" Explore data assets: Review the table schema, review their details or preview data and compare it side by side. Generate SQL queries: Iterate with NL inputs to generate the exact SQL query you need to accomplish the analytics task at hand. You can also edit the SQL before executing it. Combine results: Define joins with plain language instructions and refine the generated SQL as needed. Use query results as a starting point for further analysis with prompts like "Join this data with our customer demographics on order id." Visualize: Use natural language prompts to easily create and customize charts and graphs to visualize your data, e.g., “create a bar chart with gradient” Then, seamlessly share your findings by exporting your results to Looker Studio or Google Sheets. Automated insights: Data canvas can interpret query results and chart data and generate automated insights from them. For example, it can look at the query results of sales deal sizes and automatically provide the insight “the median deal size is $73,500.” Share to collaborate: Data analytics projects are often a team effort. You can simply save your canvas and share it with others using a link. Popular use cases While BigQuery data canvas can accelerate many analytics tasks, it’s particularly helpful for: Ad hoc analysis: When working on a tight deadline, data canvas makes it easy to pull data from various sources. Exploratory data analysis (EDA): This critical early step in the data analysis process focuses on summarizing the main characteristics of a dataset, often visually. Data canvas helps find data sources and then presents the results visually. Collaboration: Data canvas makes it easy to share an analytics project with multiple people. What our customers are saying Companies large and small have been experimenting with BigQuery data canvas for their day-to-day analytics tasks and their feedback has been very positive. Wunderkind, a performance marketing channel that powers one-to-one customer interactions, has been using BigQuery data canvas across their analytics team for several weeks and is experiencing significant time savings. “For any sort of investigation or exploratory exercise resulting in multiple queries there really is no replacement [for data canvas]. [It] Saves us so much time and mental capacity!” - Scott Schaen, VP of Data & Analytics, Wunderkind How Wunderkind accelerates time to insights with BigQuery data canvas Veo, a micro mobility company that operates in 50+ locations across the USA, is seeing immediate benefits from the AI capabilities in data canvas. “I think it's been great in terms of being able to turn ideas in the form of NL to SQL to derive insights. And the best part is that I can review and edit the query before running it - that’s a very smart and responsible design. It gives me the space to confirm it and ensure accuracy as well as reliability!” - Tim Velasquez, Head of Analytics, Veo Give BigQuery data canvas a try To learn more, watch this video and check out the documentation. BigQuery data canvas is launching in preview and will be rolled out to all users starting on April 15th. Submit this form to get early access. For any bugs and feedback, please reach out to the product and engineering team at datacanvas-feedback@google.com. We’re looking forward to hearing how you use the new data canvas! View the full article
  6. The journey of going from data to insights can be fragmented, complex and time consuming. Data teams spend time on repetitive and routine tasks such as ingesting structured and unstructured data, wrangling data in preparation for analysis, and optimizing and maintaining pipelines. Obviously, they’d rather prefer doing higher-value analysis and insights-led decision making. At Next ‘23, we introduced Duet AI in BigQuery. This year at Next ‘24, Duet AI in BigQuery becomes Gemini in BigQuery which provides AI-powered experiences for data preparation, analysis and engineering as well as intelligent recommendations to enhance user productivity and optimize costs. "With the new AI-powered assistive features in BigQuery and ease of integrating with other Google Workspace products, our teams can extract valuable insights from data. The natural language-based experiences, low-code data preparation tools, and automatic code generation features streamline high-priority analytics workflows, enhancing the productivity of data practitioners and providing the space to focus on high impact initiatives. Moreover, users with varying skill sets, including our business users, can leverage more accessible data insights to effect beneficial changes, fostering an inclusive data-driven culture within our organization." said Tim Velasquez, Head of Analytics, Veo Let’s take a closer look at the new features of Gemini in BigQuery. Accelerate data preparation with AI Your business insights are only as good as your data. When you work with large datasets that come from a variety of sources, there are often inconsistent formats, errors, and missing data. As such, cleaning, transforming, and structuring them can be a major hurdle. To simplify data preparation, validation, and enrichment, BigQuery now includes AI augmented data preparation that helps users to cleanse and wrangle their data. Additionally we are enabling users to build low-code visual data pipelines, or rebuild legacy pipelines in BigQuery. Once the pipelines are running in production, AI assists with finding and resolving issues such as schema or data drift, significantly reducing the toil associated with maintaining a data pipeline. Because the resulting pipelines run in BigQuery, users also benefit from integrated metadata management, automatic end-to-end data lineage, and capacity management. Gemini in BigQuery provides AI-driven assistance for users to clean and wrangle data Kickstart the data-to-insights journey Most data analysis starts with exploration — finding the right dataset, understanding the data’s structure, identifying key patterns, and identifying the most valuable insights you want to extract. This step can be cumbersome and time-consuming, especially if you are working with a new dataset or if you are new to the team. To address this problem, Gemini in BigQuery provides new semantic search capabilities to help you pinpoint the most relevant tables for your tasks. Leveraging the metadata and profiling information of these tables from Dataplex, Gemini in BigQuery surfaces relevant, executable queries that you can run with just one click. You can learn more about BigQuery data insights here. Gemini in BigQuery suggests executable queries for tables that you can run in single click Reimagine analytics workflows with natural language To boost user productivity, we’re also rethinking the end-to-end user experience. The new BigQuery data canvas provides a reimagined natural language-based experience for data exploration, curation, wrangling, analysis, and visualization, allowing you to explore and scaffold your data journeys in a graphical workflow that mirrors your mental model. For example, to analyze a recent marketing campaign, you can use simple natural language prompts to discover campaign data sources, integrate with existing customer data, derive insights, and share visual reports with executives — all within a single experience. Watch this video for a quick overview of BigQuery data canvas. BigQuery data canvas allows you to explore and analyze datasets, and create a customized visualization, all using natural language prompts within the same interface Enhance productivity with SQL and Python code assistance Even advanced users sometimes struggle to remember all the details of SQL or Python syntax, and navigating through numerous tables, columns, and relationships can be daunting. Gemini in BigQuery helps you write and edit SQL or Python code using simple natural language prompts, referencing relevant schemas and metadata. You can also leverage BigQuery’s in-console chat interface to explore tutorials, documentation and best practices for specific tasks using simple prompts such as: “How can I use BigQuery materialized views?” “How do I ingest JSON data?” and “How can I improve query performance?” Optimize analytics for performance and speed With growing data volumes, analytics practitioners including data administrators, find it increasingly challenging to effectively manage capacity and enhance query performance. We are introducing recommendations that can help continuously improve query performance, minimize errors and optimize your platform costs. With these recommendations, you can identify materialized views that can be created or deleted based on your query patterns and partition or cluster of your tables. Additionally, you can autotune Spark pipelines and troubleshoot failures and performance issues. Get started To learn more about Gemini in BigQuery, watch this short overview video and refer to the documentation , and sign up to get early access to the preview features. If you’re at Next ‘24, join our data and analytics breakout sessions and stop by at the demo stations to explore further and see these capabilities in action. Pricing details for Gemini in BigQuery will be shared when generally available to all customers. View the full article
  7. Torpedo Labs leveraged RudderStack and BigQuery ML to increase revenue for Wynn Casino’s Wynn Slots app to the tune of $10,000 a day by reducing customer churn.View the full article
  8. The rise of data collaboration and use of external data sources highlights the need for robust privacy and compliance measures. In this evolving data ecosystem, businesses are turning to clean rooms to share data in low-trust environments. Clean rooms enable secure analysis of sensitive data assets, allowing organizations to unlock insights without compromising on privacy. To facilitate this type of data collaboration, we launched the preview of data clean rooms last year. Today, we are excited to announce that BigQuery data clean rooms is now generally available. Backed by BigQuery, customers can now share data in place with analysis rules to protect the underlying data. This launch includes a streamlined data contributor and subscriber experience in the Google Cloud console, as well as highly requested capabilities such as: Join restrictions: Limits the joins that can be on specific columns for data shared in a clean room, preventing unintended or unauthorized connections between data. Differential privacy analysis rule: Enforces that all queries on your shared data use differential privacy with the parameters that you specify. The privacy budget that you specify also prevents further queries on that data when the budget is exhausted. List overlap analysis rule: Restricts the output to only display the intersecting rows between two or more views joined in a query. Usage metrics on views: Data owners or contributors see aggregated metrics on the views and tables shared in a clean room. Using data clean rooms in BigQuery does not require creating copies of or moving sensitive data. Instead, the data can be shared directly from your BigQuery project and you remain in full control. Any updates you make to your shared data are reflected in the clean room in real-time, ensuring everyone is working with the most current data. Create and deploy clean rooms in BigQuery BigQuery data clean rooms are available in all BigQuery regions. You can set up a clean room environment using the Google Cloud console or using APIs. During this process, you set permissions and invite collaborators within or outside organizational boundaries to contribute or subscribe to the data. Enforce analysis rules to protect underlying data When sharing data into a clean room, you can configure analysis rules to protect the underlying data and determine how the data can be analyzed. BigQuery data clean rooms support multiple analysis rules including aggregation, differential privacy, list overlap, and join restrictions. The new user experience within Cloud console lets data contributors configure these rules without needing to use SQL. Lastly, by default, a clean room employs restricted egress to prevent subscribers from exporting or copying the underlying data. However, data contributors can choose to allow the export and copying of query results for specific use cases, such as activation. Monitor usage and stay in control of your data The data owner or contributor is always in control of their respective data in a clean room. At any time, a data contributor can revoke access to their data. Additionally, as the clean room owner, you can adjust access using subscription management or privacy budgets to prevent subscribers from performing further analysis. Additionally, data contributors receive aggregated logs and metrics, giving them insights into how their data is being used within the clean room. This promotes both transparency and a clearer understanding of the collaborative process. What BigQuery data clean room customers are saying Customers across all industries are already seeing tremendous success with BigQuery data clean rooms. Here’s what some of our early adopters and partners had to say: “With BigQuery data clean rooms, we are now able to share and monetize more impactful data with our partners while maintaining our customers' and strategic data protection.” - Guillaume Blaquiere, Group Data Architect, Carrefour “Data clean rooms in BigQuery is a real accelerator for L'Oréal to be able to share, consume, and manage data in a secure and sustainable way with our partners.” - Antoine Castex, Enterprise Data Architect, L’Oréal “BigQuery data clean rooms equip marketing teams with a powerful tool for advancing privacy-focused data collaboration and advanced analytics in the face of growing signal loss. LiveRamp and Habu, which independently were each early partners of BigQuery data clean rooms, are excited to build on top of this foundation with our combined interoperable solutions: a powerful application layer, powered by Habu, accelerates the speed to value for technical and business users alike, while cloud-native identity, powered by RampID in Google Cloud, maximizes data fidelity and ecosystem connectivity for all collaborators. With BigQuery data clean rooms, enterprises will be empowered to drive more strategic decisions with actionable, data-driven insights.” - Roopak Gupta, VP of Engineering, LiveRamp “In today’s marketing landscape, where resources are limited and the ecosystem is fragmented, solutions like the data clean room we are building with Google Cloud can help reduce friction for our clients. This collaborative clean room ensures privacy and security while allowing Stagwell to integrate our proprietary data to create custom audiences across our product and service offerings in the Stagwell Marketing Cloud. With the continued partnership of Google Cloud, we can offer our clients integrated Media Studio solutions that connect brands with relevant audiences, improving customer journeys and making media spend more efficient.” - Mansoor Basha, Chief Technology Officer, Stagwell Marketing Cloud “We are extremely excited about the General Availability announcement of BigQuery data clean rooms. It's been great collaborating with Google Cloud on this initiative and it is great to see it come to market.. This release enables production-grade secure data collaboration for the media and advertising industry, unlocking more interoperable planning, activation and measurement use cases for our ecosystem.” - Bosko Milekic, Chief Product Officer, Optable Next steps Whether you're an advertiser trying to optimize your advertising effectiveness with a publisher, or a retailer improving your promotional strategy with a CPG, BigQuery data clean rooms can help. Get started today by using this guide, starting a free trial with BigQuery, or contacting the Google Cloud sales team. View the full article
  9. We are excited to announce that differential privacy enforcement with privacy budgeting is now available in BigQuery data clean rooms to help organizations prevent data from being reidentified when it is shared. Differential privacy is an anonymization technique that limits the personal information that is revealed in a query output. Differential privacy is considered to be one of the strongest privacy protections that exists today because it: is provably private supports multiple differentially private queries on the same dataset can be applied to many data types Differential privacy is used by advertisers, healthcare companies, and education companies to perform analysis without exposing individual records. It is also used by public sector organizations that comply with the General Data Protection Regulation (GDPR), the Health Insurance Portability and Accountability Act (HIPAA), the Family Educational Rights and Privacy Act (FERPA), and the California Consumer Privacy Act (CCPA). What can I do with differential privacy? With differential privacy, you can: protect individual records from re-identification without moving or copying your data protect against privacy leak and re-identification use one of the anonymization standards most favored by regulators BigQuery customers can use differential privacy to: share data in BigQuery data clean rooms while preserving privacy anonymize query results on AWS and Azure data with BigQuery Omni share anonymized results with Apache Spark stored procedures and Dataform pipelines so they can be consumed by other applications enhance differential privacy implementations with technology from Google Cloud partners Gretel.ai and Tumult Analytics call frameworks like PipelineDP.io So what is BigQuery differential privacy exactly? BigQuery differential privacy is three capabilities: Differential privacy in GoogleSQL – You can use differential privacy aggregate functions directly in GoogleSQL Differential privacy enforcement in BigQuery data clean rooms – You can apply a differential privacy analysis rule to enforce that all queries on your shared data use differential privacy in GoogleSQL with the parameters that you specify Parameter-driven privacy budgeting in BigQuery data clean rooms – When you apply a differential privacy analysis rule, you also set a privacy budget to limit the data that is revealed when your shared data is queried. BigQuery uses parameter-driven privacy budgeting to give you more granular control over your data than query thresholds do and to prevent further queries on that data when the budget is exhausted. BigQuery differential privacy enforcement in action Here’s how to enable the differential privacy analysis rule and configure a privacy budget when you add data to a BigQuery data clean room. Subscribers of that clean room must then use differential privacy to query your shared data. Subscribers of that clean room cannot query your shared data once the privacy budget is exhausted. Get started with BigQuery differential privacy BigQuery differential privacy is configured when a data owner or contributor shares data in a BigQuery data clean room. A data owner or contributor can share data using any compute pricing model and does not incur compute charges when a subscriber queries that data. Subscribers of a data clean room incur compute charges when querying shared data that is protected with a differential privacy analysis rule. Those subscribers are required to use on-demand pricing (charged per TB) or the Enterprise Plus edition (charged per slot hour). Create a clean room where all queries are protected with differential privacy today and let us know where you need help. Related Article Privacy-preserving data sharing now generally available with BigQuery data clean rooms Now GA, BigQuery data clean rooms has a new data contributor and subscriber experience, join restrictions, new analysis rules, usage metr... Read Article View the full article
  10. Editor’s note: The post is part of a series highlighting our partners, and their solutions, that are Built with BigQuery. To fully leverage the data that’s critical for modern businesses, it must be accurate, complete, and up to date. Since 2007, ZoomInfo has provided B2B teams with the accurate firmographic, technographic, contact, and intent data they need to hit their marketing, sales, and revenue targets. While smart-analytics teams have used ZoomInfo data sets in Google BigQuery to integrate them with other sources to deliver reliable and actionable insights powered by machine learning, Google Cloud and ZoomInfo recently have partnered to give organizations even richer data sets and more powerful analytics tools. Today, customers now have instant access to ZoomInfo data and intelligence directly within Google BigQuery. ZoomInfo is available as a virtual view in BigQuery, so analysts can explore the data there even before importing it. Once ZoomInfo data has been imported into BigQuery, data and operations teams will be able to use it in their workflows quickly and easily, saving their sales and marketing teams time, money, and resources. ZoomInfo data sets include: Contact and company. Capture essential prospect and customer data — from verified email addresses and direct-dial business phone and mobile numbers, to job responsibilities and web mentions. Get B2B company insights, including organizational charts, employee and revenue growth rates, and look-alike companies. Technographics and scoops. Uncover the technologies that prospects use — and how they use them — to inform your marketing and sales efforts. Discover trends to shape the right outreach messaging and determine a buyer’s needs before making the first touch. Buyer intent. ZoomInfo’s buyer intent engine captures real-time buying signals from companies researching relevant topics and keywords related to your business solution across the web. Website IP traffic. Enrich data around traffic from your digital properties, so your customer-facing teams can take immediate action and turn traffic into sales opportunities. In the future, ZoomInfo data sets will be available in the Google Cloud Marketplace as well as Google Cloud Analytics Hub (now in preview) alongside popular Google and third-party data sets including Google Trends, Google Analytics, and Census Bureau data. The new features will help ZoomInfo and Google Cloud customers such as Wayfair Professional, one of the world’s largest home retailers. Wayfair Professional is a long-time user of ZoomInfo’s Company Data Brick, API, and enrichment services. Wayfair Professional has historically accessed ZoomInfo data through file transfer, which involved shuffling encrypted CSVs back and forth over SFTP and manual file processing to ingest it into Google BigQuery. Ryan Sigurdson, senior analytics manager at Wayfair Professional, shared that moving their monthly offline company enrichment workflow to BigQuery could save them weeks of manual work and maintenance every month. Built with BigQuery ZoomInfo is one of over 700 tech companies powering their products and businesses using data cloud products from Google, such as BigQuery, Looker, Spanner, and Vertex AI. Recently at the Data Cloud Summit, Google Cloud announced Built with BigQuery, which helps ISVs like ZoomInfo get started building applications using data and machine learning products. By providing dedicated access to technology, expertise, and go to market programs, this initiative can help tech companies to accelerate, optimize, and amplify their success. ZoomInfo’s SaaS solutions have been built on Google Cloud for years. By partnering with Google Cloud, ZoomInfo can leverage an all-in-one cloud platform to develop its data collection, data processing, data storage, and data analytics solutions. “Enabling customers to gain superior insights and intelligence from data is core to the ZoomInfo strategy. We are excited about the innovation Google Cloud is bringing to market and how it is creating a differentiated ecosystem that allows customers to gain insights from their data securely, at scale, and without having to move data around,” says Henry Schuck, ZoomInfo’s chief executive officer. “Working with the Built with BigQuery team enables us to rapidly gain deep insight into the opportunities available and accelerate our speed to market.” Google Cloud provides a platform for building data-driven applications like ZoomInfo, from simplified data ingestion, processing, and storage to powerful analytics, AI/ML, and data sharing capabilities, all integrated with the open, secure, and sustainable Google Cloud platform. With a diverse partner ecosystem and support for multicloud, open source tools, and APIs, Google Cloud provides technology companies the portability and extensibility they need to avoid data lock-in. To learn more about ZoomInfo on Google Cloud, visit https://www.zoominfo.com/offers/google-bigquery. To learn more about Built with BigQuery, visit https://cloud.google.com/solutions/data-cloud-isvs Related Article Get value from data quickly with Informatica Data Loader for BigQuery With Informatica’s Data Loader on Google Cloud, accelerate data uploads and keep data flowing to get insights and answers faster. Read Article
  11. Make the most out of your BigQuery usage, burn data rather than money to create real value with some practical techniques.· Introduction · BigQuery basics and understanding costs ∘ Storage ∘ Compute · Data modeling ∘ Data types ∘ The shift towards de-normalization ∘ Partitioning ∘ Clustering ∘ Nested repeated columns ∘ Indexing ∘ Physical Bytes Storage Billing ∘ Join optimizations with primary keys and foreign keys · Data operations ∘ Copy data / tables ∘ Load data ∘ Delete partitions ∘ Get distinct partitions for a table ∘ Do not persist calculated measures · Summary ∘ Embrace data modeling best practices ∘ Master data operations for cost-effectiveness ∘ Design for efficiency and avoid unnecessary data persistence Disclaimer: BigQuery is a product which is constantly being developed, pricing might change at any time and this article is based on my own experience. Photo by Konstantin Evdokimov on Unsplash IntroductionIn the field of data warehousing, there’s a universal truth: managing data can be costly. Like a dragon guarding its treasure, each byte stored and each query executed demands its share of gold coins. But let me give you a magical spell to appease the dragon: burn data, not money! In this article, we will unravel the arts of BigQuery sorcery, to reduce costs while increasing efficiency, and beyond. Join as we journey through the depths of cost optimization, where every byte is a precious coin. Photo by Jonathan Kemper on Unsplash BigQuery basics and understanding costsBigQuery is not just a tool but a package of scalable compute and storage technologies, with fast network, everything managed by Google. At its core, BigQuery is a serverless Data Warehouse for analytical purposes and built-in features like Machine Learning (BigQuery ML). BigQuery separates storage and compute with Google’s Jupiter network in-between to utilize 1 Petabit/sec of total bisection bandwidth. The storage system is using Capacitor, a proprietary columnar storage format by Google for semi-structured data and the file system underneath is Colossus, the distributed file system by Google. The compute engine is based on Dremel and it uses Borg for cluster management, running thousands of Dremel jobs across cluster(s). BigQuery is not just a tool but a package of scalable compute and storage technologies, with fast network, everything managed by GoogleThe following illustration shows the basic architecture of how BigQuery is structured: BigQuery architecture (by author)Data can be stored in Colossus, however, it is also possible to create BigQuery tables on top of data stored in Google Cloud Storage. In that case, queries are still processed using the BigQuery compute infrastructure but read data from GCS instead. Such external tables come with some disadvantages but in some cases it can be more cost efficient to have the data stored in GCS. Also, sometimes it is not about Big Data but simply reading data from existing CSV files that are somehow ingested to GCS. For simplicity it can also be benficial to use these kind of tables. BigQuery external tables (by author)To utilize the full potential of BigQuery, the regular case is to store data in the BigQuery storage. The main drivers for costs are storage and compute, Google is not charging you for other parts, like the network transfer in between storage and compute. StorageStorage costs you $0.02 per GB — $0.04 per GB for active and $0.01 per GB — $0.02 per GB for inactive data (which means not modified in the last 90 days). If you have a table or partition that is not modified for 90 consecutive days, it is considered long term storage, and the price of storage automatically drops by 50%. Discount is applied on a per-table, per-partition basis. Modification resets the 90-day counter. ComputeBigQuery charges for data scanned and not the runtime of the query, also transfer from storage to compute cluster is not charged. Compute costs depend on the location, the costs for europe-west3 are $8.13 per TB for example. This means: We want to minimize the data to be scanned for each query!Left: Jp Valery on Unsplash, right: Gabriel Jimenez on UnsplashWhen executing a query, BigQuery is estimating the data to be processed. After entering your query in the BigQuery Studio query editor, you can see the estimate on the top right. BigQuery StudioIf it says 1.27 GB like in the screenshot above and the query is processed in the location europe-west3, the costs can be calculated like this: 1.27 GB / 1024 = 0.0010 TB * $8.13 = $0.0084 total costsThe estimate is mostly a pessimistic calculation, often the optimizer is able to use cached results, materialized views or other techniques, so that the actual bytes billed are lower than the estimate. It is still a good practice to check this estimate in order to get a rough feeling of the impact of your work. It is also possible to set a maximum for the bytes billed for your query. If your query exceeds the limit it will fail and create no costs at all. The setting can be changed by navigating to More -> Query settings -> Advanced options -> Maximum bytes billed. BigQuery Query SettingsBigQuery exceeded limit for bytes billedUnfortunately up until now, it is not possible to set a default value per query. It is only possible to limit the bytes billed for each day per user per project or for all bytes billed combined per day for a project. When you start using BigQuery for the first projects, you will most likely stick with the on-demand compute pricing model. With on-demand pricing, you will generally have access to up to 2000 concurrent slots, shared among all queries in a single project, which is more than enough in most cases. A slot is like a virtual CPU working on a unit of work of your query DAG. When reaching a certain spending per month, it is worth looking into the capacity pricing model, which gives you more predictable costs. Data modelingData typesTo reduce the costs for storage but also compute, it is very important to always use the smallest datatype possible for your columns. You can easily estimate the costs for a certain amount of rows following this overview: Type | Size -----------|--------------------------------------------------------------- ARRAY | Sum of the size of its elements BIGNUMERIC | 32 logical bytes BOOL | 1 logical byte BYTES | 2 logical bytes + logical bytes in the value DATE | 8 logical bytes DATETIME | 8 logical bytes FLOAT64 | 8 logical bytes GEOGRAPHY | 16 logical bytes + 24 logical bytes * vertices in the geo type INT64 | 8 logical bytes INTERVAL | 16 logical bytes JSON | Logical bytes in UTF-8 encoding of the JSON string NUMERIC | 16 logical bytes STRING | 2 logical bytes + the UTF-8 encoded string size STRUCT | 0 logical bytes + the size of the contained fields TIME | 8 logical bytes TIMESTAMP | 8 logical bytesNULL is calculated as 0 logical bytes Example: CREATE TABLE gold.some_table ( user_id INT64, other_id INT64, some_String STRING, -- max 10 chars country_code STRING(2), user_name STRING, -- max 20 chars day DATE );With this definition and the table of datatypes, it is possible to estimate the logical size of 100,000,000 rows: 100.000.000 rows * ( 8 bytes (INT64) + 8 bytes (INT64) + 2 bytes + 10 bytes (STRING) + 2 bytes + 2 bytes (STRING(2)) + 2 bytes + 20 bytes (STRING) + 8 bytes (DATE) ) = 6200000000 bytes / 1024 / 1024 / 1024 = 5.78 GBAssuming we are running a SELECT * on this table, it would cost us 5.78 GB / 1024 = 0.0056 TB * $8.13 = $0.05 in europe-west3. It is a good idea to make these calculations before designing your data model, not only to optimize the datatype usage but also to get an estimate of the costs for the project that you are working on. The shift towards de-normalizationIn the realm of database design and management, data normalization and de-normalization are fundamental concepts aimed at optimizing data structures for efficient storage, retrieval, and manipulation. Traditionally, normalization has been hailed as a best practice, emphasizing the reduction of redundancy and the preservation of data integrity. However, in the context of BigQuery and other modern data warehouses, the dynamics shift, and de-normalization often emerges as the preferred approach. In normalized databases, data is structured into multiple tables, each representing a distinct entity or concept, and linked through relationships such as one-to-one, one-to-many, or many-to-many. This approach adheres to the principles laid out by database normalization forms, such as the First Normal Form (1NF), Second Normal Form (2NF), and Third Normal Form (3NF), among others. This comes with the advantages of reduction of redundancy, data integrity and consequently, less storage usage. Photo by Shubham Dhage on UnsplashWhile data normalization holds merit in traditional relational databases, the paradigm shifts when dealing with modern analytics platforms like BigQuery. BigQuery is designed for handling massive volumes of data and performing complex analytical queries at scale. In this environment, the emphasis shifts from minimizing storage space to optimizing query performance. In BigQuery, de-normalization emerges as a preferred strategy for several reasons: Query Performance: BigQuery’s distributed architecture excels at scanning large volumes of data in parallel. De-normalized tables reduce the need for complex joins, resulting in faster query execution times.Cost Efficiency: By minimizing the computational resources required for query processing, de-normalization can lead to cost savings, as query costs in BigQuery are based on the amount of data processed.Simplified Data Modeling: De-normalized tables simplify the data modeling process, making it easier to design and maintain schemas for analytical purposes.Optimized for Analytical Workloads: De-normalized structures are well-suited for analytical workloads, where aggregations, transformations, and complex queries are common.Also, storage is much cheaper than compute and that means: With pre-joined datasets, you exchange compute for storage resources!De-normalization (by author)PartitioningPartitions divide a table into segments based on one specific column. The partition column can use one of 3 approaches: Integer range partitioning: Partition by integer column based on range with start, end and interval Time-unit partitioning: Partition by date, timestamp or datetime column in table with hourly, daily, monthly or yearly granularity Ingestion time partitioning: Automatically assign partition when inserting data based on current time with a pseudocolumn named _PARTITIONTIME It is up to you to define the partition column but it is highly recommend to choose this wisely as it can eliminate a lot of bytes processed / billed. Partitioning example (by author)Example: CREATE TABLE IF NOT EXISTS silver.some_partitioned_table ( title STRING, topic STRING, day DATE ) PARTITION BY day OPTIONS ( partition_expiration_days = 365 );In the above example you can also see how to set the partition_expiration_days option, which will remove partitions older than X days. ClusteringClusters sort the data within each partition based on one ore more columns. When using cluster columns in your query filter, this technique will speed up the execution since BigQuery can determine which blocks to scan. This is especially recommended to use with high cardinality columns such as the title column in the following example. You can define up to four cluster columns. Example: CREATE TABLE IF NOT EXISTS silver.some_partitioned_table ( title STRING, topic STRING, day DATE ) PARTITION BY day CLUSTER BY topic OPTIONS ( partition_expiration_days = 365 );Nested repeated columnsWith data de-normalization often also duplication of information is introduced. This data redundancy adds additional storage and bytes to be processed in our queries. However, there is a way to have a de-normalized table design without redundancy using nested repeated columns. A nested column uses the type struct and combines certain attributes to one object. A nested repeated column is an array of structs stored for a single row in the table. For example: if you have a table storing one row per login of a user, together with the user ID and the registration country of that user, you would have redundancy in form of the ID and country per login for each user. Instead of storing one row per login, with a nested repeated column you can store one single row per user and in a column of type ARRAY<STRUCT<...>> you store an array of all logins of that user. The struct holds all attributes attached to the login, like the date and device. The following illustration visualizes this example: Nested repeated column example (by author)Example: CREATE TABLE silver.logins ( user_id INT64, country STRING(2), logins ARRAY<STRUCT< login_date DATE, login_device STRING >>, day DATE ) PARTITION BY day CLUSTER BY country, user_id OPTIONS ( require_partition_filter=true );The above example also shows the utilization of the require_partition_filter which will prevent any queries without filtering on the partition column. This data modelling technique can reduce the stored and processed bytes drastically. However, it is not the silver bullet for all de-normalization or data modeling cases. The major downside is: you can’t set cluster or partition columns on attributes of structs. That means: in the example above, if a user would filter by login_device a full table scan is necessary and we do not have the option to optimize this with clustering. This can be an issue especially if your table is used as a data source for third party software like Excel or PowerBI. In such cases, you should carefully evaluate if the benefit of removing redundancy with nested repeated columns compensates the lack of optimizations via clustering. IndexingBy defining a search index on one or multiple columns, BigQuery can use this to speed up queries using the SEARCH function. A search index can be created with the CREATE SEARCH INDEX statement: CREATE SEARCH INDEX example_index ON silver.some_table(ALL COLUMNS);With ALL COLUMNS the index is automatically created for all STRING and JSON columns. It is also possible to be more selective and add a list of column names instead. With the SEARCH function, the index can be utilized to search within all or specific columns: SELECT * FROM silver.some_table WHERE SEARCH(some_table, 'needle');A new feature, which is in preview state by the time writing this article, is to also utilize the index for operators such as =, IN, LIKE, and STARTS_WITH. This can be very beneficial for data structures that are directly used by end users via third party tools like PowerBI or Excel to further increase speed and reduce costs for certain filter operations. More information about this can be found in the official search index documentation. Physical Bytes Storage BillingBigQuery offers two billing models for storage: Standard and Physical Bytes Storage Billing. Choosing the right model depends on your data access patterns and compression capabilities. The standard model is straightforward. You pay a set price per gigabyte of data, with a slight discount for data that hasn’t been modified in 90 days. This is simple to use and doesn’t require managing different storage categories. However, it can be more expensive if your data is highly compressed or if you don’t access it very often. Physical Bytes Storage Billing takes a different approach. Instead of paying based on how much logical data you store, you pay based on the physical space it occupies on disk, regardless of how often you access it or how well it’s compressed. This model can be significantly cheaper for highly compressed data or data you don’t access frequently. However, it requires you to manage two separate storage classes: one for frequently accessed data and another for long-term storage, which can add complexity. So, which model should you choose? Here’s a quick guide: Choose the standard model if: Your data isn’t highly compressed.You prefer a simple and easy-to-manage approach.Choose PBSB if: Your data is highly compressed.You’re comfortable managing different storage classes to optimize costs.You can change the billing model in the advanced option for your datasets. You can also check the logical vs. physical bytes in the table details view, which makes it easier to decide for a model. Dataset advanced options for Storage Billing ModelJoin optimizations with primary keys and foreign keysSince July 2023, BigQuery introduced unenforced Primary Key and Foreign Key constraints. Keep in mind that BigQuery is not a classical RDBMS, even though defining a data model with this feature might give you the feeling that it is. If the keys are not enforced and this is not a relational database as we know it, what is the point? The answer is: the query optimizer may use this information to better optimize queries, namely with the concepts of Inner Join Elimination, Outer Join Elimination and Join Reordering. Defining constraints is similar to other SQL dialects, just that you have to specify them as NOT ENFORCED: CREATE TABLE gold.inventory ( date INT64 REFERENCES dim_date(id) NOT ENFORCED, item INT64 REFERENCES item(id) NOT ENFORCED, warehouse INT64 REFERENCES warehouse(id) NOT ENFORCED, quantity INT64, PRIMARY KEY(date, item, warehouse) NOT ENFORCED ); Data operationsCopy data / tablesCopying data from one place to another is a typical part of our daily business as Data Engineers. Let’s assume the task is to copy data from a BigQuery dataset called bronze to another dataset called silver within a Google Cloud Platform project called project_x. The naive approach would be a simple SQL query like: CREATE OR REPLACE TABLE project_x.silver.login_count AS SELECT user_id, platform, login_count, day FROM project_x.bronze.login_count;Even though this allows for transformation, in many cases we simply want to copy data from one place to another. The bytes billed for the query above would essentially be the amount of data we have to read from the source. However, we can also get this for free with the following query: CREATE TABLE project_x.silver.login_count COPY project_x.bronze.login_count;Alternatively, the bq CLI tool can be used to achieve the same result: bq cp project_x:bronze.login_count project_x:silver.login_countThat way, you can copy data for 0 costs. Load dataFor data ingestion Google Cloud Storage is a pragmatic way to solve the task. No matter if it is a CSV file, ORC / Parquet files from a Hadoop ecosystem or any other source. Data can easily be uploaded and stored for low costs. It is also possible to create BigQuery tables on top of data stored in GCS. These external tables still utilize the compute infrastructure from BigQuery but do not offer some of the features and performance. Let’s assume we upload data from a partitioned Hive table using the ORC storage format. Uploading the data can be achieved using distcp or simply by getting the data from HDFS first and then uploading it to GCS using one of the available CLI tools to interact with Cloud Storage. Assuming we have a partition structure including one partition called month, the files might look like the following: /some_orc_table/month=2024-01/000000_0.orc /some_orc_table/month=2024-01/000000_1.orc /some_orc_table/month=2024-02/000000_0.orcWhen we uploaded this data to GCS, an external table definition can be created like this: CREATE EXTERNAL TABLE IF NOT EXISTS project_x.bronze.some_orc_table WITH PARTITION COLUMNS OPTIONS( format="ORC", hive_partition_uri_prefix="gs://project_x/ingest/some_orc_table", uris=["gs://project_x/ingest/some_orc_table/*"] );It will derive the schema from the ORC files and even detect the partition column. The naive approach to move this data from GCS to BigQuery storage might now be, to create a table in BigQuery and then follow the pragmatic INSERT INTO ... SELECT FROM approach. However, similar to the previous example, the bytes billed would reflect the amount of data stored in gs://project_x/ingest/some_orc_table. There is another way, which will achieve the same result but again for 0 costs using the LOAD DATA SQL statement. LOAD DATA OVERWRITE project_x.silver.some_orc_table ( user_id INT64, column_1 STRING, column_2 STRING, some_value INT64 ) CLUSTER BY column_1, column_2 FROM FILES ( format="ORC", hive_partition_uri_prefix="gs://project_x/ingest/some_orc_table", uris=["gs://project_x/ingest/some_orc_table/*"] ) WITH PARTITION COLUMNS ( month STRING );Using this statement, we directly get a BigQuery table with the data ingested, no need to create an external table first! Also this query comes at 0 costs. The OVERWRITE is optional, since data can also be appended instead of overwriting the table on every run. As you can see, also the partition columns can be specified. Even though no transformation can be applied, there is one major advantage: we can already define cluster columns. That way, we can create an efficient version of the target table for further downstream processing, for free! Delete partitionsIn certain ETL or ELT scenarios, a typical workflow is to have a table partitioned by day and then replace specific partitions based on new data coming from a staging / ingestion table. Ingest partition example (by author)BigQuery offers the MERGE statement but the naive approach is to first delete the affected partitions from the target table and then insert the data. Deleting partitions in such a scenario can be achieved like this: DELETE FROM silver.target WHERE day IN ( SELECT DISTINCT day FROM bronze.ingest );Even if day is a partition column in both cases, this operation is connected to several costs. However, again there is an alternative solution that comes at 0 costs again: DROP TABLE silver.target$20240101With DROP TABLE you can actually also just drop one single partition by appending the suffix $<partition_id>. Of course the above example is just dropping one partition. However, with the procedual language from BigQuery, we can easily execute the statement in a loop. FOR x IN (SELECT DISTINCT day FROM bronze.ingest) DO SELECT x; -- replace with DROP TABLE END FOR;Or alternatively use Airflow and/or dbt to first select the partitions and then run a certain templated query in a loop. However, getting the distinct partitions for a partitioned table can be done like the in the examples above, but this will still cause some costs even if we only read a single column. But yet again, there is a way to get this almost for free, which we will explore in the next chapter. Get distinct partitions for a tableIn the examples above, we used the following approach to get the distinct partitions of a partitioned BigQuery table: SELECT DISTINCT day FROM bronze.ingestThis is how much the query cost me in an example use-case I worked on: Bytes billed: 149.14 GB (= $1.18 depending on location)BigQuery maintains a lot of valuable metadata about tables, columns and partitions. This can be accessed via the INFORMATION_SCHEMA. We can achieve the very same result, by simply using this metadata: SELECT PARSE_DATE('%Y%m%d', partition_id) AS day FROM bronze.INFORMATION_SCHEMA.PARTITIONS WHERE table_name = 'ingest'And comparing it with the same use-case as I mentioned above, this is how much the query cost: Bytes billed: 10 MB (= $0.00008 depending on location)As you can see, 149GB vs 10MB is a huge difference. With this method, you can get the distinct partitions even for huge tables at almost 0 costs. Do not persist calculated measuresWhen you start using BigQuery for the first projects, you will most likely stick with the on-demand compute pricing model. With on-demand pricing, you will generally have access to up to 2000 concurrent slots, shared among all queries in a single project. But even with capacity pricing, you will have a minimum of 100 slots. With a lot of the daily ETL / ELT workload, these slots are actually not the limitation of the performance. You can simply check this yourself by navigating to BigQuery -> Administration -> Monitoring, select the correct location and change the Chart to Slot Usage under Chart Configuration. In a lot of cases you will be surprised how little slots you are actually using. BigQuery Monitoring for slotsHow does that relate to saving costs? Let’s assume you have a classic fact table or some table in general, which delivers certain KPIs. This table is then used for analysis / reporting in Looker, Excel, PowerBI or other tools. Often these tools automatically generate queries to serve the report or dashboard with the necessary data. These generated queries might not be ideal, when it comes to applying BigQuery best practices. In other words, they might end up scanning more data than necessary which increases the bytes billed. We can avoid this, by introducing a view layer on top of our fact tables. Serving tools with data from a view rather than the actual table is a very valuable best practice, as it gives you more flexibility when it comes to schema changes but it also gives the possibility to introduce calculated measures within the view without persisting the data. Of course this might increase the CPU usage when these measures are used but on the other hand, it can drastically reduce the total size of the underlying table. To illustrate this principle, take the following fact table as a basis: CREATE TABLE IF NOT EXISTS gold.some_fact_table ( user_id INT64, payment_count INT64, value_1 INT64, value_2 INT64, day DATE ) PARTITION BY day CLUSTER BY user_id OPTIONS ( partition_expiration_days = 365 );The basic idea is to introduce a view for stakeholders accessing this data and extend it with calculated measures: CREATE OR REPLACE VIEW gold.some_fact_view AS SELECT user_id, payment_count, value_1, value_2, payment_count > 0 AS is_paying_user, value_1 + value_2 AS total_value, day FROM gold.some_fact_table;In this example we were able to avoid persisting two INT64 values. One of these uses 8 logical bytes. If our fact table has 1,000,000,000 rows this would mean we save: 1000000000 rows * 8 B * 2 columns / 1024 / 1024 / 1024 = 15 GBThis is not a huge amount of data, but it can mean that BigQuery has to scan 15 GB less data in certain situations. In practice, there can be calculated measures that might save you much more data to be scanned. SummaryForget hoarding every byte like a dragon guarding its treasure. Instead, learn to burn data through smart management and optimization . By embracing this fiery approach, you’ll transform BigQuery from a cost center to a powerful engine for data exploration, allowing you to burn data, not money! Embrace data modeling best practicesUtilize the smallest data types possible to minimize storage and processing costs.Leverage de-normalization when appropriate to optimize query performance and reduce storage usage.Implement partitioning and clustering to enable BigQuery to efficiently scan only the relevant data for your queries.Explore nested repeated columns as a way to eliminate redundancy while maintaining data integrity, but be mindful of limitations regarding clustering.Master data operations for cost-effectivenessEmploy CREATE TABLE ... COPY or bq cp commands to copy data between tables without incurring charges.Utilize LOAD DATA statements to directly load data from Cloud Storage into BigQuery tables, again at no cost.Leverage the power of DROP TABLE with partition suffixes to efficiently remove specific partitions.Utilize INFORMATION_SCHEMA to retrieve table metadata like distinct partition values, significantly reducing costs compared to traditional queries.Design for efficiency and avoid unnecessary data persistenceImplement a view layer to serve data with calculated measures, preventing the storage of redundant data.Monitor your BigQuery slot usage to understand if slot limitations are a concern, allowing you to focus on optimizing query structures.By adopting these strategies, you can unlock the true potential of BigQuery, transforming it into a cost-effective engine for data exploration and analysis. Remember, in the realm of BigQuery, it’s all about burning data, not money!Feel free to share your experiences in the comments! A Definitive Guide to Using BigQuery Efficiently was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story. View the full article
  12. Editor’s note: The post is part of a series showcasing partner solutions that are Built with BigQuery. Blockchains generate a lot of data with every transaction. The beauty of Web3 is that all of that data is publicly available. But the multichain and modular expansion of the space has increased the complexity of accessing data, where any project looking to build cross-chain decentralized apps (DApps) has to figure out how to tap into on-chain data that is stored in varying locations and formats. Meanwhile, running indexers to extract the data and make it readable is a time-consuming, resource-intensive endeavor often beyond small Web3 teams’ capabilities, since proficiency in coding smart contracts and building indexers are entirely different skills. Having recognized the challenges for builders to leverage one of the most valuable pieces of Web3 (its data!), the Subsquid team set out to build a fully decentralized solution that drastically increases access to data in a permissionless manner. Subsquid explainedOne way to think about the Subsquid Network is as Web3’s largest decentralized data lake — existing to ingest, normalize, and structure data from over 100 Ethereum Virtual Machines (EVM) and non-EVM chains. It allows devs to quickly access (‘query’) data more granularly — and vastly more efficiently — than via legacy RPC node infrastructure. Subsquid Network is horizontally scalable, meaning it can grow alongside archival blockchain data storage. Its query engine is optimized to extract large amounts of data and is a perfect fit for both dApp development (indexing) and for analytics. In fact, a total of over 11 billion dollars in decentralized application and L1/L2 value depends on Subsquid indexing. Since September, Subsquid has been shifting from its initial architecture to a permissionless and decentralized format. So far during the testnet, 30,000 participants — including tens of thousands of developers — have built and deployed over 40,000 indexers. Now, the Subsquid team is determined to bring this user base and its data to Google BigQuery. BigQuery and blockchainBigQuery is a powerful enterprise data warehouse solution that allows companies and individuals to store and analyze petabytes of data. Designed for large-scale data analytics, BigQuery supports multi-cloud deployments and offers built-in machine learning capabilities, enabling data scientists to create ML models with simple SQL. BigQuery is also fully integrated with Google's own suite of business intelligence and external tools, empowering users to run their own code inside BigQuery using Jupyter Notebooks or Apache Zeppelin. Since 2018, Google has added support for blockchains like Ethereum and Bitcoin to BigQuery. Then, earlier this year, the on-chain data of 11 additional layer-1 blockchain architectures was integrated into BigQuery, including Avalanche, Fantom, NEAR, Polkadot, and Tron. But while it's great to be able to run analytics on public blockchain data, this might not always offer exactly the data a particular developer needs for their app. This is where Subsquid comes in. Data superpowers for Web3 devs and analystsSaving custom-curated data to BigQuery lets developers leverage Google's analytics tools to gain insights into how their product is used, beyond the context of one chain or platform. Multi-chain projects can leverage Subsquid in combination with BigQuery to quickly analyze their usage on different chains and gain insights into fees, operating costs, and trends. With BigQuery, they aren't limited to on-chain data either. After all, Google is the company behind Google Analytics, an advanced analytics suite for website traffic. Web3 Data Unlocked: Indexing Web3 Data with Subsquid & Google BigQuery Subsquid Developer relations engineer Daria A. demonstrates how to store data indexing using Subsquid to BigQuery and other tools Analyzing across domains by combining sets of on-chain activity with social media data and website traffic can help projects understand retention and conversion in their projects while identifying points where users drop off, to further improve their workflows. “BigQuery is quickly becoming an essential product in Web3, as it enables builders to query and analyze one’s own data, as well as to leverage a rich collection of datasets already compiled by others. Since it's SQL based, it's extremely easy to explore any data and then run more and more complex queries. With a rich API and complete developer toolkit, it can be connected to virtually anything,” writes Dmitry Zhelezov, Subsquid CEO and co-founder. “Now, with the addition of Subsquid indexing, Web3 developers literally have data superpowers. They can build a squid indexer from scratch or use an existing one to get exactly the data they need extremely efficiently. We can’t wait to see what this unlocks for builders.” Get started with Subsquid on BigQuery todaySubsquid’s support for BigQuery is already feature-complete. Are you interested in incorporating this tool into your Web3 projects? Find out more in the documentation. You can also view an example project demoed on YouTube and open-sourced on GitHub. The Built with BigQuery advantage for Data Providers and ISVs Built with BigQuery helps companies like Subsquid build innovative applications with Google Data and AI Cloud. Participating companies can: Accelerate product design and architecture through access to designated experts who can provide insight into key use cases, architectural patterns, and best practices.Amplify success with joint marketing programs to drive awareness, generate demand, and increase adoption.BigQuery gives Data Providers and ISVs the advantage of a powerful, highly scalable unified AI lakehouse that’s integrated with Google Cloud’s open, secure, sustainable platform. Click here to learn more about Built with BigQuery. View the full article
  13. While working with BigQuery for years, I observed 5 issues that are commonly made, even by experienced Data Scientists ... View the full article
  14. Have you ever tried to load a JSON file into BigQuery only to find out the file wasn’t in the proper newline delimited format (each line being an object, not an array) that BigQuery expects? Well you might still be able to load the file using only BigQuery (no other tools necessary) and a little bit of creativity! View the full article
  15. Are you looking to migrate a large amount of Hive ACID tables to BigQuery? ACID enabled Hive tables support transactions that accept updates and delete DML operations. In this blog, we will explore migrating Hive ACID tables to BigQuery. The approach explored in this blog works for both compacted (major / minor) and non-compacted Hive tables. Let’s first understand the term ACID and how it works in Hive. ACID stands for four traits of database transactions: Atomicity (an operation either succeeds completely or fails, it does not leave partial data) Consistency (once an application performs an operation the results of that operation are visible to it in every subsequent operation) Isolation (an incomplete operation by one user does not cause unexpected side effects for other users) Durability (once an operation is complete it will be preserved even in the face of machine or system failure) Starting in Version 0.14, Hive supports all ACID properties which enables it to use transactions, create transactional tables, and run queries like Insert, Update, and Delete on tables. Underlying the Hive ACID table, files are in the ORC ACID version. To support ACID features, Hive stores table data in a set of base files and all the insert, update, and delete operation data in delta files. At the read time, the reader merges both the base file and delta files to present the latest data. As operations modify the table, a lot of delta files are created and need to be compacted to maintain adequate performance. There are two types of compactions, minor and major. Minor compaction takes a set of existing delta files and rewrites them to a single delta file per bucket. Major compaction takes one or more delta files and the base file for the bucket and rewrites them into a new base file per bucket. Major compaction is more expensive but is more effective. Organizations configure automatic compactions, but they also need to perform manual compactions when automated fails. If compaction is not performed for a long time after a failure, it results in a lot of small delta files. Running compaction on these large numbers of small delta files can become a very resource intensive operation and can run into failures as well. Some of the issues with Hive ACID tables are: NameNode capacity problems due to small delta files. Table Locks during compaction. Running major compactions on Hive ACID tables is a resource intensive operation. Longer time taken for data replication to DR due to small files. Benefits of migrating Hive ACIDs to BigQuery Some of the benefits of migrating Hive ACID tables to BigQuery are: Once data is loaded into managed BigQuery tables, BigQuery manages and optimizes the data stored in the internal storage and handles compaction. So there will not be any small file issue like we have in Hive ACID tables. The locking issue is resolved here as BigQuery storage read API is gRPC based and is highly parallelized. As ORC files are completely self-describing, there is no dependency on Hive Metastore DDL. BigQuery has an in-built schema inference feature that can infer the schema from an ORC file and supports schema evolution without any need for tools like Apache Spark to perform schema inference. Hive ACID table structure and sample data Here is the sample Hive ACID table “employee_trans” Schema code_block [StructValue([(u'code', u"hive> show create table employee_trans;\r\nOK\r\nCREATE TABLE `employee_trans`(\r\n `id` int, \r\n `name` string, \r\n `age` int, \r\n `gender` string)\r\nROW FORMAT SERDE \r\n 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' \r\nSTORED AS INPUTFORMAT \r\n 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' \r\nOUTPUTFORMAT \r\n 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'\r\nLOCATION\r\n 'hdfs://hive-cluster-m/user/hive/warehouse/aciddb.db/employee_trans'\r\nTBLPROPERTIES (\r\n 'bucketing_version'='2', \r\n 'transactional'='true', \r\n 'transactional_properties'='default', \r\n 'transient_lastDdlTime'='1657906607')"), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3eda26dc94d0>)])] This sample ACID table “employee_trans” has 3 records. code_block [StructValue([(u'code', u'hive> select * from employee_trans;\r\nOK\r\n1 James 30 M\r\n3 Jeff 45 M\r\n2 Ann 40 F\r\nTime taken: 0.1 seconds, Fetched: 3 row(s)'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3eda26dc9310>)])] For every insert, update and delete operation, small delta files are created. This is the underlying directory structure of the Hive ACID enabled table. code_block [StructValue([(u'code', u'hdfs://hive-cluster-m/user/hive/warehouse/aciddb.db/employee_trans/delete_delta_0000005_0000005_0000\r\nhdfs://hive-cluster-m/user/hive/warehouse/aciddb.db/employee_trans/delete_delta_0000006_0000006_0000\r\nhdfs://hive-cluster-m/user/hive/warehouse/aciddb.db/employee_trans/delta_0000001_0000001_0000\r\nhdfs://hive-cluster-m/user/hive/warehouse/aciddb.db/employee_trans/delta_0000002_0000002_0000\r\nhdfs://hive-cluster-m/user/hive/warehouse/aciddb.db/employee_trans/delta_0000003_0000003_0000\r\nhdfs://hive-cluster-m/user/hive/warehouse/aciddb.db/employee_trans/delta_0000004_0000004_0000\r\nhdfs://hive-cluster-m/user/hive/warehouse/aciddb.db/employee_trans/delta_0000005_0000005_0000'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3eda26a28c50>)])] These ORC files in an ACID table are extended with several columns: code_block [StructValue([(u'code', u'struct<\r\n operation: int,\r\n originalTransaction: bigInt,\r\n bucket: int,\r\n rowId: bigInt,\r\n currentTransaction: bigInt,\r\n row: struct<...>\r\n>'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3eda26a28810>)])] Steps to Migrate Hive ACID tables to BigQuery Migrate underlying Hive table HDFS data Copy the files present under employee_trans hdfs directory and stage in GCS. You can use either HDFS2GCS solution or Distcp. HDFS2GCS solution uses open source technologies to transfer data and provide several benefits like status reporting, error handling, fault tolerance, incremental/delta loading, rate throttling, start/stop, checksum validation, byte2byte comparison etc. Here is the high level architecture of the HDFS2GCS solution. Please refer to the public github URL HDFS2GCS to learn more about this tool. The source location may contain extra files that we don’t necessarily want to copy. Here, we can use filters based on regular expressions to do things such as copying files with the .ORC extension only. Load ACID Tables as-is to BigQuery Once the underlying Hive acid table files are copied to GCS, use the BQ load tool to load data in BigQuery base table. This base table will have all the change events. Data verification Run “select *” on the base table to verify if all the changes are captured. Note: Use of “select * …” is used for demonstration purposes and is not a stated best practice. Loading to target BigQuery table The following query will select only the latest version of all records from the base table, by discarding the intermediate delete and update operations. You can either load the results of this query into a target table using scheduled query on-demand with the overwrite option or alternatively, you can also create this query as a view on the base table to get the latest records from the base table directly. code_block [StructValue([(u'code', u'WITH\r\n latest_records_desc AS (\r\n SELECT\r\n Row.*,\r\n operation,\r\n ROW_NUMBER() OVER (PARTITION BY originalTransaction ORDER BY originalTransaction ASC, bucket ASC, rowId ASC, currentTransaction DESC) AS rownum\r\n FROM\r\n `hiveacid-sandbox.hivetobq.basetable` )\r\nSELECT id,name,age,gender\r\nFROM\r\n latest_records_desc\r\nWHERE\r\n rownum=1\r\n AND operation != 2'), (u'language', u''), (u'caption', <wagtail.wagtailcore.rich_text.RichText object at 0x3eda2680bc90>)])] Once the data is loaded in target BigQuey table, you can perform validation using below steps: a. Use the Data Validation Tool to validate the Hive ACID table and the target BigQuery table. DVT provides an automated and repeatable solution to perform schema and validation tasks. This tool supports the following validations: Column validation (count, sum, avg, min, max, group by) Row validation (BQ, Hive, and Teradata only) Schema validation Custom Query validation Ad hoc SQL exploration b. If you have analytical HiveQLs running on this ACID table, translate them using the BigQuery SQL translation service and point to the target BigQuery table. Hive DDL Migration (Optional) Since ORC is self-contained, leverage BigQuery’s schema inference feature when loading. There is no dependency to extract Hive DDLs from Metastore. But if you have an organization-wide policy to pre-create datasets and tables before migration, this step will be useful and will be a good starting point. a. Extract Hive ACID DDL dumps and translate them using BigQuery translation service to create equivalent BigQuery DDLs. There is a Batch SQL translation service to bulk translate exported HQL (Hive Query Language) scripts from a source metadata bucket in Google Cloud Storage to BigQuery equivalent SQLs into a target GCS bucket. You can also use BigQuery interactive SQL translator which is a live, real time SQL translation tool across multiple SQL dialects to translate a query like HQL dialect into a BigQuery Standard SQL query. This tool can reduce time and effort to migrate SQL workloads to BigQuery. b. Create managed BigQuery tables using the translated DDLs. Here is the screenshot of the translation service in the BigQuery console. Submit “Translate” to translate the HiveQLs and “Run” to execute the query. For creating tables from batch translated bulk sql queries, you can use Airflow BigQuery operator (BigQueryInsertJobOperator) to run multiple queries After the DDLs are converted, copy the ORC files to GCS and perform ELT in BigQuery. The pain points of Hive ACID tables are resolved when migrating to BigQuery. When you migrate the ACID tables to BigQuery, you can leverage BigQuery ML and GeoViz capabilities for real-time analytics. If you are interested in exploring more, please check out the additional resources section. Additional Resources Hive ACID ACID ORC Format HDFS2GCS Solution DistCp Data Validation Tool BigQuery Translation Service Related Article Scheduling a command in GCP using Cloud Run and Cloud Scheduler How to efficiently and quickly schedule commands like Gsutil using Cloud Run and Cloud Scheduler. Read Article
  16. At Google Cloud, we’re invested in building data analytics products with a customer-first mindset. Our engineering team is thrilled to share recent feature enhancements and product updates that we’ve made to help you get even more value out of BigQuery, Google Cloud’s enterprise data warehouse...
  17. BigQuery is positioned as a unique analytics data warehousing service with its serverless architecture that allows it to operate at scale and speed to provide incredibly fast SQL analytics over large datasets. We started a new blog series—BigQuery Explained—to uncover and explain BigQuery's concepts, features and improvements. This blog post is the home page to the series with links to the existing and upcoming posts for the readers to refer. Here are links to the blog posts in this series: Overview: This post dives into how data warehouses change business decision making, how BigQuery solves problems with traditional data warehouses, and dives into a high-level overview of BigQuery architecture and how to quickly get started with BigQuery. Storage Overview: This post dives into BigQuery storage organization, storage format and introduces partitioning and clustering data for optimal performance. Data Ingestion: In this post, we cover options to load data into BigQuery. This post dives into batch ingestion and introduces streaming, data transfer service and query materialization. Querying your Data: This post covers querying data with BigQuery, lifecycle of a SQL query, standard & materialized views, saving and sharing queries. Working with Joins, Nested & Repeated Data: This post looks into joins with BigQuery, optimizing join patterns and nested and repeated fields for denormalizing data. We have more articles coming soon covering BigQuery's features and concepts. Stay tuned. Thank you for reading! Have a question or want to chat? Find me on Twitter or LinkedIn. Many thanks to Alicia Williams for helping with the posts.
  • Forum Statistics

    63.6k
    Total Topics
    61.7k
    Total Posts
×
×
  • Create New...