Search the Community
Showing results for tags 'data analytics'.
-
Google Cloud Next made a big splash in Las Vegas this week! From our opening keynote showcasing incredible customer momentum to exciting product announcements, we covered how AI is transforming the way that companies work. You can catch up on the highlights in our 14 minute keynote recap! Developers were front and center at our Developer keynote and in our buzzing Innovators Hive on the Expo floor (which was triple the size this year!). Our nearly 400 partner sponsors were also deeply integrated throughout Next, bringing energy from the show floor to sessions and evening events throughout the week. Last year, we talked about the exciting possibilities of generative AI, and this year it was great to showcase how customers are now using it to transform the way they work. At Next ‘24, we featured 300+ customer and partner AI stories, 500+ breakout sessions, hands-on demos, interactive training sessions, and so much more. It was a jam-packed week, so we’ve put together a summary of our announcements which highlight how we’re delivering the new way to cloud. Read on for a complete list of the 218 (yes, you read that right) announcements from Next ‘24: Gemini for Google Cloud We shared how Google's Gemini family of models will help teams accomplish more in the cloud, including: 1. Gemini for Google Cloud, a new generation of AI assistants for developers, Google Cloud services, and applications. 2. Gemini Code Assist, which is the evolution of the Duet AI for Developers. 3. Gemini Cloud Assist, which helps cloud teams design, operate, and optimize their application lifecycle. 4. Gemini in Security Operations, generally available at the end of this month, converts natural language to new detections, summarizes event data, recommends actions to take, and navigates users through the platform via conversational chat. 5. Gemini in BigQuery, in preview, enables data analysts to be more productive, improve query performance and optimize costs throughout the analytics lifecycle. 6. Gemini in Looker, in private preview, provides a dedicated space in Looker to initiate a chat on any topic with your data and derive insights quickly. 7. Gemini in Databases, also in preview, helps developers, operators, and database administrators build applications faster using natural language; manage, optimize and govern an entire fleet of databases from a single pane of glass; and accelerate database migrations. Customer Stories We shared new customer announcements, including: 8. Cintas is leveraging Google Cloud’s gen AI to develop an internal knowledge center that will allow its customer service and sales employees to easily find key information. 9. Bayer will build a radiology platform that will help Bayer and other companies create and deploy AI-first healthcare apps that assist radiologists, ultimately improving efficiency and diagnosis turn-around time. 10. Best Buy is leveraging Google Cloud’s Gemini large language model to create new and more convenient ways to give customers the solutions they need, starting with gen AI virtual assistants that can troubleshoot product issues, reschedule order deliveries, and more. 11. Citadel Securities used Google Cloud to build the next generation of its quantitative research platform that increased its research productivity and price-performance ratio. 12. Discover Financial is transforming customer experience by bringing gen AI to its customer contact centers to improve agent productivity through personalized resolutions, intelligent document summarization, real-time search assistants, and enhanced self-service options. 13. IHG Hotels & Resorts is using Gemini to build a generative AI-powered chatbot to help guests easily plan their next vacation directly in the IHG Hotels & Rewards mobile app. 14. Mercedes-Benz will expand its collaboration with Google Cloud, using our AI and gen AI technologies to advance customer-facing use cases across e-commerce, customer service, and marketing. 15. Orange is expanding its partnership with Google Cloud to deploy generative AI closer to Orange’s and its customers’ operations to help meet local requirements for trusted cloud environments and accelerate gen AI adoption and benefits across autonomous networks, workforce productivity, and customer experience. 16. WPP will leverage Google Cloud’s gen AI capabilities to deliver personalization, creativity, and efficiency across the business. Following the adoption of Gemini, WPP is already seeing internal impacts, including real-time campaign performance analysis, streamlined content creation processes, AI narration, and more. 17. Covered California, California’s health insurance marketplace, will simplify the healthcare enrollment process using Google Cloud’s Document AI, enabling the organization to verify more than 50,000 healthcare documents with a 84% verification rate per month. Workspace and collaboration The next wave of innovations and enhancements are coming to Google Workspace: 18. Google Vids, a key part of our Google Workspace innovations, is a new AI-powered video creation app for work that sits alongside Docs, Sheets and Slides. Vids will be released to Workspace Labs in June. 19. Gemini is coming to Google Chat in preview, giving you an AI-powered teammate to summarize conversations, answer questions, and more. 20. The new AI Meetings and Messaging add-on is priced at $10 per user, per month, and includes: Take notes for me, now in preview, translate for me, coming in June, which automatically detects and translates captions in Meet, with support for 69 languages, and automatic translation of messages and on-demand conversation summaries in Google Chat, coming later this year. 21. Using large language models, Gmail can now block an additional 20% more spam and evaluate 1,000 times more user-reported spam every day. 22. A new AI Security add-on allows IT teams to automatically classify and protect sensitive files in Google Drive, and is available for $10 per user, per month. 23. We’re extending DLP controls and classification labels to Gmail in beta. 24. We’re adding experimental support for post-quantum cryptography (PQC) in client-side encryption with our partners Thales and Fortanix. 25. Voice prompting and instant polish in Gmail: Send emails easily when you’re on the go with voice input in Help me write, and convert rough notes to a complete email with one click. 26. A new tables feature in Sheets (generally available in the coming weeks) formats and organizes data with a sleek design and a new set of building blocks — from project management to event planning templates witautomatic alerts based on custom triggers like a change in a status field. 27. Tabs in Docs (generally available in the coming weeks) allow you to organize information in a single document rather than linking to multiple documents or searching through Drive. 28. Docs now supports full-bleed cover images that extend from one edge of your browser to the other; generally available in the coming weeks. 29. Generally available in the coming weeks, Chat will support increased member capacity of up to 500,000 in spaces. 30. Messaging interoperability for Slack and Teams is now generally available through our partner Mio. AI infrastructure 31. The Cloud TPU v5p GA is now generally available. 32. Google Kubernetes Engine (GKE) now supports Cloud TPU v5p and TPU multi-host serving, also generally available. 33. A3 Mega compute instance powered by NVIDIA H100 GPUs offers double the GPU-to-GPU networking bandwidth of A3, and will be generally available in May. 34. Confidential Computing is coming to the A3 VM family, in preview later this year. 35. The NVIDIA Blackwell GPU platform will be available on the AI Hypercomputer architecture in two configurations: NVIDIA HGX B200 for the most demanding AI, data analytics, and HPC workloads; and the liquid-cooled GB200 NVL72 GPU for real-time LLM inference and training massive-scale models. 36. New caching capabilities for Cloud Storage FUSE improve training throughput and serving performance, and are generally available. 37. The Parallelstore high-performance parallel filesystem now includes caching in preview. 38. Hyperdisk ML in preview is a next-generation block storage service optimized for AI inference/serving workloads. 39. The new open-source MaxDiffusion is a new high-performance and scalable reference implementation for diffusion models. 40. MaxText, a JAX LLM, now supports new LLM models including Gemma, GPT3, LLAMA2 and Mistral across both Cloud TPUs and NVIDIA GPUs. 41. PyTorch/XLA 2.3 will follow the upstream release later this month, bringing single program, multiple data (SPMD) auto-sharding, and asynchronous distributed checkpointing features. 42. For Hugging Face PyTorch users, the Hugging Face Optimum-TPU package lets you train and serve Hugging Face models on TPUs. 43. Jetstream is a new open-source, throughput- and memory-optimized LLM inference engine for XLA devices (starting with TPUs); it supports models trained with both JAX and PyTorch/XLA, with optimizations for popular open models such as Llama 2 and Gemma. 44. Google models will be available as NVIDIA NIM inference microservices. 45. Dynamic Workload Scheduler now offers two modes: flex start mode (in preview), and calendar mode (in preview). 46. We shared the latest performance results from MLPerf™ Inference v4.0 using A3 virtual machines (VMs) powered by NVIDIA H100 GPUs. 47. We shared performance benchmarks for Gemma models using Cloud TPU v5e and JetStream. 48. We introduced ML Productivity Goodput, a new metric to measure the efficiency of an overall ML system, as well as an API to integrate into your projects, and methods to maximize ML Productivity Goodput. Vertex AI 49. Gemini 1.5 Pro is now available in public preview in Vertex AI, bringing the world’s largest context window to developers everywhere. 50. Gemini 1.5 Pro on Vertex AI can now process audio streams including speech, and the audio portion of videos. 51. Imagen 2.0, our family of image generation models, can now be used to create short, 4-second live images from text prompts. 52. Image editing is generally available in Imagen 2.0, including inpainting/outpainting and digital watermarking powered by Google DeepMind’s SynthID. 53. We added CodeGemma, a new model from our Gemma family of lightweight models, to Vertex AI. 54. Vertex AI has expanded grounding capabilities, including the ability to directly ground responses with Google Search, now in public preview. 55. Vertex AI Prompt Management, in preview, helps teams improve prompt performance. 56. Vertex AI Rapid Evaluation, in preview, helps users evaluate model performance when iterating on the best prompt design. 57. Vertex AI AutoSxS is now generally available, and helps teams compare the performance of two models. 58. We expanded data residency guarantees for data stored at-rest for Gemini, Imagen, and Embeddings APIs on Vertex AI to 11 new countries: Australia, Brazil, Finland, Hong Kong, India, Israel, Italy, Poland, Spain, Switzerland, and Taiwan. 59. When using Gemini 1.0 Pro and Imagen, you can now limit machine-learning processing to the United States or European Union. 60. Vertex AI hybrid search, in preview, integrates vector-based and keyword-based search techniques to ensure relevant and accurate responses for users. 61. The new Vertex AI Agent Builder, in preview, lets developers build and deploy gen AI experiences using natural language or open-source frameworks like LangChain on Vertex AI. 62. Vertex AI includes two new text embedding models in public preview: the English-only text-embedding-preview-0409, and the multilingual text-multilingual-embedding-preview-0409 Core infrastructure Thomas with the Google Axion chip 63. We expanded Google Cloud’s compute portfolio, with major product releases spanning compute and storage for general-purpose workloads, as well as for more specialized workloads like SAP and high-performance databases. 64. Google Axion is our first custom Arm-based CPU designed for the data center, and will be in preview in the coming months. 65. Now in preview, the Compute Engine C4 general-purpose VM provides high performance paired with a controlled maintenance experience for your mission-critical workloads. 66. The general-purpose N4 machine series is built for price-performance with Dynamic Resource Management, and is generally available. 67. C3 bare-metal machines, available in an upcoming preview, provide workloads with direct access to the underlying server’s CPU and memory resources. 68. New X4 memory-optimized instances are now in preview, through this interest form. 69. Z3 VMs are designed for storage-dense workloads that require SSD, and are generally available. 70. Hyperdisk Storage Pools Advanced Capacity, in general availability, and Advanced Performance in preview, allow you to purchase and manage block storage capacity in a pool that’s shared across workloads. 71. Coming to general availability in May, Hyperdisk Instant Snapshots provide near-zero RPO/RTO for Hyperdisk volumes. 72. Google Compute Engine users can now use zonal flexibility, VM family flexibility, and mixed on-demand and spot consumption to deploy their VMs. As part of Google Distributed Cloud (GDC) offering, we announced: 73. A generative AI search packaged solution powered by Gemma open models will be available in preview in Q2 2024 on GDC to help customers retrieve and analyze data at the edge or on-premises. 74. GDC has achieved ISO27001 and SOC2 compliance certifications. 75. A new managed Intrusion Detection and Prevention Solution (IDPS) integrates Palo Alto Networks threat prevention technology with GDC, and is now generally available. 76. GDC Sandbox, in preview, helps application developers build and test services designed for GDC in a Google Cloud environment, without needing to navigate the air-gap and physical hardware. 77. A preview GDC storage flexibility feature can help you grow your storage independent of compute, with support for block, file, or object storage. 78. GDC can now run in disconnected mode for up to seven days, and offers a suite of offline management features to help ensure deployments and workloads are accessible and working while they are disconnected; this capability is generally available. 79. New Managed GDC Providers who can sell GDC as a managed service include Clarence, T-Systems, and WWT.and a new Google Cloud Ready — Distributed Cloud badge signals that a solution has been tuned for GDC. 80. GDC servers are now available with an energy-efficient NVIDIA L4 Tensor Core GPU. 81. Google Distributed Cloud Hosted (GDC Hosted) is now authorized to host Top Secret and Secret missions for the U.S. Intelligence Community, and Top Secret missions for the Department of Defense (DoD). From our Google Cloud Networking family, we announced: 82. Gemini Cloud Assist, in preview, provides AI-based assistance to solve a variety of networking tasks such as generating configurations, recommending capacity, correlating changes with issues, identifying vulnerabilities, and optimizing performance. 83. Now generally available, the Model as a Service Endpoint solution uses Private Service Connect, Cloud Load Balancing, and App Hub lets model creators own the model service endpoint to which application developers then connect. 84. Later this year, Cloud Load Balancing will add enhancements for inference workloads: Cloud Load Balancing with custom metrics, Cloud Load Balancing for streaming inference, and Cloud Load Balancing with traffic management for AI models. 85. Cloud Service Mesh is a fully managed service mesh that combines Traffic Director’s control plane and Google’s open-source Istio-based service mesh, Anthos Service Mesh. A service-centric Cross-Cloud Network delivers a consistent, secure experience from any cloud to any service, and includes the following enhancements: 86. Private Service Connect transitivity over Network Connectivity Center, available in preview this quarter, enables services in a spoke VPC to be transitively accessible from other spoke VPCs. 87. Cloud NGFW Enterprise (formerly Cloud Firewall Plus), now GA, provides network threat protection powered by Palo Alto Networks, plus network security posture controls for org-wide perimeter and Zero Trust microsegmentation. 88. Identity-based authorization with mTLS integrates the Identity-Aware Proxy with our internal application Load Balancer to support Zero Trust network access, including client-side and soon, back-end mutual TLS. 89. In-line network data-loss prevention (DLP), in preview soon, integrates Symantec DLP into Cloud Load Balancers and Secure Web Proxy using Service Extensions. 90. Partners Imperva, HUMAN Security, Palo Alto Networks and Traceable are integrating their advanced web protection services into Service Extensions, as are web services providers Cloudinary, Nagra, Queue-it, and Datadog. 91. Service Extensions now has a library of code examples to customize origin selection, adjust headers, and more. 92. Private Service Connect is now fully integrated with Cloud SQL, and generally available. There are many improvements to our storage offerings: 93. Generate insights with Gemini lets you use natural language to analyze your storage footprint, optimize costs, and enhance security across billions of objects. It is available now through the Google Cloud console as an allowlist experimental release. 94. Google Cloud NetApp Volumes is expanding to 15 new Google Cloud regions in Q2’24 (GA) and includes a number of enhancements: dynamically migrating files by policy to lower-cost storage based on access frequency (in preview Q2’24); increasing Premium and Extreme service levels up to 1PB in size, with throughput performance up to 3X (preview Q2’24). NetApp Volumes also includes a new Flex service level enabling volumes as small as 1GiB. 95. Filestore now supports single-share backup for Filestore Persistent Volumes and GKE (generally available) and NFS v4.1 (preview), plus expanded Filestore Enterprise capacity up to 100TiB. For Cloud Storage: 96. Cloud Storage Anywhere Cache now uses zonal SSD read cache across multiple regions within a continent (allowlist GA). 97. Cloud Storage soft delete protects against accidental or malicious deletion of data by preserving deleted items for a configurable period of time (generally available). 98. The new Cloud Storage managed folders resource type allows granular IAM permissions to be applied to groups of objects (generally available). 99. Tag-based at-scale backup helps manage data protection for Compute Engine VMs (generally available). 100. The new high-performance backup option for SAP HANA leverages persistent disk (PD) snapshot capabilities for database-aware backups (generally available). 101. As part of Backup and DR Service Report Manager, you can now customize reports with data from Google Cloud Backup and DR using Cloud Monitoring, Cloud Logging, and BigQuery (generally available). Databases 102. Database Studio, a part of Gemini in Databases, brings SQL generation and summarization capabilities to our rich SQL editor in the Google Cloud console, as well as an AI-driven chat interface. 103. Database Center lets operators manage an entire fleet of databases through intelligent dashboards that proactively assess availability, data protection, security, and compliance issues, as well as with smart recommendations to optimize performance and troubleshoot issues. 104. Database Migration Service is also integrated with Gemini in Databases, including assistive code conversion (e.g., from Oracle to PostgreSQL) and explainability features. Likewise, AlloyDB gains a lot of new functionality: 105. AlloyDB AI lets gen AI developers build applications that accurately query data with natural language, just like they do with SQL; available now in AlloyDB Omni. 106. AlloyDB AI now includes a new pgvector-compatible index based on Google’s approximate nearest neighbor algorithms, or ScaNN; it’s available as a technology preview in AlloyDB Omni. 107. AlloyDB model endpoint management makes it easier to call remote Vertex AI, third-party, and custom models; available in AlloyDB Omni today and soon on AlloyDB in Google Cloud. 108. AlloyDB AI “parameterized secure views” secures data based on end-users’ context; available now in AlloyDB Omni. Bigtable, which turns 20 this year, got several new features: 109. Bigtable Data Boost, a pre-GA offering, delivers high-performance, workload-isolated, on-demand processing of transactional data, without disrupting operational workloads. 110. Bigtable authorized views, now generally available, allow multiple teams to leverage the same tables and securely share data directly from the database. 111. New Bigtable distributed counters in preview process high-frequency event data like clickstreams directly in the database. 112. Bigtable large nodes, the first of other workload-optimized node shapes, offer more performance stability at higher server utilization rates, and are in private preview. Memorystore for Redis Cluster, meanwhile: 113. Now supports both AOF (Append Only File) and RDB (Redis Database)-based persistence and has new node shapes that offer better performance and cost management. 114. Offers ultra-fast vector search, now generally available. 115. Includes new configuration options to tune max clients, max memory, max memory policies, and more, now in preview. Firestore users, take note: 116. Gemini Code Assist now incorporates assistive capabilities for developing with Firestore. 117. Firestore now has built-in support for vector search using exact nearest neighbors, the ability to automatically generate vector embeddings using popular embedding models via a turn-key extension, and integrations with popular generative AI libraries such as LangChain and LlamaIndex. 118. Firestore Query Explain in preview can help you troubleshoot your queries. 119. Firestore now supports Customer Managed Encryption Keys (CMEK) in preview, which allows you to encrypt data stored at-rest using your own specified encryption key. 120. You can now deploy Firestore in any available supported Google Cloud region, and Firestore’s Scheduled Backup feature can now retain backups for up to 98 days, up from seven days. 121. Cloud SQL Enterprise Plus edition now offers advanced failover capabilities such as orchestrated switchover and switchback Data analytics 122. BigQuery is now Google Cloud’s single integrated platform for data to AI workloads, with BigLake, BigQuery’s unified storage engine, providing a single interface across BigQuery native and open formats for analytics and AI workloads. 123. BigQuery better supports Iceberg, DDL, DML and high-throughput support in preview, while BigLake now supports the Delta file format, also in preview. 124. BigQuery continuous queries are in preview, providing continuous SQL processing over data streams, enabling real-time pipelines with AI operators or reverse ETL. The above-mentioned Gemini in BigQuery enables all manner of new capabilities and offerings: 125. New BigQuery integrations with Gemini models in Vertex AI support multimodal analytics and vector embeddings, and fine-tuning of LLMs. 126. BigQuery Studio provides a collaborative data workspace, the choice of SQL, Python, Spark or natural language directly, and new integrations for real-time streaming and governance; it is now generally available. 127. The new BigQuery data canvas provides a notebook-like experience with embedded visualizations and natural language support courtesy of Gemini. 128. BigQuery can now connect models in Vertex AI with enterprise data, without having to copy or move data out of BigQuery. 129. You can now use BigQuery with Gemini 1.0 Pro Vision to analyze both images and videos by combining them with your own text prompts using familiar SQL statements. 130. Column-level lineage in BigQuery and expanded lineage capabilities for Vertex AI pipelines will be in preview soon. Other updates to our data analytics portfolio include: 131. Apache Kafka for BigQuery as a managed service is in preview, to enable streaming data workloads based on open source APIs. 132. A serverless engine for Apache Spark integrated within BigQuery Studio is now in preview. 133. Dataplex features expanded data-to-AI governance capabilities in preview. Developers & operators Gemini Code Assist includes several new enhancements: 134. Full codebase awareness, in preview, uses Gemini 1.5 Pro to make complex changes, add new features, and streamline updates to your codebase. 135. A new code transformation feature available today in Cloud Workstations and Cloud Shell Editor lets you use natural language prompts to tell Gemini Code Assist to analyze, refactor, and optimize your code. 136. Gemini Code Assist now has extended local context, automatically retrieving relevant local files from your IDE workspace and displaying references to the files used. 137. With code customization in private preview, Gemini Code Assist lets you integrate private codebases and repositories for hyper-personalized code generation and completions, and connects to GitLab, GitHub, and Bitbucket source-code repositories. 138. Gemini Code Assist extends to Apigee and Application Integration in preview, to access and connect your applications. 139. We extended our partnership with Snyk to Gemini Code Assist, letting you learn about vulnerabilities and common security topics right within your IDE. 140. The new App Hub provides an accurate, up-to-date representation of deployed applications and their resource dependencies. Integrated with Gemini Cloud Assist, App Hub is generally available. Users of our Cloud Run and Google Kubernetes Engine (GKE) runtime environments can look forward to a variety of features: 141. Cloud Run application canvas lets developers generate, modify and deploy Cloud Run applications with integrations to Vertex AI, Firestore, Memorystore, and Cloud SQL, as well as load balancing and Gemini Cloud Assist. 142. GKE now supports container and model preloading to accelerate workload cold starts. 143. GPU sharing with NVIDIA Multi-Process Service (MPS) is now offered in GKE, enabling concurrent processing on a single GPU. 144. GKE support GCS FUSE read caching, now generally available, using a local directory as a cache to accelerate repeat reads for small and random I/Os. 145. GKE Autopilot mode now supports NVIDIA H100 GPUs, TPUs, reservations, and Compute Engine committed use discounts (CUDs). 146. Gemini Cloud Assist in GKE is available to help with optimizing costs, troubleshooting, and synthetic monitoring. Cloud Billing tools help you track and understand Google Cloud spending, pay your bill, and optimize your costs; here are a few new features: 147. Support for Cloud Storage costs at the bucket level and storage tags is included out of the box with Cloud Billing detailed data exports to BigQuery. 148. A new BigQuery data view for FOCUS allows users to compare costs and usage across clouds. 149. You can now convert cost management reports into BigQuery billing queries right from the Cloud Billing console. 150. A new Cloud FinOps Anomaly Detection feature is in private preview. 151. FinOps hub is now generally available, adds support to view top savings opportunities, and a preview of our FinOps hub dashboard lets you to analyze costs by project, region, or machine type. 152. A new CUD Analysis solution is available across Google Compute Engine resource families including TPU v5e, TPU v5p, A3, H3, and C3D. 153. There are new spend-based CUDs available for Memorystore, AlloyDB, BigTable, and Dataflow. Security Building on natural language search and case summaries in Chronicle, Gemini in Security Operations is coming to the entire investigation lifecycle, including: 154. A new assisted investigation feature, generally available at the end of this month, that guides analysts through their workflow in Chronicle Enterprise and Chronicle Enterprise Plus. 155. The ability to ask Gemini for the latest threat intelligence from Mandiant directly in-line — including any indicators of compromise found in their environment. 156. Gemini in Threat Intelligence, in public preview, allows you to tap into Mandiant’s frontline threat intelligence using conversational search. 157. VirusTotal now automatically ingests OSINT reports, which Gemini summarizes directly in the platform; generally available now. 158. Gemini in Security Command Center, which now lets security teams search for threats and other security events using natural language in preview, and provides summaries of critical- and high-priority misconfiguration and vulnerability alerts, and summarizes attack paths. 159. Gemini Cloud Assist also helps with security tasks, via: IAM Recommendations, which can provide straightforward, contextual recommendations to remove roles from over-permissioned users or service accounts; Key Insights, which help during encryption key creation based on its understanding of your data, your encryption preferences, and your compliance needs; and Confidential Computing Insights, which recommends options for adding confidential computing protection to sensitive workloads based on your data and your compute usage. Other security news includes: 160. The new Chrome Enterprise Premium, now generally available, combines the popular browser with Google threat and data protection, Zero Trust access controls, enterprise policy controls, and security insights and reporting. 161. Applied threat intelligence in Google Security Operations, now generally available, automatically applies global threat visibility and applies it to each customer’s unique environment. 162. Security Command Center Enterprise is now generally available and includesMandiant Hunt, now in preview. 163. Identity and Access Management Privileged Access Manager (PAM), now available in preview, provides just-in-time, time-bound, and approval-based access elevations. 164. Identity and Access Management Principal Access Boundary (PAB) is a new, identity-centered control now in preview that enforces restrictions on IAM principals. 165. Cloud Next-Gen Firewall (NGFW) Enterprise is now generally available, including threat protection from Palo Alto Networks. 166. Cloud Armor Enterprise is now generally available and offers a pay-as-you-go model that includes advanced network DDoS protection, web application firewall capabilities, network edge policy, adaptive protection, and threat intelligence. 167. Sensitive Data Protection integration with Cloud SQL is now generally available, and is deeply integrated into the Security Command Center Enterprise risk engine. 168. Key management with Autokey is now in preview, simplifying the creation and management of customer encryption keys (CMEK). 169. Bare metal HSM deployments in PCI-compliant facilities are now available in more regions. 170. Regional Controls for Assured Workloads is now in preview and is available in 32 cloud regions in 14 countries. 171. Audit Manager automates control verification with proof of compliance for workloads and data on Google Cloud, and is in preview. 172. Advanced API Security, part of Apigee API Management, now offers shadow API detection in preview. As part of our Confidential Computing portfolio, we announced: 173. Confidential VMs on Intel TDX are now in preview and available on the C3 machine series with Intel TDX. For AI and ML workloads, we support Intel AMX, which provides CPU-based acceleration by default on C3 series Confidential VMs. 174. Confidential VMs on general-purpose N2D machine series with AMD Secure Encrypted Virtualization-Secure Nested Paging (SEV-SNP) are now in preview. 175. Live Migration on Confidential VMs is now in general availability on N2D machine series across all regions. 176. Confidential VMs on the A3 machine series with NVIDIA Tensor Core H100 GPUs will be in private preview later this year. Migration 177. The Rapid Migration Program (RaMP) now covers migration and modernization use cases that span across applications and the underlying infrastructure, data and analytics. For example, as part of RaMP for Storage: Storage egress costs from Amazon S3 to Google Cloud Storage are now completely free. Cloud Storage's client libraries for Python, Node.js, and Java now support parallelization of uploads and downloads from client libraries. Migration Center also includes several excellent new additions: 178. Migration use case navigator, for mapping out how to migrate your resources (servers, databases, data warehouses, etc.) from on-prem and other clouds directly into Google Cloud, including new Cloud Spend Estimators for rapid TCO assessments of on-premises VMware and Exadata environments. 179. Database discovery and assessment for Microsoft SQL Server, PostgreSQL and MySQL to Cloud SQL migrations. Google Cloud VMware Engine, an integrated VMware service on Google Cloud now offers: 180. The intent to support VMware Cloud Foundation License Portability 181. General availability of larger instance type (ve2-standard-128) offerings. 182. Networking enhancements including next-gen VMware Engine Networking, automated zero-config VPC peering, and Cloud DNS for workloads. 183. Terraform Infrastructure as Code Automation. Migrate to Virtual Machines helps teams migrate their workloads. Here’s what we announced: 184. A new Disk Migration solution for migrating disk volumes to Google Cloud. 185. Image Import (preview) as a managed service. 186. BIOS to UEFI Conversion in preview, which automatically converts bootloaders to the newer UEFI format. 187. Amazon Linux Conversion in preview, for converting Amazon Linux to Rocky Linux in Google Compute Engine. 188. CMEK support, so you maintain control over your own encryption keys. When replatforming VMs to containers in GKE or Cloud Run, there’s: 189. The new Migrate to Containers (M2C) CLI, which generates artifacts that you can deploy to either GKE or Cloud Run. 190. M2C Cloud Code Extension, in preview, which migrates applications from VMs to containers running on GKE directly in Visual Studio. Here are the enhancements to our Database Migration Service: 191. Database Migration Service now offers AI-powered last-mile code conversion from Oracle to PostgreSQL. 192. Database Migration Service now performs migration from SQL Server (on any platform) to Cloud SQL for SQL Server, in preview. 193. In Datastream, SQL Server as a source for CDC performs data movement to BigQuery destinations. Migrating from a mainframe? Here are some new capabilities: 194. The Mainframe Assessment Tool (MAT) now powered by gen AI analyzes the application codebase, performing fit assessment and creating application-level summarization and test cases. 195. Mainframe Connector sends a copy of your mainframe data to BigQuery for off-mainframe analytics. 196. G4 refactors mainframe application code (COBOL, RPG, JCL etc.) and data from their original state/programming language to a modern stack (JAVA). 197. Dual Run lets you run a new system side by side with your existing mainframe, duplicating all transactions and checking for completeness, quality and effectiveness of the new solution. Partners & ecosystem 198. Partners showcased more than 100 solutions that leverage Google AI on the Next ‘24 show floor. 199. We announced the 2024 Google Cloud Partner of the Year winners. 200. Gemini models will be available in the SAP Generative AI Hub. 201. GitLab announced that its authentication, security, and CI/CD integrations with Google Cloud are now in public beta for customers. 202. Palo Alto Networks named Google Cloud its AI provider of choice and will use Gemini models to improve threat analysis and incident summarization for its Cortex XSIAM platform. 203. Exabeam is using Google Cloud AI to improve security outcomes for customers. 204. Global managed security services company Optiv is expanding support for Google Cloud products. 205. Alteryx, Dynatrace, and Harness are launching new features built with Google Cloud AI to automate workflows, support data governance, and enable users to better observe and manage the data. 206. A new Generative AI Services Specialization is available for partners who demonstrate the highest level of technical proficiency with Google Cloud gen AI. 207. We introduced new Generative AI Delivery Excellence and Technical Bootcamps, and advanced Challenge Labs in generative AI. 208. The Google Cloud Ready - BigQuery initiative has 21 new partners: Actable, AgileData, Amplitude, Boostkpi, CaliberMind, Calibrate Analytics, CloudQuery, DBeaver, Decube, DinMo, Estuary, Followrabbit, Gretel, Portable, Precog, Retool, SheetGo, Tecton, Unravel Data, Vallidio, and Vaultree 209. The Google Cloud Ready - AlloyDB initiative has six new partners: Boostkpi, DBeaver, Estuary, Redis, Thoughtspot, and SeeBurger 210. The Google Cloud Ready - Cloud SQL initiative has five new partners: BoostKPI, DBeaver, Estuary, Redis, and Thoughtspot 211. Crowdstrike is integrating its Falcon Platform with Google Cloud products. Members of our Google for Startups program, meanwhile, will be interested to learn that: 212. The Google for Startups Cloud Program has a new partnership with the NVIDIA Inception startup program. The benefits include providing Inception members with access to Google Cloud credits, go-to-market support, technical expertise, and fast-tracked onboarding to Google Cloud Marketplace. 213. As part of the NVIDIA Inception partnership, Google for Startups Cloud Program members can join NVIDIA Inception and gain access to technological expertise, NVIDIA Deep Learning Institute course credits, NVIDIA hardware and software, and more. Eligible members of the Google for Startups Cloud Program also can participate in NVIDIA Inception Capital Connect, a platform that gives startups exposure to venture capital firms interested in the space. 214. The new Google for Startups Accelerator: AI-First program for startups building AI solutions based in the U.S. and Canada has launched, and its cohort includes 15 AI startups: Aptori, Augmend, Backpack Healthcare, BrainLogic AI, Cicerai, CLIKA, Easel AI, Findly, Glass Health, Kodif, Liminal, mbue, Modulo Bio, Rocket Doctor, and Sibli. 215. The Startup Learning Center provides startups with curated content to help them grow with Google Cloud, and will be launching an offering for startup developers and future founders via Innovators Plus in the coming months Finally, Google Cloud Consulting, has the following services to help you build out your Google Cloud environment: 216. Google Cloud Consulting is offering no-cost, on-demand training to top customers through Google Cloud Skills Boost, including new gen AI skill badges: Prompt Design in Vertex AI, Develop Gen AI Apps with Gemini and Streamlit, and Inspect Rich Documents with Gemini Multimodality and Multimodal RAG. 217. The new Isolator solution protects healthcare data used in collaborations between parties using a variety of Google Cloud technologies including Chrome Enterprise Premium, VPC Service Controls, Chrome Enterprise, and encryption. 218. Google Cloud Consulting’s Delivery Navigator is now generally available to all Google Cloud qualified services partners. Phew. What a week! On behalf of Google Cloud, we’re so grateful you joined us at Next ‘24, and can’t wait to host you again next year back in Las Vegas at the Mandalay Bay on April 9 - 11 in 2025! View the full article
-
- 1
-
- google cloud next
- google gemini
- (and 6 more)
-
Healthcare providers have an opportunity to improve the patient experience by collecting and analyzing broader and more diverse datasets. This includes patient medical history, allergies, immunizations, family disease history, and individuals’ lifestyle data such as workout habits. Having access to those datasets and forming a 360-degree view of patients allows healthcare providers such as claim analysts to see a broader context about each patient and personalize the care they provide for every individual. This is underpinned by building a complete patient profile that enables claim analysts to identify patterns, trends, potential gaps in care, and adherence to care plans. They can then use the result of their analysis to understand a patient’s health status, treatment history, and past or upcoming doctor consultations to make more informed decisions, streamline the claim management process, and improve operational outcomes. Achieving this will also improve general public health through better and more timely interventions, identify health risks through predictive analytics, and accelerate the research and development process. AWS has invested in a zero-ETL (extract, transform, and load) future so that builders can focus more on creating value from data, instead of having to spend time preparing data for analysis. The solution proposed in this post follows a zero-ETL approach to data integration to facilitate near real-time analytics and deliver a more personalized patient experience. The solution uses AWS services such as AWS HealthLake, Amazon Redshift, Amazon Kinesis Data Streams, and AWS Lake Formation to build a 360 view of patients. These services enable you to collect and analyze data in near real time and put a comprehensive data governance framework in place that uses granular access control to secure sensitive data from unauthorized users. Zero-ETL refers to a set of features on the AWS Cloud that enable integrating different data sources with Amazon Redshift: Integration between Amazon Redshift and Amazon Simple Storage Service (Amazon S3) via Amazon Redshift Spectrum and auto-copy features Integration between Amazon Redshift and Amazon Aurora, Amazon Relational Database Service (Amazon RDS), and Amazon DynamoDB via the zero-ETL feature Integration between Amazon Redshift and streaming sources like Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka (Amazon MSK) via streaming ingestion Solution overview Organizations in the healthcare industry are currently spending a significant amount of time and money on building complex ETL pipelines for data movement and integration. This means data will be replicated across multiple data stores via bespoke and in some cases hand-written ETL jobs, resulting in data inconsistency, latency, and potential security and privacy breaches. With support for querying cross-account Apache Iceberg tables via Amazon Redshift, you can now build a more comprehensive patient-360 analysis by querying all patient data from one place. This means you can seamlessly combine information such as clinical data stored in HealthLake with data stored in operational databases such as a patient relationship management system, together with data produced from wearable devices in near real-time. Having access to all this data enables healthcare organizations to form a holistic view of patients, improve care coordination across multiple organizations, and provide highly personalized care for each individual. The following diagram depicts the high-level solution we build to achieve these outcomes. Deploy the solution You can use the following AWS CloudFormation template to deploy the solution components: This stack creates the following resources and necessary permissions to integrate the services: A Kinesis data stream. You can send data from your streaming source to this resource for ingesting the data into a Redshift data warehouse. We use on-demand capacity mode. An Amazon Aurora MySQL-Compatible Edition cluster version 8.0. This will be your online transaction processing (OLTP) data store for transactional data. To set up zero-ETL integration for ingesting transaction data to the Redshift data warehouse, see Getting started guide for near-real time operational analytics using Amazon Aurora zero-ETL integration with Amazon Redshift. The required parameter groups for source and target are already created as part of the CloudFormation stack. An Amazon Redshift Serverless workgroup and associated namespace. The CloudFormation stack also deploys a provisioned Redshift cluster. If you would like to work with Redshift Serverless, you can remove the provisioned cluster from the template and vice versa. An AWS Identity and Access Management (IAM) role with required policies and trust relationships. Network components, including VPC, subnets, route table, and associations. You can customize these resources as per your organization’s rules. AWS Solution setup AWS HealthLake AWS HealthLake enables organizations in the health industry to securely store, transform, transact, and analyze health data. It stores data in HL7 FHIR format, which is an interoperability standard designed for quick and efficient exchange of health data. When you create a HealthLake data store, a Fast Healthcare Interoperability Resources (FHIR) data repository is made available via a RESTful API endpoint. Simultaneously and as part of AWS HealthLake managed service, the nested JSON FHIR data undergoes an ETL process and is stored in Apache Iceberg open table format in Amazon S3. To create an AWS HealthLake data store, refer to Getting started with AWS HealthLake. Make sure to select the option Preload sample data when creating your data store. In real-world scenarios and when you use AWS HealthLake in production environments, you don’t need to load sample data into your AWS HealthLake data store. Instead, you can use FHIR REST API operations to manage and search resources in your AWS HealthLake data store. We use two tables from the sample data stored in HealthLake: patient and allergyintolerance. Query AWS HealthLake tables with Redshift Serverless Amazon Redshift is the data warehousing service available on the AWS Cloud that provides up to six times better price-performance than any other cloud data warehouses in the market, with a fully managed, AI-powered, massively parallel processing (MPP) data warehouse built for performance, scale, and availability. With continuous innovations added to Amazon Redshift, it is now more than just a data warehouse. It enables organizations of different sizes and in different industries to access all the data they have in their AWS environments and analyze it from one single location with a set of features under the zero-ETL umbrella. Amazon Redshift integrates with AWS HealthLake and data lakes through Redshift Spectrum and Amazon S3 auto-copy features, enabling you to query data directly from files on Amazon S3. Query AWS HealthLake data with Amazon Redshift Amazon Redshift makes it straightforward to query the data stored in S3-based data lakes with automatic mounting of an AWS Glue Data Catalog in the Redshift query editor v2. This means you no longer have to create an external schema in Amazon Redshift to use the data lake tables cataloged in the Data Catalog. To get started with this feature, see Querying the AWS Glue Data Catalog. After it is set up and you’re connected to the Redshift query editor v2, complete the following steps: Validate that your tables are visible in the query editor V2. The Data Catalog objects are listed under the awsdatacatalog database. FHIR data stored in AWS HealthLake is highly nested. To learn about how to un-nest semi-structured data with Amazon Redshift, see Tutorial: Querying nested data with Amazon Redshift Spectrum. Use the following query to un-nest the allergyintolerance and patient tables, join them together, and get patient details and their allergies: WITH patient_allergy AS ( SELECT resourcetype, c AS allery_category, a."patient"."reference", SUBSTRING(a."patient"."reference", 9, LEN(a."patient"."reference")) AS patient_id, a.recordeddate AS allergy_record_date, NVL(cd."code", 'NA') AS allergy_code, NVL(cd.display, 'NA') AS allergy_description FROM "awsdatacatalog"."datastore_01_179674d36391d68926a8d74c12599306_healthlake_view"."allergyintolerance" a LEFT JOIN a.category c ON TRUE LEFT JOIN a.reaction r ON TRUE LEFT JOIN r.manifestation m ON TRUE LEFT JOIN m.coding cd ON TRUE ), patinet_info AS ( SELECT id, gender, g as given_name, n.family as family_name, pr as prefix FROM "awsdatacatalog"."datastore_01_179674d36391d68926a8d74c12599306_healthlake_view"."patient" p LEFT JOIN p.name n ON TRUE LEFT JOIN n.given g ON TRUE LEFT JOIN n.prefix pr ON TRUE ) SELECT DISTINCT p.id, p.gender, p.prefix, p.given_name, p.family_name, pa.allery_category, pa.allergy_code, pa.allergy_description from patient_allergy pa JOIN patinet_info p ON pa.patient_id = p.id ORDER BY p.id, pa.allergy_code ; To eliminate the need for Amazon Redshift to un-nest data every time a query is run, you can create a materialized view to hold un-nested and flattened data. Materialized views are an effective mechanism to deal with complex and repeating queries. They contain a precomputed result set, based on a SQL query over one or more base tables. You can issue SELECT statements to query a materialized view, in the same way that you can query other tables or views in the database. Use the following SQL to create a materialized view. You use it later to build a complete view of patients: CREATE MATERIALIZED VIEW patient_allergy_info AUTO REFRESH YES AS WITH patient_allergy AS ( SELECT resourcetype, c AS allery_category, a."patient"."reference", SUBSTRING(a."patient"."reference", 9, LEN(a."patient"."reference")) AS patient_id, a.recordeddate AS allergy_record_date, NVL(cd."code", 'NA') AS allergy_code, NVL(cd.display, 'NA') AS allergy_description FROM "awsdatacatalog"."datastore_01_179674d36391d68926a8d74c12599306_healthlake_view"."allergyintolerance" a LEFT JOIN a.category c ON TRUE LEFT JOIN a.reaction r ON TRUE LEFT JOIN r.manifestation m ON TRUE LEFT JOIN m.coding cd ON TRUE ), patinet_info AS ( SELECT id, gender, g as given_name, n.family as family_name, pr as prefix FROM "awsdatacatalog"."datastore_01_179674d36391d68926a8d74c12599306_healthlake_view"."patient" p LEFT JOIN p.name n ON TRUE LEFT JOIN n.given g ON TRUE LEFT JOIN n.prefix pr ON TRUE ) SELECT DISTINCT p.id, p.gender, p.prefix, p.given_name, p.family_name, pa.allery_category, pa.allergy_code, pa.allergy_description from patient_allergy pa JOIN patinet_info p ON pa.patient_id = p.id ORDER BY p.id, pa.allergy_code ; You have confirmed you can query data in AWS HealthLake via Amazon Redshift. Next, you set up zero-ETL integration between Amazon Redshift and Amazon Aurora MySQL. Set up zero-ETL integration between Amazon Aurora MySQL and Redshift Serverless Applications such as front-desk software, which are used to schedule appointments and register new patients, store data in OLTP databases such as Aurora. To get data out of OLTP databases and have them ready for analytics use cases, data teams might have to spend a considerable amount of time to build, test, and deploy ETL jobs that are complex to maintain and scale. With the Amazon Redshift zero-ETL integration with Amazon Aurora MySQL, you can run analytics on the data stored in OLTP databases and combine them with the rest of the data in Amazon Redshift and AWS HealthLake in near real time. In the next steps in this section, we connect to a MySQL database and set up zero-ETL integration with Amazon Redshift. Connect to an Aurora MySQL database and set up data Connect to your Aurora MySQL database using your editor of choice using AdminUsername and AdminPassword that you entered when running the CloudFormation stack. (For simplicity, it is the same for Amazon Redshift and Aurora.) When you’re connected to your database, complete the following steps: Create a new database by running the following command: CREATE DATABASE front_desk_app_db; Create a new table. This table simulates storing patient information as they visit clinics and other healthcare centers. For simplicity and to demonstrate specific capabilities, we assume that patient IDs are the same in AWS HealthLake and the front-of-office application. In real-world scenarios, this can be a hashed version of a national health care number: CREATE TABLE patient_appointment ( patient_id varchar(250), gender varchar(1), date_of_birth date, appointment_datetime datetime, phone_number varchar(15), PRIMARY KEY (patient_id, appointment_datetime) ); Having a primary key in the table is mandatory for zero-ETL integration to work. Insert new records into the source table in the Aurora MySQL database. To demonstrate the required functionalities, make sure the patient_id of the sample records inserted into the MySQL database match the ones in AWS HealthLake. Replace [patient_id_1] and [patient_id_2] in the following query with the ones from the Redshift query you ran previously (the query that joined allergyintolerance and patient): INSERT INTO front_desk_app_db.patient_appointment (patient_id, gender, date_of_birth, appointment_datetime, phone_number) VALUES([PATIENT_ID_1], 'F', '1988-7-04', '2023-12-19 10:15:00', '0401401401'), ([PATIENT_ID_1], 'F', '1988-7-04', '2023-09-19 11:00:00', '0401401401'), ([PATIENT_ID_1], 'F', '1988-7-04', '2023-06-06 14:30:00', '0401401401'), ([PATIENT_ID_2], 'F', '1972-11-14', '2023-12-19 08:15:00', '0401401402'), ([PATIENT_ID_2], 'F', '1972-11-14', '2023-01-09 12:15:00', '0401401402'); Now that your source table is populated with sample records, you can set up zero-ETL and have data ingested into Amazon Redshift. Set up zero-ETL integration between Amazon Aurora MySQL and Amazon Redshift Complete the following steps to create your zero-ETL integration: On the Amazon RDS console, choose Databases in the navigation pane. Choose the DB identifier of your cluster (not the instance). On the Zero-ETL Integration tab, choose Create zero-ETL integration. Follow the steps to create your integration. Create a Redshift database from the integration Next, you create a target database from the integration. You can do this by running a couple of simple SQL commands on Amazon Redshift. Log in to the query editor V2 and run the following commands: Get the integration ID of the zero-ETL you set up between your source database and Amazon Redshift: SELECT * FROM svv_integration; Create a database using the integration ID: CREATE DATABASE ztl_demo FROM INTEGRATION '[INTEGRATION_ID '; Query the database and validate that a new table is created and populated with data from your source MySQL database: SELECT * FROM ztl_demo.front_desk_app_db.patient_appointment; It might take a few seconds for the first set of records to appear in Amazon Redshift. This shows that the integration is working as expected. To validate it further, you can insert a new record in your Aurora MySQL database, and it will be available in Amazon Redshift for querying in near real time within a few seconds. Set up streaming ingestion for Amazon Redshift Another aspect of zero-ETL on AWS, for real-time and streaming data, is realized through Amazon Redshift Streaming Ingestion. It provides low-latency, high-speed ingestion of streaming data from Kinesis Data Streams and Amazon MSK. It lowers the effort required to have data ready for analytics workloads, lowers the cost of running such workloads on the cloud, and decreases the operational burden of maintaining the solution. In the context of healthcare, understanding an individual’s exercise and movement patterns can help with overall health assessment and better treatment planning. In this section, you send simulated data from wearable devices to Kinesis Data Streams and integrate it with the rest of the data you already have access to from your Redshift Serverless data warehouse. For step-by-step instructions, refer to Real-time analytics with Amazon Redshift streaming ingestion. Note the following steps when you set up streaming ingestion for Amazon Redshift: Select wearables_stream and use the following template when sending data to Amazon Kinesis Data Streams via Kinesis Data Generator, to simulate data generated by wearable devices. Replace [PATIENT_ID_1] and [PATIENT_ID_2] with the patient IDs you earlier when inserting new records into your Aurora MySQL table: { "patient_id": "{{random.arrayElement(["[PATIENT_ID_1]"," [PATIENT_ID_2]"])}}", "steps_increment": "{{random.arrayElement( [0,1] )}}", "heart_rate": {{random.number( { "min":45, "max":120} )}} } Create an external schema called from_kds by running the following query and replacing [IAM_ROLE_ARN] with the ARN of the role created by the CloudFormation stack (Patient360BlogRole): CREATE EXTERNAL SCHEMA from_kds FROM KINESIS IAM_ROLE '[IAM_ROLE_ARN]'; Use the following SQL when creating a materialized view to consume data from the stream: CREATE MATERIALIZED VIEW patient_wearable_data AUTO REFRESH YES AS SELECT approximate_arrival_timestamp, JSON_PARSE(kinesis_data) as Data FROM from_kds."wearables_stream" WHERE CAN_JSON_PARSE(kinesis_data); To validate that streaming ingestion works as expected, refresh the materialized view to get the data you already sent to the data stream and query the table to make sure data has landed in Amazon Redshift: REFRESH MATERIALIZED VIEW patient_wearable_data; SELECT * FROM patient_wearable_data ORDER BY approximate_arrival_timestamp DESC; Query and analyze patient wearable data The results in the data column of the preceding query are in JSON format. Amazon Redshift makes it straightforward to work with semi-structured data in JSON format. It uses PartiQL language to offer SQL-compatible access to relational, semi-structured, and nested data. Use the following query to flatten data: SELECT data."patient_id"::varchar AS patient_id, data."steps_increment"::integer as steps_increment, data."heart_rate"::integer as heart_rate, approximate_arrival_timestamp FROM patient_wearable_data ORDER BY approximate_arrival_timestamp DESC; The result looks like the following screenshot. Now that you know how to flatten JSON data, you can analyze it further. Use the following query to get the number of minutes a patient has been physically active per day, based on their heart rate (greater than 80): WITH patient_wearble_flattened AS ( SELECT data."patient_id"::varchar AS patient_id, data."steps_increment"::integer as steps_increment, data."heart_rate"::integer as heart_rate, approximate_arrival_timestamp, DATE(approximate_arrival_timestamp) AS date_received, extract(hour from approximate_arrival_timestamp) AS hour_received, extract(minute from approximate_arrival_timestamp) AS minute_received FROM patient_wearable_data ), patient_active_minutes AS ( SELECT patient_id, date_received, hour_received, minute_received, avg(heart_rate) AS heart_rate FROM patient_wearble_flattened GROUP BY patient_id, date_received, hour_received, minute_received HAVING avg(heart_rate) > 80 ) SELECT patient_id, date_received, COUNT(heart_rate) AS active_minutes_count FROM patient_active_minutes GROUP BY patient_id, date_received ORDER BY patient_id, date_received; Create a complete patient 360 Now that you are able to query all patient data with Redshift Serverless, you can combine the three datasets you used in this post and form a comprehensive patient 360 view with the following query: WITH patient_appointment_info AS ( SELECT "patient_id", "gender", "date_of_birth", "appointment_datetime", "phone_number" FROM ztl_demo.front_desk_app_db.patient_appointment ), patient_wearble_flattened AS ( SELECT data."patient_id"::varchar AS patient_id, data."steps_increment"::integer as steps_increment, data."heart_rate"::integer as heart_rate, approximate_arrival_timestamp, DATE(approximate_arrival_timestamp) AS date_received, extract(hour from approximate_arrival_timestamp) AS hour_received, extract(minute from approximate_arrival_timestamp) AS minute_received FROM patient_wearable_data ), patient_active_minutes AS ( SELECT patient_id, date_received, hour_received, minute_received, avg(heart_rate) AS heart_rate FROM patient_wearble_flattened GROUP BY patient_id, date_received, hour_received, minute_received HAVING avg(heart_rate) > 80 ), patient_active_minutes_count AS ( SELECT patient_id, date_received, COUNT(heart_rate) AS active_minutes_count FROM patient_active_minutes GROUP BY patient_id, date_received ) SELECT pai.patient_id, pai.gender, pai.prefix, pai.given_name, pai.family_name, pai.allery_category, pai.allergy_code, pai.allergy_description, ppi.date_of_birth, ppi.appointment_datetime, ppi.phone_number, pamc.date_received, pamc.active_minutes_count FROM patient_allergy_info pai LEFT JOIN patient_active_minutes_count pamc ON pai.patient_id = pamc.patient_id LEFT JOIN patient_appointment_info ppi ON pai.patient_id = ppi.patient_id GROUP BY pai.patient_id, pai.gender, pai.prefix, pai.given_name, pai.family_name, pai.allery_category, pai.allergy_code, pai.allergy_description, ppi.date_of_birth, ppi.appointment_datetime, ppi.phone_number, pamc.date_received, pamc.active_minutes_count ORDER BY pai.patient_id, pai.gender, pai.prefix, pai.given_name, pai.family_name, pai.allery_category, pai.allergy_code, pai.allergy_description, ppi.date_of_birth DESC, ppi.appointment_datetime DESC, ppi.phone_number DESC, pamc.date_received, pamc.active_minutes_count You can use the solution and queries used here to expand the datasets used in your analysis. For example, you can include other tables from AWS HealthLake as needed. Clean up To clean up resources you created, complete the following steps: Delete the zero-ETL integration between Amazon RDS and Amazon Redshift. Delete the CloudFormation stack. Delete AWS HealthLake data store Conclusion Forming a comprehensive 360 view of patients by integrating data from various different sources offers numerous benefits for organizations operating in the healthcare industry. It enables healthcare providers to gain a holistic understanding of a patient’s medical journey, enhances clinical decision-making, and allows for more accurate diagnosis and tailored treatment plans. With zero-ETL features for data integration on AWS, it is effortless to build a view of patients securely, cost-effectively, and with minimal effort. You can then use visualization tools such as Amazon QuickSight to build dashboards or use Amazon Redshift ML to enable data analysts and database developers to train machine learning (ML) models with the data integrated through Amazon Redshift zero-ETL. The result is a set of ML models that are trained with a broader view into patients, their medical history, and their lifestyle, and therefore enable you make more accurate predictions about their upcoming health needs. About the Authors Saeed Barghi is a Sr. Analytics Specialist Solutions Architect specializing in architecting enterprise data platforms. He has extensive experience in the fields of data warehousing, data engineering, data lakes, and AI/ML. Based in Melbourne, Australia, Saeed works with public sector customers in Australia and New Zealand. Satesh Sonti is a Sr. Analytics Specialist Solutions Architect based out of Atlanta, specialized in building enterprise data platforms, data warehousing, and analytics solutions. He has over 17 years of experience in building data assets and leading complex data platform programs for banking and insurance clients across the globe. View the full article
-
KX and Databricks have partnered to develop time series analytics solutions for the capital markets sector to support many use cases including quant... View the full article
-
- kx
- databricks
-
(and 1 more)
Tagged with:
-
If you’re considering pursuing the AWS Data Analytics Specialty certification, you’re in the right place! I have compiled a list of 25 free questions that will help you test your knowledge and prepare for the exam. If you’re considering a career in AWS data analytics, you need to be comfortable with statistical analysis, data visualization, and machine learning. You also need to have strong problem-solving skills and be able to think creatively. What does a AWS Data Analyst will do? AWS Data Analytics Specialists are responsible for managing and analyzing data on the AWS platform. They work with data from a variety of sources, including relational databases, NoSQL databases, and streaming data. AWS Data Analytics Specialists use a variety of tools and techniques to gain insights into data and to make recommendations to businesses. What to expect in AWS Data Analytics Specialty exam? If you’re consider taking the AWS Data Analytics Specialty exam, here’s what you can expect. The exam is divided into five sections: collection, storage & data management ,processing, analysis & visualization and security. Each section has a different weightage. To pass the exam, you’ll need to demonstrate your knowledge and skills in collection system, as well as analytics and visualization. The section on processing will test your ability to data processing solution. The analytics and visualization section will test your ability to use data to generate insights and visualize those insights in a way that is easy to understand. To prepare for the AWS Data Analytics Specialty exam, it is recommended that you have experience working with AWS data analytics services, such as Amazon Kinesis, Amazon Redshift, and Amazon Athena. You should also be familiar with common data analysis techniques, such as regression, classification, and clustering. The AWS Data Analytics Specialty exam is a challenging exam, but if you prepare correctly by taking the AWS Data Analytics Specialty practice exam, you can pass it with flying colors Let us start learning through these AWS Data Analytics Specialty exam free questions and answers ! Domain : Security Question 1 : You work as a data engineer for an international banking firm where you are responsible for building a Redshift data warehouse to allow the bank management team to produce business insights via dashboards and reports. Much of the data stored in your Redshift data warehouse is highly confidential, for example Personally Identifiable Information (PII). Also, some of the data needed to produce the management insights is stored in S3 and accessed via Redshift Spectrum. To achieve the highest level of security, the Glue data catalog used by Redshift Spectrum to access your tables on S3 is encrypted. What must you do to gain access to the S3 tables via Redshift Spectrum? A. Use the KMS key for Redshift to access the Glue data catalog B. Nothing, Redshift and Redshift Spectrum can access the data in S3 via the Glue data catalog regardless of whether the Glue data catalog is encrypted or not C. Create a KMS key for Redshift Spectrum and use it to access the Glue data catalog D. Use the KMS key for Glue to access the Glue data catalog Correct Answer: D Explanation: Option A is incorrect. If the Glue catalog is encrypted, you need the KMS key for Glue to access the Glue data catalog. A key associated with Redshift will not allow you to access the encrypted Glue data catalog. Option B is incorrect. If the Glue catalog is encrypted, you need the KMS key for Glue to access the Glue data catalog. Option C is incorrect. If the Glue catalog is encrypted, you need the KMS key for Glue to access the Glue data catalog. A key associated with Redshift will not allow you to access the encrypted Glue data catalog. Option D is correct. If the Glue catalog is encrypted, you need the KMS key for Glue to access the Glue data catalog. Reference: Please see the Amazon Redshift database developer guide titled Querying external data using Amazon Redshift Spectrum (https://docs.aws.amazon.com/redshift/latest/dg/c-using-spectrum.html) Domain : Processing Question 2 : You work as a data engineer for a financial services company that receives near real-time streaming data for stock and derivative security master data including prices, symbol, contract info, etc. You receive these data stream feeds from several data providers. You have set up Kinesis Data Streams delivery streams to receive the data from your providers. Currently you have your delivery streams configured to send the streaming data to an S3 destination. Your management team wants you to change the delivery stream destination for one of your feeds from S3 to Redshift. How would you do this in the least disruptive manner? A. Use the StopDeliveryStream API call to stop the delivery stream, then change the destination to Redshift using the UpdateDestination API call, then use the StartDeliveryStream API to restart the delivery stream. B. Use the UpdateDestination API call to change the destination from S3 to Redshift. The target delivery stream remains active while the configuration is updated; data writes to the delivery stream can continue during the change. The updated configuration completes within a few minutes. C. Create a new delivery stream using the CreateDeliveryStream API call that has Redshift as its destination, use the StopDeliveryStream API call to stop the delivery stream that writes to S3, and start the new delivery stream using the StartDeliveryStream API call. D. Use the ChangeDeliveryStream API call to change the destination from S3 to Redshift. The target delivery stream remains active while the configuration is updated; data writes to the delivery stream can continue during the change. The updated configuration completes within a few minutes. Correct Answer: B Explanation: Option A is incorrect. There is no StopDeliveryStream API call. Also, there is no StartDeliveryStream API call. Option B is correct. You can change the delivery stream destination without interrupting the flow of data through the delivery stream by using the UpdateDestination API call. Option C is incorrect. There is no StopDeliveryStream or StartDeliveryStream API call. Option D is incorrect. There is no ChangeDeliveryStream API call. References: Please see the Amazon Kinesis Data Firehose developer guide titled Creating an Amazon Kinesis Data Firehose Delivery Stream (https://docs.aws.amazon.com/firehose/latest/dev/basic-create.html), and the Amazon Kinesis Data Firehose API reference titled UpdateDestination (https://docs.aws.amazon.com/firehose/latest/APIReference/API_UpdateDestination.html), and the Amazon Kinesis Data Firehose API reference titled Actions (https://docs.aws.amazon.com/firehose/latest/APIReference/API_Operations.html) Domain : Collection Question 3 : You work as a data engineer for an online retailer that wishes to capture customer clickstream activity for its website and mobile platforms. Your marketing department plans to gain insights from the clickstream data through the use of your data warehouse. You have built a Kinesis Data Streams pipeline that streams your data to Redshift through the use of a Kinesis Producer Library (KPL) application and a Kinesis Client Library (KCL) application that uses the Kinesis Connector Library to write the clickstream data to Redshift. As your KPL code writes your clickstream data to your Kinesis Data stream, you need to monitor to ensure even load distributions across your fleet of EC2 instances running your KPL application. How can you monitor the load distribution of your KPL fleet of EC2 instances most efficiently? A. Use the CloudWatch metrics published by the KPL with a metric level of DETAILED and a granularity of STREAM to monitor your load distribution B. Use the CloudWatch metrics published by the KPL with a metric level of SUMMARY and a granularity of SHARD to monitor your load distribution C. Use the CloudWatch metrics published by the KPL with a metric level of DETAILED and a granularity of SHARD to monitor your load distribution; add the EC2 hostname as a dimension D. Use the CloudWatch metrics published by the KPL with the default metric level and granularity Correct Answer: C Explanation: Option A is incorrect. The granularity level of STREAM is not a granular enough level of metric to efficiently monitor for uneven distribution across your EC2 fleet. You need to use fine-grained metrics so that you can capture an identifier, like the hostname of the KPL instance, to allow you to identify an uneven load distribution across your fleet. Option B is incorrect. The metric level of SUMMARY will not send granular-level metrics to CloudWatch. You need granular-level metrics to efficiently monitor for uneven distribution across your EC2 fleet. You need to use fine-grained metrics so that you can capture an identifier, like the hostname of the KPL instance, to allow you to identify an uneven load distribution across your fleet. Option C is correct. The DETAILED metric level and the granularity of SHARD will allow you to use fine-grained metrics so that you can capture an identifier, like the hostname of the KPL instance, to allow you to identify an uneven load distribution across your fleet. Adding the hostname as a dimension to your CloudWatch metrics will allow you to identify the distribution of your load. Option D is incorrect. The default metric level (DETAILED) and granularity (SHARD) will get you the level of granularity needed to monitor for load distribution, however, adding the hostname as a dimension to your CloudWatch metrics is a more efficient way to identify uneven load distribution. References: Please see the Amazon Kinesis Data Streams developer guide titled Developing Producers Using the Amazon Kinesis Producer Library (https://docs.aws.amazon.com/streams/latest/dev/developing-producers-with-kpl.html), and the Amazon Kinesis Data Streams developer guide titled Writing Data to Amazon Kinesis Data Stream (https://docs.aws.amazon.com/streams/latest/dev/building-producers.html), and the Amazon Kinesis Data Streams developer guide titled Monitoring Amazon Kinesis Data Streams (https://docs.aws.amazon.com/streams/latest/dev/monitoring.html), and the Amazon Kinesis Data Streams developer guide titled Monitoring the Kinesis Producer Library with Amazon CloudWatch (https://docs.aws.amazon.com/streams/latest/dev/monitoring-with-kpl.html), and the Amazon Kinesis Data Streams developer guide titled Using the Kinesis Client Library (https://docs.aws.amazon.com/streams/latest/dev/shared-throughput-kcl-consumers.html), and the Amazon Kinesis Data Streams page titled Getting started with Amazon Kinesis Data Streams (https://aws.amazon.com/kinesis/data-streams/getting-started/#:~:text=Amazon%20Kinesis%20Client%20Library%20(KCL,S3%2C%20and%20Amazon%20Elasticsearch%20Service.) Domain : Security Question 4 : You work as a data engineer for a government agency that compiles data on the Gross Domestic Product (GDP) for the country. To facilitate the building of the data lake that houses the GDP data, your team is responsible for managing the configuration and deployment of all EMR clusters used by your government analysts. You are responsible for centralizing governance and compliance requirements, and providing a common set of policies on how EMR instances should be set up. Your goal is to enable your analysts to be able to quickly deploy only your agency’s approved EMR cluster configurations on a self-service basis while staying within the governance and compliance requirements of your agency. What is the most efficient way to implement your EMR cluster management system? A. Create a set of CloudFormation templates, one for each configuration to used as a self-service deployment configuration B. Use AWS Systems Manager to create a portfolio of products used by your analysts to provision the products needed to build their EMR clusters. C. Use AWS OpsWorks and build a Puppet master server to create a portfolio of products used by your analysts to provision the products needed to build their EMR clusters. D. Use AWS Service Catalog to create a portfolio of products used by your analysts to provision the products needed to build their EMR clusters. Correct Answer: D Explanation: Option A is incorrect. This approach is very inefficient. Your team would have to write and maintain all of the templates. Using AWS Service Catalog to create and manage your deployment configurations as products is much more efficient. Option B is incorrect. AWS Systems Manager is not used to create portfolios of products for systematic distribution, AWS Service Catalog is used for this purpose. Option C is incorrect. AWS OpsWorks Puppet gives you a set of tools for enforcing the desired state of your infrastructure, and automating on-demand tasks. However, it would be far more time consuming to use this approach than using AWS Service Catalog. Option D is correct. You can use AWS Service Catalog to centrally manage your analysts’ commonly deployed EMR cluster configurations. This approach helps you achieve consistent governance and meet your compliance requirements, while at the same time enabling your analysts to quickly deploy only the approved EMR cluster configurations on a self-service basis. References: Please see the AWS Big Data blog titled Build a self-service environment for each line of business using Amazon EMR and AWS Service Catalog (https://aws.amazon.com/blogs/big-data/build-a-self-service-environment-for-each-line-of-business-using-amazon-emr-and-aws-service-catalog/), and the AWS OpsWorks user guide titled What Is AWS OpsWorks? (https://docs.aws.amazon.com/opsworks/latest/userguide/welcome.html) Domain : Security Question 5 : You work as a data engineer for an automobile manufacturer. Your company is building a data lake using EMR as the big data platform to enable company analysts to run petabyte-scale analysis on the company’s car sales and manufacturing data. For your analysts’ access to your EMR cluster nodes, you need to provide strong authentication so that passwords or other credentials aren’t sent over the network in an unencrypted format, therefore you have chosen to use Kerberos authentication. You also need to allow your analysts to connect to your EMR cluster nodes, however you do not want to have your analysts use an EC2 private key file when connecting to your EMR cluster. Which Kerberos architecture options allow you to meet your security requirements? (Select TWO) A. Cluster-dedicated KDC (KDC on master node) B. Cross-realm trust C. External KDC – MIT KDC D. External KDC – master node on a different cluster E. External KDC – cluster KDC on a different cluster with Active Directory cross-realm trust Correct Answers: B and E Explanation: Option A is incorrect. With this architecture your analysts would have to use an EC2 private key file and kinit credentials to connect to the cluster. Option B is correct. Cross-realm trusts are most commonly implemented using Active Directory. With this architecture, if your analysts are in your Active Directory domain they can use kinit credentials to access your clusters that are protected via kerberos, without using an EC2 private key file. Option C is incorrect. With this architecture your analysts would have to use an EC2 private key file and kinit credentials to connect to the cluster. Option D is incorrect. With this architecture your analysts would have to use an EC2 private key file and kinit credentials to connect to the cluster. Option E is correct. An Active Directory cross-realm trust is implemented using Active Directory. With this architecture analysts in the Active Directory domain can access your Kerberized clusters using kinit credentials, without the EC2 private key file. References: Please see the Amazon EMR management guide titled Kerberos architecture options (https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-kerberos-options.html), and the Amazon EMR management guide titled Use Kerberos authentication (https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-kerberos.html) Domain : Analysis and Visualization Question 6 : You work as a data scientist for a data analytics firm. Your firm collects data about product usage, consumer behavior, global supply chains, etc. As you gather the data from your sources, you need to transform and aggregate the data for use by your clients. You have written AWS Glue jobs to perform the transformation and aggregation. You need to gather metrics from your Glue jobs to ensure they are performing as expected by tracking runtime metrics such as bytes read and written, memory usage and CPU load of the driver and executors, and data shuffles among executors. How do you enable the gathering of Glue metrics? A. Enable the job metrics option in your Glue job definition, resulting in the job script initializing a GlueContext class B. Enable the job metrics option in your Glue job definition, resulting in the job script initializing a GlueTransform class C. Enable the job metrics option in your Glue job definition, resulting in the job script initializing a DynamicFrame class D. Enable the job metrics option in your Glue job definition, resulting in the job script initializing a GlueMetrics class Correct Answer: A Explanation: Option A is correct. When you enable job metrics in your Glue job definition, the job script initializes a GlueContext class which is then used to initialize the Spark session. Option B is incorrect. The GlueTransform class is used to transform data, not to gather metrics. Option C is incorrect. The DynamicFrame class is used to manipulate your data in a dataframe. Option D is incorrect. There is no GlueMetrics class. References: Please see the AWS Glue developer guide titled Monitoring AWS Glue Using Amazon CloudWatch Metrics (https://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html), and the AWS announcement titled AWS Glue now provides additional ETL job metrics (https://aws.amazon.com/about-aws/whats-new/2018/07/aws-glue-now-provides-additional-ETL-job-metrics/), and the AWS Glue developer guide titled Job Monitoring and Debugging (https://docs.aws.amazon.com/glue/latest/dg/monitor-profile-glue-job-cloudwatch-metrics.html), and the AWS Glue developer guide titled GlueContext Class (https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-glue-context.html) Domain : Collection Question 7 : You work as a data engineer for an international airline. Your company is building a data lake to house flight data including airplane travel routes, passenger capacity, weather patterns, fuel consumption, etc. You and your data engineering team have decided to use AWS Lake Formation to build your data lake. You are loading data from your operational systems and you need to decide which Lake Formation blueprint to use. If you choose to use a Database Snapshot type of Lake Formation blueprint, which of the following characteristics describe your requirements? (Select TWO) A. Schema evolution is flexible, e.g. columns are re-named, previous columns are deleted, and new columns are added in their place B. Schema evolution is incremental, e.g. there is only successive addition of columns C. Complete consistency is needed between the source and the destination D. Only new rows are added; previous rows are not updated E. Only new rows are added; previous rows are updated Correct Answers: A and C Option A is correct. A Database Snapshot blueprint loads or reloads data from all tables into the data lake from a JDBC source. Therefore, schema evolution can be flexible. Option B is incorrect. Incremental schema evolution is more suited to the Incremental Database blueprint. Option C is correct. A Database Snapshot blueprint loads or reloads data from all tables into the data lake from a JDBC source. Therefore, you can achieve complete consistency between your source and your destination. Option D is incorrect. When only adding new rows and not updating previous rows, the Incremental Database blueprint is a better choice. Option E is incorrect. This option does not match a use case for any of the three Lake Formation blueprints: Database Snapshot, Incremental Database, or Log File. References: Please see the AWS Lake Formation developer guide titled AWS Lake Formation: How It Works (https://docs.aws.amazon.com/lake-formation/latest/dg/how-it-works.html), and the AWS Lake Formation developer guide titled Importing Data Using Workflows in Lake Formation (https://docs.aws.amazon.com/lake-formation/latest/dg/workflows.html), and the AWS Lake Formation developer guide titled Blueprints and Workflows in Lake Formation (https://docs.aws.amazon.com/lake-formation/latest/dg/workflows-about.html) Domain : Processing Question 8 : You work as a data engineer for a security surveillance company that provides video security for business and residential properties. You are building a live video streaming service that will be used for real-time video analysis of security camera footage. The camera devices you are using run a proprietary operating system, also they don’t run a Java virtual machine. You and your team are writing your video processing code that extracts data from your camera video and sends the video fragments to your Kinesis Video stream. Which coding approach is the most efficient for you to use? A. Use the Kinesis Video Streams Producer Client B. Use the Kinesis Producer Library (KPL) C. Use the Kinesis Video Streams Producer Library D. Use the Kinesis Video Streams Media Source Library Correct Answer: C Explanation: Option A is incorrect. The Kinesis Video Streams Producer Client is used when your producing device, or camera in your case, runs either Java or Android applications. Your devices run a proprietary operating system and don’t run a Java virtual machine. Option B is incorrect. Since you are sending your data to a Kinesis Video stream, you should use one of the Kinesis Video Streams producer libraries. Using these libraries that are built for Kinesis Video Streams will be more efficient than using the KPL. Option C is correct. You should use the Kinesis Video Streams Producer Library directly when the device on which you are running the application doesn’t have a Java virtual machine, and when your application is running on a device with a proprietary operating system. Option D is incorrect. There is no Kinesis Video Streams Media Source Library. References: Please see the Amazon Kinesis Video Streams developer guide titled Amazon Kinesis Video Streams: How It Works (https://docs.aws.amazon.com/kinesisvideostreams/latest/dg/how-it-works.html), and the Amazon Kinesis Video Streams developer guide titled Kinesis Video Streams API and Producer Libraries Support (https://docs.aws.amazon.com/kinesisvideostreams/latest/dg/how-it-works-kinesis-video-api-producer-sdk.html), and the Amazon Kinesis Video Streams developer guide titled Kinesis Video Streams Producer Libraries (https://docs.aws.amazon.com/kinesisvideostreams/latest/dg/producer-sdk.html), and the Amazon Kinesis Video Streams developer guide titled Using the C++ Producer Library (https://docs.aws.amazon.com/kinesisvideostreams/latest/dg/producer-sdk-cpp.html) Domain : Storage and Data Management Question 9 : You work as a data scientist for an online retailer where you are responsible for managing the company’s product catalog. The catalog data of approximately 500,000 products is stored in their DynamoDB database. The DynamoDB tables that hold the product catalog data use 2GB of storage. You are building an Elasticsearch cluster to allow your analysts to efficiently search the product catalog. You are assuming a compression ratio of 1.0 for your indexed data. Also, you plan to use an m4.2xlarge Elasticsearch instance node type. You need to make sure the Elasticsearch cluster is highly available and that it is configured for optimal performance. How many Elasticsearch shards should you set for your index shard count and how many nodes should you create? A. 1 shard and 2 nodes B. 2 shards and 2 nodes C. 2 shards and 1 node D. 1 shard and 1 node Correct Answer: A Explanation: Option A is correct. The calculation for your shards and nodes: total storage to be indexed 2GB multiplied by your compression ratio of 1.0 gives you 2GB for your index size. To get the number of required shards, divide your index storage by 30GB. Therefore, 2GB/30GB means you can get away with having only one shard. Also, you need to have one replica for redundancy so your index storage becomes 2 X 2GB = 4GB. Your m4.2xlarge instance has 8GB storage so you could use one node. However, you need a highly available solution so you should add an additional node. Therefore, you need 2 nodes. Option B is incorrect. Based on the shard calculation number of shards = index size / 30GB, your calculation is 2GB/30GB. Therefore, you only need one shard. Option C is incorrect. Based on the shard calculation number of shards = index size / 30GB, your calculation is 2GB/30GB. Therefore, you only need one shard. You should have two nodes for high availability. Option D is incorrect. You should have two nodes for high availability. References: Please see the AWS Database blog titled Get Started with Amazon Elasticsearch Service: How Many Shards Do I Need? (https://aws.amazon.com/blogs/database/get-started-with-amazon-elasticsearch-service-how-many-shards-do-i-need/), and the AWS Database blog titled Get Started with Amazon Elasticsearch Service: How Many Data Instances Do I Need? (https://aws.amazon.com/blogs/database/get-started-with-amazon-elasticsearch-service-how-many-data-instances-do-i-need/) Domain : Analysis and Visualization Question 10 : You work as a data engineer for a marketing firm. You and your engineering team have been given the task of creating a dashboard of the data behind the marketing firm’s Objectives and Key Results (OKRs) and the progress toward achieving those OKRs. The source data for the OKR dashboard comes from many of the firm’s operational systems. You have created the initial visualization of your dashboard in QuickSight. How would you construct an architecture to visualize your dashboard in QuickSight that refreshes the data in your dashboard as soon as the data is available? A. Operational data loaded into an S3 bucket; an EventBridge rule triggers a Lambda function which uses the CreateIngestion API operation to refresh the data in QuickSight SPICE B. Operational data loaded into an S3 bucket; use the options on Datasets page to refresh the data in QuickSight SPICE C. Operational data loaded into an S3 bucket; an EventBridge rule triggers a Lambda function which uses the CreateDataSet API operation to refresh the data in QuickSight SPICE D. Operational data loaded into an S3 bucket; schedule refreshes in the dataset settings to refresh the data in QuickSight SPICE Correct Answer: A Explanation: Option A is correct. To have the latest data displayed in your dashboard, you need to refresh the SPICE data. There are 4 ways to refresh the SPICE data: use the options on Datasets page in the QuickSight UI, refresh the dataset by editing the dataset, schedule refreshes in the dataset settings, or use the CreateIngestion API operation to refresh the data. The best option to have the data refresh as soon as the data is available is to trigger a Lambda function to run the CreateIngestion API operation to refresh the data in QuickSight SPICE. Option B is incorrect. This option requires manual intervention, therefore making it very inefficient and very unlikely that it would allow you to visualize the data as soon as it’s available. Option C is incorrect. The CreateDataSet API operation creates a new SPICE dataset, you wouldn’t use this API operation to refresh an existing dataset. Option D is incorrect. Scheduling refreshes would eventually visualize your data, however your requirement is to visualize the data as soon as the data is available. References: Please see the Amazon QuickSight user guide titled Refreshing Data (https://docs.aws.amazon.com/quicksight/latest/user/refreshing-imported-data.html), and the Amazon QuickSight API reference titled CreateIngestion (https://docs.aws.amazon.com/quicksight/latest/APIReference/API_CreateIngestion.html), and the AWS Big Data blog titled Event-driven refresh of SPICE datasets in Amazon QuickSight (https://aws.amazon.com/blogs/big-data/event-driven-refresh-of-spice-datasets-in-amazon-quicksight/) Domain : Storage and Data Management Question 11 : You work as a data engineer for a shipping company. Your company tracks shipments, shipping containers, shipping contractors, and other related operational data in your data warehouse. You and your engineering team have chosen to use Redshift to house your data warehouse. Your company ingests its operational data every day but the initial storage requirement is relatively small. You have estimated that your data warehouse will grow over time, but will never exceed 1 petabyte in size. Your management team has mandated that you build the most cost effective storage and processing architecture for your Redshift cluster. Which storage node type gives you the best price/performance ratio? A. ra3.4xlarge nodes B. ra3.xlplus nodes C. ds2.xlarge nodes D. ds2.8xlarge nodes Correct Answer: B Explanation: Option A is incorrect. The rs3.4xlarge node type gives you a total managed storage capacity of 8 petabytes, but you don’t expect that your storage requirement will ever exceed 1 petabyte. Also, the ra3.4xlarge node type costs $3.26 per/hour. This is far more expensive than the ra3.xlplus node type. Option B is correct. The ra3.xlplus node type gives you a total managed storage capacity of 1 petabyte, and it costs $1.086 per/hour. Also, the ra3 node types use distributed, hardware-accelerated cache that enables Redshift to run much faster than the ds2 node type. Therefore, the ra3.xlplus node type gives you the best price/performance ratio. Option C is incorrect. The ra3 node types use distributed, hardware-accelerated cache that enables Redshift to run much faster than the ds2 node type. Option D is incorrect. The ra3 node types use distributed, hardware-accelerated cache that enables Redshift to run much faster than the ds2 node type. References: Please see the AWS Big Data blog titled Introducing Amazon Redshift RA3.xlplus nodes with managed storage (https://aws.amazon.com/blogs/big-data/introducing-amazon-redshift-ra3-xlplus-nodes-with-managed-storage/), and the Amazon Redshift cluster management guide titled Amazon Redshift clusters (https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-clusters.html) Domain : Security Question 12 : You work as a data engineer for a financial services firm where you are responsible for the firm’s data warehouse and the associated security of the data warehouse. The data warehouse contains information about the firm’s clients and information about the firm’s trading activity. This data must be monitored for security auditing, specifically authentication attempts and connections/disconnections to the data warehouse. You have enabled audit logging on your Redshift cluster. Which Redshift audit log captures the authentication method for user activity in the data warehouse? A. User Activity log B. User log C. Connection log D. Security log Correct Answer: C Explanation: Option A is incorrect. The User Activity log captures information about the types of queries that users and system tasks perform in the database. Option B is incorrect. The User log captures information about changes to database user definitions. Option C is correct. The Connection log captures information about the users who are connecting to the database, including related connection information, such as the type of authentication used. Option D is incorrect. There is no Redshift audit log named Security log. References: Please see the Amazon Redshift cluster management guide titled Database audit logging (https://docs.aws.amazon.com/redshift/latest/mgmt/db-auditing.html), and the Amazon Redshift cluster management guide titled Logging and monitoring in Amazon Redshift (https://docs.aws.amazon.com/redshift/latest/mgmt/security-incident-response.html) Domain : Collection Question 13 : You work as a data engineer for a large regional medical insurance firm. Your firm gathers medical and insurance data from several sources that is loaded into your data lake. The data needs to be transformed as you load it into your data lake. You are designing your data loading process and you have identified the need for several small to medium-sized generic tasks that will be part of your ETL (extract, transform, load) workflow. You have chosen to use AWS Glue for your ETL workflow. Which of the types of AWS Glue jobs is the most cost effective, in terms of DPUs (data processing units), for your design? A. Apache Spark B. Python shell C. Spark Streaming D. Scala shell Correct Answer: B Explanation: Option A is incorrect. Using Apache Spark would be more expensive than Python shell scripts. An Apache Spark job run in Glue requires a minimum of 2 DPUs. Each DPU costs $0.44 per DPU-hour in increments of 1 second, rounded up to the nearest second, with a 1-minute minimum billing duration. For Python scripts, Glue allocates 0.0625 DPU to each Python shell job. You are billed $0.44 per DPU-Hour in increments of 1 second, rounded up to the nearest second, with a 1-minute minimum duration for each job of type Python shell. Option B is correct. The type of job you are running, small to medium-sized generic tasks, is best suited to Python shell scripts. Also, Python scripts are less expensive as far as DPU allocation per job. Python shell scripts use either 1 or 0.0625 DPUs, where a Spark Streaming or Apache Spark job requires a minimum of 2 DPUs. Option C is incorrect. A Spark Streaming job run in Glue requires a minimum of 2 DPUs. Each DPU costs $0.44 per DPU-hour in increments of 1 second, rounded up to the nearest second, with a 10-minute minimum billing duration. Option D is incorrect. There is no Scala shell script type of Glue job. References: Please see the AWS Glue product page titled AWS Glue pricing (https://aws.amazon.com/glue/pricing/), and the AWS Announcement titled Introducing Python Shell Jobs in AWS Glue (https://aws.amazon.com/about-aws/whats-new/2019/01/introducing-python-shell-jobs-in-aws-glue/) Domain : Processing Question 14 : You work as a data engineer for a transportation company. Your company streams data from several operational sources and data providers to build a data lake. Your management team uses the data in the data lake to create business intelligence dashboards. Your machine learning specialists also use the data lake as the source data for their machine learning models. You have built a real-time streaming data pipeline using Amazon Managed Streaming for Apache Kafka (Amazon MSK). You have created your MSK cluster and have configured MSK to create broker nodes in each Availability Zone in your region. Which of the Amazon KSK components coordinates cluster tasks and maintains state for resources interacting with your Apache Kafka cluster? A. Broker Nodes B. Zookeeper Nodes C. Data Producer D. Cluster Operator Correct Answer: B Option A is incorrect. In Amazon MSK, Apache Kafka partitions topics and replicates the partitions across multiple nodes called broker nodes. Apache Kafka runs the broker nodes. Option B is correct. In Amazon MSK, the Zookeeper nodes coordinate cluster tasks and maintain state for resources interacting with an Apache Kafka cluster. Option C is incorrect. In Amazon MSK, Data Producers are the applications that produce streaming data and send it to the cluster. Option D is incorrect. There is no Cluster Operator component in Amazon MSK. References: Please see the Amazon MSK product page titled Amazon Managed Streaming for Apache Kafka (Amazon MSK) (https://aws.amazon.com/msk/), and the Amazon Managed Streaming for Apache Kafka developer guide titled What Is Amazon MSK?(https://docs.amazonaws.cn/en_us/msk/latest/developerguide/what-is-msk.html), and the Amazon MSK FAQs (https://aws.amazon.com/msk/faqs/) Domain : Security Question 15 : You work as a data engineer for a company that offers a property rental service app. Your company’s data analysts need access to your data lake of property information to analyze the rental property data and produce dashboards and operational intelligence visualizations. The analysts need to be able to search through your property information using a fast search engine, so you have set up Elasticsearch for search and Kinbana as your visualization tool. Your company uses single sign-on (SSO) technology for access to your internal applications. You need to control access to your Kibana service, which access control is the most efficient for your organization? A. IP-based access policy B. IAM users and roles C. SAML authentication D. Cognito authentication Correct Answer: C Explanation: Option A is incorrect. An IP-based access policy is used for public access domains. You are running your Kibana service for your internal analysts. Also, Kibana is a JavaScript application that originates its requests from the user’s IP address. IP-based access control is impractical due to the sheer number of IP addresses you would need to allow in order for each user to have access to Kibana. You could solve this with a proxy server, but using SAML authentication is a simpler approach that also allows for single sign-on. Option B is incorrect. Kibana does not natively support IAM users and roles. Option C is correct. SAML authentication for Kibana lets you use your existing identity provider to offer single sign-on (SSO) for Kibana on your Elasticsearch domain. Option D is incorrect. While you could use Cognito user and identity pools, it will be more efficient for you to use SAML authentication since your company already uses SSO. References: Please see the Amazon Elasticsearch Service developer guide titled Using Kibana with Amazon Elasticsearch Service (https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-kibana.html), and the Amazon Elasticsearch Service developer guide titled Configuring Amazon Cognito authentication for Kibana (https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-cognito-auth.html), and the Amazon Elasticsearch Service developer guide titled SAML authentication for Kibana (https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/saml.html) Domain : Analysis and Visualization Question 16 : You work as a data engineer for an online retailer where you are setting up sessionalization to track the clickstream data of your users. Your marketing department plans to use clickstream data analysis to help assess the effectiveness of your company’s new online features and marketing campaigns. Your clickstream data arrives at the rate of thousands of messages per second. Your marketing department wants to assess the data in real-time so that they can be very nimble in their use of targeted marketing and features. How should you perform your sessionalization? A. Send the clickstream data through Kinesis Data Streams to Glue, use Glue to perform data sessionalization B. Send the clickstream data through Kinesis Data Streams to EMR, use EMR to perform data sessionalization C. Send the clickstream data through Kinesis Data Firehose to S3, use Athena to perform data sessionalization D. Send the clickstream data through Kinesis Data Streams to Kinesis Data Analytics, use Kinesis Data Analytics to perform data sessionalization Correct Answer: D Explanation: Option A is incorrect. You could perform the sessionization in batch jobs using Glue or Amazon EMR. However, this will not give you real-time access to the data. Option B is incorrect. You could perform the sessionization in batch jobs using Glue or Amazon EMR. However, this will not give you real-time access to the data. Option C is incorrect. Streaming the data directly to S3 using Kinesis Firehose and accessing the data with Athena will not allow you to efficiently sessionalize your clickstreams data. Option D is correct. Using Kinesis Data Analytics to sessionalize your clickstream data is much faster than the other options. This configuration allows you to provide real-time seasonalized data. References: Please see the AWS Big Data blog titled Create real-time clickstream sessions and run analytics with Amazon Kinesis Data Analytics, AWS Glue, and Amazon Athena (https://aws.amazon.com/blogs/big-data/create-real-time-clickstream-sessions-and-run-analytics-with-amazon-kinesis-data-analytics-aws-glue-and-amazon-athena/), and the AWS What’s New page titled New Kinesis Analytics stream processing functions for time series analytics, real time sessionization, and more (https://aws.amazon.com/about-aws/whats-new/2017/09/new-kinesis-analytics-stream-processing-functions-for-time-series-analytics-real-time-sessionization-and-more/) Domain: Analysis and Visualization Question 17 : You work as a data scientist for a logistics company. Your company has a fleet of thousands of trucks on the road at any given time delivering temperature-sensitive freight. You are responsible for building a dashboard that shows any anomaly in the temperatures of any of the trucks on the road. Each truck has a temperature sensor onboard that streams the current temperature of the onboard freight at 1 minute intervals. Which option gives you a streaming data pipeline, including real-time analytics, anomaly detection, and visualization in the most efficient manner? A. Sensors send temperature data to Kinesis Data Firehose, Kinesis Data Firehose performs anomaly detection using a Lambda function, Kinesis Data Firehose writes the processed anomaly data to S3, use QuickSight to visualize the processed anomaly data B. Sensors send temperature data to Kinesis Data Streams, Kinesis Data Streams sends the temperature data to Kinesis Data Analytics, use the built-in Random Cut Forest function in Kinesis Data Analytics to detect anomalies in real time, Kinesis Data Analytics sends the processed anomaly data to a Kinesis Data Firehose delivery stream, Kinesis Data Firehose sends the processed anomaly data to an Elasticsearch cluster where you use Kibana to visualize the anomaly data C. Sensors send temperature data to Kinesis Data Firehose, Kinesis Data Firehose streams the data to an S3 bucket, a SageMaker Random Cut Forest model detects anomalies in the data and writes the resulting processed anomaly data to another S3 bucket, use QuickSight to visualize the processed anomaly data D. Sensors send temperature data to Kinesis Data Firehose, Kinesis Data Firehose streams the data to an S3 bucket, a SageMaker Random Cut Forest model detects anomalies in the data and writes the resulting processed anomaly data to an Elasticsearch cluster where you use Kibana to visualize the anomaly data Correct Answer: B Explanation: Option A is incorrect. This option writes your streaming temperature data to an S3 bucket. This step will add latency in the processing, so you won’t get real-time anomaly detection. Also, writing a Lambda function to do anomaly detection is far less efficient than using a Random Cut Forest machine learning model. Option B is correct. Using Kinesis Data Analytics and its built-in Random Cut Forest feature you can detect temperature anomalies in real-time. Using Elasticsearch and Kibana you can easily visualize the anomaly data and provide a real-time dashboard. Option C is incorrect. This option writes your streaming temperature data and your processed anomaly data to S3 buckets. These steps will add latency in the processing, so you won’t get real-time anomaly detection. Option D is incorrect. This option writes your streaming temperature data to an S3 bucket. This step will add latency in the processing, so you won’t get real-time anomaly detection. References: Please see the AWS Big Data blog titled Perform Near Real-time Analytics on Streaming Data with Amazon Kinesis and Amazon Elasticsearch Service (https://aws.amazon.com/blogs/big-data/perform-near-real-time-analytics-on-streaming-data-with-amazon-kinesis-and-amazon-elasticsearch-service/), and the AWS Machine Learning blog titled Building a visual search application with Amazon SageMaker and Amazon ES (https://aws.amazon.com/blogs/machine-learning/building-a-visual-search-application-with-amazon-sagemaker-and-amazon-es/) Domain: Storage and Data Management Question 18 : You work as a data engineer for an international airline. Your data engineering team is responsible for the company’s data warehouse, which you have built on a Redshift cluster. The data warehouse stores information about the airline’s travel patterns, customer preferences, miles programs, etc. It is important that the Redshift cluster remains very highly available so you have configured automatic snapshots and automatic snapshot copy from your corporate headquarters (your source region) to your European regional office (your destination region). Due to changes in your corporate strategy, you now need to change your destination region to the Asia Pacific region. Which are the most efficient options (Select TWO)? A. In the AWS console, select your Redshift cluster and specify the new destination AWS Region B. Use the AWS CLI to select your Redshift cluster and specify the new destination AWS Region C. Use the AWS console to disable the automatic copy feature, then re-enable it, specifying the new destination AWS Region D. Use the AWS console to disable the automatic snapshot feature, then re-enable it, specifying the new destination AWS Region E. Use the AWS CLI to disable the automatic copy feature, then re-enable it, specifying the new destination AWS Region Correct Answers: C and E Explanation: Option A is incorrect. Through the console or the CLI, you can’t change the destination region while the Redshift automatic copy feature is enabled. You must first disable the automatic copy feature, then re-enable the automatic snapshot feature specifying the new destination region. Option B is incorrect. Through the console or the CLI, you can’t change the destination region while the Redshift automatic copy feature is enabled. You must first disable the automatic copy feature, then re-enable the automatic snapshot feature specifying the new destination region. Option C is correct. You can use the AWS console to first disable the automatic copy feature, then re-enable the automatic snapshot feature specifying the new destination region. Option D is incorrect. You need to change the automatic copy feature, not the automatic snapshot feature. Option E is correct. You can use the AWS CLI to first disable the automatic copy feature, then re-enable the automatic snapshot feature specifying the new destination region. References: Please see the Amazon Redshift cluster management guide titled Amazon Redshift snapshots (https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-snapshots.html#cross-region-snapshot-copy), and the AWS CLI Command Reference titled disable-snapshot-copy (https://docs.aws.amazon.com/cli/latest/reference/redshift/disable-snapshot-copy.html), and the AWS CLI Command Reference titled enable-snapshot-copy (https://docs.aws.amazon.com/cli/latest/reference/redshift/enable-snapshot-copy.html) Domain : Processing Question 19 : You work as a data engineer for a global travel agency. Your company collects data from resorts across the globe to ingest into your data lake. Your marketing analysts use the data lake to generate insights to help produce the most effective marketing campaigns. You and your engineering team use AWS Glue to ingest your travel data into your data lake. You have configured your Glue jobs and development endpoints to use the Glue Data Catalog as an external Apache Hive metastore by checking the Use AWS Glue Data Catalog as the Hive metastore check box in the Catalog options group on the Add job and Add endpoint pages on the console. Which permissions should the IAM role used for your jobs and development endpoints have to allow use of the Glue Data Catalog as the Hive metastore? A. glue:CreateDatabase B. glue:CreateConnection C. glue:CreateEndpoint D. glue:CreateJob Correct Answer: A Explanation: Option A is correct. To enable the Data Catalog access, the IAM role used for your jobs and development endpoints should have glue:CreateDatabase permissions. Option B is incorrect. To enable the Data Catalog access, the IAM role used for your jobs and development endpoints should have glue:CreateDatabase permissions, not the glue:CreateConnection permissions. Option C is incorrect. There is no glue:CreateEndpoint permissions defined in IAM. Option D is incorrect. To enable the Data Catalog access, the IAM role used for your jobs and development endpoints should have glue:CreateDatabase permissions, not the glue:CreateJob permissions. References: Please see the AWS Glue developer guide titled AWS Glue Data Catalog Support for Spark SQL Jobs (https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-data-catalog-hive.html), and the AWS Glue developer guide titled AWS Glue API Permissions: Actions and Resources Reference (https://docs.aws.amazon.com/glue/latest/dg/api-permissions-reference.html) Domain : Collection Question 20 : You work as a data engineer for a data analytics company that sells data analytics solutions to marketing companies interested in targeted online marketing. Your data engineering department ingests real-time data streams from multiple sources to produce your data lake. You are building a data ingestion pipeline where you need to split data between multiple S3 buckets in your data lake in near real-time. Which is the best option for implementing your data ingestion requirement? A. S3 Replication B. S3 Batch C. Snowball Edge D. DataSync Correct Answer: D Explanation: Option A is incorrect. You should use S3 Replication for continuous replication of data to a specific destination bucket, not to split data across multiple S3 buckets in your data lake. Option B is incorrect. S3 Batch will not meet your near real-time requirement. Option C is incorrect. Snowball Edge is used for offline data transfers where you are transferring data from remote or disconnected environments. Option D is correct. DataSync is the best option for splitting data between multiple buckets. References: Please see the AWS DataSync FAQs page, specifically the question “To transfer objects between my buckets, when do I use AWS DataSync, when do I use S3 Replication, and when do I use S3 Batch Operations?” (https://aws.amazon.com/datasync/faqs/), and the AWS DataSync user guide titled What is AWS DataSync? (https://docs.aws.amazon.com/datasync/latest/userguide/what-is-datasync.html) Domain : Analysis and Visualization Question 21 : You work as a data analyst for an online retail service. Your company uses a data lake housed on S3 to store user clickstream data, client account information, product information, etc. You have been given the assignment of creating a dashboard in QuickSight using your clickstream data to gain insight into user behavior. Which of the following data sources are NOT valid choices to build your QuickSight dashboard? (Select TWO) A. Hive B. Presto C. DynamoDB D. Redshift Spectrum E. S3 Analytics Correct Answers: A and C Explanation: Option A is correct. Hive is NOT a supported data source for QuickSight. Option B is incorrect. Presto is a supported data source for QuickSight. Option C is correct. DynamoDB is NOT a supported data source for QuickSight. Option D is incorrect. Redshift Spectrum is a supported data source for QuickSight. Option E is incorrect. S3 Analytics is a supported data source for QuickSight. References: Please see the Amazon QuickSight user guide titled Supported Data Sources (https://docs.aws.amazon.com/quicksight/latest/user/supported-data-sources.html), and the AWS Database blog titled How to perform advanced analytics and build visualizations of your Amazon DynamoDB data by using Amazon Athena (https://aws.amazon.com/blogs/database/how-to-perform-advanced-analytics-and-build-visualizations-of-your-amazon-dynamodb-data-by-using-amazon-athena/) Domain : Security Question 22 : You work as a data analyst for a global financial services company. Your company stores client information in their data lake for clients located in different countries around the world. In order to comply with data sovereignty laws you are required to store data in separate AWS accounts and you are barred from letting your client data leave their specific region. How can you ensure your data in your data lake is highly available? A. Use S3 Cross-Region Replication B. Use S3 Same-Region Replication C. Use S3 Time Control Replication D. Use S3 Batch Replication Correct Answer: B Explanation: Option A is incorrect. You have the requirement to keep your client data in the AWS region that is within the client’s country of origin. Cross-region replication could move the client data out of the client’s country of origin. Option B is correct. Same-region replication allows you to replicate data between buckets within the same region, thus satisfying the requirement to keep your client data in the AWS region that is within the client’s country of origin while also giving you high availability. Option C is incorrect. S3 Time Control replication allows you to meet replication service level agreements (SLAs). Option D is incorrect. There is no S3 Batch Replication. Reference: Please see the Amazon S3 features page titled Amazon S3 Replication (https://aws.amazon.com/s3/features/replication/#:~:text=When%20to%20use%20S3%20Replication,and%20data%20sharing%20across%20accounts.) Domain: Processing Question 23 : You work as a data engineer for a hedge fund that trades on the global derivatives markets. Your firm gathers data from various streaming data services to populate its data lake on S3. The data frequently needs to be transformed before it’s stored in your data lake. You and your engineering team have built a data ingestion pipeline using Kinesis Data Firehose. Your Kinesis Data Firehose stream leverages lambda functions to perform the necessary transformations. Sometimes your pipeline processes so much data at such a high rate that your AWS account reaches the Lambda invocation limit. What happens when your pipeline reaches the Lambda invocation limit? A. Kinesis Data Firehose skips the failed batch of records, which are treated as unsuccessfully processed records and the records are lost B. Kinesis Data Firehose retries the Lambda invocation three times by default, if the invocation still fails, Kinesis Data Firehose skips the failed batch of records, which are treated as unsuccessfully processed records and the records are lost. C. Kinesis Data Firehose retries the Lambda invocation three times by default, if the invocation still fails, Kinesis Data Firehose skips the failed batch of records, which are treated as unsuccessfully processed records and the the unsuccessfully processed records are delivered to your S3 bucket in the processing-failed folder. D. Kinesis Data Firehose retries the Lambda invocation three times by default, if the invocation still fails, Kinesis Data Firehose skips the failed batch of records, which are treated as unsuccessfully processed records and the the unsuccessfully processed records are delivered to your SQS queue and tagged with the processing-failed label. Correct Answer: C Explanation: Option A is incorrect. The records are not lost. Kinesis Data Firehose first retries the Lambda invocation 3 times by default. If the invocation still fails, Kinesis Data Firehose delivers the unsuccessfully processed records to one of your S3 buckets. Option B is incorrect. The records are not lost. Kinesis Data Firehose first retries the Lambda invocation 3 times by default. If the invocation still fails, Kinesis Data Firehose delivers the unsuccessfully processed records to one of your S3 buckets. Option C is correct. Kinesis Data Firehose ensures that your data is not lost. Kinesis Data Firehose first retries the Lambda invocation 3 times by default. If the invocation still fails, Kinesis Data Firehose delivers the unsuccessfully processed records to one of your S3 buckets. Option D is incorrect. Kinesis Data Firehose delivers your unsuccessfully processed records to one of your S3 buckets, not an SQS queue. Reference: Please see the Amazon Kinesis Data Firehose developer guide titled Amazon Kinesis Data Firehose Data Transformation (https://docs.aws.amazon.com/firehose/latest/dev/data-transformation.html) Domain : Storage and Data Management Question 24 : You work as a data engineer for a social media software company. You stream data from the company’s websites and mobile apps into your data lake. You also stream data from marketing analytics firms into your data lake. This data is transformed and aggregated and then loaded into your Redshift data warehouse for use in business intelligence dashboards and queries. You are now streaming a new data source (which is in the CSV format) using Kinesis Data Firehose and you have decided that the best format for this new data is parquet, since the source data is large and you can take advantage of partitioning and columnar query performance. Which option describes the most optimal way to transform the data and then load it from your data lake to your data warehouse? A. Use Kinesis Data Firehose to transform the streaming data from CSV to parquet and set the destination of the transformed parquet data to your Redshift cluster. B. Have your Kinesis Data Firehose stream leverage a Lambda function to transform the CSV data to JSON, then have your Kinesis Data Firehose stream convert the JSON data to paquet and set the destination of the transformed parquet data to your Redshift cluster. C. Use Kinesis Data Firehose to transform the streaming data from CSV to parquet, then set the destination of the transformed parquet data to an S3 bucket, then use the Redshift COPY command to copy your parquet data to your Redshift cluster. D. Have your Kinesis Data Firehose stream leverage a Lambda function to transform the CSV data to JSON, then have your Kinesis Data Firehose stream convert the JSON data to paquet and set the destination of the transformed parquet data to an S3 bucket, then use the Redshift COPY command to copy your parquet data to your Redshift cluster. Correct Answer: D Explanation: Option A is incorrect. Kinesis Data Firehose cannot convert from CSV directly to parquet. It needs to leverage a Lambda function to first convert the data to JSON. Option B is incorrect. When you enable record format conversion in Kinesis Data Firehose, you can’t set your Kinesis Data Firehose destination to your Redshift cluster. With format conversion enabled, S3 is the only destination that you can use for your Kinesis Data Firehose delivery stream. Option C is incorrect. Kinesis Data Firehose cannot convert from CSV directly to parquet. It needs to leverage a Lambda function to first convert the data to JSON. Option D is correct. Kinesis Data Firehose needs to leverage a Lambda function to first convert the data to JSON. You then need to set your destination to S3 because when you enable record format conversion in Kinesis Data Firehose, you can’t set your Kinesis Data Firehose destination to your Redshift cluster. With format conversion enabled, S3 is the only destination that you can use for your Kinesis Data Firehose delivery stream. Once your data is on S3, you can use the Redshift COPY command to load your data into your Redshift tables. References: Please see the Amazon Kinesis Data Firehose developer guide titled Converting Your Input Record Format in Kinesis Data Firehose (https://docs.aws.amazon.com/firehose/latest/dev/record-format-conversion.html), and the AWS What’s New page titled Amazon Redshift Can Now COPY from Parquet and ORC File Formats (https://aws.amazon.com/about-aws/whats-new/2018/06/amazon-redshift-can-now-copy-from-parquet-and-orc-file-formats/), and the Amazon Redshift database developer guide titled Tutorial: Loading data from Amazon S3 (https://docs.aws.amazon.com/redshift/latest/dg/tutorial-loading-data.html) Domain : Collection Question 25 : You work as a data engineer for a data analytics company that produces a Software as a Service (SaaS) that delivers streaming data in near real-time to their SaaS consumer clients. The data needs to remain in sequence. In other words, when your consumers receive the data records they must be in the same order as when your company produced the data records. You have implemented a DynamoDB Streams solution to publish your records as they stream into your DynamoDB tables. You need to now implement the services that deliver the data records to your SaaS consumers. Which option will meet your requirements to build a guaranteed message ordered, near-real-time data processing capability? A. Use a Lambda fan-out pattern where you configure a single Lambda function to process the DynamoDB stream and write to a Kinesis Data Stream B. Use an SNS fan-out pattern to SQS where you configure a single Lambda function to process a DynamoDB stream, the Lambda function processes each record and writes it to an SNS topic, have SQS queues that subscribe to the SNS topic C. Use Kinesis Data Streams fan-out pattern where you configure a single Lambda function to process the DynamoDB stream, the Lambda function processes each item and writes to a Kinesis data stream D. </strong FAQ 1. What is the job market like for AWS data analytics specialists? The job market for AWS data analytics specialists is strong. According to the latest data from the U.S. Bureau of Labor Statistics, the median salary for this occupation is $86,010. The job market is expected to grow at a rate of 21 percent through 2026, which is much faster than the average for all occupations. 2. What are the most common AWS data analytics tools? The most common AWS data analytics tools are Amazon Redshift, Amazon Athena, and Amazon EMR. These tools are used to manage and analyze data in the cloud. 3. What are the most common use cases for AWS data analytics? Common use cases for AWS data analytics include data warehousing, data lakes, data mining, and business intelligence. Summary Hope you have enjoyed learning all these AWS Data Analytics Specialty exam questions and answers. Also, It is recommended not to try any AWS data analytics specialty dumps available online. Those questions are quite out-of-date, and Microsoft has the right to permanently ban you and cancel your certification at any moment. Hence more time on learning the exam objectives and try out the AWS practice exam on AWS data analytics specialty. We at Whizlabs provides you the AWS Data Analytics Specialty exam preparation guidance with all of the training resources like video courses, practice tests and Hands-on-labs, AWS sandboxes for real-time experiments that you need to pass the AWS Data Analytics Specialty certification exam successfully. Happy Learning ! View the full article
-
- data analytics
- free
-
(and 1 more)
Tagged with:
-
Amazon Kinesis Data Analytics for Apache Flink now provides access to the Apache Flink Dashboard, giving you greater visibility into your applications and advanced monitoring capabilities. You can now view your Apache Flink application’s environment variables, over 120 metrics, logs, and the directed acyclic graph (DAG) of the Apache Flink application in a simple, contextualized user interface. View the full article
-
You can now build and run streaming applications using Apache Flink version 1.11 in Amazon Kinesis Data Analytics for Apache Flink. Apache Flink v1.11 provides improvements to the Table and SQL API, which is a unified, relational API for stream and batch processing and acts as a superset of the SQL language specially designed for working with Apache Flink. Apache Flink v1.11 capabilities also include an improved memory model and RocksDB optimizations for increased application stability, and support for task manager stack traces in the Apache Flink Dashboard. View the full article
-
Editor’s note: Here we take a look at how Branch, a fintech startup, built their data platform with BigQuery and other Google Cloud solutions that democratized data for their analysts and scientists. As a startup in the fintech sector, Branch helps redefine the future of work by building innovative, simple-to-use tech solutions. We’re an employer payments platform, helping businesses provide faster pay and fee-free digital banking to their employees. As head of the Behavioral and Data Science team, I was tapped last year to build out Branch’s team and data platform. I brought my enthusiasm for Google Cloud and its easy-to-use solutions to the first day on the job. We chose Google Cloud for ease-of-use, data & savings I had worked with Google Cloud previously, and one of the primary mandates from our CTO was “Google Cloud-first,” with the larger goal of simplifying unnecessary complexity in the system architecture and controlling the costs associated with being on multiple cloud platforms. From the start, Google Cloud’s suite of solutions supported my vision of how to design a data team. There’s no one-size-fits-all approach. It starts with asking questions: what does Branch need? Which stage are we at? Will we be distributed or centralized? But above all, what parameters in the product will need to be optimized with analytics and data science approaches? With team design, product parameterization is critical. With a product-driven company, the data science team can be most effective by tuning a product’s parameters—for example, a recommendation engine for an ecommerce site is driven by algorithms and underlying models that are updating parameters. “Show X to this type of person but Y to this type of person,” X and Y are the parameters optimized by modeling behavioral patterns. Data scientists behind the scenes can run models as to how that engine should work, and determine which changes are needed. By focusing on tuning parameters, the team is designed around determining and optimizing an objective function. That of course relies heavily on the data behind it. How do we label the outcome variable? Is a whole labeling service required? Is it clean data with a pipeline that won’t require a lot of engineering work? What data augmentation will be needed? With that data science team design envisioned, I started by focusing on user behavior—deciding how to monitor and track it, how to partner with the product team to ensure it’s in line with the product objectives, then spinning up A/B testing and monitoring. On the optimization side, transaction monitoring is critical in fintech. We need to look for low-probability events and abnormal patterns in the data, and then take action, either reaching out to the user as quickly as possible to inform them, or stopping the transaction directly. In the design phase, we need to determine if these actions need to be done in real-time or after the fact. Is it useful to the user to have that information in real time? For example, if we are working to encourage engagement, and we miss an event or an interaction, it’s not the end of the world. It’s different with a fraud monitoring system, for which you’ve got to be much more strict about real-time notifications. Our data infrastructure There are many use cases at Branch for data cloud technologies from Google Cloud. One is with “basic” data work. It’s been incredibly easy to use BigQuery, Google’s serverless data warehouse, which is where we’ve replicated all of our SQL databases, and Cloud Scheduler, the fully managed enterprise-grade cron job scheduler. These two tools, working together, make it easy to organize data pipelining. And because of their deep integration, they play well with other Google Cloud solutions like Cloud Composer and Dataform, as well as with services, like Airflow, from other providers. Especially for us as a startup, the whole Google Cloud suite of products accelerates the process of getting established and up and running, so we can perform the “bread-and-butter” work of data science. We also use BigQuery as a holder of heavier stats, and we train our models there, weekly, monthly, nightly, depending on how much data we collect. Then we leverage the messaging and ingestion tool Pub/Sub and its event systems to get the response in real time. We evaluate the output for that model in a Dataproc cluster or Dataform, and run all of that in Python notebooks, which can call out to BigQuery to train a model, or get evaluated and pass the event system through. Full integration of data solutions At the next level, you need to push data out to your internal teams. We are growing and evolving, so I looked for ways to save on costs during this transition. We do a heavy amount of work in Google Sheets because it integrates well with other Google services, getting data and visuals out to the people who need them; enabling them to access raw data and refresh as needed. Google Groups also makes it easy to restrict access to data tables, which is a vital concern in the fintech space. The infrastructure management and integration of Google Groups make it super useful. If an employee departs the organization, we can easily delete or control their level of access. We can add new employees to a group that has a certain level of rights, or read and write access to the underlying databases. As we grow with Google Cloud, I also envision being able to track the user levels, including who’s running which SQLs and who’s straining the database and raising our costs. A streamlined data science team saves costs I’d estimate that Google Cloud’s solutions have saved us the equivalent of one full-time engineer we’d otherwise need to hire to link the various tools together, making sure that they are functional and adding more monitoring. Because of the fully managed features of many of Google Cloud’s products, that work is done for us, and we can focus on expanding our customer products. We’re now 100% Google Cloud for all production systems, having consolidated from IBM, AWS, and other cloud point solutions. For example, Branch is now expanding financial wellness offerings for our customers to encourage better financial behavior through transaction monitoring, forecasting their spend and deposits, and notifying them of risks or anomalies. With those products and others, we’ll be using and benefiting from the speed, scalability, and ease of use of Google Cloud solutions, where they always keep data—and data teams—top of mind. Learn more about Branch. Curious about other use cases for BigQuery? Read how retailers can use BigQuery ML to create demand forecasting models. Related Article Inventory management with BigQuery and Cloud Run Building a simple inventory management system with Cloud Run and BigQuery Read Article
-
Forum Statistics
70.4k
Total Topics68.3k
Total Posts