Jump to content

Search the Community

Showing results for tags 'real-time'.

  • Search By Tags

    Type tags separated by commas.
  • Search By Author

Content Type


Forums

There are no results to display.

There are no results to display.


Find results in...

Find results that contain...


Date Created

  • Start

    End


Last Updated

  • Start

    End


Filter by number of...

Joined

  • Start

    End


Group


Website URL


LinkedIn Profile URL


About Me


Cloud Platforms


Cloud Experience


Development Experience


Current Role


Skills


Certifications


Favourite Tools


Interests

Found 23 results

  1. Today’s fast-paced world demands timely insights and decisions, which is driving the importance of streaming data. Streaming data refers to data that is continuously generated from a variety of sources. The sources of this data, such as clickstream events, change data capture (CDC), application and service logs, and Internet of Things (IoT) data streams are proliferating. Snowflake offers two options to bring streaming data into its platform: Snowpipe and Snowflake Snowpipe Streaming. Snowpipe is suitable for file ingestion (batching) use cases, such as loading large files from Amazon Simple Storage Service (Amazon S3) to Snowflake. Snowpipe Streaming, a newer feature released in March 2023, is suitable for rowset ingestion (streaming) use cases, such as loading a continuous stream of data from Amazon Kinesis Data Streams or Amazon Managed Streaming for Apache Kafka (Amazon MSK). Before Snowpipe Streaming, AWS customers used Snowpipe for both use cases: file ingestion and rowset ingestion. First, you ingested streaming data to Kinesis Data Streams or Amazon MSK, then used Amazon Data Firehose to aggregate and write streams to Amazon S3, followed by using Snowpipe to load the data into Snowflake. However, this multi-step process can result in delays of up to an hour before data is available for analysis in Snowflake. Moreover, it’s expensive, especially when you have small files that Snowpipe has to upload to the Snowflake customer cluster. To solve this issue, Amazon Data Firehose now integrates with Snowpipe Streaming, enabling you to capture, transform, and deliver data streams from Kinesis Data Streams, Amazon MSK, and Firehose Direct PUT to Snowflake in seconds at a low cost. With a few clicks on the Amazon Data Firehose console, you can set up a Firehose stream to deliver data to Snowflake. There are no commitments or upfront investments to use Amazon Data Firehose, and you only pay for the amount of data streamed. Some key features of Amazon Data Firehose include: Fully managed serverless service – You don’t need to manage resources, and Amazon Data Firehose automatically scales to match the throughput of your data source without ongoing administration. Straightforward to use with no code – You don’t need to write applications. Real-time data delivery – You can get data to your destinations quickly and efficiently in seconds. Integration with over 20 AWS services – Seamless integration is available for many AWS services, such as Kinesis Data Streams, Amazon MSK, Amazon VPC Flow Logs, AWS WAF logs, Amazon CloudWatch Logs, Amazon EventBridge, AWS IoT Core, and more. Pay-as-you-go model – You only pay for the data volume that Amazon Data Firehose processes. Connectivity – Amazon Data Firehose can connect to public or private subnets in your VPC. This post explains how you can bring streaming data from AWS into Snowflake within seconds to perform advanced analytics. We explore common architectures and illustrate how to set up a low-code, serverless, cost-effective solution for low-latency data streaming. Overview of solution The following are the steps to implement the solution to stream data from AWS to Snowflake: Create a Snowflake database, schema, and table. Create a Kinesis data stream. Create a Firehose delivery stream with Kinesis Data Streams as the source and Snowflake as its destination using a secure private link. To test the setup, generate sample stream data from the Amazon Kinesis Data Generator (KDG) with the Firehose delivery stream as the destination. Query the Snowflake table to validate the data loaded into Snowflake. The solution is depicted in the following architecture diagram. Prerequisites You should have the following prerequisites: An AWS account and access to the following AWS services: AWS Identity and Access Management (IAM) Kinesis Data Streams Amazon S3 Amazon Data Firehose Familiarity with the AWS Management Console. A Snowflake account. A key pair generated and your user configured to connect securely to Snowflake. For instructions, refer to the following: Generate the private key Generate a public key Store the private and public keys securely Assign the public key to a Snowflake user Verify the user’s public key fingerprint An S3 bucket for error logging. The KDG set up. For instructions, refer to Test Your Streaming Data Solution with the New Amazon Kinesis Data Generator. Create a Snowflake database, schema, and table Complete the following steps to set up your data in Snowflake: Log in to your Snowflake account and create the database: create database adf_snf; Create a schema in the new database: create schema adf_snf.kds_blog; Create a table in the new schema: create or replace table iot_sensors (sensorId number, sensorType varchar, internetIP varchar, connectionTime timestamp_ntz, currentTemperature number ); Create a Kinesis data stream Complete the following steps to create your data stream: On the Kinesis Data Streams console, choose Data streams in the navigation pane. Choose Create data stream. For Data stream name, enter a name (for example, KDS-Demo-Stream). Leave the remaining settings as default. Choose Create data stream. Create a Firehose delivery stream Complete the following steps to create a Firehose delivery stream with Kinesis Data Streams as the source and Snowflake as its destination: On the Amazon Data Firehose console, choose Create Firehose stream. For Source, choose Amazon Kinesis Data Streams. For Destination, choose Snowflake. For Kinesis data stream, browse to the data stream you created earlier. For Firehose stream name, leave the default generated name or enter a name of your preference. Under Connection settings, provide the following information to connect Amazon Data Firehose to Snowflake: For Snowflake account URL, enter your Snowflake account URL. For User, enter the user name generated in the prerequisites. For Private key, enter the private key generated in the prerequisites. Make sure the private key is in PKCS8 format. Do not include the PEM header-BEGIN prefix and footer-END suffix as part of the private key. If the key is split across multiple lines, remove the line breaks. For Role, select Use custom Snowflake role and enter the IAM role that has access to write to the database table. You can connect to Snowflake using public or private connectivity. If you don’t provide a VPC endpoint, the default connectivity mode is public. To allow list Firehose IPs in your Snowflake network policy, refer to Choose Snowflake for Your Destination. If you’re using a private link URL, provide the VPCE ID using SYSTEM$GET_PRIVATELINK_CONFIG: select SYSTEM$GET_PRIVATELINK_CONFIG(); This function returns a JSON representation of the Snowflake account information necessary to facilitate the self-service configuration of private connectivity to the Snowflake service, as shown in the following screenshot. For this post, we’re using a private link, so for VPCE ID, enter the VPCE ID. Under Database configuration settings, enter your Snowflake database, schema, and table names. In the Backup settings section, for S3 backup bucket, enter the bucket you created as part of the prerequisites. Choose Create Firehose stream. Alternatively, you can use an AWS CloudFormation template to create the Firehose delivery stream with Snowflake as the destination rather than using the Amazon Data Firehose console. To use the CloudFormation stack, choose Generate sample stream data Generate sample stream data from the KDG with the Kinesis data stream you created: { "sensorId": {{random.number(999999999)}}, "sensorType": "{{random.arrayElement( ["Thermostat","SmartWaterHeater","HVACTemperatureSensor","WaterPurifier"] )}}", "internetIP": "{{internet.ip}}", "connectionTime": "{{date.now("YYYY-MM-DDTHH:m:ss")}}", "currentTemperature": {{random.number({"min":10,"max":150})}} } Query the Snowflake table Query the Snowflake table: select * from adf_snf.kds_blog.iot_sensors; You can confirm that the data generated by the KDG that was sent to Kinesis Data Streams is loaded into the Snowflake table through Amazon Data Firehose. Troubleshooting If data is not loaded into Kinesis Data Steams after the KDG sends data to the Firehose delivery stream, refresh and make sure you are logged in to the KDG. If you made any changes to the Snowflake destination table definition, recreate the Firehose delivery stream. Clean up To avoid incurring future charges, delete the resources you created as part of this exercise if you are not planning to use them further. Conclusion Amazon Data Firehose provides a straightforward way to deliver data to Snowpipe Streaming, enabling you to save costs and reduce latency to seconds. To try Amazon Kinesis Firehose with Snowflake, refer to the Amazon Data Firehose with Snowflake as destination lab. About the Authors Swapna Bandla is a Senior Solutions Architect in the AWS Analytics Specialist SA Team. Swapna has a passion towards understanding customers data and analytics needs and empowering them to develop cloud-based well-architected solutions. Outside of work, she enjoys spending time with her family. Mostafa Mansour is a Principal Product Manager – Tech at Amazon Web Services where he works on Amazon Kinesis Data Firehose. He specializes in developing intuitive product experiences that solve complex challenges for customers at scale. When he’s not hard at work on Amazon Kinesis Data Firehose, you’ll likely find Mostafa on the squash court, where he loves to take on challengers and perfect his dropshots. Bosco Albuquerque is a Sr. Partner Solutions Architect at AWS and has over 20 years of experience working with database and analytics products from enterprise database vendors and cloud providers. He has helped technology companies design and implement data analytics solutions and products. View the full article
  2. About the Author Mona Rakibe is the co–founder, and CEO of Telmai, a low-code data reliability platform designed for open architecture, i.e., any batch/streaming source of your data pipeline. Mona is a veteran in data space, and before starting Telmai, she headed product management at Reltio, a cloud-based master data management company. As we hurtle […]View the full article
  3. Real-time data streaming and event processing present scalability and management challenges. AWS offers a broad selection of managed real-time data streaming services to effortlessly run these workloads at any scale. In this post, Nexthink shares how Amazon Managed Streaming for Apache Kafka (Amazon MSK) empowered them to achieve massive scale in event processing. Experiencing business hyper-growth, Nexthink migrated to AWS to overcome the scaling limitations of on-premises solutions. With Amazon MSK, Nexthink now seamlessly processes trillions of events per day, reaching over 5 GB per second of aggregated throughput. In the following sections, Nexthink introduces their product and the need for scalability. They then highlight the challenges of their legacy on-premises application and present their transition to a cloud-centered software as a service (SaaS) architecture powered by Amazon MSK. Finally, Nexthink details the benefits achieved by adopting Amazon MSK. Nexthink’s need to scale Nexthink is the leader in digital employee experience (DeX). The company is shaping the future of work by providing IT leaders and C-levels with insights into employees’ daily technology experiences at the device and application level. This allows IT to evolve from reactive problem-solving to proactive optimization. The Nexthink Infinity platform combines analytics, monitoring, automation, and more to manage the employee digital experience. By collecting device and application events, processing them in real time, and storing them, our platform analyzes data to solve problems and boost experiences for over 15 million employees across five continents. In just 3 years, Nexthink’s business grew tenfold, and with the introduction of more real-time data our application had to scale from processing 200 MB per second to 5 GB per second and trillions of events daily. To enable this growth, we modernized our application from an on-premises single-tenant monolith to a cloud-based scalable SaaS solution powered by Amazon MSK. The next sections detail our modernization journey, including the challenges we faced and the benefits we realized with our new cloud-centered, AWS-based architecture. The on-premises solution and its challenges Let’s first explore our previous on-premises solution, Nexthink V6, before examining how Amazon MSK addressed its challenges. The following diagram illustrates its architecture. V6 was made up of two monolithic, single-tenant Java and C++ applications that were tightly coupled. The portal was a backend-for-frontend Java application, and the core engine was an in-house C++ in-memory database application that was also handling device connections, data ingestion, aggregation, and querying. By bundling all these functions together, the engine became difficult to manage and improve. V6 also lacked scalability. Initially supporting 10,000 devices, some new tenants had over 300,000 devices. We reacted by deploying multiple V6 engines per tenant, increasing complexity and cost, hampering user experience, and delaying time to market. This also led to longer proof of concept and onboarding cycles, which hurt the business. Furthermore, the absence of a streaming platform like Kafka created dependencies between teams through tight HTTP/gRPC coupling. Additionally, teams couldn’t access real-time events before ingestion into the database, limiting feature development. We also lacked a data buffer, risking potential data loss during outages. Such constraints impeded innovation and increased risks. In summary, although the V6 system served its initial purpose, reinventing it with cloud-centered technologies became imperative to enhance scalability, reliability, and foster innovation by our engineering and product teams. Transitioning to a cloud-centered architecture with Amazon MSK To achieve our modernization goals, after thorough research and iterations, we implemented an event-driven microservices design on Amazon Elastic Kubernetes Service (Amazon EKS), using Kafka on Amazon MSK for distributed event storage and streaming. Our transition from the v6 on-prem solution to the cloud-centered platform was phased over four iterations: Phase 1 – We lifted and shifted from on premises to virtual machines in the cloud, reducing operational complexities and accelerating proof of concept cycles while transparently migrating customers. Phase 2 – We extended the cloud architecture by implementing new product features with microservices and self-managed Kafka on Kubernetes. However, operating Kafka clusters ourselves proved overly difficult, leading us to Phase 3. Phase 3 – We switched from self-managed Kafka to Amazon MSK, improving stability and reducing operational costs. We realized that managing Kafka wasn’t our core competency or differentiator, and the overhead was high. Amazon MSK enabled us to focus on our core application, freeing us from the burden of undifferentiated Kafka management. Phase 4 – Finally, we eliminated all legacy components, completing the transition to a fully cloud-centered SaaS platform. This multi-year journey of learning and transformation took 3 years. Today, after our successful transition, we use Amazon MSK for two key functions: Real-time data ingestion and processing of trillions of daily events from over 15 million devices worldwide, as illustrated in the following figure. Enabling an event-driven system that decouples data producers and consumers, as depicted in the following figure. To further enhance our scalability and resilience, we adopted a cell-based architecture using the wide availability of Amazon MSK across AWS Regions. We currently operate over 10 cells, each representing an independent regional deployment of our SaaS solution. This cell-based approach minimizes the area of impact in case of issues, addresses data residency requirements, and enables horizontal scaling across AWS Regions, as illustrated in the following figure. Benefits of Amazon MSK Amazon MSK has been critical in enabling our event-driven design. In this section, we outline the main benefits we gained from its adoption. Improved data resilience In our new architecture, data from devices is pushed directly to Kafka topics in Amazon MSK, which provides high availability and resilience. This makes sure that events can be safely received and stored at any time. Our services consuming this data inherit the same resilience from Amazon MSK. If our backend ingestion services face disruptions, no event is lost, because Kafka retains all published messages. When our services resume, they seamlessly continue processing from where they left off, thanks to Kafka’s producer semantics, which allow processing messages exactly-once, at-least-once, or at-most-once based on application needs. Amazon MSK enables us to tailor the data retention duration to our specific requirements, ranging from seconds to unlimited duration. This flexibility grants uninterrupted data availability to our application, which wasn’t possible with our previous architecture. Furthermore, to safeguard data integrity in the event of processing errors or corruption, Kafka enabled us to implement a data replay mechanism, ensuring data consistency and reliability. Organizational scaling By adopting an event-driven architecture with Amazon MSK, we decomposed our monolithic application into loosely coupled, stateless microservices communicating asynchronously via Kafka topics. This approach enabled our engineering organization to scale rapidly from just 4–5 teams in 2019 to over 40 teams and approximately 350 engineers today. The loose coupling between event publishers and subscribers empowered teams to focus on distinct domains, such as data ingestion, identification services, and data lakes. Teams could develop solutions independently within their domains, communicating through Kafka topics without tight coupling. This architecture accelerated feature development by minimizing the risk of new features impacting existing ones. Teams could efficiently consume events published by others, offering new capabilities more rapidly while reducing cross-team dependencies. The following figure illustrates the seamless workflow of adding new domains to our system. Furthermore, the event-driven design allowed teams to build stateless services that could seamlessly auto scale based on MSK metrics like messages per second. This event-driven scalability eliminated the need for extensive capacity planning and manual scaling efforts, freeing up development time. By using an event-driven microservices architecture on Amazon MSK, we achieved organizational agility, enhanced scalability, and accelerated innovation while minimizing operational overhead. Seamless infrastructure scaling Nexthink’s business grew tenfold in 3 years, and many new capabilities were added to the product, leading to a substantial increase in traffic from 200 MB per second to 5 GB per second. This exponential data growth was enabled by the robust scalability of Amazon MSK. Achieving such scale with an on-premises solution would have been challenging and expensive, if not infeasible. Attempting to self-manage Kafka imposed unnecessary operational overhead without providing business value. Running it with just 5% of today’s traffic was already complex and required two engineers. At today’s volumes, we estimated needing 6–10 dedicated staff, increasing costs and diverting resources away from core priorities. Real-time capabilities By channeling all our data through Amazon MSK, we enabled real-time processing of events. This unlocked capabilities like real-time alerts, event-driven triggers, and webhooks that were previously unattainable. As such, Amazon MSK was instrumental in facilitating our event-driven architecture and powering impactful innovations. Secure data access Transitioning to our new architecture, we met our security and data integrity goals. With Kafka ACLs, we enforced strict access controls, allowing consumers and producers to only interact with authorized topics. We based these granular data access controls on criteria like data type, domain, and team. To securely scale decentralized management of topics, we introduced proprietary Kubernetes Custom Resource Definitions (CRDs). These CRDs enabled teams to independently manage their own topics, settings, and ACLs without compromising security. Amazon MSK encryption made sure that the data remained encrypted at rest and in transit. We also introduced a Bring Your Own Key (BYOK) option, allowing application-level encryption with customer keys for all single-tenant and multi-tenant topics. Enhanced observability Amazon MSK gave us great visibility into our data flows. The out-of-the-box Amazon CloudWatch metrics let us see the amount and types of data flowing through each topic and cluster. This helped us quantify the usage of our product features by tracking data volumes at the topic level. The Amazon MSK operational metrics enabled effortless monitoring and right-sizing of clusters and brokers. Overall, the rich observability of Amazon MSK facilitated data-driven decisions about architecture and product features. Conclusion Nexthink’s journey from an on-premises monolith to a cloud SaaS was streamlined by using Amazon MSK, a fully managed Kafka service. Amazon MSK allowed us to scale seamlessly while benefiting from enterprise-grade reliability and security. By offloading Kafka management to AWS, we could stay focused on our core business and innovate faster. Going forward, we plan to further improve performance, costs, and scalability by adopting Amazon MSK capabilities such as tiered storage and AWS Graviton-based EC2 instance types. We are also working closely with the Amazon MSK team to prepare for upcoming service features. Rapidly adopting new capabilities will help us remain at the forefront of innovation while continuing to grow our business. To learn more about how Nexthink uses AWS to serve its global customer base, explore the Nexthink on AWS case study. Additionally, discover other customer success stories with Amazon MSK by visiting the Amazon MSK blog category. About the Authors Moe Haidar is a principal engineer and special projects lead @ CTO office of Nexthink. He has been involved with AWS since 2018 and is a key contributor to the cloud transformation of the Nexthink platform to AWS. His focus is on product and technology incubation and architecture, but he also loves doing hands-on activities to keep his knowledge of technologies sharp and up to date. He still contributes heavily to the code base and loves to tackle complex problems. Simone Pomata is Senior Solutions Architect at AWS. He has worked enthusiastically in the tech industry for more than 10 years. At AWS, he helps customers succeed in building new technologies every day. Magdalena Gargas is a Solutions Architect passionate about technology and solving customer challenges. At AWS, she works mostly with software companies, helping them innovate in the cloud. She participates in industry events, sharing insights and contributing to the advancement of the containerization field. View the full article
  4. Healthcare providers have an opportunity to improve the patient experience by collecting and analyzing broader and more diverse datasets. This includes patient medical history, allergies, immunizations, family disease history, and individuals’ lifestyle data such as workout habits. Having access to those datasets and forming a 360-degree view of patients allows healthcare providers such as claim analysts to see a broader context about each patient and personalize the care they provide for every individual. This is underpinned by building a complete patient profile that enables claim analysts to identify patterns, trends, potential gaps in care, and adherence to care plans. They can then use the result of their analysis to understand a patient’s health status, treatment history, and past or upcoming doctor consultations to make more informed decisions, streamline the claim management process, and improve operational outcomes. Achieving this will also improve general public health through better and more timely interventions, identify health risks through predictive analytics, and accelerate the research and development process. AWS has invested in a zero-ETL (extract, transform, and load) future so that builders can focus more on creating value from data, instead of having to spend time preparing data for analysis. The solution proposed in this post follows a zero-ETL approach to data integration to facilitate near real-time analytics and deliver a more personalized patient experience. The solution uses AWS services such as AWS HealthLake, Amazon Redshift, Amazon Kinesis Data Streams, and AWS Lake Formation to build a 360 view of patients. These services enable you to collect and analyze data in near real time and put a comprehensive data governance framework in place that uses granular access control to secure sensitive data from unauthorized users. Zero-ETL refers to a set of features on the AWS Cloud that enable integrating different data sources with Amazon Redshift: Integration between Amazon Redshift and Amazon Simple Storage Service (Amazon S3) via Amazon Redshift Spectrum and auto-copy features Integration between Amazon Redshift and Amazon Aurora, Amazon Relational Database Service (Amazon RDS), and Amazon DynamoDB via the zero-ETL feature Integration between Amazon Redshift and streaming sources like Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka (Amazon MSK) via streaming ingestion Solution overview Organizations in the healthcare industry are currently spending a significant amount of time and money on building complex ETL pipelines for data movement and integration. This means data will be replicated across multiple data stores via bespoke and in some cases hand-written ETL jobs, resulting in data inconsistency, latency, and potential security and privacy breaches. With support for querying cross-account Apache Iceberg tables via Amazon Redshift, you can now build a more comprehensive patient-360 analysis by querying all patient data from one place. This means you can seamlessly combine information such as clinical data stored in HealthLake with data stored in operational databases such as a patient relationship management system, together with data produced from wearable devices in near real-time. Having access to all this data enables healthcare organizations to form a holistic view of patients, improve care coordination across multiple organizations, and provide highly personalized care for each individual. The following diagram depicts the high-level solution we build to achieve these outcomes. Deploy the solution You can use the following AWS CloudFormation template to deploy the solution components: This stack creates the following resources and necessary permissions to integrate the services: A Kinesis data stream. You can send data from your streaming source to this resource for ingesting the data into a Redshift data warehouse. We use on-demand capacity mode. An Amazon Aurora MySQL-Compatible Edition cluster version 8.0. This will be your online transaction processing (OLTP) data store for transactional data. To set up zero-ETL integration for ingesting transaction data to the Redshift data warehouse, see Getting started guide for near-real time operational analytics using Amazon Aurora zero-ETL integration with Amazon Redshift. The required parameter groups for source and target are already created as part of the CloudFormation stack. An Amazon Redshift Serverless workgroup and associated namespace. The CloudFormation stack also deploys a provisioned Redshift cluster. If you would like to work with Redshift Serverless, you can remove the provisioned cluster from the template and vice versa. An AWS Identity and Access Management (IAM) role with required policies and trust relationships. Network components, including VPC, subnets, route table, and associations. You can customize these resources as per your organization’s rules. AWS Solution setup AWS HealthLake AWS HealthLake enables organizations in the health industry to securely store, transform, transact, and analyze health data. It stores data in HL7 FHIR format, which is an interoperability standard designed for quick and efficient exchange of health data. When you create a HealthLake data store, a Fast Healthcare Interoperability Resources (FHIR) data repository is made available via a RESTful API endpoint. Simultaneously and as part of AWS HealthLake managed service, the nested JSON FHIR data undergoes an ETL process and is stored in Apache Iceberg open table format in Amazon S3. To create an AWS HealthLake data store, refer to Getting started with AWS HealthLake. Make sure to select the option Preload sample data when creating your data store. In real-world scenarios and when you use AWS HealthLake in production environments, you don’t need to load sample data into your AWS HealthLake data store. Instead, you can use FHIR REST API operations to manage and search resources in your AWS HealthLake data store. We use two tables from the sample data stored in HealthLake: patient and allergyintolerance. Query AWS HealthLake tables with Redshift Serverless Amazon Redshift is the data warehousing service available on the AWS Cloud that provides up to six times better price-performance than any other cloud data warehouses in the market, with a fully managed, AI-powered, massively parallel processing (MPP) data warehouse built for performance, scale, and availability. With continuous innovations added to Amazon Redshift, it is now more than just a data warehouse. It enables organizations of different sizes and in different industries to access all the data they have in their AWS environments and analyze it from one single location with a set of features under the zero-ETL umbrella. Amazon Redshift integrates with AWS HealthLake and data lakes through Redshift Spectrum and Amazon S3 auto-copy features, enabling you to query data directly from files on Amazon S3. Query AWS HealthLake data with Amazon Redshift Amazon Redshift makes it straightforward to query the data stored in S3-based data lakes with automatic mounting of an AWS Glue Data Catalog in the Redshift query editor v2. This means you no longer have to create an external schema in Amazon Redshift to use the data lake tables cataloged in the Data Catalog. To get started with this feature, see Querying the AWS Glue Data Catalog. After it is set up and you’re connected to the Redshift query editor v2, complete the following steps: Validate that your tables are visible in the query editor V2. The Data Catalog objects are listed under the awsdatacatalog database. FHIR data stored in AWS HealthLake is highly nested. To learn about how to un-nest semi-structured data with Amazon Redshift, see Tutorial: Querying nested data with Amazon Redshift Spectrum. Use the following query to un-nest the allergyintolerance and patient tables, join them together, and get patient details and their allergies: WITH patient_allergy AS ( SELECT resourcetype, c AS allery_category, a."patient"."reference", SUBSTRING(a."patient"."reference", 9, LEN(a."patient"."reference")) AS patient_id, a.recordeddate AS allergy_record_date, NVL(cd."code", 'NA') AS allergy_code, NVL(cd.display, 'NA') AS allergy_description FROM "awsdatacatalog"."datastore_01_179674d36391d68926a8d74c12599306_healthlake_view"."allergyintolerance" a LEFT JOIN a.category c ON TRUE LEFT JOIN a.reaction r ON TRUE LEFT JOIN r.manifestation m ON TRUE LEFT JOIN m.coding cd ON TRUE ), patinet_info AS ( SELECT id, gender, g as given_name, n.family as family_name, pr as prefix FROM "awsdatacatalog"."datastore_01_179674d36391d68926a8d74c12599306_healthlake_view"."patient" p LEFT JOIN p.name n ON TRUE LEFT JOIN n.given g ON TRUE LEFT JOIN n.prefix pr ON TRUE ) SELECT DISTINCT p.id, p.gender, p.prefix, p.given_name, p.family_name, pa.allery_category, pa.allergy_code, pa.allergy_description from patient_allergy pa JOIN patinet_info p ON pa.patient_id = p.id ORDER BY p.id, pa.allergy_code ; To eliminate the need for Amazon Redshift to un-nest data every time a query is run, you can create a materialized view to hold un-nested and flattened data. Materialized views are an effective mechanism to deal with complex and repeating queries. They contain a precomputed result set, based on a SQL query over one or more base tables. You can issue SELECT statements to query a materialized view, in the same way that you can query other tables or views in the database. Use the following SQL to create a materialized view. You use it later to build a complete view of patients: CREATE MATERIALIZED VIEW patient_allergy_info AUTO REFRESH YES AS WITH patient_allergy AS ( SELECT resourcetype, c AS allery_category, a."patient"."reference", SUBSTRING(a."patient"."reference", 9, LEN(a."patient"."reference")) AS patient_id, a.recordeddate AS allergy_record_date, NVL(cd."code", 'NA') AS allergy_code, NVL(cd.display, 'NA') AS allergy_description FROM "awsdatacatalog"."datastore_01_179674d36391d68926a8d74c12599306_healthlake_view"."allergyintolerance" a LEFT JOIN a.category c ON TRUE LEFT JOIN a.reaction r ON TRUE LEFT JOIN r.manifestation m ON TRUE LEFT JOIN m.coding cd ON TRUE ), patinet_info AS ( SELECT id, gender, g as given_name, n.family as family_name, pr as prefix FROM "awsdatacatalog"."datastore_01_179674d36391d68926a8d74c12599306_healthlake_view"."patient" p LEFT JOIN p.name n ON TRUE LEFT JOIN n.given g ON TRUE LEFT JOIN n.prefix pr ON TRUE ) SELECT DISTINCT p.id, p.gender, p.prefix, p.given_name, p.family_name, pa.allery_category, pa.allergy_code, pa.allergy_description from patient_allergy pa JOIN patinet_info p ON pa.patient_id = p.id ORDER BY p.id, pa.allergy_code ; You have confirmed you can query data in AWS HealthLake via Amazon Redshift. Next, you set up zero-ETL integration between Amazon Redshift and Amazon Aurora MySQL. Set up zero-ETL integration between Amazon Aurora MySQL and Redshift Serverless Applications such as front-desk software, which are used to schedule appointments and register new patients, store data in OLTP databases such as Aurora. To get data out of OLTP databases and have them ready for analytics use cases, data teams might have to spend a considerable amount of time to build, test, and deploy ETL jobs that are complex to maintain and scale. With the Amazon Redshift zero-ETL integration with Amazon Aurora MySQL, you can run analytics on the data stored in OLTP databases and combine them with the rest of the data in Amazon Redshift and AWS HealthLake in near real time. In the next steps in this section, we connect to a MySQL database and set up zero-ETL integration with Amazon Redshift. Connect to an Aurora MySQL database and set up data Connect to your Aurora MySQL database using your editor of choice using AdminUsername and AdminPassword that you entered when running the CloudFormation stack. (For simplicity, it is the same for Amazon Redshift and Aurora.) When you’re connected to your database, complete the following steps: Create a new database by running the following command: CREATE DATABASE front_desk_app_db; Create a new table. This table simulates storing patient information as they visit clinics and other healthcare centers. For simplicity and to demonstrate specific capabilities, we assume that patient IDs are the same in AWS HealthLake and the front-of-office application. In real-world scenarios, this can be a hashed version of a national health care number: CREATE TABLE patient_appointment ( patient_id varchar(250), gender varchar(1), date_of_birth date, appointment_datetime datetime, phone_number varchar(15), PRIMARY KEY (patient_id, appointment_datetime) ); Having a primary key in the table is mandatory for zero-ETL integration to work. Insert new records into the source table in the Aurora MySQL database. To demonstrate the required functionalities, make sure the patient_id of the sample records inserted into the MySQL database match the ones in AWS HealthLake. Replace [patient_id_1] and [patient_id_2] in the following query with the ones from the Redshift query you ran previously (the query that joined allergyintolerance and patient): INSERT INTO front_desk_app_db.patient_appointment (patient_id, gender, date_of_birth, appointment_datetime, phone_number) VALUES([PATIENT_ID_1], 'F', '1988-7-04', '2023-12-19 10:15:00', '0401401401'), ([PATIENT_ID_1], 'F', '1988-7-04', '2023-09-19 11:00:00', '0401401401'), ([PATIENT_ID_1], 'F', '1988-7-04', '2023-06-06 14:30:00', '0401401401'), ([PATIENT_ID_2], 'F', '1972-11-14', '2023-12-19 08:15:00', '0401401402'), ([PATIENT_ID_2], 'F', '1972-11-14', '2023-01-09 12:15:00', '0401401402'); Now that your source table is populated with sample records, you can set up zero-ETL and have data ingested into Amazon Redshift. Set up zero-ETL integration between Amazon Aurora MySQL and Amazon Redshift Complete the following steps to create your zero-ETL integration: On the Amazon RDS console, choose Databases in the navigation pane. Choose the DB identifier of your cluster (not the instance). On the Zero-ETL Integration tab, choose Create zero-ETL integration. Follow the steps to create your integration. Create a Redshift database from the integration Next, you create a target database from the integration. You can do this by running a couple of simple SQL commands on Amazon Redshift. Log in to the query editor V2 and run the following commands: Get the integration ID of the zero-ETL you set up between your source database and Amazon Redshift: SELECT * FROM svv_integration; Create a database using the integration ID: CREATE DATABASE ztl_demo FROM INTEGRATION '[INTEGRATION_ID '; Query the database and validate that a new table is created and populated with data from your source MySQL database: SELECT * FROM ztl_demo.front_desk_app_db.patient_appointment; It might take a few seconds for the first set of records to appear in Amazon Redshift. This shows that the integration is working as expected. To validate it further, you can insert a new record in your Aurora MySQL database, and it will be available in Amazon Redshift for querying in near real time within a few seconds. Set up streaming ingestion for Amazon Redshift Another aspect of zero-ETL on AWS, for real-time and streaming data, is realized through Amazon Redshift Streaming Ingestion. It provides low-latency, high-speed ingestion of streaming data from Kinesis Data Streams and Amazon MSK. It lowers the effort required to have data ready for analytics workloads, lowers the cost of running such workloads on the cloud, and decreases the operational burden of maintaining the solution. In the context of healthcare, understanding an individual’s exercise and movement patterns can help with overall health assessment and better treatment planning. In this section, you send simulated data from wearable devices to Kinesis Data Streams and integrate it with the rest of the data you already have access to from your Redshift Serverless data warehouse. For step-by-step instructions, refer to Real-time analytics with Amazon Redshift streaming ingestion. Note the following steps when you set up streaming ingestion for Amazon Redshift: Select wearables_stream and use the following template when sending data to Amazon Kinesis Data Streams via Kinesis Data Generator, to simulate data generated by wearable devices. Replace [PATIENT_ID_1] and [PATIENT_ID_2] with the patient IDs you earlier when inserting new records into your Aurora MySQL table: { "patient_id": "{{random.arrayElement(["[PATIENT_ID_1]"," [PATIENT_ID_2]"])}}", "steps_increment": "{{random.arrayElement( [0,1] )}}", "heart_rate": {{random.number( { "min":45, "max":120} )}} } Create an external schema called from_kds by running the following query and replacing [IAM_ROLE_ARN] with the ARN of the role created by the CloudFormation stack (Patient360BlogRole): CREATE EXTERNAL SCHEMA from_kds FROM KINESIS IAM_ROLE '[IAM_ROLE_ARN]'; Use the following SQL when creating a materialized view to consume data from the stream: CREATE MATERIALIZED VIEW patient_wearable_data AUTO REFRESH YES AS SELECT approximate_arrival_timestamp, JSON_PARSE(kinesis_data) as Data FROM from_kds."wearables_stream" WHERE CAN_JSON_PARSE(kinesis_data); To validate that streaming ingestion works as expected, refresh the materialized view to get the data you already sent to the data stream and query the table to make sure data has landed in Amazon Redshift: REFRESH MATERIALIZED VIEW patient_wearable_data; SELECT * FROM patient_wearable_data ORDER BY approximate_arrival_timestamp DESC; Query and analyze patient wearable data The results in the data column of the preceding query are in JSON format. Amazon Redshift makes it straightforward to work with semi-structured data in JSON format. It uses PartiQL language to offer SQL-compatible access to relational, semi-structured, and nested data. Use the following query to flatten data: SELECT data."patient_id"::varchar AS patient_id, data."steps_increment"::integer as steps_increment, data."heart_rate"::integer as heart_rate, approximate_arrival_timestamp FROM patient_wearable_data ORDER BY approximate_arrival_timestamp DESC; The result looks like the following screenshot. Now that you know how to flatten JSON data, you can analyze it further. Use the following query to get the number of minutes a patient has been physically active per day, based on their heart rate (greater than 80): WITH patient_wearble_flattened AS ( SELECT data."patient_id"::varchar AS patient_id, data."steps_increment"::integer as steps_increment, data."heart_rate"::integer as heart_rate, approximate_arrival_timestamp, DATE(approximate_arrival_timestamp) AS date_received, extract(hour from approximate_arrival_timestamp) AS hour_received, extract(minute from approximate_arrival_timestamp) AS minute_received FROM patient_wearable_data ), patient_active_minutes AS ( SELECT patient_id, date_received, hour_received, minute_received, avg(heart_rate) AS heart_rate FROM patient_wearble_flattened GROUP BY patient_id, date_received, hour_received, minute_received HAVING avg(heart_rate) > 80 ) SELECT patient_id, date_received, COUNT(heart_rate) AS active_minutes_count FROM patient_active_minutes GROUP BY patient_id, date_received ORDER BY patient_id, date_received; Create a complete patient 360 Now that you are able to query all patient data with Redshift Serverless, you can combine the three datasets you used in this post and form a comprehensive patient 360 view with the following query: WITH patient_appointment_info AS ( SELECT "patient_id", "gender", "date_of_birth", "appointment_datetime", "phone_number" FROM ztl_demo.front_desk_app_db.patient_appointment ), patient_wearble_flattened AS ( SELECT data."patient_id"::varchar AS patient_id, data."steps_increment"::integer as steps_increment, data."heart_rate"::integer as heart_rate, approximate_arrival_timestamp, DATE(approximate_arrival_timestamp) AS date_received, extract(hour from approximate_arrival_timestamp) AS hour_received, extract(minute from approximate_arrival_timestamp) AS minute_received FROM patient_wearable_data ), patient_active_minutes AS ( SELECT patient_id, date_received, hour_received, minute_received, avg(heart_rate) AS heart_rate FROM patient_wearble_flattened GROUP BY patient_id, date_received, hour_received, minute_received HAVING avg(heart_rate) > 80 ), patient_active_minutes_count AS ( SELECT patient_id, date_received, COUNT(heart_rate) AS active_minutes_count FROM patient_active_minutes GROUP BY patient_id, date_received ) SELECT pai.patient_id, pai.gender, pai.prefix, pai.given_name, pai.family_name, pai.allery_category, pai.allergy_code, pai.allergy_description, ppi.date_of_birth, ppi.appointment_datetime, ppi.phone_number, pamc.date_received, pamc.active_minutes_count FROM patient_allergy_info pai LEFT JOIN patient_active_minutes_count pamc ON pai.patient_id = pamc.patient_id LEFT JOIN patient_appointment_info ppi ON pai.patient_id = ppi.patient_id GROUP BY pai.patient_id, pai.gender, pai.prefix, pai.given_name, pai.family_name, pai.allery_category, pai.allergy_code, pai.allergy_description, ppi.date_of_birth, ppi.appointment_datetime, ppi.phone_number, pamc.date_received, pamc.active_minutes_count ORDER BY pai.patient_id, pai.gender, pai.prefix, pai.given_name, pai.family_name, pai.allery_category, pai.allergy_code, pai.allergy_description, ppi.date_of_birth DESC, ppi.appointment_datetime DESC, ppi.phone_number DESC, pamc.date_received, pamc.active_minutes_count You can use the solution and queries used here to expand the datasets used in your analysis. For example, you can include other tables from AWS HealthLake as needed. Clean up To clean up resources you created, complete the following steps: Delete the zero-ETL integration between Amazon RDS and Amazon Redshift. Delete the CloudFormation stack. Delete AWS HealthLake data store Conclusion Forming a comprehensive 360 view of patients by integrating data from various different sources offers numerous benefits for organizations operating in the healthcare industry. It enables healthcare providers to gain a holistic understanding of a patient’s medical journey, enhances clinical decision-making, and allows for more accurate diagnosis and tailored treatment plans. With zero-ETL features for data integration on AWS, it is effortless to build a view of patients securely, cost-effectively, and with minimal effort. You can then use visualization tools such as Amazon QuickSight to build dashboards or use Amazon Redshift ML to enable data analysts and database developers to train machine learning (ML) models with the data integrated through Amazon Redshift zero-ETL. The result is a set of ML models that are trained with a broader view into patients, their medical history, and their lifestyle, and therefore enable you make more accurate predictions about their upcoming health needs. About the Authors Saeed Barghi is a Sr. Analytics Specialist Solutions Architect specializing in architecting enterprise data platforms. He has extensive experience in the fields of data warehousing, data engineering, data lakes, and AI/ML. Based in Melbourne, Australia, Saeed works with public sector customers in Australia and New Zealand. Satesh Sonti is a Sr. Analytics Specialist Solutions Architect based out of Atlanta, specialized in building enterprise data platforms, data warehousing, and analytics solutions. He has over 17 years of experience in building data assets and leading complex data platform programs for banking and insurance clients across the globe. View the full article
  5. Foundation models (FMs) are large machine learning (ML) models trained on a broad spectrum of unlabeled and generalized datasets. FMs, as the name suggests, provide the foundation to build more specialized downstream applications, and are unique in their adaptability. They can perform a wide range of different tasks, such as natural language processing, classifying images, forecasting trends, analyzing sentiment, and answering questions. This scale and general-purpose adaptability are what makes FMs different from traditional ML models. FMs are multimodal; they work with different data types such as text, video, audio, and images. Large language models (LLMs) are a type of FM and are pre-trained on vast amounts of text data and typically have application uses such as text generation, intelligent chatbots, or summarization. Streaming data facilitates the constant flow of diverse and up-to-date information, enhancing the models’ ability to adapt and generate more accurate, contextually relevant outputs. This dynamic integration of streaming data enables generative AI applications to respond promptly to changing conditions, improving their adaptability and overall performance in various tasks. To better understand this, imagine a chatbot that helps travelers book their travel. In this scenario, the chatbot needs real-time access to airline inventory, flight status, hotel inventory, latest price changes, and more. This data usually comes from third parties, and developers need to find a way to ingest this data and process the data changes as they happen. Batch processing is not the best fit in this scenario. When data changes rapidly, processing it in a batch may result in stale data being used by the chatbot, providing inaccurate information to the customer, which impacts the overall customer experience. Stream processing, however, can enable the chatbot to access real-time data and adapt to changes in availability and price, providing the best guidance to the customer and enhancing the customer experience. Another example is an AI-driven observability and monitoring solution where FMs monitor real-time internal metrics of a system and produces alerts. When the model finds an anomaly or abnormal metric value, it should immediately produce an alert and notify the operator. However, the value of such important data diminishes significantly over time. These notifications should ideally be received within seconds or even while it’s happening. If operators receive these notifications minutes or hours after they happened, such an insight is not actionable and has potentially lost its value. You can find similar use cases in other industries such as retail, car manufacturing, energy, and the financial industry. In this post, we discuss why data streaming is a crucial component of generative AI applications due to its real-time nature. We discuss the value of AWS data streaming services such as Amazon Managed Streaming for Apache Kafka (Amazon MSK), Amazon Kinesis Data Streams, Amazon Managed Service for Apache Flink, and Amazon Kinesis Data Firehose in building generative AI applications. In-context learning LLMs are trained with point-in-time data and have no inherent ability to access fresh data at inference time. As new data appears, you will have to continuously fine-tune or further train the model. This is not only an expensive operation, but also very limiting in practice because the rate of new data generation far supersedes the speed of fine-tuning. Additionally, LLMs lack contextual understanding and rely solely on their training data, and are therefore prone to hallucinations. This means they can generate a fluent, coherent, and syntactically sound but factually incorrect response. They are also devoid of relevance, personalization, and context. LLMs, however, have the capacity to learn from the data they receive from the context to more accurately respond without modifying the model weights. This is called in-context learning, and can be used to produce personalized answers or provide an accurate response in the context of organization policies. For example, in a chatbot, data events could pertain to an inventory of flights and hotels or price changes that are constantly ingested to a streaming storage engine. Furthermore, data events are filtered, enriched, and transformed to a consumable format using a stream processor. The result is made available to the application by querying the latest snapshot. The snapshot constantly updates through stream processing; therefore, the up-to-date data is provided in the context of a user prompt to the model. This allows the model to adapt to the latest changes in price and availability. The following diagram illustrates a basic in-context learning workflow. A commonly used in-context learning approach is to use a technique called Retrieval Augmented Generation (RAG). In RAG, you provide the relevant information such as most relevant policy and customer records along with the user question to the prompt. This way, the LLM generates an answer to the user question using additional information provided as context. To learn more about RAG, refer to Question answering using Retrieval Augmented Generation with foundation models in Amazon SageMaker JumpStart. A RAG-based generative AI application can only produce generic responses based on its training data and the relevant documents in the knowledge base. This solution falls short when a near-real-time personalized response is expected from the application. For example, a travel chatbot is expected to consider the user’s current bookings, available hotel and flight inventory, and more. Moreover, the relevant customer personal data (commonly known as the unified customer profile) is usually subject to change. If a batch process is employed to update the generative AI’s user profile database, the customer may receive dissatisfying responses based on old data. In this post, we discuss the application of stream processing to enhance a RAG solution used for building question answering agents with context from real-time access to unified customer profiles and organizational knowledge base. Near-real-time customer profile updates Customer records are typically distributed across data stores within an organization. For your generative AI application to provide a relevant, accurate, and up-to-date customer profile, it is vital to build streaming data pipelines that can perform identity resolution and profile aggregation across the distributed data stores. Streaming jobs constantly ingest new data to synchronize across systems and can perform enrichment, transformations, joins, and aggregations across windows of time more efficiently. Change data capture (CDC) events contain information about the source record, updates, and metadata such as time, source, classification (insert, update, or delete), and the initiator of the change. The following diagram illustrates an example workflow for CDC streaming ingestion and processing for unified customer profiles. In this section, we discuss the main components of a CDC streaming pattern required to support RAG-based generative AI applications. CDC streaming ingestion A CDC replicator is a process that collects data changes from a source system (usually by reading transaction logs or binlogs) and writes CDC events with the exact same order they occurred in a streaming data stream or topic. This involves a log-based capture with tools such as AWS Database Migration Service (AWS DMS) or open source connectors such as Debezium for Apache Kafka connect. Apache Kafka Connect is part of the Apache Kafka environment, allowing data to be ingested from various sources and delivered to variety of destinations. You can run your Apache Kafka connector on Amazon MSK Connect within minutes without worrying about configuration, setup, and operating an Apache Kafka cluster. You only need to upload your connector’s compiled code to Amazon Simple Storage Service (Amazon S3) and set up your connector with your workload’s specific configuration. There are also other methods for capturing data changes. For example, Amazon DynamoDB provides a feature for streaming CDC data to Amazon DynamoDB Streams or Kinesis Data Streams. Amazon S3 provides a trigger to invoke an AWS Lambda function when a new document is stored. Streaming storage Streaming storage functions as an intermediate buffer to store CDC events before they get processed. Streaming storage provides reliable storage for streaming data. By design, it is highly available and resilient to hardware or node failures and maintains the order of the events as they are written. Streaming storage can store data events either permanently or for a set period of time. This allows stream processors to read from part of the stream if there is a failure or a need for re-processing. Kinesis Data Streams is a serverless streaming data service that makes it straightforward to capture, process, and store data streams at scale. Amazon MSK is a fully managed, highly available, and secure service provided by AWS for running Apache Kafka. Stream processing Stream processing systems should be designed for parallelism to handle high data throughput. They should partition the input stream between multiple tasks running on multiple compute nodes. Tasks should be able to send the result of one operation to the next one over the network, making it possible for processing data in parallel while performing operations such as joins, filtering, enrichment, and aggregations. Stream processing applications should be able to process events with regards to the event time for use cases where events could arrive late or correct computation relies on the time events occur rather than the system time. For more information, refer to Notions of Time: Event Time and Processing Time. Stream processes continuously produce results in the form of data events that need to be output to a target system. A target system could be any system that can integrate directly with the process or via streaming storage as in intermediary. Depending on the framework you choose for stream processing, you will have different options for target systems depending on available sink connectors. If you decide to write the results to an intermediary streaming storage, you can build a separate process that reads events and applies changes to the target system, such as running an Apache Kafka sink connector. Regardless of which option you choose, CDC data needs extra handling due to its nature. Because CDC events carry information about updates or deletes, it’s important that they merge in the target system in the right order. If changes are applied in the wrong order, the target system will be out of sync with its source. Apache Flink is a powerful stream processing framework known for its low latency and high throughput capabilities. It supports event time processing, exactly-once processing semantics, and high fault tolerance. Additionally, it provides native support for CDC data via a special structure called dynamic tables. Dynamic tables mimic the source database tables and provide a columnar representation of the streaming data. The data in dynamic tables changes with every event that is processed. New records can be appended, updated, or deleted at any time. Dynamic tables abstract away the extra logic you need to implement for each record operation (insert, update, delete) separately. For more information, refer to Dynamic Tables. With Amazon Managed Service for Apache Flink, you can run Apache Flink jobs and integrate with other AWS services. There are no servers and clusters to manage, and there is no compute and storage infrastructure to set up. AWS Glue is a fully managed extract, transform, and load (ETL) service, which means AWS handles the infrastructure provisioning, scaling, and maintenance for you. Although it’s primarily known for its ETL capabilities, AWS Glue can also be used for Spark streaming applications. AWS Glue can interact with streaming data services such as Kinesis Data Streams and Amazon MSK for processing and transforming CDC data. AWS Glue can also seamlessly integrate with other AWS services such as Lambda, AWS Step Functions, and DynamoDB, providing you with a comprehensive ecosystem for building and managing data processing pipelines. Unified customer profile Overcoming the unification of the customer profile across a variety of source systems requires the development of robust data pipelines. You need data pipelines that can bring and synchronize all records into one data store. This data store provides your organization with the holistic customer records view that is needed for operational efficiency of RAG-based generative AI applications. For building such a data store, an unstructured data store would be best. An identity graph is a useful structure for creating a unified customer profile because it consolidates and integrates customer data from various sources, ensures data accuracy and deduplication, offers real-time updates, connects cross-systems insights, enables personalization, enhances customer experience, and supports regulatory compliance. This unified customer profile empowers the generative AI application to understand and engage with customers effectively, and adhere to data privacy regulations, ultimately enhancing customer experiences and driving business growth. You can build your identity graph solution using Amazon Neptune, a fast, reliable, fully managed graph database service. AWS provides a few other managed and serverless NoSQL storage service offerings for unstructured key-value objects. Amazon DocumentDB (with MongoDB compatibility) is a fast, scalable, highly available, and fully managed enterprise document database service that supports native JSON workloads. DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability. Near-real-time organizational knowledge base updates Similar to customer records, internal knowledge repositories such as company policies and organizational documents are siloed across storage systems. This is typically unstructured data and is updated in a non-incremental fashion. The use of unstructured data for AI applications is effective using vector embeddings, which is a technique of representing high dimensional data such as text files, images, and audio files as multi-dimensional numeric. AWS provides several vector engine services, such as Amazon OpenSearch Serverless, Amazon Kendra, and Amazon Aurora PostgreSQL-Compatible Edition with the pgvector extension for storing vector embeddings. Generative AI applications can enhance the user experience by transforming the user prompt into a vector and use it to query the vector engine to retrieve contextually relevant information. Both the prompt and the vector data retrieved are then passed to the LLM to receive a more precise and personalized response. The following diagram illustrates an example stream-processing workflow for vector embeddings. Knowledge base contents need to be converted to vector embeddings before being written to the vector data store. Amazon Bedrock or Amazon SageMaker can help you access the model of your choice and expose a private endpoint for this conversion. Furthermore, you can use libraries such as LangChain to integrate with these endpoints. Building a batch process can help you convert your knowledge base content to vector data and store it in a vector database initially. However, you need to rely on an interval to reprocess the documents to synchronize your vector database with changes in your knowledge base content. With a large number of documents, this process can be inefficient. Between these intervals, your generative AI application users will receive answers according to the old content, or will receive an inaccurate answer because the new content is not vectorized yet. Stream processing is an ideal solution for these challenges. It produces events as per existing documents initially and further monitors the source system and creates a document change event as soon as they occur. These events can be stored in streaming storage and wait to be processed by a streaming job. A streaming job reads these events, loads the content of the document, and transforms the contents to an array of related tokens of words. Each token further transforms into vector data via an API call to an embedding FM. Results are sent for storage to the vector storage via a sink operator. If you’re using Amazon S3 for storing your documents, you can build an event-source architecture based on S3 object change triggers for Lambda. A Lambda function can create an event in the desired format and write that to your streaming storage. You can also use Apache Flink to run as a streaming job. Apache Flink provides the native FileSystem source connector, which can discover existing files and read their contents initially. After that, it can continuously monitor your file system for new files and capture their content. The connector supports reading a set of files from distributed file systems such as Amazon S3 or HDFS with a format of plain text, Avro, CSV, Parquet, and more, and produces a streaming record. As a fully managed service, Managed Service for Apache Flink removes the operational overhead of deploying and maintaining Flink jobs, allowing you to focus on building and scaling your streaming applications. With seamless integration into the AWS streaming services such as Amazon MSK or Kinesis Data Streams, it provides features like automatic scaling, security, and resiliency, providing reliable and efficient Flink applications for handling real-time streaming data. Based on your DevOps preference, you can choose between Kinesis Data Streams or Amazon MSK for storing the streaming records. Kinesis Data Streams simplifies the complexities of building and managing custom streaming data applications, allowing you to focus on deriving insights from your data rather than infrastructure maintenance. Customers using Apache Kafka often opt for Amazon MSK due to its straightforwardness, scalability, and dependability in overseeing Apache Kafka clusters within the AWS environment. As a fully managed service, Amazon MSK takes on the operational complexities associated with deploying and maintaining Apache Kafka clusters, enabling you to concentrate on constructing and expanding your streaming applications. Because a RESTful API integration suits the nature of this process, you need a framework that supports a stateful enrichment pattern via RESTful API calls to track for failures and retry for the failed request. Apache Flink again is a framework that can do stateful operations in at-memory speed. To understand the best ways to make API calls via Apache Flink, refer to Common streaming data enrichment patterns in Amazon Kinesis Data Analytics for Apache Flink. Apache Flink provides native sink connectors for writing data to vector datastores such as Amazon Aurora for PostgreSQL with pgvector or Amazon OpenSearch Service with VectorDB. Alternatively, you can stage the Flink job’s output (vectorized data) in an MSK topic or a Kinesis data stream. OpenSearch Service provides support for native ingestion from Kinesis data streams or MSK topics. For more information, refer to Introducing Amazon MSK as a source for Amazon OpenSearch Ingestion and Loading streaming data from Amazon Kinesis Data Streams. Feedback analytics and fine-tuning It’s important for data operation managers and AI/ML developers to get insight about the performance of the generative AI application and the FMs in use. To achieve that, you need to build data pipelines that calculate important key performance indicator (KPI) data based on the user feedback and variety of application logs and metrics. This information is useful for stakeholders to gain real-time insight about the performance of the FM, the application, and overall user satisfaction about the quality of support they receive from your application. You also need to collect and store the conversation history for further fine-tuning your FMs to improve their ability in performing domain-specific tasks. This use case fits very well in the streaming analytics domain. Your application should store each conversation in streaming storage. Your application can prompt users about their rating of each answer’s accuracy and their overall satisfaction. This data can be in a format of a binary choice or a free form text. This data can be stored in a Kinesis data stream or MSK topic, and get processed to generate KPIs in real time. You can put FMs to work for users’ sentiment analysis. FMs can analyze each answer and assign a category of user satisfaction. Apache Flink’s architecture allows for complex data aggregation over windows of time. It also provides support for SQL querying over stream of data events. Therefore, by using Apache Flink, you can quickly analyze raw user inputs and generate KPIs in real time by writing familiar SQL queries. For more information, refer to Table API & SQL. With Amazon Managed Service for Apache Flink Studio, you can build and run Apache Flink stream processing applications using standard SQL, Python, and Scala in an interactive notebook. Studio notebooks are powered by Apache Zeppelin and use Apache Flink as the stream processing engine. Studio notebooks seamlessly combine these technologies to make advanced analytics on data streams accessible to developers of all skill sets. With support for user-defined functions (UDFs), Apache Flink allows for building custom operators to integrate with external resources such as FMs for performing complex tasks such as sentiment analysis. You can use UDFs to compute various metrics or enrich user feedback raw data with additional insights such as user sentiment. To learn more about this pattern, refer to Proactively addressing customer concern in real-time with GenAI, Flink, Apache Kafka, and Kinesis. With Managed Service for Apache Flink Studio, you can deploy your Studio notebook as a streaming job with one click. You can use native sink connectors provided by Apache Flink to send the output to your storage of choice or stage it in a Kinesis data stream or MSK topic. Amazon Redshift and OpenSearch Service are both ideal for storing analytical data. Both engines provide native ingestion support from Kinesis Data Streams and Amazon MSK via a separate streaming pipeline to a data lake or data warehouse for analysis. Amazon Redshift uses SQL to analyze structured and semi-structured data across data warehouses and data lakes, using AWS-designed hardware and machine learning to deliver the best price-performance at scale. OpenSearch Service offers visualization capabilities powered by OpenSearch Dashboards and Kibana (1.5 to 7.10 versions). You can use the outcome of such analysis combined with user prompt data for fine-tuning the FM when is needed. SageMaker is the most straightforward way to fine-tune your FMs. Using Amazon S3 with SageMaker provides a powerful and seamless integration for fine-tuning your models. Amazon S3 serves as a scalable and durable object storage solution, enabling straightforward storage and retrieval of large datasets, training data, and model artifacts. SageMaker is a fully managed ML service that simplifies the entire ML lifecycle. By using Amazon S3 as the storage backend for SageMaker, you can benefit from the scalability, reliability, and cost-effectiveness of Amazon S3, while seamlessly integrating it with SageMaker training and deployment capabilities. This combination enables efficient data management, facilitates collaborative model development, and makes sure that ML workflows are streamlined and scalable, ultimately enhancing the overall agility and performance of the ML process. For more information, refer to Fine-tune Falcon 7B and other LLMs on Amazon SageMaker with @remote decorator. With a file system sink connector, Apache Flink jobs can deliver data to Amazon S3 in open format (such as JSON, Avro, Parquet, and more) files as data objects. If you prefer to manage your data lake using a transactional data lake framework (such as Apache Hudi, Apache Iceberg, or Delta Lake), all of these frameworks provide a custom connector for Apache Flink. For more details, refer to Create a low-latency source-to-data lake pipeline using Amazon MSK Connect, Apache Flink, and Apache Hudi. Summary For a generative AI application based on a RAG model, you need to consider building two data storage systems, and you need to build data operations that keep them up to date with all the source systems. Traditional batch jobs are not sufficient to process the size and diversity of the data you need to integrate with your generative AI application. Delays in processing the changes in source systems result in an inaccurate response and reduce the efficiency of your generative AI application. Data streaming enables you to ingest data from a variety of databases across various systems. It also allows you to transform, enrich, join, and aggregate data across many sources efficiently in near-real time. Data streaming provides a simplified data architecture to collect and transform users’ real-time reactions or comments on the application responses, helping you deliver and store the results in a data lake for model fine-tuning. Data streaming also helps you optimize data pipelines by processing only the change events, allowing you to respond to data changes more quickly and efficiently. Learn more about AWS data streaming services and get started building your own data streaming solution. About the Authors Ali Alemi is a Streaming Specialist Solutions Architect at AWS. Ali advises AWS customers with architectural best practices and helps them design real-time analytics data systems which are reliable, secure, efficient, and cost-effective. He works backward from customer’s use cases and designs data solutions to solve their business problems. Prior to joining AWS, Ali supported several public sector customers and AWS consulting partners in their application modernization journey and migration to the Cloud. Imtiaz (Taz) Sayed is the World-Wide Tech Leader for Analytics at AWS. He enjoys engaging with the community on all things data and analytics. He can be reached via LinkedIn. View the full article
  6. Over the past few years, Apache Kafka has emerged as the leading standard for streaming data. Fast-forward to the present day: Kafka has achieved ubiquity, being adopted by at least 80% of the Fortune 100. This widespread adoption is attributed to Kafka's architecture, which goes far beyond basic messaging. Kafka's architecture versatility makes it exceptionally suitable for streaming data at a vast "internet" scale, ensuring fault tolerance and data consistency crucial for supporting mission-critical applications. Flink is a high-throughput, unified batch and stream processing engine, renowned for its capability to handle continuous data streams at scale. It seamlessly integrates with Kafka and offers robust support for exactly-once semantics, ensuring each event is processed precisely once, even amidst system failures. Flink emerges as a natural choice as a stream processor for Kafka. While Apache Flink enjoys significant success and popularity as a tool for real-time data processing, accessing sufficient resources and current examples for learning Flink can be challenging. View the full article
  7. StreamNative, a leading Apache Pulsar-based real-time data platform solutions provider, and Databricks, the Data Intelligence Platform, are thrilled to announce the enhanced Pulsar-Spark... View the full article
  8. Discover Real-time with our latest whitepaper: A CTO’s Guide to Real-time Linux Welcome to this two-part blog series on Linux vs RTOS (Real-time Operating System). This series will explain the differences between achieving real-time requirements with a Linux kernel and doing the same with an RTOS. Part I will explain the basics behind a real-time capable kernel running Linux vs RTOS. In Part II, we will delve deeper into the pros and cons of each approach and how to make an informed comparison between the two. Let’s get started. What is real-time all about? The demand for real-time capabilities is rising in the ever-evolving computing landscape, with forecasts indicating that nearly 30% of the world’s data will require real-time processing by 2025 [1]. To understand how to meet real-time requirements, it helps to have a clear definition of the concept. “Real timeliness” is the ability of an Operating System (OS) to provide a required level of service in a bounded response time [2]. In a real-time system, not only does the correctness of a computation depend on the logical correctness of the result, but also on the time at which it produces it [3]. On the software side, alternative approaches are available to fulfil real-time requirements. Which one should your business adopt? Where should enterprises land on the Linux vs RTOS debate? Real-time with Linux Real-time computing via the Linux kernel is gaining traction as a valuable solution, especially in systems running the latest silicon, with precision and deterministic response times. The concept of preemption lies at the core of real-time Linux. It involves interrupting the current thread of execution to process higher-priority events promptly. Deterministic response times are unattainable in Linux without kernel preemption. PREEMPT_RT, hosted at the Linux Foundation, is the de-facto Linux real-time implementation. While it doesn’t aim for the lowest latencies possible, it introduces mechanisms like priority inheritance and replaces locking primitives, making the Linux kernel preemptible with deterministic response times. Real-time Ubuntu with the out-of-tree PREEMPT_RT patches, brings real-time capabilities to the forefront. Offering reduced kernel latencies as required by demanding workloads, Real-time Ubuntu provides a time-predictable task execution environment. Furthermore, Canonical has supported the Real-time Ubuntu Linux kernel for over 10 years, enabling device manufacturers to focus on their business drivers. Meeting real-time requirements with an RTOS An RTOS provides an alternative to real-time Linux solutions. Unlike general-purpose operating systems. RTOSes focus on deterministic response times and precise control over task scheduling. Enterprises often use an RTOS in mission-critical scenarios with extreme latency-dependent use cases, where a missed deadline results in system failure. RTOS excel at managing task priorities, allowing critical tasks to take precedence over less time-sensitive processes. This prioritisation is crucial in scenarios where the system must guarantee operations will occur within a specific timeframe. On the other hand, an RTOS’ suitability for applications in critical systems where any form of system failure is intolerable, may comes with some drawbacks, which we will touch upon in Part II. Is it always possible to achieve the “best performance” with either a real-time capable Linux kernel or an RTOS? And can development teams rest assured there is one clear choice between the two approaches dependin gon their systems’ requirements? Let’s find out. Reducing latency with Linux vs RTOS A real-time capable Linux kernel does not guarantee maximum latency because performance strictly depends on the system at hand. From networking to cache partitioning, every shared resource can affect cycle times and be a source of jitter. Every level can be a source of latency, from the hardware to the kernel and the application. Similarly, even the most efficient RTOS can be useless in the presence of other latency sinks. Specific tuning for each use case is required, and an optimal combination of tuning configs for a particular hardware may still lead to poor results in a different environment. This meant that follows that one can’t guarantee a maximum latency “in the abstract”, as performance strictly depends on the specific system. Making an informed choice between real-time Linux vs RTOS The choice between RTOS and real-time Linux hinges on the specific latency requirements of a system, balancing the need for determinism, overhead, and resource efficiency. While Real-time Ubuntu Linux with the PREEMPT_RT patch offers a robust solution for several scenarios, a dedicated RTOS may be preferable in a critical embedded system. A real-time capable Linux kernel is ideal for strong latency-dependent use cases but for extreme latency requirements, an RTOS may be more suitable. Understanding the intricacies of both options is key to making an informed decision that aligns with a business’s demands and constraints. Stay tuned for Part II of this mini-series, where we will discuss the relevant considerations to keep in mind when choosing between a Linux kernel or an RTOS in a real-time system. Download the latest whitepaper on real-time Linux now! View the full article
  9. In the rapidly evolving digital landscape, the role of data has shifted from being merely a byproduct of business to becoming its lifeblood. With businesses constantly in the race to stay ahead, the process of integrating this data becomes crucial. However, it's no longer enough to assimilate data in isolated, batch-oriented processes. The new norm is real-time data integration, and it’s transforming the way companies make decisions and conduct their operations. This article delves into the paradigm shift from traditional to real-time data integration, examines its architectural nuances, and contemplates its profound impact on decision-making and business processes… View the full article
  10. In today's tech landscape, where application systems are numerous and complex, real-time monitoring during deployments has transitioned from being a luxury to an absolute necessity. Ensuring that all the components of an application are functioning as expected during and immediately after deployment while also keeping an eye on essential application metrics is paramount to the health and functionality of any software application. This is where Datadog steps in — a leading monitoring and analytics platform that brings visibility into every part of the infrastructure, from front-end apps to the underlying hardware. In tandem with this is Ansible, a robust tool for automation, particularly in deployment and configuration management. In this article, we will discover how Datadog real-time monitoring can be integrated into Ansible-based deployments and how this integration can be leveraged during deployments. This concept and methodology can be applied to similar sets of monitoring and deployment tools as well. Why Integrate Real-Time Monitoring in Deployments? In the ever-evolving realm of DevOps, the line between development and operations is continuously blurring. This integration drives a growing need for continuous oversight throughout the entire lifecycle of an application, not just post-deployment. Here's why integrating Datadog with your deployment processes and within your deployment scripts is both timely and essential: View the full article
  11. In the dynamic landscape of Kubernetes application deployment, GitOps has emerged as a transformative methodology. At its core lies the concept of declarative configuration stored in version-controlled repositories, enabling consistent and automated application management. Argo CD, a leading tool in the GitOps ecosystem, empowers organizations to efficiently deploy and manage Kubernetes applications with precision and reproducibility. As GitOps gains traction, the need for real-time communication and collaboration surrounding deployment updates becomes increasingly apparent. View the full article
  12. Generative AI chatbots like ChatGPT, Bing Chat and Google Bard continue to iterate and improve, and Google's smart assistant is the latest to get a feature update – it can now respond to you in real time, if you want it to. Before now, Bard has always taken the time to compose its responses in full, before putting them on screen. That's in contrast to ChatGPT and Bing Chat, which output text in real time while the answer is still being worked on. Now, Google Bard will do that as well, by default. The update was spotted by 9to5Google, and we've seen it for ourselves too, though Bard's changelog hasn't yet been updated to reflect the different approach... View the full article
  13. The predictive quality of a machine learning model is a direct reflection of the quality of data used to train and serve the... View the full article
  14. Amazon SageMaker Canvas now supports deploying machine learning (ML) models to real-time inferencing endpoints, allowing you take your ML models to production and drive action based on ML powered insights. SageMaker Canvas is a no-code workspace that enables analysts and citizen data scientists to generate accurate ML predictions for their business needs. View the full article
  15. With the help of event-driven architecture (EDA) and the Open API economy, businesses can keep up with the world and operate in real-time. View the full article
  16. WebSocket is a common communication protocol used in web applications to facilitate real-time bi-directional data exchange between client and server. However, when the server has to maintain a direct connection with the client, it can limit the server’s ability to scale down when there are long-running clients. This scale down can occur when nodes are underutilized during periods of low usage. In this post, we demonstrate how to redesign a web application to achieve auto scaling even for long-running clients, with minimal changes to the original application... View the full article
  17. Amazon Connect Contact Lens now provides manager alerts on real-time metrics via email notifications, EventBridge events, or Amazon Connect Tasks. These new alerts enable businesses to notify their managers on unexpected changes in contact center operations that could impact the end-customer experience. With this launch, businesses can now configure alerts that include choosing a metric (e.g., service level), defining a metric threshold (e.g., service level of 90 seconds drops below 75% on a business critical queue), and sending an automated email notification or assigning a task to a manager for follow-up action. View the full article
  18. Amazon Personalize has launched a new open source Amazon Personalize Kafka Sink connector that makes it easy to stream data in real-time for use in Amazon Personalize. Amazon Personalize enables developers to improve customer engagement through personalized product and content recommendations – no ML expertise required. Amazon Personalize uses data provided by customers to train custom models on their behalf. With this launch, customers can readily ingest their data from Apache Kafka clusters and call the Amazon Personalize-specific APIs to enable real-time data steaming, without the need for custom code. This makes it faster and easier to bring real-time data to Amazon Personalize for those using Apache Kafka. View the full article
  19. Amazon Connect Voice ID (available in preview) provides real-time caller authentication that makes voice interactions in contact centers more secure and efficient. Voice ID provides an additional security layer that doesn’t rely on the caller answering multiple questions (such as birthdate and mother’s maiden name) and makes it easy to enroll and verify customers without changing the natural flow of the conversation. View the full article
  20. Contact Lens for Amazon Connect, launched at re:Invent 2019, provides a set of machine learning (ML) capabilities integrated into Amazon Connect that analyze call recordings or customer sentiment, trends, and compliance of conversations. Now, Contact Lens supports real-time call analytics capabilities, enabling you to detect customer issues during live calls and resolve them faster. View the full article
  21. Real-Time Live Sports Updates Using AWS AppSync is a new AWS Solutions Implementation that helps media and entertainment (M&E) companies deliver sports information to their customers on mobile and web applications in near real-time. Delivering real-time sports updates is a critical workload for many M&E companies. When a fan’s favorite team scores a goal, hits a home run, or makes a touchdown, it is important that this update makes it to fans in as close to real-time as possible. This solution simplifies historically complex and expensive infrastructure and helps support thousands of fans tracking a game or match in a web or mobile application. View the full article
  22. Data engineers and developers face challenges every day to help their organizations digitally transform. To do this, they must deliver real-time data applications faster, better and cheaper. With businesses in every industry now data-driven, data professionals must work more efficiently and accelerate time to market for their products. That’s where DataOps comes in. Development of […] The post DataOps: The Key for Real-Time Data Application Development appeared first on DevOps.com. View the full article
  23. Amazon Connect real-time metric dashboards now allow you to drill down into queue and routing profile data in one click. For example, if a queue has a long wait time, call center managers can create a table in one click to view agents in that queue. With this table, they can quickly identify agents in an error status and work with them to resolve the issue. View the full article
  • Forum Statistics

    63.6k
    Total Topics
    61.7k
    Total Posts
×
×
  • Create New...