Search the Community
Showing results for tags 'aws data exchange'.
-
It always pays to know more about your customers, and AWS Data Exchange makes it straightforward to use publicly available census data to enrich your customer dataset. The United States Census Bureau conducts the US census every 10 years and gathers household survey data. This data is anonymized, aggregated, and made available for public use. The smallest geographic area for which the Census Bureau collects and aggregates data are census blocks, which are formed by streets, roads, railroads, streams and other bodies of water, other visible physical and cultural features, and the legal boundaries shown on Census Bureau maps. If you know the census block in which a customer lives, you are able to make general inferences about their demographic characteristics. With these new attributes, you are able to build a segmentation model to identify distinct groups of customers that you can target with personalized messaging. This data is available to subscribe to on AWS Data Exchange—and with data sharing, you don’t need to pay to store a copy of it in your account in order to query it. In this post, we show how to use customer addresses to enrich a dataset with additional demographic details from the US Census Bureau dataset. Solution overview The solution includes the following high-level steps: Set up an Amazon Redshift Serverless endpoint and load customer data. Set up a place index in Amazon Location Service. Write an AWS Lambda user-defined function (UDF) to call Location Service from Amazon Redshift. Subscribe to census data on AWS Data Exchange. Use geospatial queries to tag addresses to census blocks. Create a new customer dataset in Amazon Redshift. Evaluate new customer data in Amazon QuickSight. The following diagram illustrates the solution architecture. Prerequisites You can use the following AWS CloudFormation template to deploy the required infrastructure. Before deployment, you need to sign up for QuickSight access through the AWS Management Console. Load generic address data to Amazon Redshift Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. Redshift Serverless makes it straightforward to run analytics workloads of any size without having to manage data warehouse infrastructure. To load our address data, we first create a Redshift Serverless workgroup. Then we use Amazon Redshift Query Editor v2 to load customer data from Amazon Simple Storage Service (Amazon S3). Create a Redshift Serverless workgroup There are two primary components of the Redshift Serverless architecture: Namespace – A collection of database objects and users. Namespaces group together all of the resources you use in Redshift Serverless, such as schemas, tables, users, datashares, and snapshots. Workgroup – A collection of compute resources. Workgroups have network and security settings that you can configure using the Redshift Serverless console, the AWS Command Line Interface (AWS CLI), or the Redshift Serverless APIs. To create your namespace and workgroup, refer to Creating a data warehouse with Amazon Redshift Serverless. For this exercise, name your workgroup sandbox and your namespace adx-demo. Use Query Editor v2 to load customer data from Amazon S3 You can use Query Editor v2 to submit queries and load data to your data warehouse through a web interface. To configure Query Editor v2 for your AWS account, refer to Data load made easy and secure in Amazon Redshift using Query Editor V2. After it’s configured, complete the following steps: Use the following SQL to create the customer_data schema within the dev database in your data warehouse: CREATE SCHEMA customer_data; Use the following SQL DDL to create your target table into which you’ll load your customer address data: CREATE TABLE customer_data.customer_addresses ( address character varying(256) ENCODE lzo, unitnumber character varying(256) ENCODE lzo, municipality character varying(256) ENCODE lzo, region character varying(256) ENCODE lzo, postalcode character varying(256) ENCODE lzo, country character varying(256) ENCODE lzo, customer_id integer ENCODE az64 ) DISTSTYLE AUTO; Load the address_list.csv file to the table you just created. For instructions, refer to Data load made easy and secure in Amazon Redshift using Query Editor V2. The file has no column headers and is pipe delimited (|). For information on how to load data from either Amazon S3 or your local desktop, refer to Loading data into a database. Use Location Service to geocode and enrich address data Location Service lets you add location data and functionality to applications, which includes capabilities such as maps, points of interest, geocoding, routing, geofences, and tracking. Our data is in Amazon Redshift, so we need to access the Location Service APIs using SQL statements. Each row of data contains an address that we want to enrich and geotag using the Location Service APIs. Amazon Redshift allows developers to create UDFs using a SQL SELECT clause, Python, or Lambda. Lambda is a compute service that lets you run code without provisioning or managing servers. With Lambda UDFs, you can write custom functions with complex logic and integrate with third-party components. Scalar Lambda UDFs return one result per invocation of the function—in this case, the Lambda function runs one time for each row of data it receives. For this post, we write a Lambda function that uses the Location Service API to geotag and validate our customer addresses. Then we register this Lambda function as a UDF with our Redshift instance, allowing us to call the function from a SQL command. For instructions to create a Location Service place index and create your Lambda function and scalar UDF, refer to Access Amazon Location Service from Amazon Redshift. For this post, we use ESRI as a provider and name the place index placeindex.redshift. Test your new function with the following code, which returns the coordinates of the White House in Washington, DC: select public.f_geocode_address('1600 Pennsylvania Ave.','Washington','DC','20500','USA'); Subscribe to demographic data from AWS Data Exchange AWS Data Exchange is a data marketplace with more than 3,500 products from over 300 providers delivered—through files, APIs, or Amazon Redshift queries—directly to the data lakes, applications, analytics, and machine learning models that use it. First, we need to give our Redshift namespace permission via AWS Identity and Access Management (IAM) to access subscriptions on AWS Data Exchange. Then we can subscribe to our sample demographic data. Complete the following steps: On the IAM console, add the AWSDataExchangeSubscriberFullAccess managed policy to your Amazon Redshift commands access role you assigned when creating the namespace. On the AWS Data Exchange console, navigate to the dataset ACS – Sociodemographics (USA, Census Block Groups, 2019), provided by CARTO. Choose Continue to subscribe, then choose Subscribe. The subscription may take a few minutes to configure. When your subscription is in place, navigate back to the Redshift Serverless console. In the navigation pane, choose Datashares. On the Subscriptions tab, choose the datashare that you just subscribed to. On the datashare details page, choose Create database from datashare. Choose the namespace you created earlier and provide a name for the new database that will hold the shared objects from the dataset you subscribed to. In Query Editor v2, you should see the new database you just created and two new tables: one that holds the block group polygons and another that holds the demographic information for each block group. Join geocoded customer data to census data with geospatial queries There are two primary types of spatial data: raster and vector data. Raster data is represented as a grid of pixels and is beyond the scope of this post. Vector data is comprised of vertices, edges, and polygons. With geospatial data, vertices are represented as latitude and longitude points and edges are the connections between pairs of vertices. Think of the road connecting two intersections on a map. A polygon is a set of vertices with a series of connecting edges that form a continuous shape. A simple rectangle is a polygon, just as the state border of Ohio can be represented as a polygon. The geography_usa_blockgroup_2019 dataset that you subscribed to has 220,134 rows, each representing a single census block group and its geographic shape. Amazon Redshift supports the storage and querying of vector-based spatial data with the GEOMETRY and GEOGRAPHY data types. You can use Redshift SQL functions to perform queries such as a point in polygon operation to determine if a given latitude/longitude point falls within the boundaries of a given polygon (such as state or county boundary). In this dataset, you can observe that the geom column in geography_usa_blockgroup_2019 is of type GEOMETRY. Our goal is to determine which census block (polygon) each of our geotagged addresses falls within so we can enrich our customer records with details that we know about the census block. Complete the following steps: Build a new table with the geocoding results from our UDF: CREATE TABLE customer_data.customer_addresses_geocoded AS select address ,unitnumber ,municipality ,region ,postalcode ,country ,customer_id ,public.f_geocode_address(address||' '||unitnumber,municipality,region,postalcode,country) as geocode_result FROM customer_data.customer_addresses; Use the following code to extract the different address fields and latitude/longitude coordinates from the JSON column and create a new table with the results: CREATE TABLE customer_data.customer_addresses_points AS SELECT customer_id ,geo_address address ,unitnumber ,municipality ,region ,postalcode ,country ,longitude ,latitude ,ST_SetSRID(ST_MakePoint(Longitude, Latitude),4326) as address_point --create new geom column of type POINT, set new point SRID = 4326 FROM ( select customer_id ,address ,unitnumber ,municipality ,region ,postalcode ,country ,cast(json_extract_path_text(geocode_result, 'Label', true) as VARCHAR) as geo_address ,cast(json_extract_path_text(geocode_result, 'Longitude', true) as float) as longitude ,cast(json_extract_path_text(geocode_result, 'Latitude', true) as float) as latitude --use json function to extract fields from geocode_result from customer_data.customer_addresses_geocoded) a; This code uses the ST_POINT function to create a new column from the latitude/longitude coordinates called address_point of type GEOMETRY and subtype POINT. It uses the ST_SetSRID geospatial function to set the spatial reference identifier (SRID) of the new column to 4326. The SRID defines the spatial reference system to be used when evaluating the geometry data. It’s important when joining or comparing geospatial data that they have matching SRIDs. You can check the SRID of an existing geometry column by using the ST_SRID function. For more information on SRIDs and GEOMETRY data types, refer to Querying spatial data in Amazon Redshift. Now that your customer addresses are geocoded as latitude/longitude points in a geometry column, you can use a join to identify which census block shape your new point falls within: CREATE TABLE customer_data.customer_addresses_with_census AS select c.* ,shapes.geoid as census_group_shape ,demo.* from customer_data.customer_addresses_points c inner join "carto_census_data"."carto".geography_usa_blockgroup_2019 shapes on ST_Contains(shapes.geom, c.address_point) --join tables where the address point falls within the census block geometry inner join carto_census_data.usa_acs.demographics_sociodemographics_usa_blockgroup_2019_yearly_2019 demo on demo.geoid = shapes.geoid; The preceding code creates a new table called customer_addresses_with_census, which joins the customer addresses to the census block in which they belong as well as the demographic data associated with that census block. To do this, you used the ST_CONTAINS function, which accepts two geometry data types as an input and returns TRUE if the 2D projection of the first input geometry contains the second input geometry. In our case, we have census blocks represented as polygons and addresses represented as points. The join in the SQL statement succeeds when the point falls within the boundaries of the polygon. Visualize the new demographic data with QuickSight QuickSight is a cloud-scale business intelligence (BI) service that you can use to deliver easy-to-understand insights to the people who you work with, wherever they are. QuickSight connects to your data in the cloud and combines data from many different sources. First, let’s build some new calculated fields that will help us better understand the demographics of our customer base. We can do this in QuickSight, or we can use SQL to build the columns in a Redshift view. The following is the code for a Redshift view: CREATE VIEW customer_data.customer_features AS ( SELECT customer_id ,postalcode ,region ,municipality ,geoid as census_geoid ,longitude ,latitude ,total_pop ,median_age ,white_pop/total_pop as perc_white ,black_pop/total_pop as perc_black ,asian_pop/total_pop as perc_asian ,hispanic_pop/total_pop as perc_hispanic ,amerindian_pop/total_pop as perc_amerindian ,median_income ,income_per_capita ,median_rent ,percent_income_spent_on_rent ,unemployed_pop/coalesce(pop_in_labor_force) as perc_unemployment ,(associates_degree + bachelors_degree + masters_degree + doctorate_degree)/total_pop as perc_college_ed ,(household_language_total - household_language_english)/coalesce(household_language_total) as perc_other_than_english FROM "dev"."customer_data"."customer_addresses_with_census" t ); To get QuickSight to talk to our Redshift Serverless endpoint, complete the following steps: Manually authorize connections from QuickSight to Redshift clusters. For instructions, refer to Authorizing connections from Amazon QuickSight to Amazon Redshift clusters (stop after Step 19). Configure the VPC connection between QuickSight and the Redshift Serverless endpoint. Now you can create a new dataset in QuickSight. On the QuickSight console, choose Datasets in the navigation pane. Choose New dataset. We want to create a dataset from a new data source and use the Redshift: Manual connect option. Provide the connection information for your Redshift Serverless workgroup. You will need the endpoint for our workgroup and the user name and password that you created when you set up your workgroup. You can find your workgroup’s endpoint on the Redshift Serverless console by navigating to your workgroup configuration. The following screenshot is an example of the connection settings needed. Notice the connection type is the name of the VPC connection that you previously configured in QuickSight. When you copy the endpoint from the Redshift console, be sure to remove the database and port number from the end of the URL before entering it in the field. Save the new data source configuration. You’ll be prompted to choose the table you want to use for your dataset. Choose the new view that you created that has your new derived fields. Select Directly query your data. This will connect your visualizations directly to the data in the database rather than ingesting data into the QuickSight in-memory data store. To create a histogram of median income level, choose the blank visual on Sheet1 and then choose the histogram visual icon under Visual types. Choose median_income under Fields list and drag it to the Value field well. This builds a histogram showing the distribution of median_income for our customers based on the census block group in which they live. Conclusion In this post, we demonstrated how companies can use open census data available on AWS Data Exchange to effortlessly gain a high-level understanding of their customer base from a demographic standpoint. This basic understanding of customers based on where they live can serve as the foundation for more targeted marketing campaigns and even influence product development and service offerings. As always, AWS welcomes your feedback. Please leave your thoughts and questions in the comments section. About the Author Tony Stricker is a Principal Technologist on the Data Strategy team at AWS, where he helps senior executives adopt a data-driven mindset and align their people/process/technology in ways that foster innovation and drive towards specific, tangible business outcomes. He has a background as a data warehouse architect and data scientist and has delivered solutions in to production across multiple industries including oil and gas, financial services, public sector, and manufacturing. In his spare time, Tony likes to hang out with his dog and cat, work on home improvement projects, and restore vintage Airstream campers. View the full article
-
- geospatial
- amazon redshift
-
(and 2 more)
Tagged with:
-
This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS! A new week starts, and Spring is almost here! If you’re curious about AWS news from the previous seven days, I got you covered. Last Week’s Launches Here are the launches that got my attention last week: Amazon S3 – Last week there was AWS Pi Day 2023 celebrating 17 years of innovation since Amazon S3 was introduced on March 14, 2006. For the occasion, the team released many new capabilities: S3 Object Lambda now provides aliases that are interchangeable with bucket names and can be used with Amazon CloudFront to tailor content for end users. S3 now support datasets that are replicated across multiple AWS accounts with cross-account support for S3 Multi-Region Access Points. You can now create and configure replication rules to automatically replicate S3 objects from one AWS Outpost to another. Amazon S3 has also simplified private connectivity from on-premises networks: with private DNS for S3, on-premises applications can use AWS PrivateLink to access S3 over an interface endpoint, while requests from your in-VPC applications access S3 using gateway endpoints. We released Mountpoint for Amazon S3, a high performance open source file client. Read more in the blog. Note that Mountpoint isn’t a general-purpose networked file system, and comes with some restrictions on file operations. Amazon Linux 2023 – Our new Linux-based operating system is now generally available. Sébastien’s post is full of tips and info. Application Auto Scaling – Now can use arithmetic operations and mathematical functions to customize the metrics used with Target Tracking policies. You can use it to scale based on your own application-specific metrics. Read how it works with Amazon ECS services. AWS Data Exchange for Amazon S3 is now generally available – You can now share and find data files directly from S3 buckets, without the need to create or manage copies of the data. Amazon Neptune – Now offers a graph summary API to help understand important metadata about property graphs (PG) and resource description framework (RDF) graphs. Neptune added support for Slow Query Logs to help identify queries that need performance tuning. Amazon OpenSearch Service – The team introduced security analytics that provides new threat monitoring, detection, and alerting features. The service now supports OpenSearch version 2.5 that adds several new features such as support for Point in Time Search and improvements to observability and geospatial functionality. AWS Lake Formation and Apache Hive on Amazon EMR – Introduced fine-grained access controls that allow data administrators to define and enforce fine-grained table and column level security for customers accessing data via Apache Hive running on Amazon EMR. Amazon EC2 M1 Mac Instances – You can now update guest environments to a specific or the latest macOS version without having to tear down and recreate the existing macOS environments. AWS Chatbot – Now Integrates With Microsoft Teams to simplify the way you troubleshoot and operate your AWS resources. Amazon GuardDuty RDS Protection for Amazon Aurora – Now generally available to help profile and monitor access activity to Aurora databases in your AWS account without impacting database performance AWS Database Migration Service – Now supports validation to ensure that data is migrated accurately to S3 and can now generate an AWS Glue Data Catalog when migrating to S3. AWS Backup – You can now back up and restore virtual machines running on VMware vSphere 8 and with multiple vNICs. Amazon Kendra – There are new connectors to index documents and search for information across these new content: Confluence Server, Confluence Cloud, Microsoft SharePoint OnPrem, Microsoft SharePoint Cloud. This post shows how to use the Amazon Kendra connector for Microsoft Teams. For a full list of AWS announcements, be sure to keep an eye on the What's New at AWS page. Other AWS News A few more blog posts you might have missed: Women founders Q&A – We’re talking to six women founders and leaders about how they’re making impacts in their communities, industries, and beyond. What you missed at that 2023 IMAGINE: Nonprofit conference – Where hundreds of nonprofit leaders, technologists, and innovators gathered to learn and share how AWS can drive a positive impact for people and the planet. Monitoring load balancers using Amazon CloudWatch anomaly detection alarms – The metrics emitted by load balancers provide crucial and unique insight into service health, service performance, and end-to-end network performance. Extend geospatial queries in Amazon Athena with user-defined functions (UDFs) and AWS Lambda – Using a solution based on Uber’s Hexagonal Hierarchical Spatial Index (H3) to divide the globe into equally-sized hexagons. How cities can use transport data to reduce pollution and increase safety – A guest post by Rikesh Shah, outgoing head of open innovation at Transport for London. For AWS open-source news and updates, here’s the latest newsletter curated by Ricardo to bring you the most recent updates on open-source projects, posts, events, and more. Upcoming AWS Events Here are some opportunities to meet: AWS Public Sector Day 2023 (March 21, London, UK) – An event dedicated to helping public sector organizations use technology to achieve more with less through the current challenging conditions. Women in Tech at Skills Center Arlington (March 23, VA, USA) – Let’s celebrate the history and legacy of women in tech. The AWS Summits season is warming up! You can sign up here to know when registration opens in your area. That’s all from me for this week. Come back next Monday for another Week in Review! — Danilo View the full article
-
- women in tech
- s3
- (and 23 more)
-
Forum Statistics
63.7k
Total Topics61.7k
Total Posts