Showing results for tags 'postgresql'.

amazon aurora Amazon Aurora supports PostgreSQL 16.2, 15.6, 14.11, 13.14, and 12.18

Amazon Web Services posted a topic in Databases, Data Engineering & Data Science

Amazon Aurora PostgreSQL-Compatible Edition now supports PostgreSQL versions 16.2, 15.6, 14.11, 13.14, and 12.18. These releases contain product improvements and bug fixes made by the PostgreSQL community, along with Aurora-specific improvements. These releases also offer improved performance through faster COPY operations (for PostgreSQL 16.2), Query Plan Management enhancements for queries with aggregate operations, and optimizations for further reducing the logical replication lag. View the full article

databricks AWS Aurora Postgres to Databricks Data Integration: 2 Simplified Methods

Hevo Data posted a topic in Databases, Data Engineering & Data Science

Effective data migration is the key to overcoming the challenges associated with today’s data-driven world. The AWS Aurora Postgres to Databricks integration offers data storage and analytics solutions that help unlock the full potential of your organization’s operational data. Through this integration, you can achieve the simplicity and cost-effectiveness of AWS Aurora Postgres databases with […]View the full article

April 20
- databricks data integration
- data integration
- (and 3 more)
  Tagged with:

data transfer AWS Aurora Postgres to Snowflake Data Transfer: 2 Easy Methods

Hevo Data posted a topic in Databases, Data Engineering & Data Science

Data-driven decision-making is one of the core principles of many organizations these days. Companies spend thousands of dollars working on data and interpreting its behavior to increase sales and scale their business. Moving all the company data from a relational database to a fully managed data warehousing service is an essential step to consider. This […]View the full article

April 20
- aws aurora
- aws aurora postgres
- (and 3 more)
  Tagged with:

postgresql Why RudderStack Used Postgres Over Apache Kafka for Streaming Engine

RudderStack posted a topic in Databases, Data Engineering & Data Science

Finally RudderStack keys "Why they did not prefer Apache Kafka over PostgreSQL for building RudderStack?". Focuses on the challenges using Apache KafkaView the full article

alloydb AlloyDB fuels 50% faster margin calculations for Apex FinTech

Google Cloud Platform posted a topic in Databases, Data Engineering & Data Science

Editor’s note: Apex FinTech Solutions Inc. ("Apex") enables modern investing and wealth management tools through an ecosystem of frictionless platforms, APIs, and services. As part of its clearing and custody services (provided through Apex Clearing Corporation, a wholly owned subsidiary of Apex Fintech Solutions Inc.), Apex wanted to deliver its fintech clients in the trading and investment sector with more timely and accurate margin calculations. To transform this traditionally on-premises process, the company migrated to AlloyDB for PostgreSQL to enable real-time decision-making and risk management for clients and investors. We know that time is money — especially for investors. In the trading and investment sector, fintech companies are leading the charge as investors demand faster, more secure, and more intuitive experiences. Developing solutions for increasingly tech-savvy clients, however, can take a toll on computing resources. That’s why companies turn to us, Apex FinTech Solutions. Our modular ecosystem of APIs and platforms includes trading solutions that provide the routing for execution exchanges and market makers. To ensure top-notch speed and performance for our margin calculation workloads, we chose AlloyDB for PostgreSQL as our database. Overhauling processes to invest in our successRecently, Apex leadership has been focused on transforming on-premises processes into SaaS-based product offerings. We saw an opportunity for improvement in our margin calculation method — a crucial service for brokerages. Margin is the capital that an investor borrows from a broker to purchase an investment and represents the difference between the investment's value and the amount borrowed. Investors should continually monitor and understand margin requirements as they are required to maintain sufficient equity to hold open positions. The amount of required equity fluctuates based upon portfolio mix and optimization, price movements, and underlying market volatility. It is also important to understand that trading firms have a fixed amount of capital they can allocate to their specific margin obligations at any given time, which may affect the requirements and equity needed to maintain their open positions. In our previous system, we used a Java application with Microsoft SQL Server as its backend to calculate margins as a batch process that provided the start-of-day buying power numbers to our clients and investors. We ran the process in our data center via a database export file, which we chunked up and distributed across several worker nodes to calculate the margins in mini-batches. As part of Apex’s digital transformation and cloud migration, we needed real-time data from APIs to calculate the margin on demand and help our customers to determine their risk in seconds. We knew this capability would provide a strategic advantage for Apex and our customers. AlloyDB — a convincing solutionAlloyDB is a fully managed, PostgreSQL-compatible database solution that provides openness and scalability. As a highly regulated financial institution, total system availability and performance were critical considerations. We quickly adopted AlloyDB with confidence since it was compatible with our high availability (HA) and disaster recovery (DR) requirements right out of the box. Today, a combination of a new microservices architecture and AlloyDB allows us to query directly against the real-time account data stored in the database. The functionality we use includes AlloyDB read replicas, the Google Kubernetes Engine (GKE) cluster autoscaler, a Pub/Sub event bus, and the AlloyDB columnar engine. The most crucial part of the architecture is its ability to separate write and read operations between the AlloyDB primary instance and AlloyDB read replicas. This capability provides the flexibility to segregate transactional and analytical queries, improving performance and scalability. Since AlloyDB uses a disaggregated, shared storage system to store data, we can spin up read replicas on demand without waiting for data to be copied over from the primary instance. When the workload is complete, we can scale down replicas to minimize costs. Moreover, we enabled the AlloyDB columnar engine since margin calculations are an analytical query from the database’s point of view. The columnar engine reduced the total CPU load on the database, saving costs by minimizing the required number of read replicas. A GKE cluster configured for autoscaling based on the number of messages in a Pub/Sub queue allows resources to scale up dynamically — so we can quickly process the entire workload more efficiently and in parallel — and then spin back down once processing is complete. Figure 1: Architecture diagram of modern margin calculation in Google Cloud Crunching the numbers: 50% reduction in processing timeThis new, flexible architecture offers Apex customers the choice between on-demand single account margin calculations via an Apigee API or comprehensive nightly batch jobs for all accounts. This in turn gives us an opportunity to expand our existing service offerings and increase customer value. The AlloyDB-based solution achieved a remarkable 50 percent reduction in processing time, enabling margin calculations for 100,000 accounts in just one minute.1 This is a significant improvement in efficiency and scalability. With the flexibility and speed of AlloyDB, we have significant potential to improve these results with further tuning. We are excited to collaborate with Google Cloud to use AlloyDB in our scalable architecture for batch and real-time margin calculation without the need to move or copy data. We are now exploring the potential to migrate additional traditional PostgreSQL instances and continue driving disruption and innovation for Apex — and the future of financial services. Get startedLearn more about AlloyDB for PostgreSQL and start a free trial today!1. 50% reduction in processing time is based on internal testing and benchmarking conducted in November 2024 using the Google database and services. All product and company names are trademarks ™ or registered ® trademarks of their respective holders. Use of them does not imply any affiliation with or endorsement by them. Apex Fintech Solutions is a fintech powerhouse enabling seamless access, frictionless investing, and investor education for all. Apex’s omni-suite of scalable solutions fuel innovation and evolution for hundreds of today’s market leaders, challengers, change makers, and visionaries. The Company’s digital ecosystem creates an environment where clients with the biggest ideas are empowered to change the world. Apex works to ensure their partners succeed on the frontlines of the industry via bespoke solutions through its Apex Clearing™, Apex Advisor Solutions™, Apex Silver™, and Apex CODA Markets™ brands. Apex Clearing Corporation, a wholly-owned subsidiary of Apex Fintech Solutions Inc., is an SEC registered broker dealer, a member of FINRA and SIPC, and is licensed in 53 states and territories. Securities products and services referenced herein are provided by Apex Clearing Corporation. FINRA BrokerCheck reports for Apex Clearing Corporation are available at: http://www.finra.org/brokercheck Nothing herein shall be construed as a recommendation to buy or sell any security. While we have made every attempt to ensure that the information contained in this document has been obtained from reliable sources, Apex is not responsible for any errors or omissions, or for the results obtained from the use of this information. All information in this document is provided “as is”, with no guarantee of completeness, accuracy, timeliness or of the results obtained due to the use of this information, and without warranty of any kind, express or implied, including but not limited to warranties of performance, merchantability and fitness for a particular purpose. In no event will Apex be liable to you or anyone else for any decision made or taken in reliance on the information in this document or for any consequential, special or similar damages, even if advised of the possibility of such damages. Dissemination of this information is prohibited without Apex’s written permission. View the full article

March 13
- fintech
- apex fintech
- (and 2 more)
  Tagged with:
  - fintech
  - apex fintech
  - apex
  - postgresql

rds Amazon RDS for PostgreSQL, MySQL, and MariaDB support M7g and R7g database instances in additional regions

Amazon Web Services posted a topic in Databases, Data Engineering & Data Science

Amazon Relational Database Service (Amazon RDS) for PostgreSQL, MySQL, and MariaDB now support AWS Graviton3-based M7g and R7g database instances in US West (N. California), Asia Pacific (Hyderabad, Seoul), Canada (Central), Europe (London, Spain), and Middle East (Bahrain). Graviton3-based instances provide up to a 30% performance improvement and up to a 27% price/performance improvement (based on on-demand pricing) over Graviton2-based instances on RDS for open-source databases depending on database engine, version, and workload View the full article

February 29
- postgresql
- mysql
- (and 1 more)
  Tagged with:
  - postgresql
  - mysql
  - mariadb

End-to-End Data Engineering System on Real Data with Kafka, Spark, Airflow, Postgres, and Docker

TDS posted a topic in Databases, Data Engineering & Data Science

This article is part of a project that’s split into two main phases. The first phase focuses on building a data pipeline. This involves getting data from an API and storing it in a PostgreSQL database. In the second phase, we’ll develop an application that uses a language model to interact with this database. Ideal for those new to data systems or language model applications, this project is structured into two segments: This initial article guides you through constructing a data pipeline utilizing Kafka for streaming, Airflow for orchestration, Spark for data transformation, and PostgreSQL for storage. To set-up and run these tools we will use Docker.The second article, which will come later, will delve into creating agents using tools like LangChain to communicate with external databases.This first part project is ideal for beginners in data engineering, as well as for data scientists and machine learning engineers looking to deepen their knowledge of the entire data handling process. Using these data engineering tools firsthand is beneficial. It helps in refining the creation and expansion of machine learning models, ensuring they perform effectively in practical settings. This article focuses more on practical application rather than theoretical aspects of the tools discussed. For detailed understanding of how these tools work internally, there are many excellent resources available online. OverviewLet’s break down the data pipeline process step-by-step: Data Streaming: Initially, data is streamed from the API into a Kafka topic.Data Processing: A Spark job then takes over, consuming the data from the Kafka topic and transferring it to a PostgreSQL database.Scheduling with Airflow: Both the streaming task and the Spark job are orchestrated using Airflow. While in a real-world scenario, the Kafka producer would constantly listen to the API, for demonstration purposes, we’ll schedule the Kafka streaming task to run daily. Once the streaming is complete, the Spark job processes the data, making it ready for use by the LLM application.All of these tools will be built and run using docker, and more specifically docker-compose. Overview of the data pipeline. Image by the author.Now that we have a blueprint of our pipeline, let’s dive into the technical details ! Local setupFirst you can clone the Github repo on your local machine using the following command: git clone https://github.com/HamzaG737/data-engineering-project.gitHere is the overall structure of the project: ├── LICENSE ├── README.md ├── airflow │ ├── Dockerfile │ ├── __init__.py │ └── dags │ ├── __init__.py │ └── dag_kafka_spark.py ├── data │ └── last_processed.json ├── docker-compose-airflow.yaml ├── docker-compose.yml ├── kafka ├── requirements.txt ├── spark │ └── Dockerfile └── src ├── __init__.py ├── constants.py ├── kafka_client │ ├── __init__.py │ └── kafka_stream_data.py └── spark_pgsql └── spark_streaming.pyThe airflow directory contains a custom Dockerfile for setting up airflow and a dags directory to create and schedule the tasks.The data directory contains the last_processed.json file which is crucial for the Kafka streaming task. Further details on its role will be provided in the Kafka section.The docker-compose-airflow.yaml file defines all the services required to run airflow.The docker-compose.yaml file specifies the Kafka services and includes a docker-proxy. This proxy is essential for executing Spark jobs through a docker-operator in Airflow, a concept that will be elaborated on later.The spark directory contains a custom Dockerfile for spark setup.src contains the python modules needed to run the application.To set up your local development environment, start by installing the required Python packages. The only essential package is psycopg2-binary. You have the option to install just this package or all the packages listed in the requirements.txt file. To install all packages, use the following command: pip install -r requirements.txtNext let’s dive step by step into the project details. About the APIThe API is RappelConso from the French public services. It gives access to data relating to recalls of products declared by professionals in France. The data is in French and it contains initially 31 columns (or fields). Some of the most important are: reference_fiche (reference sheet): Unique identifier of the recalled product. It will act as the primary key of our Postgres database later.categorie_de_produit (Product category): For instance food, electrical appliance, tools, transport means, etc …sous_categorie_de_produit (Product sub-category): For instance we can have meat, dairy products, cereals as sub-categories for the food category.motif_de_rappel (Reason for recall): Self explanatory and one of the most important fields.date_de_publication which translates to the publication date.risques_encourus_par_le_consommateur which contains the risks that the consumer may encounter when using the product.There are also several fields that correspond to different links, such as link to product image, link to the distributers list, etc..You can see some examples and query manually the dataset records using this link. We refined the data columns in a few key ways: Columns like ndeg_de_version and rappelguid, which were part of a versioning system, have been removed as they aren’t needed for our project.We combined columns that deal with consumer risks — risques_encourus_par_le_consommateur and description_complementaire_du_risque — for a clearer overview of product risks.The date_debut_fin_de_commercialisation column, which indicates the marketing period, has been divided into two separate columns. This split allows for easier queries about the start or end of a product’s marketing.We’ve removed accents from all columns except for links, reference numbers, and dates. This is important because some text processing tools struggle with accented characters.For a detailed look at these changes, check out our transformation script at src/kafka_client/transformations.py. The updated list of columns is available insrc/constants.py under DB_FIELDS. Kafka streamingTo avoid sending all the data from the API each time we run the streaming task, we define a local json file that contains the last publication date of the latest streaming. Then we will use this date as the starting date for our new streaming task. To give an example, suppose that the latest recalled product has a publication date of 22 november 2023. If we make the hypothesis that all of the recalled products infos before this date are already persisted in our Postgres database, We can now stream the data starting from the 22 november. Note that there is an overlap because we may have a scenario where we didn’t handle all of the data of the 22nd of November. The file is saved in ./data/last_processed.json and has this format: {last_processed:"2023-11-22"}By default the file is an empty json which means that our first streaming task will process all of the API records which are 10 000 approximately. Note that in a production setting this approach of storing the last processed date in a local file is not viable and other approaches involving an external database or an object storage service may be more suitable. The code for the kafka streaming can be found on ./src/kafka_client/kafka_stream_data.py and it involves primarily querying the data from the API, making the transformations, removing potential duplicates, updating the last publication date and serving the data using the kafka producer. The next step is to run the kafka service defined the docker-compose defined below: version: '3' services: kafka: image: 'bitnami/kafka:latest' ports: - '9094:9094' networks: - airflow-kafka environment: - KAFKA_CFG_NODE_ID=0 - KAFKA_CFG_PROCESS_ROLES=controller,broker - KAFKA_CFG_LISTENERS=PLAINTEXT://:9092,CONTROLLER://:9093,EXTERNAL://:9094 - KAFKA_CFG_ADVERTISED_LISTENERS=PLAINTEXT://kafka:9092,EXTERNAL://localhost:9094 - KAFKA_CFG_LISTENER_SECURITY_PROTOCOL_MAP=CONTROLLER:PLAINTEXT,EXTERNAL:PLAINTEXT,PLAINTEXT:PLAINTEXT - KAFKA_CFG_CONTROLLER_QUORUM_VOTERS=0@kafka:9093 - KAFKA_CFG_CONTROLLER_LISTENER_NAMES=CONTROLLER volumes: - ./kafka:/bitnami/kafka kafka-ui: container_name: kafka-ui-1 image: provectuslabs/kafka-ui:latest ports: - 8800:8080 depends_on: - kafka environment: KAFKA_CLUSTERS_0_NAME: local KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS: PLAINTEXT://kafka:9092 DYNAMIC_CONFIG_ENABLED: 'true' networks: - airflow-kafka networks: airflow-kafka: external: trueThe key highlights from this file are: The kafka service uses a base image bitnami/kafka.We configure the service with only one broker which is enough for our small project. A Kafka broker is responsible for receiving messages from producers (which are the sources of data), storing these messages, and delivering them to consumers (which are the sinks or end-users of the data). The broker listens to port 9092 for internal communication within the cluster and port 9094 for external communication, allowing clients outside the Docker network to connect to the Kafka broker.In the volumes part, we map the local directory kafka to the docker container directory /bitnami/kafka to ensure data persistence and a possible inspection of Kafka’s data from the host system.We set-up the service kafka-ui that uses the docker image provectuslabs/kafka-ui:latest . This provides a user interface to interact with the Kafka cluster. This is especially useful for monitoring and managing Kafka topics and messages.To ensure communication between kafka and airflow which will be run as an external service, we will use an external network airflow-kafka.Before running the kafka service, let’s create the airflow-kafka network using the following command: docker network create airflow-kafkaNow everything is set to finally start our kafka service docker-compose up After the services start, visit the kafka-ui at http://localhost:8800/. Normally you should get something like this: Overview of the Kafka UI. Image by the author.Next we will create our topic that will contain the API messages. Click on Topics on the left and then Add a topic at the top left. Our topic will be called rappel_conso and since we have only one broker we set the replication factor to 1. We will also set the partitions number to 1 since we will have only one consumer thread at a time so we won’t need any parallelism. Finally, we can set the time to retain data to a small number like one hour since we will run the spark job right after the kafka streaming task, so we won’t need to retain the data for a long time in the kafka topic. Postgres set-upBefore setting-up our spark and airflow configurations, let’s create the Postgres database that will persist our API data. I used the pgadmin 4 tool for this task, however any other Postgres development platform can do the job. To install postgres and pgadmin, visit this link https://www.postgresql.org/download/ and get the packages following your operating system. Then when installing postgres, you need to setup a password that we will need later to connect to the database from the spark environment. You can also leave the port at 5432. If your installation has succeeded, you can start pgadmin and you should observe something like this window: Overview of pgAdmin interface. Image by the author.Since we have a lot of columns for the table we want to create, we chose to create the table and add its columns with a script using psycopg2, a PostgreSQL database adapter for Python. You can run the script with the command: python scripts/create_table.pyNote that in the script I saved the postgres password as environment variable and name it POSTGRES_PASSWORD. So if you use another method to access the password you need to modify the script accordingly. Spark Set-upHaving set-up our Postgres database, let’s delve into the details of the spark job. The goal is to stream the data from the Kafka topic rappel_conso to the Postgres table rappel_conso_table. from pyspark.sql import SparkSession from pyspark.sql.types import ( StructType, StructField, StringType, ) from pyspark.sql.functions import from_json, col from src.constants import POSTGRES_URL, POSTGRES_PROPERTIES, DB_FIELDS import logging logging.basicConfig( level=logging.INFO, format="%(asctime)s:%(funcName)s:%(levelname)s:%(message)s" ) def create_spark_session() -> SparkSession: spark = ( SparkSession.builder.appName("PostgreSQL Connection with PySpark") .config( "spark.jars.packages", "org.postgresql:postgresql:42.5.4,org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0", ) .getOrCreate() ) logging.info("Spark session created successfully") return spark def create_initial_dataframe(spark_session): """ Reads the streaming data and creates the initial dataframe accordingly. """ try: # Gets the streaming data from topic random_names df = ( spark_session.readStream.format("kafka") .option("kafka.bootstrap.servers", "kafka:9092") .option("subscribe", "rappel_conso") .option("startingOffsets", "earliest") .load() ) logging.info("Initial dataframe created successfully") except Exception as e: logging.warning(f"Initial dataframe couldn't be created due to exception: {e}") raise return df def create_final_dataframe(df): """ Modifies the initial dataframe, and creates the final dataframe. """ schema = StructType( [StructField(field_name, StringType(), True) for field_name in DB_FIELDS] ) df_out = ( df.selectExpr("CAST(value AS STRING)") .select(from_json(col("value"), schema).alias("data")) .select("data.*") ) return df_out def start_streaming(df_parsed, spark): """ Starts the streaming to table spark_streaming.rappel_conso in postgres """ # Read existing data from PostgreSQL existing_data_df = spark.read.jdbc( POSTGRES_URL, "rappel_conso", properties=POSTGRES_PROPERTIES ) unique_column = "reference_fiche" logging.info("Start streaming ...") query = df_parsed.writeStream.foreachBatch( lambda batch_df, _: ( batch_df.join( existing_data_df, batch_df[unique_column] == existing_data_df[unique_column], "leftanti" ) .write.jdbc( POSTGRES_URL, "rappel_conso", "append", properties=POSTGRES_PROPERTIES ) ) ).trigger(once=True) \ .start() return query.awaitTermination() def write_to_postgres(): spark = create_spark_session() df = create_initial_dataframe(spark) df_final = create_final_dataframe(df) start_streaming(df_final, spark=spark) if __name__ == "__main__": write_to_postgres()Let’s break down the key highlights and functionalities of the spark job: First we create the Spark sessiondef create_spark_session() -> SparkSession: spark = ( SparkSession.builder.appName("PostgreSQL Connection with PySpark") .config( "spark.jars.packages", "org.postgresql:postgresql:42.5.4,org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0", ) .getOrCreate() ) logging.info("Spark session created successfully") return spark2. The create_initial_dataframe function ingests streaming data from the Kafka topic using Spark's structured streaming. def create_initial_dataframe(spark_session): """ Reads the streaming data and creates the initial dataframe accordingly. """ try: # Gets the streaming data from topic random_names df = ( spark_session.readStream.format("kafka") .option("kafka.bootstrap.servers", "kafka:9092") .option("subscribe", "rappel_conso") .option("startingOffsets", "earliest") .load() ) logging.info("Initial dataframe created successfully") except Exception as e: logging.warning(f"Initial dataframe couldn't be created due to exception: {e}") raise return df3. Once the data is ingested, create_final_dataframe transforms it. It applies a schema (defined by the columns DB_FIELDS) to the incoming JSON data, ensuring that the data is structured and ready for further processing. def create_final_dataframe(df): """ Modifies the initial dataframe, and creates the final dataframe. """ schema = StructType( [StructField(field_name, StringType(), True) for field_name in DB_FIELDS] ) df_out = ( df.selectExpr("CAST(value AS STRING)") .select(from_json(col("value"), schema).alias("data")) .select("data.*") ) return df_out4. The start_streaming function reads existing data from the database, compares it with the incoming stream, and appends new records. def start_streaming(df_parsed, spark): """ Starts the streaming to table spark_streaming.rappel_conso in postgres """ # Read existing data from PostgreSQL existing_data_df = spark.read.jdbc( POSTGRES_URL, "rappel_conso", properties=POSTGRES_PROPERTIES ) unique_column = "reference_fiche" logging.info("Start streaming ...") query = df_parsed.writeStream.foreachBatch( lambda batch_df, _: ( batch_df.join( existing_data_df, batch_df[unique_column] == existing_data_df[unique_column], "leftanti" ) .write.jdbc( POSTGRES_URL, "rappel_conso", "append", properties=POSTGRES_PROPERTIES ) ) ).trigger(once=True) \ .start() return query.awaitTermination()The complete code for the Spark job is in the file src/spark_pgsql/spark_streaming.py. We will use the Airflow DockerOperator to run this job, as explained in the upcoming section. Let’s go through the process of creating the Docker image we need to run our Spark job. Here’s the Dockerfile for reference: FROM bitnami/spark:latest WORKDIR /opt/bitnami/spark RUN pip install py4j COPY ./src/spark_pgsql/spark_streaming.py ./spark_streaming.py COPY ./src/constants.py ./src/constants.py ENV POSTGRES_DOCKER_USER=host.docker.internal ARG POSTGRES_PASSWORD ENV POSTGRES_PASSWORD=$POSTGRES_PASSWORDIn this Dockerfile, we start with the bitnami/spark image as our base. It's a ready-to-use Spark image. We then install py4j, a tool needed for Spark to work with Python. The environment variables POSTGRES_DOCKER_USER and POSTGRES_PASSWORD are set up for connecting to a PostgreSQL database. Since our database is on the host machine, we use host.docker.internal as the user. This allows our Docker container to access services on the host, in this case, the PostgreSQL database. The password for PostgreSQL is passed as a build argument, so it's not hard-coded into the image. It’s important to note that this approach, especially passing the database password at build time, might not be secure for production environments. It could potentially expose sensitive information. In such cases, more secure methods like Docker BuildKit should be considered. Now, let’s build the Docker image for Spark: docker build -f spark/Dockerfile -t rappel-conso/spark:latest --build-arg POSTGRES_PASSWORD=$POSTGRES_PASSWORD .This command will build the image rappel-conso/spark:latest . This image includes everything needed to run our Spark job and will be used by Airflow’s DockerOperator to execute the job. Remember to replace $POSTGRES_PASSWORD with your actual PostgreSQL password when running this command. AirflowAs said earlier, Apache Airflow serves as the orchestration tool in the data pipeline. It is responsible for scheduling and managing the workflow of the tasks, ensuring they are executed in a specified order and under defined conditions. In our system, Airflow is used to automate the data flow from streaming with Kafka to processing with Spark. Airflow DAGLet’s take a look at the Directed Acyclic Graph (DAG) that will outline the sequence and dependencies of tasks, enabling Airflow to manage their execution. start_date = datetime.today() - timedelta(days=1) default_args = { "owner": "airflow", "start_date": start_date, "retries": 1, # number of retries before failing the task "retry_delay": timedelta(seconds=5), } with DAG( dag_id="kafka_spark_dag", default_args=default_args, schedule_interval=timedelta(days=1), catchup=False, ) as dag: kafka_stream_task = PythonOperator( task_id="kafka_data_stream", python_callable=stream, dag=dag, ) spark_stream_task = DockerOperator( task_id="pyspark_consumer", image="rappel-conso/spark:latest", api_version="auto", auto_remove=True, command="./bin/spark-submit --master local[*] --packages org.postgresql:postgresql:42.5.4,org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0 ./spark_streaming.py", docker_url='tcp://docker-proxy:2375', environment={'SPARK_LOCAL_HOSTNAME': 'localhost'}, network_mode="airflow-kafka", dag=dag, ) kafka_stream_task >> spark_stream_taskHere are the key elements from this configuration The tasks are set to execute daily.The first task is the Kafka Stream Task. It is implemented using the PythonOperator to run the Kafka streaming function. This task streams data from the RappelConso API into a Kafka topic, initiating the data processing workflow.The downstream task is the Spark Stream Task. It uses the DockerOperator for execution. It runs a Docker container with our custom Spark image, tasked with processing the data received from Kafka.The tasks are arranged sequentially, where the Kafka streaming task precedes the Spark processing task. This order is crucial to ensure that data is first streamed and loaded into Kafka before being processed by Spark.About the DockerOperatorUsing docker operator allow us to run docker-containers that correspond to our tasks. The main advantage of this approach is easier package management, better isolation and enhanced testability. We will demonstrate the use of this operator with the spark streaming task. Here are some key details about the docker operator for the spark streaming task: We will use the image rappel-conso/spark:latest specified in the Spark Set-up section.The command will run the Spark submit command inside the container, specifying the master as local, including necessary packages for PostgreSQL and Kafka integration, and pointing to the spark_streaming.py script that contains the logic for the Spark job.docker_url represents the url of the host running the docker daemon. The natural solution is to set it as unix://var/run/docker.sock and to mount the var/run/docker.sock in the airflow docker container. One problem we had with this approach is a permission error to use the socket file inside the airflow container. A common workaround, changing permissions with chmod 777 var/run/docker.sock, poses significant security risks. To circumvent this, we implemented a more secure solution using bobrik/socat as a docker-proxy. This proxy, defined in a Docker Compose service, listens on TCP port 2375 and forwards requests to the Docker socket: docker-proxy: image: bobrik/socat command: "TCP4-LISTEN:2375,fork,reuseaddr UNIX-CONNECT:/var/run/docker.sock" ports: - "2376:2375" volumes: - /var/run/docker.sock:/var/run/docker.sock networks: - airflow-kafkaIn the DockerOperator, we can access the host docker /var/run/docker.sock via thetcp://docker-proxy:2375 url, as described here and here. Finally we set the network mode to airflow-kafka. This allows us to use the same network as the proxy and the docker running kafka. This is crucial since the spark job will consume the data from the kafka topic so we must ensure that both containers are able to communicate.After defining the logic of our DAG, let’s understand now the airflow services configuration in the docker-compose-airflow.yaml file. Airflow ConfigurationThe compose file for airflow was adapted from the official apache airflow docker-compose file. You can have a look at the original file by visiting this link. As pointed out by this article, this proposed version of airflow is highly resource-intensive mainly because the core-executor is set to CeleryExecutor that is more adapted for distributed and large-scale data processing tasks. Since we have a small workload, using a single-noded LocalExecutor is enough. Here is an overview of the changes we made on the docker-compose configuration of airflow: We set the environment variable AIRFLOW__CORE__EXECUTOR to LocalExecutor.We removed the services airflow-worker and flower because they only work for the Celery executor. We also removed the redis caching service since it works as a backend for celery. We also won’t use the airflow-triggerer so we remove it too.We replaced the base image ${AIRFLOW_IMAGE_NAME:-apache/airflow:2.7.3} for the remaining services, mainly the scheduler and the webserver, by a custom image that we will build when running the docker-compose.version: '3.8' x-airflow-common: &airflow-common build: context: . dockerfile: ./airflow_resources/Dockerfile image: de-project/airflow:latestWe mounted the necessary volumes that are needed by airflow. AIRFLOW_PROJ_DIR designates the airflow project directory that we will define later. We also set the network as airflow-kafka to be able to communicate with the kafka boostrap servers.volumes: - ${AIRFLOW_PROJ_DIR:-.}/dags:/opt/airflow/dags - ${AIRFLOW_PROJ_DIR:-.}/logs:/opt/airflow/logs - ${AIRFLOW_PROJ_DIR:-.}/config:/opt/airflow/config - ./src:/opt/airflow/dags/src - ./data/last_processed.json:/opt/airflow/data/last_processed.json user: "${AIRFLOW_UID:-50000}:0" networks: - airflow-kafkaNext, we need to create some environment variables that will be used by docker-compose: echo -e "AIRFLOW_UID=$(id -u)\nAIRFLOW_PROJ_DIR=\"./airflow_resources\"" > .envWhere AIRFLOW_UID represents the User ID in Airflow containers and AIRFLOW_PROJ_DIR represents the airflow project directory. Now everything is set-up to run your airflow service. You can start it with this command: docker compose -f docker-compose-airflow.yaml upThen to access the airflow user interface you can visit this url http://localhost:8080 . Sign-in window on Airflow. Image by the author.By default, the username and password are airflow for both. After signing in, you will see a list of Dags that come with airflow. Look for the dag of our project kafka_spark_dag and click on it. Overview of the task window in airflow. Image by the author.You can start the task by clicking on the button next to DAG: kafka_spark_dag. Next, you can check the status of your tasks in the Graph tab. A task is done when it turns green. So, when everything is finished, it should look something like this: Image by the author.To verify that the rappel_conso_table is filled with data, use the following SQL query in the pgAdmin Query Tool: SELECT count(*) FROM rappel_conso_tableWhen I ran this in January 2024, the query returned a total of 10022 rows. Your results should be around this number as well. ConclusionThis article has successfully demonstrated the steps to build a basic yet functional data engineering pipeline using Kafka, Airflow, Spark, PostgreSQL, and Docker. Aimed primarily at beginners and those new to the field of data engineering, it provides a hands-on approach to understanding and implementing key concepts in data streaming, processing, and storage. Throughout this guide, we’ve covered each component of the pipeline in detail, from setting up Kafka for data streaming to using Airflow for task orchestration, and from processing data with Spark to storing it in PostgreSQL. The use of Docker throughout the project simplifies the setup and ensures consistency across different environments. It’s important to note that while this setup is ideal for learning and small-scale projects, scaling it for production use would require additional considerations, especially in terms of security and performance optimization. Future enhancements could include integrating more advanced data processing techniques, exploring real-time analytics, or even expanding the pipeline to incorporate more complex data sources. In essence, this project serves as a practical starting point for those looking to get their hands dirty with data engineering. It lays the groundwork for understanding the basics, providing a solid foundation for further exploration in the field. In the second part, we’ll explore how to effectively use the data stored in our PostgreSQL database. We’ll introduce agents powered by Large Language Models (LLMs) and a variety of tools that enable us to interact with the database using natural language queries. So, stay tuned ! To reach outLinkedIn : https://www.linkedin.com/in/hamza-gharbi-043045151/Twitter : https://twitter.com/HamzaGh25079790End-to-End Data Engineering System on Real Data with Kafka, Spark, Airflow, Postgres, and Docker was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story. View the full article

February 9
- apache kafka
- apache spark
- (and 3 more)
  Tagged with:

amazon rds Your MySQL 5.7 and PostgreSQL 11 databases will be automatically enrolled into Amazon RDS Extended Support

Amazon Web Services posted a topic in Databases, Data Engineering & Data Science

Today, we are announcing that your MySQL 5.7 and PostgreSQL 11 database instances running on Amazon Aurora and Amazon Relational Database Service (Amazon RDS) will be automatically enrolled into Amazon RDS Extended Support starting on February 29, 2024. This will help avoid unplanned downtime and compatibility issues that can arise with automatically upgrading to a new major version. This provides you with more control over when you want to upgrade the major version of your database. This automatic enrollment may mean that you will experience higher charges when RDS Extended Support begins. You can avoid these charges by upgrading your database to a newer DB version before the start of RDS Extended Support. What is Amazon RDS Extended Support? In September 2023, we announced Amazon RDS Extended Support, which allows you to continue running your database on a major engine version past its RDS end of standard support date on Amazon Aurora or Amazon RDS at an additional cost. Until community end of life (EoL), the MySQL and PostgreSQL open source communities manage common vulnerabilities and exposures (CVE) identification, patch generation, and bug fixes for the respective engines. The communities release a new minor version every quarter containing these security patches and bug fixes until the database major version reaches community end of life. After the community end of life date, CVE patches or bug fixes are no longer available and the community considers those engines unsupported. For example, MySQL 5.7 and PostgreSQL 11 are no longer supported by the communities as of October and November 2023 respectively. We are grateful to the communities for their continued support of these major versions and a transparent process and timeline for transitioning to the newest major version. With RDS Extended Support, Amazon Aurora and RDS takes on engineering the critical CVE patches and bug fixes for up to three years beyond a major version’s community EoL. For those 3 years, Amazon Aurora and RDS will work to identify CVEs and bugs in the engine, generate patches and release them to you as quickly as possible. Under RDS Extended Support, we will continue to offer support, such that the open source community’s end of support for an engine’s major version does not leave your applications exposed to critical security vulnerabilities or unresolved bugs. You might wonder why we are charging for RDS Extended Support rather than providing it as part of the RDS service. It’s because the engineering work for maintaining security and functionality of community EoL engines requires AWS to invest developer resources for critical CVE patches and bug fixes. This is why RDS Extended Support is only charging customers who need the additional flexibility to stay on a version past community EoL. RDS Extended Support may be useful to help you meet your business requirements for your applications if you have particular dependencies on a specific MySQL or PostgreSQL major version, such as compatibility with certain plugins or custom features. If you are currently running on-premises database servers or self-managed Amazon Elastic Compute Cloud (Amazon EC2) instances, you can migrate to Amazon Aurora MySQL-Compatible Edition, Amazon Aurora PostgreSQL-Compatible Edition, Amazon RDS for MySQL, Amazon RDS for PostgreSQL beyond the community EoL date, and continue to use these versions these versions with RDS Extended Support while benefiting from a managed service. If you need to migrate many databases, you can also utilize RDS Extended Support to split your migration into phases, ensuring a smooth transition without overwhelming IT resources. In 2024, RDS Extended Support will be available for RDS for MySQL major versions 5.7 and higher, RDS for PostgreSQL major versions 11 and higher, Aurora MySQL-compatible version 2 and higher, and Aurora PostgreSQL-compatible version 11 and higher. For a list of all future supported versions, see Supported MySQL major versions on Amazon RDS and Amazon Aurora major versions in the AWS documentation. Community major version RDS/Aurora version Community end of life date End of RDS standard support date Start of RDS Extended Support pricing End of RDS Extended Support MySQL 5.7 RDS for MySQL 5.7 October 2023 February 29, 2024 March 1, 2024 February 28, 2027 Aurora MySQL 2 October 31, 2024 December 1, 2024 PostgreSQL 11 RDS for PostgreSQL 11 November 2023 March 31, 2024 April 1, 2024 March 31, 2027 Aurora PostgreSQL 11 February 29, 2024 RDS Extended Support is priced per vCPU per hour. Learn more about pricing details and timelines for RDS Extended Support at Amazon Aurora pricing, RDS for MySQL pricing, and RDS for PostgreSQL pricing. For more information, see the blog posts about Amazon RDS Extended Support for MySQL and PostgreSQL databases in the AWS Database Blog. Why are we automatically enrolling all databases to Amazon RDS Extended Support? We had originally informed you that RDS Extended Support would provide the opt-in APIs and console features in December 2023. In that announcement, we said that if you decided not to opt your database in to RDS Extended Support, it would automatically upgrade to a newer engine version starting on March 1, 2024. For example, you would be upgraded from Aurora MySQL 2 or RDS for MySQL 5.7 to Aurora MySQL 3 or RDS for MySQL 8.0 and from Aurora PostgreSQL 11 or RDS for PostgreSQL 11 to Aurora PostgreSQL 15 and RDS for PostgreSQL 15, respectively. However, we heard lots of feedback from customers that these automatic upgrades may cause their applications to experience breaking changes and other unpredictable behavior between major versions of community DB engines. For example, an unplanned major version upgrade could introduce compatibility issues or downtime if applications are not ready for MySQL 8.0 or PostgreSQL 15. Automatic enrollment in RDS Extended Support gives you additional time and more control to organize, plan, and test your database upgrades on your own timeline, providing you flexibility on when to transition to new major versions while continuing to receive critical security and bug fixes from AWS. If you’re worried about increased costs due to automatic enrollment in RDS Extended Support, you can avoid RDS Extended Support and associated charges by upgrading before the end of RDS standard support. How to upgrade your database to avoid RDS Extended Support charges Although RDS Extended Support helps you schedule your upgrade on your own timeline, sticking with older versions indefinitely means missing out on the best price-performance for your database workload and incurring additional costs from RDS Extended Support. MySQL 8.0 on Aurora MySQL, also known as Aurora MySQL 3, unlocks support for popular Aurora features, such as Global Database, Amazon RDS Proxy, Performance Insights, Parallel Query, and Serverless v2 deployments. Upgrading to RDS for MySQL 8.0 provides features including up to three times higher performance versus MySQL 5.7, such as Multi-AZ cluster deployments, Optimized Reads, Optimized Writes, and support for AWS Graviton2 and Graviton3-based instances. PostgreSQL 15 on Aurora PostgreSQL supports the Aurora I/O Optimized configuration, Aurora Serverless v2, Babelfish for Aurora PostgreSQL, pgvector extension, Trusted Language Extensions for PostgreSQL (TLE), and AWS Graviton3-based instances as well as community enhancements. Upgrading to RDS for PostgreSQL 15 provides features such as Multi-AZ DB cluster deployments, RDS Optimized Reads, HypoPG extension, pgvector extension, TLEs for PostgreSQL, and AWS Graviton3-based instances. Major version upgrades may make database changes that are not backward-compatible with existing applications. You should manually modify your database instance to upgrade to the major version. It is strongly recommended that you thoroughly test any major version upgrade on non-production instances before applying it to production to ensure compatibility with your applications. For more information about an in-place upgrade from MySQL 5.7 to 8.0, see the incompatibilities between the two versions, Aurora MySQL in-place major version upgrade, and RDS for MySQL upgrades in the AWS documentation. For the in-place upgrade from PostgreSQL 11 to 15, you can use the pg_upgrade method. To minimize downtime during upgrades, we recommend using Fully Managed Blue/Green Deployments in Amazon Aurora and Amazon RDS. With just a few steps, you can use Amazon RDS Blue/Green Deployments to create a separate, synchronized, fully managed staging environment that mirrors the production environment. This involves launching a parallel green environment with upper version replicas of your production databases lower version. After validating the green environment, you can shift traffic over to it. Then, the blue environment can be decommissioned. To learn more, see Blue/Green Deployments for Aurora MySQL and Aurora PostgreSQL or Blue/Green Deployments for RDS for MySQL and RDS for PostgreSQL in the AWS documentation. In most cases, Blue/Green Deployments are the best option to reduce downtime, except for limited cases in Amazon Aurora or Amazon RDS. For more information on performing a major version upgrade in each DB engine, see the following guides in the AWS documentation. Upgrading the MySQL DB engine for Amazon RDS Upgrading the PostgreSQL DB engine for Amazon RDS Upgrading the Amazon Aurora MySQL DB cluster Upgrading Amazon Aurora PostgreSQL DB clusters Now available Amazon RDS Extended Support is now available for all customers running Amazon Aurora and Amazon RDS instances using MySQL 5.7, PostgreSQL 11, and higher major versions in AWS Regions, including the AWS GovCloud (US) Regions beyond the end of the standard support date in 2024. You don’t need to opt in to RDS Extended Support, and you get the flexibility to upgrade your databases and continued support for up to 3 years. Learn more about RDS Extended Support in the Amazon Aurora User Guide and the Amazon RDS User Guide. For pricing details and timelines for RDS Extended Support, see Amazon Aurora pricing, RDS for MySQL pricing, and RDS for PostgreSQL pricing. Please send feedback to AWS re:Post for Amazon RDS and Amazon Aurora or through your usual AWS Support contacts. — Channy View the full article

December 21, 2023
- mysql
- postgresql
- (and 1 more)
  Tagged with:

rds Amazon Relational Database Service announces Dedicated Log Volume

Amazon Web Services posted a topic in Databases, Data Engineering & Data Science

Amazon Relational Database Service (Amazon RDS) now supports a Dedicated Log Volume for PostgreSQL, MySQL, and MariaDB databases. An Amazon RDS Dedicated Log Volume allows customers to select a configuration where the most latency sensitive components of their database, the transaction logs, are stored in a separate, dedicated volume. Dedicated Log Volumes work with Provisioned IOPS storage and are recommended for databases with 5,000 GiB or more of allocated storage. View the full article

October 17, 2023
- postgresql
- mysql
- (and 2 more)
  Tagged with:

rds Amazon RDS for PostgreSQL, MySQL, and MariaDB supports M6in, M6idn, R6in, R6idn database instances

Amazon Web Services posted a topic in Databases, Data Engineering & Data Science

Amazon Relational Database Service (Amazon RDS) now supports M6in, M6idn, R6in, and R6idn database (DB) instances for RDS for PostgreSQL, MySQL, and MariaDB. These network optimized DB instances deliver up to 200Gbps network bandwidth, which is 300% more than similar sized M6i and R6i database instances. Enhanced network bandwidth makes M6in and R6in DB instances ideal for write-intensive workloads. M6idn and R6idn support local block storage with up to 7.6 TB of NVMe-based solid state disk (SSD) storage. View the full article

October 13, 2023
- postgresql
- mysql
- (and 2 more)
  Tagged with:
  - postgresql
  - mysql
  - mariadb
  - ec2

postgresql Announcing pgactive: Active-active Replication Extension for PostgreSQL on Amazon RDS

Amazon Web Services posted a topic in Databases, Data Engineering & Data Science

Today, AWS announces the general availability of pgactive: Active-active Replication Extension for PostgreSQL, available for Amazon Relational Database Service (RDS) for PostgreSQL. pgactive lets you use asynchronous active-active replication for streaming data between database instances to provide additional resiliency and flexibility in moving data between database instances, including writers located in different AWS Regions, for the purposes of maintaining availability for operations like switching write traffic to a different instance. View the full article

October 10, 2023
- rds
- extensions
- (and 1 more)
  Tagged with:
  - rds
  - extensions
  - pgactive

postgresql Amazon RDS for PostgreSQL now supports pgvector v0.5.0 with HNSW indexing

Amazon Web Services posted a topic in Databases, Data Engineering & Data Science

Amazon Relational Database Service (RDS) for PostgreSQL now supports v0.5.0 of the pgvector extension to store embeddings from machine learning (ML) models in your database and to perform efficient similarity searches. This version of the extension introduces pgvector introduces HNSW indexing support, parallelization of ivfflat index builds, and improves performance of its distance functions. View the full article

October 5, 2023
- rds
- pgvector
- (and 2 more)
  Tagged with:
  - rds
  - pgvector
  - hnsw
  - indexing

postgresql PostgreSQL 16.0 is now available in Amazon RDS Database Preview Environment

Amazon Web Services posted a topic in Databases, Data Engineering & Data Science

Amazon Relational Database Service (Amazon RDS) for PostgreSQL 16.0 is now available in the Amazon RDS Database Preview Environment, allowing you to evaluate PostgreSQL 16.0 on Amazon RDS for PostgreSQL. You can deploy PostgreSQL 16.0 in the Preview Environment and have the same benefits of a fully managed database, making it simpler to set up, operate, and monitor databases. PostgreSQL 16.0 in the Preview Environment also includes support for logical decoding on read replicas, AWS libcrypto (AWS-LC), and over 80 PostgreSQL extensions such as pgvector, pg_tle, h3-pg, pg_cron, and rdkit. View the full article

postgresql Amazon RDS for PostgreSQL Multi-AZ Deployment with two readable standbys now supports major version upgrades

Amazon Web Services posted a topic in Databases, Data Engineering & Data Science

Amazon Relational Database Service (Amazon RDS) PostgreSQL Multi-AZ Deployments with two readable standbys now supports major version upgrades. Starting today, you can upgrade your RDS for PostgreSQL Multi-AZ Deployments with two readable standbys from major version 13.4 and above and 14.5 and above to 15.4 with just a few clicks on the AWS Management Console. View the full article

September 11, 2023
- rds
- ha
- (and 1 more)
  Tagged with:
  - rds
  - ha
  - aws

postgresql PostgreSQL Show Extensions

Linux Hint posted a topic in Databases, Data Engineering & Data Science

PostgreSQL is a powerful, free and open-source relational database management system. It is a performance-oriented and versatile relational database that offers various features and extensibility options. Despite all the features and tools that are offered natively by the PostgreSQL server, it allows us to add additional functionality and features through the use of extensions. PostgreSQL extensions refer to a set of external modules or libraries that are used to enhance the functionality of the database engines by providing extra features or data types. This makes the PostgreSQL database encapsulate a more comprehensive range of requirements. This tutorial demonstrates how you can view the installed extensions in a PostgreSQL server using various commands and methods. Prerequisites: To follow this tutorial, you should have the following: Installed and configured PostgreSQL on your system Basic understanding of SQL and PostgreSQL Purpose of PostgreSQL Extensions PostgreSQL extensions serve multiple purposes including: Extending Database Functionality – Extensions provide additional features, data types, or functionalities that are unavailable by default. In addition, they allow us to add specialized capabilities to the database to meet the specific requirements. Simplifying Development – Extensions can streamline the development process by providing pre-packaged functionality that saves time and effort. Instead of building custom solutions, we can leverage the existing extensions to add the advanced application features. Enhancing Database Performance – Some extensions optimize and improve the performance of PostgreSQL databases. For example, they can introduce the advanced indexing techniques, query optimization, or caching mechanisms to speed up the queries and overall database operations. The role of an extension heavily determines the intended purpose. Install the Extension Before we discover how to view the installed database extension, let us demonstrate how to install one for demonstration purposes. One popular PostgreSQL extension we can install is the “PostGIS” extension. PostGIS adds support for geographic objects and allows you to store, query, and manipulate a spatial data in our PostgreSQL database. Start by connecting to the PostgreSQL database using the “psql” command: psql -U username -d database_name Replace the username with your target PostgreSQL username and database_name with the database name to which you want to connect. Then, enter the username’s password on the prompt. The next step is to install the PostGIS which is already available in your database, and all you need to do is enable it. Run the following command: CREATE EXTENSION postgis; The command activates the PostGIS extension on the current database. After the installation, you can start using the PostGIS extension’s features. PostGIS provides a rich set of functions and operators to work with spatial data. Check the following resource to learn more. To check the installed extension version, run the following query: SELECT PostGIS_Full_Version(); This should return the installed PostGIS version. Before installing an extension, checking whether it is already available in your PostgreSQL installation is a good practice. For example, we can use the following command to list the available extensions: SELECT * FROM pg_available_extensions; To install an extension in a specific schema, we can specify the schema using the SCHEMA parameter: CREATE EXTENSION extension_name SCHEMA schema_name; PostgreSQL Show Extensions Once you installed your desired extensions, you can show them using the SHOW EXTENSIONS command: SHOW EXTENSION The given command displays a table with information about the installed extensions. You can also view the extensions that are installed in a specific database using the PgAdmin 4 utility. Log into PgAdmin and expand to Servers -> PostgreSQL -> Target Database – Extensions. Conclusion We explored the role of extensions in PostgreSQL. We also explored how to install and view the installed extensions in PostgreSQL server. View the full article

June 24, 2023

postgresql PostgreSQL BPCHAR Data Type

Linux Hint posted a topic in Databases, Data Engineering & Data Science

PostgreSQL supports various character types: TEXT, VARCHAR(n), and CHAR(n). The TEXT data type takes an unlimited variable length. VARCHAR(n) takes the values with a variable length without exceeding the set length limit. As for CHAR(n), it takes a fixed length and is blank padded. Understanding CHAR(n) and VARCHAR(n) is essential in understanding the BPCHAR data type. This tutorial discusses everything about BPCHAR. We will understand how CHAR(n) and VARCHAR(n) work and where BPCHAR comes in. Let’s begin! What Is BPCHAR in PostgreSQL When working with any database, you must specify the data type of the columns that you create. That way, the database knows what type of values to expect for a given table. Hence, when you try to insert a value that doesn’t match the expected data type, the database raises an error. In the case of PostgreSQL, BPCHAR is a data type that stands for “blank padded character.” It is associated with the CHAR data type and mainly comes into play when we try to add blank spaces as values. In such a case, the empty spaces are blank-padded to avoid raising any error. To understand BPCHAR in detail, we create a table and see how CHAR and VARCHAR handle different values and where BPCHAR comes into play. In our example, we create a simple table that takes CHAR(n) and VARCHAR(n) where “n” is 10. Here, “n” is the number of characters that can be accommodated. If we inspect our table, we can confirm that all data types and columns are successfully created. Now, let’s start by inserting values into our columns. The values, in this case, have a character length that are less than the specified character length of 10 for each data type. We can verify the length of the inserted values using the length (column-name) option. In the following output, CHAR and VARCHAR have a value with fewer characters than the maximum specified length when creating the table. You should note that the VARCHAR data type only uses the required length since it takes a variable length. However, CHAR takes up all the fixed length regardless of whether the inserted value requires fewer characters. Suppose we try to insert a value that takes more characters than the length of the CHAR data type. We get the following error: The same applies to VARCHAR. Although it has a variable length, it can’t accommodate any value that exceeds the size of the data type. The value that we attempted to insert is larger than 10 characters. Hence, Postgres throws an error which blocks us from inserting the value. To comprehend how BPCHAR works, let’s have an example of inserting the spaces as values and see how CHAR and VARCHAR will treat either case. When you insert the spaces for VARCHAR, it still treats them as characters. Besides, when you insert the spaces that exceed the variable length, it won’t throw any error. But it only takes the maximum length of characters and ignores the rest. In the following example, the value that we insert for VARCHAR is spaces which exceeds to ten characters. The insertion proceeds successfully. When we inspect the inserted value, we see that only ten spaces are captured. Unlike VARCHAR, the CHAR data type treats the spaces differently. It takes them as space-padded characters. Thus, it won’t throw an error but won’t save the blank characters either. Inspecting our table, we see that the CHAR column saved no space, and the length is zero. That’s one element of BPCHAR at work. It treats the blank spaces as padded values. Hence, it doesn’t save them in the database. Going further with spaces, something different happens when we try to insert a value with spaces between characters. For VARCHAR, we can see from the following example that every space is counted as a character. Thus, we get an error that our value is larger than the fixed length for the data type. The same applies to CHAR. Even though it treats the spaces as blank-padded values earlier, when we add them between other characters, it treats them as characters and raises an error when we try adding a longer character length. How about adding spaces between and after characters? How does BPCHAR treat it? In the first example, we added a character that is longer than the required variable length. However, since the last character is spaces, CHAR treats them as blank-padded values and doesn’t throw any error. Inspecting our table, we can see that only the first characters and the spaces between them are accounted for as CHAR values. However, we get a different output if we attempt the same for VARCHAR. It still accounts for the spaces after the characters and records the maximum spaces to fit its fixed-length size. Checking our database, we can note the difference in the length of the characters that are stored as CHAR and VARCHAR. For CHAR, only four characters are stored in the table. However, VARCHAR captured ten characters, including spaces, and ignored the space that exceeded the set fixed length. These examples show how BPCHAR comes into play when working with space-padded values with CHAR. As for VARCHAR, no BPCHAR effect is encountered. Conclusion BPCHAR is a data type that applies when working with CHAR. It makes CHAR treat the spaces as space-padded values that can’t be stored in the table, depending on the location of the space in the value that is being inserted. We’ve seen BPCHAR in detail through the different examples that are presented in this post. Hopefully, you now understand how BPCHAR works and how it affects the characters when working with PostgreSQL. View the full article

May 17, 2023
- bpchar
- sql

python Fastest Way to Load Data Into PostgreSQL Using Python

James posted a topic in Development & Programming

Great article on the fastest methods to load data into a PostgreSQL DB using Python; https://hakibenita.com/fast-load-data-python-postgresql

Amazon Aurora PostgreSQL-compatible edition now supports R6i instances

Amazon Web Services posted a topic in Databases, Data Engineering & Data Science

Amazon Aurora now supports R6i instances powered by 3rd generation Intel Xeon Scalable processors. R6i instances are the 6th generation of Amazon EC2 memory optimized instances, designed for memory-intensive workloads. These instances are built on the AWS Nitro System, a combination of dedicated hardware and lightweight hypervisor, which delivers practically all of the compute and memory resources of the host hardware to your instances. R6i instances are currently available when using Amazon Aurora PostgreSQL-Compatible Edition. View the full article

July 15, 2022
- aurora
- postgresql
- (and 2 more)
  Tagged with:
  - aurora
  - postgresql
  - ec2
  - db

AWS Database Migration Service now supports Babelfish for Aurora PostgreSQL as a target

Amazon Web Services posted a topic in Databases, Data Engineering & Data Science

AWS Database Migration Service (AWS DMS) has expanded functionality by adding support for Babelfish for Aurora PostgreSQL as a target. Babelfish for Aurora PostgreSQL is a new translation layer for Amazon Aurora PostgreSQL-Compatible Edition that enables Aurora to understand commands from applications written for Microsoft SQL Server. Using AWS DMS, you can now perform full load migrations to Babelfish for Aurora PostgreSQL with minimal downtime. View the full article

July 1, 2022
- db
- migration
- (and 3 more)
  Tagged with:
  - db
  - migration
  - babelfish
  - aurora
  - postgresql

Amazon Aurora now supports PostgreSQL 14

Amazon Web Services posted a topic in Databases, Data Engineering & Data Science

Amazon Aurora PostgreSQL-Compatible Edition now supports PostgreSQL major version 14 (14.3). PostgreSQL 14 includes performance improvements for parallel queries, heavily-concurrent workloads, partitioned tables, logical replication, and vacuuming. PostgreSQL 14 also improves functionality with new capabilities. For example, you can cancel long-running queries if a client disconnects and you can close idle sessions if they time out. Range types now support multiranges, allowing representation of non-contiguous data ranges, and stored procedures can now return data via OUT parameters. This release includes new features for Babelfish for Aurora PostgreSQL version 2.1. Please refer to Amazon Aurora PostgreSQL updates for more information. View the full article

postgresql PostgreSQL Boolean Data Types

Linux Hint posted a topic in Databases, Data Engineering & Data Science

PostgreSQL is an open-source object-oriented based database system. It is a powerful database system that supports both relational and non-relational data types. The Boolean data type is a commonly used data type that can accept three types of values: True, False, and NULL. The short form of this data type is bool and one byte is used to store the Boolean data. The True value of the Boolean data can also be denoted by ‘yes’, ‘y’, ‘true’, and 1. The False value of the Boolean data can also be denoted by ‘no’, ‘n’, ‘false’, and 0. Pre-requisites: You have to install the latest version of PostgreSQL packages on the Linux operating system before executing the SQL statements shown in this tutorial. Run the following commands to install and start the PostgreSQL: $ sudo apt-get -y install postgresql postgresql-contrib $ sudo systemctl start postgresql.service Run the following command to login to PostgreSQL with root permission: $ sudo -u postgres psql Use of Boolean data type in PostgreSQL tables: Before creating any table with the Boolean data type, you have to create a PostgreSQL database. So, run the following command to create a database named ‘testdb’: # CREATE DATABASE testdb; The following output will appear after creating the database: Example-1: Create a table using the Boolean data type Create a table named ‘technicians’ in the current database with three fields. The first field name is tech_id, the data type is an integer and it is the primary key of the table. The second field name is name and the data type is character. The third field name is available and the data type is Boolean. # CREATE TABLE technicians ( tech_id INT NOT NULL PRIMARY KEY, name CHARACTER(10) NOT NULL, available BOOLEAN NOT NULL ); The following output will appear if the table is created successfully: Run the following command to INSERT a record into the technicians table where ‘true’ is used for the Boolean value: # INSERT INTO technicians VALUES (1, 'Zafar Ali', 'true'); The following output will appear after executing the above insert query: Run the following INSERT command to insert a record into the technicians table where 0 is used for the Boolean value. The 0 is not acceptable for the Boolean value in PostgreSQL. So, an error message will appear. # INSERT INTO technicians VALUES (2, 'Kabir Hossain', 0); The following output will appear after executing the above insert query. The output shows an error message that indicates that the type of the 0 is Integer, not Boolean. Run the following INSERT command to insert a record into the technicians table where ‘false’ is used for the Boolean value. # INSERT INTO technicians VALUES (3, 'Abir Hasan', 'false'); The following output will appear after executing the above insert query: Run the following INSERT command to insert a record into the technicians table where ‘t’ is used for the Boolean value: # INSERT INTO technicians VALUES (5, 'Rebeka Ali', 't'); The following output will appear after executing the above insert query: Example-2: Create a table using Bool data type Create a table named ‘products’ in the current database with three fields. The first field name is id, the data type is an integer and it is the primary key of the table. The second field name is name and the data type is character. The third field name is physical_product, the data type is BOOL, and the default value of the field is ‘true’. # CREATE TABLE products ( id INT NOT NULL PRIMARY KEY, name CHARACTER(10) NOT NULL, physical_product BOOL NOT NULL DEFAULT 'true' ); The following output will appear if the table is created successfully: Run the following command to insert a record into the products table where ‘f’ is used for the BOOL value: # INSERT INTO products VALUES (1, 'Antivirus', 'f') Run the following INSERT command to insert a record into the products table where no value is provided for the BOOL data. Here, the default value of the field will be inserted. # INSERT INTO products VALUES (2, 'Monitor'); The following output will appear after executing the above two insert queries: Check the content of the tables: Run the SELECT following select query to retrieve all records from the technicians table: # SELECT * FROM technicians; Run the SELECT following select query to retrieve all records from the technicians table where the value of the available field is ‘false’: # SELECT * FROM technicians WHERE available = 'false'; Run the SELECT following select query to retrieve all records from the technicians table where the value of the available field is ‘t’ or ‘true’: # SELECT * FROM technicians WHERE available = 't' OR available = 'true'; The following output will appear after executing the above three ‘select’ queries. The output of the first query shows all records of the table. The output of the second query shows those records of the table where the value of the available field is ‘f’. The output of the third query shows those records of the table where the value of the available field is ‘t’. Run the following select query to retrieve all records from the products table: # SELECT * FROM products; Run the following select query to retrieve all records from the products table where the value of the physical_product field is ‘True’: # SELECT * FROM products WHERE physical_product = 'True'; The following output will appear after executing the above two ‘select’ queries. The output of the first query shows all records of the table. The output of the second query shows those records of the table where the value of the available field is ‘t’. Conclusion: Different uses of Boolean or BOOL data types in PostgreSQL tables have been shown in this tutorial by using multiple examples to clarify the purpose of using Boolean data types in the table of the PostgreSQL database. View the full article

June 17, 2022

Amazon Aurora supports PostgreSQL 13.7, 12.11, 11.16, and 10.21 versions, and updated extensions

Amazon Web Services posted a topic in Databases, Data Engineering & Data Science

Following the announcement of updates to the PostgreSQL database by the open source community, we have updated Amazon Aurora PostgreSQL-Compatible Edition to support PostgreSQL 13.7, 12.11, 11.16, and 10.21. These releases contain bug fixes and improvements by the PostgreSQL community. Refer to the Aurora version policy to help you to decide how often to upgrade and how to plan your upgrade process. View the full article

June 10, 2022
- aurora
- postgresql
- (and 2 more)
  Tagged with:
  - aurora
  - postgresql
  - extensions
  - db

Amazon Aurora PostgreSQL supports LO module

Amazon Web Services posted a topic in Databases, Data Engineering & Data Science

Amazon Aurora PostgreSQL-Compatible Edition now supports the Large Objects (LO) module. The LO module provides support for managing Large Objects (also called LOs or BLOBs). View the full article

June 10, 2022
- aurora
- postgresql
- (and 3 more)
  Tagged with:
  - aurora
  - postgresql
  - db
  - lo
  - blobs

Amazon Aurora PostgreSQL-compatible edition now supports zero-downtime patching

Amazon Web Services posted a topic in Databases, Data Engineering & Data Science

Amazon Aurora PostgreSQL-compatible edition now supports zero-downtime patching (ZDP). With ZDP, customers can upgrade to a new PostgreSQL version and apply patches to their Aurora cluster without any downtime. View the full article

June 10, 2022
- aurora
- postgresql
- (and 3 more)
  Tagged with:
  - aurora
  - postgresql
  - db
  - zero-downtime
  - patching

AlloyDB for PostgreSQL under the hood: Columnar engine

Google Cloud Platform posted a topic in Databases, Data Engineering & Data Science

Recently, at Google I/O, we announced AlloyDB for PostgreSQL, a fully-managed, PostgreSQL-compatible database for demanding, enterprise-grade transactional and analytical workloads. Imagine PostgreSQL plus the best of the cloud: elastic storage and compute, intelligent caching, and AI/ML-powered management. Further, AlloyDB delivers unmatched price-performance: In our performance tests, it’s more than 4x faster on transactional workloads, and up to 100x faster on analytical queries than standard PostgreSQL, all with simple, predictable pricing. Designed for mission-critical applications, AlloyDB offers extensive data protection and an industry leading 99.99% availability SLA, inclusive of maintenance. Read Article

May 26, 2022
- alloydb
- postgresql
- (and 2 more)
  Tagged with:
  - alloydb
  - postgresql
  - columnar
  - db

Sign In

Search the Community

Search By Tags

Search By Author

Content Type

Forums

Calendars

Find results in...

Find results that contain...

Date Created

Start

End

Last Updated

Start

End

Filter by number of...

Minimum number of comments

Minimum number of replies

Minimum number of reviews

Minimum number of views

Joined

Start

End

Group

Website URL

LinkedIn Profile URL

About Me

Cloud Platforms

Cloud Experience

Development Experience

Current Role

Skills

Certifications

Favourite Tools

Interests

Forum Statistics