Jump to content

Search the Community

Showing results for tags 'postgresql'.

  • Search By Tags

    Type tags separated by commas.
  • Search By Author

Content Type


Forums

  • General
    • General Discussion
    • Artificial Intelligence
    • DevOps Forum News
  • DevOps & SRE
    • DevOps & SRE General Discussion
    • Databases, Data Engineering & Data Science
    • Development & Programming
    • CI/CD, GitOps, Orchestration & Scheduling
    • Docker, Containers, Microservices, Serverless & Virtualization
    • Infrastructure-as-Code
    • Kubernetes
    • Linux
    • Logging, Monitoring & Observability
    • Red Hat OpenShift
    • Security
  • Cloud Providers
    • Amazon Web Services
    • Google Cloud Platform
    • Microsoft Azure

Find results in...

Find results that contain...


Date Created

  • Start

    End


Last Updated

  • Start

    End


Filter by number of...

Joined

  • Start

    End


Group


Website URL


LinkedIn Profile URL


About Me


Cloud Platforms


Cloud Experience


Development Experience


Current Role


Skills


Certifications


Favourite Tools


Interests

  1. Finally RudderStack keys "Why they did not prefer Apache Kafka over PostgreSQL for building RudderStack?". Focuses on the challenges using Apache KafkaView the full article
  2. Amazon Relational Database Service (Amazon RDS) for PostgreSQL, MySQL, and MariaDB now support AWS Graviton3-based M7g and R7g database instances in US West (N. California), Asia Pacific (Hyderabad, Seoul), Canada (Central), Europe (London, Spain), and Middle East (Bahrain). Graviton3-based instances provide up to a 30% performance improvement and up to a 27% price/performance improvement (based on on-demand pricing) over Graviton2-based instances on RDS for open-source databases depending on database engine, version, and workload View the full article
  3. This article is part of a project that’s split into two main phases. The first phase focuses on building a data pipeline. This involves getting data from an API and storing it in a PostgreSQL database. In the second phase, we’ll develop an application that uses a language model to interact with this database. Ideal for those new to data systems or language model applications, this project is structured into two segments: This initial article guides you through constructing a data pipeline utilizing Kafka for streaming, Airflow for orchestration, Spark for data transformation, and PostgreSQL for storage. To set-up and run these tools we will use Docker.The second article, which will come later, will delve into creating agents using tools like LangChain to communicate with external databases.This first part project is ideal for beginners in data engineering, as well as for data scientists and machine learning engineers looking to deepen their knowledge of the entire data handling process. Using these data engineering tools firsthand is beneficial. It helps in refining the creation and expansion of machine learning models, ensuring they perform effectively in practical settings. This article focuses more on practical application rather than theoretical aspects of the tools discussed. For detailed understanding of how these tools work internally, there are many excellent resources available online. OverviewLet’s break down the data pipeline process step-by-step: Data Streaming: Initially, data is streamed from the API into a Kafka topic.Data Processing: A Spark job then takes over, consuming the data from the Kafka topic and transferring it to a PostgreSQL database.Scheduling with Airflow: Both the streaming task and the Spark job are orchestrated using Airflow. While in a real-world scenario, the Kafka producer would constantly listen to the API, for demonstration purposes, we’ll schedule the Kafka streaming task to run daily. Once the streaming is complete, the Spark job processes the data, making it ready for use by the LLM application.All of these tools will be built and run using docker, and more specifically docker-compose. Overview of the data pipeline. Image by the author.Now that we have a blueprint of our pipeline, let’s dive into the technical details ! Local setupFirst you can clone the Github repo on your local machine using the following command: git clone https://github.com/HamzaG737/data-engineering-project.gitHere is the overall structure of the project: ├── LICENSE ├── README.md ├── airflow │ ├── Dockerfile │ ├── __init__.py │ └── dags │ ├── __init__.py │ └── dag_kafka_spark.py ├── data │ └── last_processed.json ├── docker-compose-airflow.yaml ├── docker-compose.yml ├── kafka ├── requirements.txt ├── spark │ └── Dockerfile └── src ├── __init__.py ├── constants.py ├── kafka_client │ ├── __init__.py │ └── kafka_stream_data.py └── spark_pgsql └── spark_streaming.pyThe airflow directory contains a custom Dockerfile for setting up airflow and a dags directory to create and schedule the tasks.The data directory contains the last_processed.json file which is crucial for the Kafka streaming task. Further details on its role will be provided in the Kafka section.The docker-compose-airflow.yaml file defines all the services required to run airflow.The docker-compose.yaml file specifies the Kafka services and includes a docker-proxy. This proxy is essential for executing Spark jobs through a docker-operator in Airflow, a concept that will be elaborated on later.The spark directory contains a custom Dockerfile for spark setup.src contains the python modules needed to run the application.To set up your local development environment, start by installing the required Python packages. The only essential package is psycopg2-binary. You have the option to install just this package or all the packages listed in the requirements.txt file. To install all packages, use the following command: pip install -r requirements.txtNext let’s dive step by step into the project details. About the APIThe API is RappelConso from the French public services. It gives access to data relating to recalls of products declared by professionals in France. The data is in French and it contains initially 31 columns (or fields). Some of the most important are: reference_fiche (reference sheet): Unique identifier of the recalled product. It will act as the primary key of our Postgres database later.categorie_de_produit (Product category): For instance food, electrical appliance, tools, transport means, etc …sous_categorie_de_produit (Product sub-category): For instance we can have meat, dairy products, cereals as sub-categories for the food category.motif_de_rappel (Reason for recall): Self explanatory and one of the most important fields.date_de_publication which translates to the publication date.risques_encourus_par_le_consommateur which contains the risks that the consumer may encounter when using the product.There are also several fields that correspond to different links, such as link to product image, link to the distributers list, etc..You can see some examples and query manually the dataset records using this link. We refined the data columns in a few key ways: Columns like ndeg_de_version and rappelguid, which were part of a versioning system, have been removed as they aren’t needed for our project.We combined columns that deal with consumer risks — risques_encourus_par_le_consommateur and description_complementaire_du_risque — for a clearer overview of product risks.The date_debut_fin_de_commercialisation column, which indicates the marketing period, has been divided into two separate columns. This split allows for easier queries about the start or end of a product’s marketing.We’ve removed accents from all columns except for links, reference numbers, and dates. This is important because some text processing tools struggle with accented characters.For a detailed look at these changes, check out our transformation script at src/kafka_client/transformations.py. The updated list of columns is available insrc/constants.py under DB_FIELDS. Kafka streamingTo avoid sending all the data from the API each time we run the streaming task, we define a local json file that contains the last publication date of the latest streaming. Then we will use this date as the starting date for our new streaming task. To give an example, suppose that the latest recalled product has a publication date of 22 november 2023. If we make the hypothesis that all of the recalled products infos before this date are already persisted in our Postgres database, We can now stream the data starting from the 22 november. Note that there is an overlap because we may have a scenario where we didn’t handle all of the data of the 22nd of November. The file is saved in ./data/last_processed.json and has this format: {last_processed:"2023-11-22"}By default the file is an empty json which means that our first streaming task will process all of the API records which are 10 000 approximately. Note that in a production setting this approach of storing the last processed date in a local file is not viable and other approaches involving an external database or an object storage service may be more suitable. The code for the kafka streaming can be found on ./src/kafka_client/kafka_stream_data.py and it involves primarily querying the data from the API, making the transformations, removing potential duplicates, updating the last publication date and serving the data using the kafka producer. The next step is to run the kafka service defined the docker-compose defined below: version: '3' services: kafka: image: 'bitnami/kafka:latest' ports: - '9094:9094' networks: - airflow-kafka environment: - KAFKA_CFG_NODE_ID=0 - KAFKA_CFG_PROCESS_ROLES=controller,broker - KAFKA_CFG_LISTENERS=PLAINTEXT://:9092,CONTROLLER://:9093,EXTERNAL://:9094 - KAFKA_CFG_ADVERTISED_LISTENERS=PLAINTEXT://kafka:9092,EXTERNAL://localhost:9094 - KAFKA_CFG_LISTENER_SECURITY_PROTOCOL_MAP=CONTROLLER:PLAINTEXT,EXTERNAL:PLAINTEXT,PLAINTEXT:PLAINTEXT - KAFKA_CFG_CONTROLLER_QUORUM_VOTERS=0@kafka:9093 - KAFKA_CFG_CONTROLLER_LISTENER_NAMES=CONTROLLER volumes: - ./kafka:/bitnami/kafka kafka-ui: container_name: kafka-ui-1 image: provectuslabs/kafka-ui:latest ports: - 8800:8080 depends_on: - kafka environment: KAFKA_CLUSTERS_0_NAME: local KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS: PLAINTEXT://kafka:9092 DYNAMIC_CONFIG_ENABLED: 'true' networks: - airflow-kafka networks: airflow-kafka: external: trueThe key highlights from this file are: The kafka service uses a base image bitnami/kafka.We configure the service with only one broker which is enough for our small project. A Kafka broker is responsible for receiving messages from producers (which are the sources of data), storing these messages, and delivering them to consumers (which are the sinks or end-users of the data). The broker listens to port 9092 for internal communication within the cluster and port 9094 for external communication, allowing clients outside the Docker network to connect to the Kafka broker.In the volumes part, we map the local directory kafka to the docker container directory /bitnami/kafka to ensure data persistence and a possible inspection of Kafka’s data from the host system.We set-up the service kafka-ui that uses the docker image provectuslabs/kafka-ui:latest . This provides a user interface to interact with the Kafka cluster. This is especially useful for monitoring and managing Kafka topics and messages.To ensure communication between kafka and airflow which will be run as an external service, we will use an external network airflow-kafka.Before running the kafka service, let’s create the airflow-kafka network using the following command: docker network create airflow-kafkaNow everything is set to finally start our kafka service docker-compose up After the services start, visit the kafka-ui at http://localhost:8800/. Normally you should get something like this: Overview of the Kafka UI. Image by the author.Next we will create our topic that will contain the API messages. Click on Topics on the left and then Add a topic at the top left. Our topic will be called rappel_conso and since we have only one broker we set the replication factor to 1. We will also set the partitions number to 1 since we will have only one consumer thread at a time so we won’t need any parallelism. Finally, we can set the time to retain data to a small number like one hour since we will run the spark job right after the kafka streaming task, so we won’t need to retain the data for a long time in the kafka topic. Postgres set-upBefore setting-up our spark and airflow configurations, let’s create the Postgres database that will persist our API data. I used the pgadmin 4 tool for this task, however any other Postgres development platform can do the job. To install postgres and pgadmin, visit this link https://www.postgresql.org/download/ and get the packages following your operating system. Then when installing postgres, you need to setup a password that we will need later to connect to the database from the spark environment. You can also leave the port at 5432. If your installation has succeeded, you can start pgadmin and you should observe something like this window: Overview of pgAdmin interface. Image by the author.Since we have a lot of columns for the table we want to create, we chose to create the table and add its columns with a script using psycopg2, a PostgreSQL database adapter for Python. You can run the script with the command: python scripts/create_table.pyNote that in the script I saved the postgres password as environment variable and name it POSTGRES_PASSWORD. So if you use another method to access the password you need to modify the script accordingly. Spark Set-upHaving set-up our Postgres database, let’s delve into the details of the spark job. The goal is to stream the data from the Kafka topic rappel_conso to the Postgres table rappel_conso_table. from pyspark.sql import SparkSession from pyspark.sql.types import ( StructType, StructField, StringType, ) from pyspark.sql.functions import from_json, col from src.constants import POSTGRES_URL, POSTGRES_PROPERTIES, DB_FIELDS import logging logging.basicConfig( level=logging.INFO, format="%(asctime)s:%(funcName)s:%(levelname)s:%(message)s" ) def create_spark_session() -> SparkSession: spark = ( SparkSession.builder.appName("PostgreSQL Connection with PySpark") .config( "spark.jars.packages", "org.postgresql:postgresql:42.5.4,org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0", ) .getOrCreate() ) logging.info("Spark session created successfully") return spark def create_initial_dataframe(spark_session): """ Reads the streaming data and creates the initial dataframe accordingly. """ try: # Gets the streaming data from topic random_names df = ( spark_session.readStream.format("kafka") .option("kafka.bootstrap.servers", "kafka:9092") .option("subscribe", "rappel_conso") .option("startingOffsets", "earliest") .load() ) logging.info("Initial dataframe created successfully") except Exception as e: logging.warning(f"Initial dataframe couldn't be created due to exception: {e}") raise return df def create_final_dataframe(df): """ Modifies the initial dataframe, and creates the final dataframe. """ schema = StructType( [StructField(field_name, StringType(), True) for field_name in DB_FIELDS] ) df_out = ( df.selectExpr("CAST(value AS STRING)") .select(from_json(col("value"), schema).alias("data")) .select("data.*") ) return df_out def start_streaming(df_parsed, spark): """ Starts the streaming to table spark_streaming.rappel_conso in postgres """ # Read existing data from PostgreSQL existing_data_df = spark.read.jdbc( POSTGRES_URL, "rappel_conso", properties=POSTGRES_PROPERTIES ) unique_column = "reference_fiche" logging.info("Start streaming ...") query = df_parsed.writeStream.foreachBatch( lambda batch_df, _: ( batch_df.join( existing_data_df, batch_df[unique_column] == existing_data_df[unique_column], "leftanti" ) .write.jdbc( POSTGRES_URL, "rappel_conso", "append", properties=POSTGRES_PROPERTIES ) ) ).trigger(once=True) \ .start() return query.awaitTermination() def write_to_postgres(): spark = create_spark_session() df = create_initial_dataframe(spark) df_final = create_final_dataframe(df) start_streaming(df_final, spark=spark) if __name__ == "__main__": write_to_postgres()Let’s break down the key highlights and functionalities of the spark job: First we create the Spark sessiondef create_spark_session() -> SparkSession: spark = ( SparkSession.builder.appName("PostgreSQL Connection with PySpark") .config( "spark.jars.packages", "org.postgresql:postgresql:42.5.4,org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0", ) .getOrCreate() ) logging.info("Spark session created successfully") return spark2. The create_initial_dataframe function ingests streaming data from the Kafka topic using Spark's structured streaming. def create_initial_dataframe(spark_session): """ Reads the streaming data and creates the initial dataframe accordingly. """ try: # Gets the streaming data from topic random_names df = ( spark_session.readStream.format("kafka") .option("kafka.bootstrap.servers", "kafka:9092") .option("subscribe", "rappel_conso") .option("startingOffsets", "earliest") .load() ) logging.info("Initial dataframe created successfully") except Exception as e: logging.warning(f"Initial dataframe couldn't be created due to exception: {e}") raise return df3. Once the data is ingested, create_final_dataframe transforms it. It applies a schema (defined by the columns DB_FIELDS) to the incoming JSON data, ensuring that the data is structured and ready for further processing. def create_final_dataframe(df): """ Modifies the initial dataframe, and creates the final dataframe. """ schema = StructType( [StructField(field_name, StringType(), True) for field_name in DB_FIELDS] ) df_out = ( df.selectExpr("CAST(value AS STRING)") .select(from_json(col("value"), schema).alias("data")) .select("data.*") ) return df_out4. The start_streaming function reads existing data from the database, compares it with the incoming stream, and appends new records. def start_streaming(df_parsed, spark): """ Starts the streaming to table spark_streaming.rappel_conso in postgres """ # Read existing data from PostgreSQL existing_data_df = spark.read.jdbc( POSTGRES_URL, "rappel_conso", properties=POSTGRES_PROPERTIES ) unique_column = "reference_fiche" logging.info("Start streaming ...") query = df_parsed.writeStream.foreachBatch( lambda batch_df, _: ( batch_df.join( existing_data_df, batch_df[unique_column] == existing_data_df[unique_column], "leftanti" ) .write.jdbc( POSTGRES_URL, "rappel_conso", "append", properties=POSTGRES_PROPERTIES ) ) ).trigger(once=True) \ .start() return query.awaitTermination()The complete code for the Spark job is in the file src/spark_pgsql/spark_streaming.py. We will use the Airflow DockerOperator to run this job, as explained in the upcoming section. Let’s go through the process of creating the Docker image we need to run our Spark job. Here’s the Dockerfile for reference: FROM bitnami/spark:latest WORKDIR /opt/bitnami/spark RUN pip install py4j COPY ./src/spark_pgsql/spark_streaming.py ./spark_streaming.py COPY ./src/constants.py ./src/constants.py ENV POSTGRES_DOCKER_USER=host.docker.internal ARG POSTGRES_PASSWORD ENV POSTGRES_PASSWORD=$POSTGRES_PASSWORDIn this Dockerfile, we start with the bitnami/spark image as our base. It's a ready-to-use Spark image. We then install py4j, a tool needed for Spark to work with Python. The environment variables POSTGRES_DOCKER_USER and POSTGRES_PASSWORD are set up for connecting to a PostgreSQL database. Since our database is on the host machine, we use host.docker.internal as the user. This allows our Docker container to access services on the host, in this case, the PostgreSQL database. The password for PostgreSQL is passed as a build argument, so it's not hard-coded into the image. It’s important to note that this approach, especially passing the database password at build time, might not be secure for production environments. It could potentially expose sensitive information. In such cases, more secure methods like Docker BuildKit should be considered. Now, let’s build the Docker image for Spark: docker build -f spark/Dockerfile -t rappel-conso/spark:latest --build-arg POSTGRES_PASSWORD=$POSTGRES_PASSWORD .This command will build the image rappel-conso/spark:latest . This image includes everything needed to run our Spark job and will be used by Airflow’s DockerOperator to execute the job. Remember to replace $POSTGRES_PASSWORD with your actual PostgreSQL password when running this command. AirflowAs said earlier, Apache Airflow serves as the orchestration tool in the data pipeline. It is responsible for scheduling and managing the workflow of the tasks, ensuring they are executed in a specified order and under defined conditions. In our system, Airflow is used to automate the data flow from streaming with Kafka to processing with Spark. Airflow DAGLet’s take a look at the Directed Acyclic Graph (DAG) that will outline the sequence and dependencies of tasks, enabling Airflow to manage their execution. start_date = datetime.today() - timedelta(days=1) default_args = { "owner": "airflow", "start_date": start_date, "retries": 1, # number of retries before failing the task "retry_delay": timedelta(seconds=5), } with DAG( dag_id="kafka_spark_dag", default_args=default_args, schedule_interval=timedelta(days=1), catchup=False, ) as dag: kafka_stream_task = PythonOperator( task_id="kafka_data_stream", python_callable=stream, dag=dag, ) spark_stream_task = DockerOperator( task_id="pyspark_consumer", image="rappel-conso/spark:latest", api_version="auto", auto_remove=True, command="./bin/spark-submit --master local[*] --packages org.postgresql:postgresql:42.5.4,org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0 ./spark_streaming.py", docker_url='tcp://docker-proxy:2375', environment={'SPARK_LOCAL_HOSTNAME': 'localhost'}, network_mode="airflow-kafka", dag=dag, ) kafka_stream_task >> spark_stream_taskHere are the key elements from this configuration The tasks are set to execute daily.The first task is the Kafka Stream Task. It is implemented using the PythonOperator to run the Kafka streaming function. This task streams data from the RappelConso API into a Kafka topic, initiating the data processing workflow.The downstream task is the Spark Stream Task. It uses the DockerOperator for execution. It runs a Docker container with our custom Spark image, tasked with processing the data received from Kafka.The tasks are arranged sequentially, where the Kafka streaming task precedes the Spark processing task. This order is crucial to ensure that data is first streamed and loaded into Kafka before being processed by Spark.About the DockerOperatorUsing docker operator allow us to run docker-containers that correspond to our tasks. The main advantage of this approach is easier package management, better isolation and enhanced testability. We will demonstrate the use of this operator with the spark streaming task. Here are some key details about the docker operator for the spark streaming task: We will use the image rappel-conso/spark:latest specified in the Spark Set-up section.The command will run the Spark submit command inside the container, specifying the master as local, including necessary packages for PostgreSQL and Kafka integration, and pointing to the spark_streaming.py script that contains the logic for the Spark job.docker_url represents the url of the host running the docker daemon. The natural solution is to set it as unix://var/run/docker.sock and to mount the var/run/docker.sock in the airflow docker container. One problem we had with this approach is a permission error to use the socket file inside the airflow container. A common workaround, changing permissions with chmod 777 var/run/docker.sock, poses significant security risks. To circumvent this, we implemented a more secure solution using bobrik/socat as a docker-proxy. This proxy, defined in a Docker Compose service, listens on TCP port 2375 and forwards requests to the Docker socket: docker-proxy: image: bobrik/socat command: "TCP4-LISTEN:2375,fork,reuseaddr UNIX-CONNECT:/var/run/docker.sock" ports: - "2376:2375" volumes: - /var/run/docker.sock:/var/run/docker.sock networks: - airflow-kafkaIn the DockerOperator, we can access the host docker /var/run/docker.sock via thetcp://docker-proxy:2375 url, as described here and here. Finally we set the network mode to airflow-kafka. This allows us to use the same network as the proxy and the docker running kafka. This is crucial since the spark job will consume the data from the kafka topic so we must ensure that both containers are able to communicate.After defining the logic of our DAG, let’s understand now the airflow services configuration in the docker-compose-airflow.yaml file. Airflow ConfigurationThe compose file for airflow was adapted from the official apache airflow docker-compose file. You can have a look at the original file by visiting this link. As pointed out by this article, this proposed version of airflow is highly resource-intensive mainly because the core-executor is set to CeleryExecutor that is more adapted for distributed and large-scale data processing tasks. Since we have a small workload, using a single-noded LocalExecutor is enough. Here is an overview of the changes we made on the docker-compose configuration of airflow: We set the environment variable AIRFLOW__CORE__EXECUTOR to LocalExecutor.We removed the services airflow-worker and flower because they only work for the Celery executor. We also removed the redis caching service since it works as a backend for celery. We also won’t use the airflow-triggerer so we remove it too.We replaced the base image ${AIRFLOW_IMAGE_NAME:-apache/airflow:2.7.3} for the remaining services, mainly the scheduler and the webserver, by a custom image that we will build when running the docker-compose.version: '3.8' x-airflow-common: &airflow-common build: context: . dockerfile: ./airflow_resources/Dockerfile image: de-project/airflow:latestWe mounted the necessary volumes that are needed by airflow. AIRFLOW_PROJ_DIR designates the airflow project directory that we will define later. We also set the network as airflow-kafka to be able to communicate with the kafka boostrap servers.volumes: - ${AIRFLOW_PROJ_DIR:-.}/dags:/opt/airflow/dags - ${AIRFLOW_PROJ_DIR:-.}/logs:/opt/airflow/logs - ${AIRFLOW_PROJ_DIR:-.}/config:/opt/airflow/config - ./src:/opt/airflow/dags/src - ./data/last_processed.json:/opt/airflow/data/last_processed.json user: "${AIRFLOW_UID:-50000}:0" networks: - airflow-kafkaNext, we need to create some environment variables that will be used by docker-compose: echo -e "AIRFLOW_UID=$(id -u)\nAIRFLOW_PROJ_DIR=\"./airflow_resources\"" > .envWhere AIRFLOW_UID represents the User ID in Airflow containers and AIRFLOW_PROJ_DIR represents the airflow project directory. Now everything is set-up to run your airflow service. You can start it with this command: docker compose -f docker-compose-airflow.yaml upThen to access the airflow user interface you can visit this url http://localhost:8080 . Sign-in window on Airflow. Image by the author.By default, the username and password are airflow for both. After signing in, you will see a list of Dags that come with airflow. Look for the dag of our project kafka_spark_dag and click on it. Overview of the task window in airflow. Image by the author.You can start the task by clicking on the button next to DAG: kafka_spark_dag. Next, you can check the status of your tasks in the Graph tab. A task is done when it turns green. So, when everything is finished, it should look something like this: Image by the author.To verify that the rappel_conso_table is filled with data, use the following SQL query in the pgAdmin Query Tool: SELECT count(*) FROM rappel_conso_tableWhen I ran this in January 2024, the query returned a total of 10022 rows. Your results should be around this number as well. ConclusionThis article has successfully demonstrated the steps to build a basic yet functional data engineering pipeline using Kafka, Airflow, Spark, PostgreSQL, and Docker. Aimed primarily at beginners and those new to the field of data engineering, it provides a hands-on approach to understanding and implementing key concepts in data streaming, processing, and storage. Throughout this guide, we’ve covered each component of the pipeline in detail, from setting up Kafka for data streaming to using Airflow for task orchestration, and from processing data with Spark to storing it in PostgreSQL. The use of Docker throughout the project simplifies the setup and ensures consistency across different environments. It’s important to note that while this setup is ideal for learning and small-scale projects, scaling it for production use would require additional considerations, especially in terms of security and performance optimization. Future enhancements could include integrating more advanced data processing techniques, exploring real-time analytics, or even expanding the pipeline to incorporate more complex data sources. In essence, this project serves as a practical starting point for those looking to get their hands dirty with data engineering. It lays the groundwork for understanding the basics, providing a solid foundation for further exploration in the field. In the second part, we’ll explore how to effectively use the data stored in our PostgreSQL database. We’ll introduce agents powered by Large Language Models (LLMs) and a variety of tools that enable us to interact with the database using natural language queries. So, stay tuned ! To reach outLinkedIn : https://www.linkedin.com/in/hamza-gharbi-043045151/Twitter : https://twitter.com/HamzaGh25079790End-to-End Data Engineering System on Real Data with Kafka, Spark, Airflow, Postgres, and Docker was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story. View the full article
  4. Amazon Relational Database Service (Amazon RDS) now supports a Dedicated Log Volume for PostgreSQL, MySQL, and MariaDB databases. An Amazon RDS Dedicated Log Volume allows customers to select a configuration where the most latency sensitive components of their database, the transaction logs, are stored in a separate, dedicated volume. Dedicated Log Volumes work with Provisioned IOPS storage and are recommended for databases with 5,000 GiB or more of allocated storage. View the full article
  5. Amazon Relational Database Service (Amazon RDS) now supports M6in, M6idn, R6in, and R6idn database (DB) instances for RDS for PostgreSQL, MySQL, and MariaDB. These network optimized DB instances deliver up to 200Gbps network bandwidth, which is 300% more than similar sized M6i and R6i database instances. Enhanced network bandwidth makes M6in and R6in DB instances ideal for write-intensive workloads. M6idn and R6idn support local block storage with up to 7.6 TB of NVMe-based solid state disk (SSD) storage. View the full article
  6. Today, AWS announces the general availability of pgactive: Active-active Replication Extension for PostgreSQL, available for Amazon Relational Database Service (RDS) for PostgreSQL. pgactive lets you use asynchronous active-active replication for streaming data between database instances to provide additional resiliency and flexibility in moving data between database instances, including writers located in different AWS Regions, for the purposes of maintaining availability for operations like switching write traffic to a different instance. View the full article
  7. Amazon Relational Database Service (RDS) for PostgreSQL now supports v0.5.0 of the pgvector extension to store embeddings from machine learning (ML) models in your database and to perform efficient similarity searches. This version of the extension introduces pgvector introduces HNSW indexing support, parallelization of ivfflat index builds, and improves performance of its distance functions. View the full article
  8. Amazon Relational Database Service (Amazon RDS) for PostgreSQL 16.0 is now available in the Amazon RDS Database Preview Environment, allowing you to evaluate PostgreSQL 16.0 on Amazon RDS for PostgreSQL. You can deploy PostgreSQL 16.0 in the Preview Environment and have the same benefits of a fully managed database, making it simpler to set up, operate, and monitor databases. PostgreSQL 16.0 in the Preview Environment also includes support for logical decoding on read replicas, AWS libcrypto (AWS-LC), and over 80 PostgreSQL extensions such as pgvector, pg_tle, h3-pg, pg_cron, and rdkit. View the full article
  9. Amazon Relational Database Service (Amazon RDS) PostgreSQL Multi-AZ Deployments with two readable standbys now supports major version upgrades. Starting today, you can upgrade your RDS for PostgreSQL Multi-AZ Deployments with two readable standbys from major version 13.4 and above and 14.5 and above to 15.4 with just a few clicks on the AWS Management Console. View the full article
  10. PostgreSQL is a powerful, free and open-source relational database management system. It is a performance-oriented and versatile relational database that offers various features and extensibility options. Despite all the features and tools that are offered natively by the PostgreSQL server, it allows us to add additional functionality and features through the use of extensions. PostgreSQL extensions refer to a set of external modules or libraries that are used to enhance the functionality of the database engines by providing extra features or data types. This makes the PostgreSQL database encapsulate a more comprehensive range of requirements. This tutorial demonstrates how you can view the installed extensions in a PostgreSQL server using various commands and methods. Prerequisites: To follow this tutorial, you should have the following: Installed and configured PostgreSQL on your system Basic understanding of SQL and PostgreSQL Purpose of PostgreSQL Extensions PostgreSQL extensions serve multiple purposes including: Extending Database Functionality – Extensions provide additional features, data types, or functionalities that are unavailable by default. In addition, they allow us to add specialized capabilities to the database to meet the specific requirements. Simplifying Development – Extensions can streamline the development process by providing pre-packaged functionality that saves time and effort. Instead of building custom solutions, we can leverage the existing extensions to add the advanced application features. Enhancing Database Performance – Some extensions optimize and improve the performance of PostgreSQL databases. For example, they can introduce the advanced indexing techniques, query optimization, or caching mechanisms to speed up the queries and overall database operations. The role of an extension heavily determines the intended purpose. Install the Extension Before we discover how to view the installed database extension, let us demonstrate how to install one for demonstration purposes. One popular PostgreSQL extension we can install is the “PostGIS” extension. PostGIS adds support for geographic objects and allows you to store, query, and manipulate a spatial data in our PostgreSQL database. Start by connecting to the PostgreSQL database using the “psql” command: psql -U username -d database_name Replace the username with your target PostgreSQL username and database_name with the database name to which you want to connect. Then, enter the username’s password on the prompt. The next step is to install the PostGIS which is already available in your database, and all you need to do is enable it. Run the following command: CREATE EXTENSION postgis; The command activates the PostGIS extension on the current database. After the installation, you can start using the PostGIS extension’s features. PostGIS provides a rich set of functions and operators to work with spatial data. Check the following resource to learn more. To check the installed extension version, run the following query: SELECT PostGIS_Full_Version(); This should return the installed PostGIS version. Before installing an extension, checking whether it is already available in your PostgreSQL installation is a good practice. For example, we can use the following command to list the available extensions: SELECT * FROM pg_available_extensions; To install an extension in a specific schema, we can specify the schema using the SCHEMA parameter: CREATE EXTENSION extension_name SCHEMA schema_name; PostgreSQL Show Extensions Once you installed your desired extensions, you can show them using the SHOW EXTENSIONS command: SHOW EXTENSION The given command displays a table with information about the installed extensions. You can also view the extensions that are installed in a specific database using the PgAdmin 4 utility. Log into PgAdmin and expand to Servers -> PostgreSQL -> Target Database – Extensions. Conclusion We explored the role of extensions in PostgreSQL. We also explored how to install and view the installed extensions in PostgreSQL server. View the full article
  11. PostgreSQL supports various character types: TEXT, VARCHAR(n), and CHAR(n). The TEXT data type takes an unlimited variable length. VARCHAR(n) takes the values with a variable length without exceeding the set length limit. As for CHAR(n), it takes a fixed length and is blank padded. Understanding CHAR(n) and VARCHAR(n) is essential in understanding the BPCHAR data type. This tutorial discusses everything about BPCHAR. We will understand how CHAR(n) and VARCHAR(n) work and where BPCHAR comes in. Let’s begin! What Is BPCHAR in PostgreSQL When working with any database, you must specify the data type of the columns that you create. That way, the database knows what type of values to expect for a given table. Hence, when you try to insert a value that doesn’t match the expected data type, the database raises an error. In the case of PostgreSQL, BPCHAR is a data type that stands for “blank padded character.” It is associated with the CHAR data type and mainly comes into play when we try to add blank spaces as values. In such a case, the empty spaces are blank-padded to avoid raising any error. To understand BPCHAR in detail, we create a table and see how CHAR and VARCHAR handle different values and where BPCHAR comes into play. In our example, we create a simple table that takes CHAR(n) and VARCHAR(n) where “n” is 10. Here, “n” is the number of characters that can be accommodated. If we inspect our table, we can confirm that all data types and columns are successfully created. Now, let’s start by inserting values into our columns. The values, in this case, have a character length that are less than the specified character length of 10 for each data type. We can verify the length of the inserted values using the length (column-name) option. In the following output, CHAR and VARCHAR have a value with fewer characters than the maximum specified length when creating the table. You should note that the VARCHAR data type only uses the required length since it takes a variable length. However, CHAR takes up all the fixed length regardless of whether the inserted value requires fewer characters. Suppose we try to insert a value that takes more characters than the length of the CHAR data type. We get the following error: The same applies to VARCHAR. Although it has a variable length, it can’t accommodate any value that exceeds the size of the data type. The value that we attempted to insert is larger than 10 characters. Hence, Postgres throws an error which blocks us from inserting the value. To comprehend how BPCHAR works, let’s have an example of inserting the spaces as values and see how CHAR and VARCHAR will treat either case. When you insert the spaces for VARCHAR, it still treats them as characters. Besides, when you insert the spaces that exceed the variable length, it won’t throw any error. But it only takes the maximum length of characters and ignores the rest. In the following example, the value that we insert for VARCHAR is spaces which exceeds to ten characters. The insertion proceeds successfully. When we inspect the inserted value, we see that only ten spaces are captured. Unlike VARCHAR, the CHAR data type treats the spaces differently. It takes them as space-padded characters. Thus, it won’t throw an error but won’t save the blank characters either. Inspecting our table, we see that the CHAR column saved no space, and the length is zero. That’s one element of BPCHAR at work. It treats the blank spaces as padded values. Hence, it doesn’t save them in the database. Going further with spaces, something different happens when we try to insert a value with spaces between characters. For VARCHAR, we can see from the following example that every space is counted as a character. Thus, we get an error that our value is larger than the fixed length for the data type. The same applies to CHAR. Even though it treats the spaces as blank-padded values earlier, when we add them between other characters, it treats them as characters and raises an error when we try adding a longer character length. How about adding spaces between and after characters? How does BPCHAR treat it? In the first example, we added a character that is longer than the required variable length. However, since the last character is spaces, CHAR treats them as blank-padded values and doesn’t throw any error. Inspecting our table, we can see that only the first characters and the spaces between them are accounted for as CHAR values. However, we get a different output if we attempt the same for VARCHAR. It still accounts for the spaces after the characters and records the maximum spaces to fit its fixed-length size. Checking our database, we can note the difference in the length of the characters that are stored as CHAR and VARCHAR. For CHAR, only four characters are stored in the table. However, VARCHAR captured ten characters, including spaces, and ignored the space that exceeded the set fixed length. These examples show how BPCHAR comes into play when working with space-padded values with CHAR. As for VARCHAR, no BPCHAR effect is encountered. Conclusion BPCHAR is a data type that applies when working with CHAR. It makes CHAR treat the spaces as space-padded values that can’t be stored in the table, depending on the location of the space in the value that is being inserted. We’ve seen BPCHAR in detail through the different examples that are presented in this post. Hopefully, you now understand how BPCHAR works and how it affects the characters when working with PostgreSQL. View the full article
  12. Great article on the fastest methods to load data into a PostgreSQL DB using Python; https://hakibenita.com/fast-load-data-python-postgresql
  13. Amazon Aurora now supports R6i instances powered by 3rd generation Intel Xeon Scalable processors. R6i instances are the 6th generation of Amazon EC2 memory optimized instances, designed for memory-intensive workloads. These instances are built on the AWS Nitro System, a combination of dedicated hardware and lightweight hypervisor, which delivers practically all of the compute and memory resources of the host hardware to your instances. R6i instances are currently available when using Amazon Aurora PostgreSQL-Compatible Edition. View the full article
  14. AWS Database Migration Service (AWS DMS) has expanded functionality by adding support for Babelfish for Aurora PostgreSQL as a target. Babelfish for Aurora PostgreSQL is a new translation layer for Amazon Aurora PostgreSQL-Compatible Edition that enables Aurora to understand commands from applications written for Microsoft SQL Server. Using AWS DMS, you can now perform full load migrations to Babelfish for Aurora PostgreSQL with minimal downtime. View the full article
  15. Amazon Aurora PostgreSQL-Compatible Edition now supports PostgreSQL major version 14 (14.3). PostgreSQL 14 includes performance improvements for parallel queries, heavily-concurrent workloads, partitioned tables, logical replication, and vacuuming. PostgreSQL 14 also improves functionality with new capabilities. For example, you can cancel long-running queries if a client disconnects and you can close idle sessions if they time out. Range types now support multiranges, allowing representation of non-contiguous data ranges, and stored procedures can now return data via OUT parameters. This release includes new features for Babelfish for Aurora PostgreSQL version 2.1. Please refer to Amazon Aurora PostgreSQL updates for more information. View the full article
  16. Following the announcement of updates to the PostgreSQL database by the open source community, we have updated Amazon Aurora PostgreSQL-Compatible Edition to support PostgreSQL 13.7, 12.11, 11.16, and 10.21. These releases contain bug fixes and improvements by the PostgreSQL community. Refer to the Aurora version policy to help you to decide how often to upgrade and how to plan your upgrade process. View the full article
  17. Amazon Aurora PostgreSQL-Compatible Edition now supports the Large Objects (LO) module. The LO module provides support for managing Large Objects (also called LOs or BLOBs). View the full article
  18. Amazon Aurora PostgreSQL-compatible edition now supports zero-downtime patching (ZDP). With ZDP, customers can upgrade to a new PostgreSQL version and apply patches to their Aurora cluster without any downtime. View the full article
  19. Recently, at Google I/O, we announced AlloyDB for PostgreSQL, a fully-managed, PostgreSQL-compatible database for demanding, enterprise-grade transactional and analytical workloads. Imagine PostgreSQL plus the best of the cloud: elastic storage and compute, intelligent caching, and AI/ML-powered management. Further, AlloyDB delivers unmatched price-performance: In our performance tests, it’s more than 4x faster on transactional workloads, and up to 100x faster on analytical queries than standard PostgreSQL, all with simple, predictable pricing. Designed for mission-critical applications, AlloyDB offers extensive data protection and an industry leading 99.99% availability SLA, inclusive of maintenance. Read Article
  20. Releases CloudNativePG to Power Modern Cloud Applications with Unprecedented Flexibility Bedford, Mass. May 16, 2022 – EDB, the world leader in accelerating Postgres in the enterprise, announced the release of CloudNativePG, the first open source operator optimized for Kubernetes, and the submission of a Sandbox application to the Cloud Native Computing Foundation (CNCF). Built on […] The post EDB Accelerates Innovation with World’s First Openly Governed Postgres Operator for Kubernetes appeared first on DevOps.com. View the full article
  21. Enterprises are struggling to free themselves from legacy database systems, and need an alternative option to modernize their applications. Today at Google I/O, we’re thrilled to announce the preview of AlloyDB for PostgreSQL, a fully-managed, PostgreSQL-compatible database service that provides a powerful option for modernizing your most demanding enterprise database workloads. Compared with standard PostgreSQL, in our performance tests, AlloyDB was more than four times faster for transactional workloads, and up to 100 times faster for analytical queries. AlloyDB was also two times faster for transactional workloads than Amazon’s comparable service. This makes AlloyDB a powerful new modernization option for transitioning off of legacy databases. Read Article
  22. Amazon Relational Database Service (Amazon RDS) for PostgreSQL announces support for PostgreSQL 14 with three levels of cascaded read replicas, 5 replicas per instance, supporting a maximum of up to 155 read replicas per source instance. You can now create Single-AZ or Multi-AZ cascaded read replica DB instances in same region or any one cross region from another read replica instance, enabling you to build a more robust disaster recovery architecture. View the full article
  23. Amazon Relational Database Service (Amazon RDS) for PostgreSQL, version 11 and higher, now supports M6i and R6i instances. M6i instances are the 6th generation of Amazon EC2 x86-based General Purpose compute instances, designed to provide a balance of compute, memory, storage, and network resources. R6i instances are the 6th generation of Amazon EC2 memory optimized instances, designed for memory-intensive workloads. Both M6i and R6i instances are built on the AWS Nitro System, a combination of dedicated hardware and lightweight hypervisor, which delivers practically all of the compute and memory resources of the host hardware to your instances. View the full article
  24. Amazon Relational Database Service (Amazon RDS) now supports AWS Graviton2-based database (DB) instances in the regions of AWS GovCloud (US), Asia Pacific (Seoul), and Europe (Stockholm). Depending on DB engine, version, and workload, Graviton2 instances provide up to 35% performance improvement and up to 52% price/performance improvement over comparable current generation x86-based instances for Amazon RDS for MySQL, MariaDB, and PostgreSQL. View the full article
  25. Following the announcement of updates to the PostgreSQL database, we have updated Amazon RDS for PostgreSQL to support PostgreSQL minor versions 13.2, 12.6, 11.11, 10.16, 9.6.21, and 9.5.25. This release closes security vulnerabilities in PostgreSQL and contains bug fixes and improvements done by the PostgreSQL community. This also includes the final release of PostgreSQL 9.5. View the full article
  • Forum Statistics

    42.4k
    Total Topics
    42.2k
    Total Posts
×
×
  • Create New...