TDS Posted October 12, 2023 Share Posted October 12, 2023 Construction engineer investigating his work — Stable diffusionIntroductionIn our previous publication, From Data Engineering to Prompt Engineering, we demonstrated how to utilize ChatGPT to solve data preparation tasks. Apart from the good feedback we have received, one critical point has been raised: Prompt engineering may help with simple tasks, but is it really useful in a more challenging environment? This is a fair point. In recent decades, data architectures have grown increasingly diverse and complex. As a result of this complexity, data engineers more and more have to integrate a variety of data sources they are not necessarily familiar with. Can prompt engineering help in this context?This article examines the question based on a real use case from human resources management. We apply few shot learning to introduce an SAP HCM data model to ChatGPT and analyze the collected information with Apache Spark. This way, we illustrate how prompt engineering can deliver value even in advanced data engineering settings.About the business caseA common task every medium to large company has to accomplish, is to determine the number of its employees and their organizational assignment for any given point in time. The associated data in our scenario is stored in a SAP HCM system which is one of the leading applications for human resource management in enterprise environments.To solve this kind of objective, every data engineer needs to build up a lot of business related knowledge which is strongly interdependent to the underlying data model.This article will provide a step by step guide to solve the described business problem by creating PySpark code that can be used to build the data model and consequently the basis for any reporting solution.PowerBi Example report showing personnel headcountStep 1: Determine which information is neededOne of the main challenges in data science is to select the necessary information according to the business use case and to determine its origin in the source systems. To solve this we have to bring in some business knowledge to chatgpt. For this purpose we teach chatgpt some information on SAP HCM basic tables which can be found in SAP reference manual: Human Resources | SAP Help Portal combining it with a csv-Sample record for each table.In this first scenario, our intention is to report all active Employees at a specific point in time. The result should also include the employees personal number, name, status and organizational assignment.To gather all the necessary information we need to infere a Database Schema to ChatGPT including example datasets and field descriptions by using few-shot prompting. We will start out propagating the Database Schema and some example data to ChatGPT.Everyone who knows SAP HCMs Data Model should be familiar with the concept of infotypes and transparent tables. The infotype contains all the transactional information whereas the transparent tables contain the business information (masterdata) of each entity.For the following scenario we will be using OpenAIs GPT-4 to create the code we need. Lets start by providing the basic table information to ChatGPT.Prompt:Given the following Tables1. Table PA0000 - Employee ActionsField;Key;Data Element;Domain;Data Type;Length;Decimal;Short Description;Check tableMANDT;;MANDT;MANDT;CLNT;3;0;Client;T000PERNR;;PERSNO;PERSNO;NUMC;8 ;0;Personnel number;PA0003SUBTY;;SUBTY;SUBTY;CHAR;4;0;Subtype;OBJPS;;OBJPS;OBJPS;CHAR;2;0;Object Identification...containing the following sample Datasets:100;99992228;;;;9999-12-31T00:00:00.0000000;2023-07-01T00:00:00.0000000;0;2023-01-10T00:00:00.0000000;TRP0001;;;;;;;;;;;;;;10;3;;0;0100;99992229;;;;9999-12-31T00:00:00.0000000;2023-07-01T00:00:00.0000000;0;2023-03-29T00:00:00.0000000;TRP0001;;;;;;;;;;;;;;10;6;;0;02. Table T529U - Status ValuesField;Key;Data Element;Domain;Data Type;Length;Decimal Places;Short Description;Check tableMANDT;;MANDT;MANDT;CLNT;3;0;Client;*SPRSL;;SPRAS;SPRAS;LANG;1;0;Language Key;T002STATN;;STATN;STATN;NUMC;1;0;Status number;...containing the following sample datasets:MANDT,SPRSL,STATN,STATV,TEXT1100,D,2,3,active100,D,2,0,resigned....Step 2:Join the necessary base tables and filter active employees onlyNow lets create the code to join the base tables:Prompt:Give me the pyspark code to restrict "Employee Actions" to contain only employees indicated as active according to their corresponding "Status Values" for English Language texts. Use "structuredzone.PA0000" as the source for Table PA0000 and "structuredzone.T529U" as the source for Table T529U. Use meaningful column aliases.This will produce pretty decent and well formatted PySpark code:from pyspark.sql import SparkSessionfrom pyspark.sql.functions import col# Initializing Spark Sessionspark = SparkSession.builder.appName("EmployeeActionsFilter").getOrCreate()# Reading the tables into DataFramesdf_pa0000 = spark.read.table("structuredzone.PA0000")df_t529u = spark.read.table("structuredzone.T529U")# Filtering T529U for the relevant status description and languagedf_active_status = df_t529u.filter( (col("TEXT1") == "Active") & (col("SPRSL") == "E"))# Joining PA0000 with the filtered T529U based on the status field STAT2df_active_employees = df_pa0000.join( df_active_status, (df_pa0000.STAT2 == df_active_status.STATV) & (df_active_status.STATN == 2) # Assuming STATN=2 indicates employment status based on given data).select( df_pa0000["PERNR"].alias("EmployeeNumber"), df_pa0000["MASSN"].alias("ActionType"), df_pa0000["MASSG"].alias("ReasonForAction"), df_active_status["TEXT1"].alias("StatusDescription"))# Show the resultdf_active_employees.show()# Optionally, you can write this DataFrame to a destination or further process it.Notice GPT-4 recognized the join criteria for both tables which is based on Column STAT2 of Table PA0000 and column STATV of table T529U which is the corresponding transparent table. Over that the created code contains the business descriptions as column aliases to improve its readability.Step 3: Build a Timeline to reflect the companies employees historyNow we will create a DataFrame that contains date values for the period starting from 2020–01–01 until 2024–01–01 and join all valid employees according to their entry date (BEGDA) and possible exit date (ENDDA) we need to create an artificial timeline to join the employees to.PromptIn Pyspark: Create a dataframe containing one row for each date beginning at 2020-01-01 ending at 2024-01-01. The Date column should be formatted as a date. Afterwards join all valid employees according to their start and end date. The output should contain only active employees according to their Employment status.from pyspark.sql import SparkSessionfrom pyspark.sql.functions import explode, sequence, to_date, lit, col# Start a Spark Sessionspark = SparkSession.builder.appName("EmployeeTimeline").getOrCreate()# Define start and end datesstart_date = to_date(lit("2020-01-01"))end_date = to_date(lit("2024-01-01"))# Create a DataFrame with one row for each datedate_df = (spark.range(1) .select(explode(sequence(start_date, end_date)).alias("date")))# Load the PA0000 and T529U tablespa0000_df = spark.read.table("structuredzone.PA0000")t529u_df = spark.read.table("structuredzone.T529U")# Filter T529U for 'Active' employees with language key = Eactive_t529u_df = t529u_df.filter((col("SPRSL") == "E") & (col("TEXT1") == "Active"))# Join PA0000 with filtered T529U to get active employeesactive_pa0000_df = pa0000_df.join(active_t529u_df, pa0000_df.STAT2 == active_t529u_df.STATV)# Join with date range DataFrame to get timeline of active employeesactive_employee_timeline_df = date_df.join(active_pa0000_df, (date_df.date >= active_pa0000_df.BEGDA) & (date_df.date <= active_pa0000_df.ENDDA))# Show the DataFrameactive_employee_timeline_df.show()Step 4: Dissolving a parent-child relationship table by determining the highest level organizational objectIn this last step we want to build a DataFrame that represents the organizational structure of the company and determine each objects organizational assignment. Specifically we want to determine which highest level organizational unit (e.g. area or division) each child object is assigned to.The organizational structure can be used to join the employees timeline and get detailed information on each employees organizational assignment at a later step. We need to utilize SAPs HRP1001 table to achieve this:Prompt:The following table is containing the organizational assignment of each employee: Table HRP1001 – Organization structureField;Data Element;Domain;Data Type;Length;Decimal Places;Short Description;Check tableMANDT;MANDT;MANDT;CLNT;3;0;Client;T000OTYPE;OTYPE;OTYPE;CHAR;2;0;Object Type;T778OOBJID;HROBJID;HROBJID;NUMC;8;0;Object ID;PLOGI...containing the following sample datasets:MANDT,OTYPE,OBJID,PLVAR,RSIGN,RELAT,ISTAT,PRIOX,BEGDA,ENDDA,VARYF,SEQNR,INFTY,OTJID,SUBTY,AEDTM,UNAME,REASN,HISTO,ITXNR,SCLAS,SOBID,PROZT,ADATANR100,P,00998833,A,209,1,,2018-01-01T00:00:00.0000000,9999-12-31T00:00:00.0000000,CP50035300,000,1001,P 00100769,A209,2020-12-11T00:00:00.0000000,TRP1000,,,00000000,CP,60095300,0.00,Prompt:Table HRP1001 contains parent-child information on the companies organization structure. Column OBJID is containing the object ID and SOBID is containing the related parent object ID. The column SCLAS is containing the information weather the object ID is a position 'S', organizational unit 'O' or role 'K'. In Pyspark: Give me the code that creates a dataframe containing the object id of the highest available organizational unit of each object by recursively dissolving the hierarchy. Use meaningful column aliases.The Language model produces a recursive function that is searching for the highest level organizational unit (‘O’) for each object:from pyspark.sql import SparkSessionfrom pyspark.sql.functions import col# Initialize a Spark sessionspark = SparkSession.builder.appName("HighestAvailableOrgUnit").getOrCreate()HRP1001_df = spark.read.table("structuredzone.HRP1001")# Recursive function to find the highest available organizational unitdef find_highest_org_unit(df, obj_id): # Filter rows for the current object ID object_df = df.filter(col("OBJID") == obj_id) if object_df.count() == 0: return None # Get the parent object ID (SOBID) and SCLAS for the current object parent_obj_id = object_df.select("SOBID").first()[0] sclas = object_df.select("SCLAS").first()[0] # Recursively find the highest available organizational unit highest_org_unit = find_highest_org_unit(df, parent_obj_id) # If the current object is an organizational unit ('O') or the highest_org_unit is None, update the result if sclas == 'O' or highest_org_unit is None: highest_org_unit = obj_id return highest_org_unit# Create a UDF (User-Defined Function) to apply the recursive functionfrom pyspark.sql.functions import udffrom pyspark.sql.types import StringTypefind_highest_org_unit_udf = udf(find_highest_org_unit, StringType())# Apply the UDF to find the highest available organizational unit for each objecthighest_org_unit_df = df.withColumn("HighestAvailableOrgUnit", find_highest_org_unit_udf(df, col("OBJID")))# Select meaningful columns and aliasesresult_df = highest_org_unit_df.select( col("OBJID").alias("ObjectID"), col("HighestAvailableOrgUnit").alias("HighestAvailableOrgUnit"))# Show the resulting DataFrameresult_df.show()# Stop the Spark sessionspark.stop()Even when the created code is not very well optimized it contains a recursive function to dissolve the hierarchy. Users that prefer common table expressions (CTEs) should give the hint (using a common table expression) in the input prompt to create a more readable and understandable PySpark statement.from pyspark.sql import SparkSession# Create a Spark sessionspark = SparkSession.builder.appName("OrganizationalUnitHierarchy").getOrCreate()# Define the HRP1001 DataFrame (assuming you have it loaded already)# Replace 'hrp1001_df' with your actual DataFrame# Make sure you have the appropriate columns: 'OBJID', 'SOBID', and 'SCLAS'# Register the HRP1001 DataFrame as a temporary SQL tablehrp1001_df.createOrReplaceTempView("hrp1001")# Define the recursive query to dissolve the hierarchy and find the highest organizational unitrecursive_query = """WITH RECURSIVE OrgHierarchy AS ( SELECT OBJID, SOBID, SCLAS FROM hrp1001 WHERE SCLAS = 'O' -- Start with organizational units UNION ALL SELECT h.OBJID, p.SOBID, p.SCLAS FROM OrgHierarchy h JOIN hrp1001 p ON h.SOBID = p.OBJID)SELECT DISTINCT h.OBJID AS EmployeeID, h.SOBID AS HighestOrganizationalUnitFROM OrgHierarchy h"""# Execute the recursive query and create the resulting DataFrameresult_df = spark.sql(recursive_query)# Show the resultresult_df.show()# Stop the Spark session when donespark.stop()Both code versions will create the dataframe for the flattened hierarchical organization structure which can be used for further data integration steps by simply joining it to the previously generated DataFrame.Dataset containing hierarchical informationConclusionWe have shown how to tackle more advanced data engineering tasks in a practical use case by extracting and integrating data from SAP Systems using ChatGPT to generate PySpark code. Large Language models might not yet be perfect but everyone can perhaps already imagine how powerful these techniques can become for data engineers. There are several key takeaways:ChatGPT is capable of understanding the fundamental principles of data models. You can refine its understanding utilizing prompting techniques to supply more in depth knowledge.Even if the approach won´t produce perfect code at the first try, we can easily adjust the created code to fit our individual scenarios.Due to the wide availability of open reference documents and SAP knowledge bases, the approach can be expanded to an Retrieval-Augmented Generation (RAG) solution.When it comes to prompt engineering best practices, try to be as precise as possible and provide errorcodes returned by your Spark environment to leverage the LLMs capabilities to refactor the generated code. Multiple tries might be necessary to refine the code, nevertheless adding keywords like “precise” to your prompt might help ChatGPT to produce better results. Ask for detailed explanation for the solution approach as this will force ChatGPTs transformer model to dig deeper.Note: The prompts containing the csv example datasets had to be cut off due to length constraints of this article.About the authorsMarkus Stadi is a Senior Cloud Data Engineer at Dehn SE working in the field of Data Engineering, Data Science and Data Analytics for many years.Christian Koch is an Enterprise Architect at BWI GmbH and Lecturer at the Nuremberg Institute of Technology Georg Simon Ohm.Mastering data integration from SAP Systems with prompt engineering was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.View the full article Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.