Mastering data integration from SAP Systems with prompt engineering

TDS · October 12, 2023

Construction engineer investigating his work — Stable diffusion

Introduction

In our previous publication, From Data Engineering to Prompt Engineering, we demonstrated how to utilize ChatGPT to solve data preparation tasks. Apart from the good feedback we have received, one critical point has been raised: Prompt engineering may help with simple tasks, but is it really useful in a more challenging environment? This is a fair point. In recent decades, data architectures have grown increasingly diverse and complex. As a result of this complexity, data engineers more and more have to integrate a variety of data sources they are not necessarily familiar with. Can prompt engineering help in this context?

This article examines the question based on a real use case from human resources management. We apply few shot learning to introduce an SAP HCM data model to ChatGPT and analyze the collected information with Apache Spark. This way, we illustrate how prompt engineering can deliver value even in advanced data engineering settings.

About the business case

A common task every medium to large company has to accomplish, is to determine the number of its employees and their organizational assignment for any given point in time. The associated data in our scenario is stored in a SAP HCM system which is one of the leading applications for human resource management in enterprise environments.

To solve this kind of objective, every data engineer needs to build up a lot of business related knowledge which is strongly interdependent to the underlying data model.

This article will provide a step by step guide to solve the described business problem by creating PySpark code that can be used to build the data model and consequently the basis for any reporting solution.

PowerBi Example report showing personnel headcount

Step 1: Determine which information is needed

One of the main challenges in data science is to select the necessary information according to the business use case and to determine its origin in the source systems. To solve this we have to bring in some business knowledge to chatgpt. For this purpose we teach chatgpt some information on SAP HCM basic tables which can be found in SAP reference manual: Human Resources | SAP Help Portal combining it with a csv-Sample record for each table.

In this first scenario, our intention is to report all active Employees at a specific point in time. The result should also include the employees personal number, name, status and organizational assignment.

To gather all the necessary information we need to infere a Database Schema to ChatGPT including example datasets and field descriptions by using few-shot prompting. We will start out propagating the Database Schema and some example data to ChatGPT.

Everyone who knows SAP HCMs Data Model should be familiar with the concept of infotypes and transparent tables. The infotype contains all the transactional information whereas the transparent tables contain the business information (masterdata) of each entity.

For the following scenario we will be using OpenAIs GPT-4 to create the code we need. Lets start by providing the basic table information to ChatGPT.

Prompt:
Given the following Tables
1. Table PA0000 - Employee Actions
Field;Key;Data Element;Domain;Data Type;Length;Decimal;Short Description;Check table
MANDT;;MANDT;MANDT;CLNT;3;0;Client;T000
PERNR;;PERSNO;PERSNO;NUMC;8 ;0;Personnel number;PA0003
SUBTY;;SUBTY;SUBTY;CHAR;4;0;Subtype;
OBJPS;;OBJPS;OBJPS;CHAR;2;0;Object Identification
...
containing the following sample Datasets:
100;99992228;;;;9999-12-31T00:00:00.0000000;2023-07-01T00:00:00.0000000;0;2023-01-10T00:00:00.0000000;TRP0001;;;;;;;;;;;;;;10;3;;0;0
100;99992229;;;;9999-12-31T00:00:00.0000000;2023-07-01T00:00:00.0000000;0;2023-03-29T00:00:00.0000000;TRP0001;;;;;;;;;;;;;;10;6;;0;0

2. Table T529U - Status Values
Field;Key;Data Element;Domain;Data Type;Length;Decimal Places;Short Description;Check table
MANDT;;MANDT;MANDT;CLNT;3;0;Client;*
SPRSL;;SPRAS;SPRAS;LANG;1;0;Language Key;T002
STATN;;STATN;STATN;NUMC;1;0;Status number;
...
containing the following sample datasets:
MANDT,SPRSL,STATN,STATV,TEXT1
100,D,2,3,active
100,D,2,0,resigned
....

Step 2:Join the necessary base tables and filter active employees only

Now lets create the code to join the base tables:

Prompt:
Give me the pyspark code to restrict "Employee Actions" to contain 
only employees indicated as active according to their corresponding 
"Status Values" for English Language texts. Use "structuredzone.PA0000" 
as the source for Table PA0000 and "structuredzone.T529U" as the source 
for Table T529U. Use meaningful column aliases.

This will produce pretty decent and well formatted PySpark code:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Initializing Spark Session
spark = SparkSession.builder.appName("EmployeeActionsFilter").getOrCreate()

# Reading the tables into DataFrames
df_pa0000 = spark.read.table("structuredzone.PA0000")
df_t529u = spark.read.table("structuredzone.T529U")

# Filtering T529U for the relevant status description and language
df_active_status = df_t529u.filter(
    (col("TEXT1") == "Active") &
    (col("SPRSL") == "E")
)
# Joining PA0000 with the filtered T529U based on the status field STAT2
df_active_employees = df_pa0000.join(
    df_active_status,
    (df_pa0000.STAT2 == df_active_status.STATV) &
    (df_active_status.STATN == 2)  # Assuming STATN=2 indicates employment status based on given data
).select(
    df_pa0000["PERNR"].alias("EmployeeNumber"),
    df_pa0000["MASSN"].alias("ActionType"),
    df_pa0000["MASSG"].alias("ReasonForAction"),
    df_active_status["TEXT1"].alias("StatusDescription")
)
# Show the result
df_active_employees.show()
# Optionally, you can write this DataFrame to a destination or further process it.

Notice GPT-4 recognized the join criteria for both tables which is based on Column STAT2 of Table PA0000 and column STATV of table T529U which is the corresponding transparent table. Over that the created code contains the business descriptions as column aliases to improve its readability.

Step 3: Build a Timeline to reflect the companies employees history

Now we will create a DataFrame that contains date values for the period starting from 2020–01–01 until 2024–01–01 and join all valid employees according to their entry date (BEGDA) and possible exit date (ENDDA) we need to create an artificial timeline to join the employees to.

Prompt
In Pyspark: Create a dataframe containing one row for each date beginning 
at 2020-01-01 ending at 2024-01-01. The Date column should be formatted as 
a date. Afterwards join all valid employees according to their start and 
end date. The output should contain only active employees according to their 
Employment status.

from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, sequence, to_date, lit, col

# Start a Spark Session

spark = SparkSession.builder.appName("EmployeeTimeline").getOrCreate()

# Define start and end dates
start_date = to_date(lit("2020-01-01"))
end_date = to_date(lit("2024-01-01"))

# Create a DataFrame with one row for each date
date_df = (spark.range(1)
          .select(explode(sequence(start_date, end_date)).alias("date")))

# Load the PA0000 and T529U tables
pa0000_df = spark.read.table("structuredzone.PA0000")
t529u_df = spark.read.table("structuredzone.T529U")

# Filter T529U for 'Active' employees with language key = E
active_t529u_df = t529u_df.filter((col("SPRSL") == "E") & (col("TEXT1") == "Active"))

# Join PA0000 with filtered T529U to get active employees
active_pa0000_df = pa0000_df.join(active_t529u_df, pa0000_df.STAT2 == active_t529u_df.STATV)

# Join with date range DataFrame to get timeline of active employees
active_employee_timeline_df = date_df.join(active_pa0000_df, (date_df.date >= active_pa0000_df.BEGDA) & (date_df.date <= active_pa0000_df.ENDDA))

# Show the DataFrame
active_employee_timeline_df.show()

Step 4: Dissolving a parent-child relationship table by determining the highest level organizational object

In this last step we want to build a DataFrame that represents the organizational structure of the company and determine each objects organizational assignment. Specifically we want to determine which highest level organizational unit (e.g. area or division) each child object is assigned to.

The organizational structure can be used to join the employees timeline and get detailed information on each employees organizational assignment at a later step. We need to utilize SAPs HRP1001 table to achieve this:

Prompt:
The following table is containing the organizational assignment 
of each employee: 
Table HRP1001 – Organization structure
Field;Data Element;Domain;Data Type;Length;Decimal Places;Short Description;Check table
MANDT;MANDT;MANDT;CLNT;3;0;Client;T000
OTYPE;OTYPE;OTYPE;CHAR;2;0;Object Type;T778O
OBJID;HROBJID;HROBJID;NUMC;8;0;Object ID;PLOGI
...
containing the following sample datasets:
MANDT,OTYPE,OBJID,PLVAR,RSIGN,RELAT,ISTAT,PRIOX,BEGDA,ENDDA,VARYF,SEQNR,INFTY,OTJID,SUBTY,AEDTM,UNAME,REASN,HISTO,ITXNR,SCLAS,SOBID,PROZT,ADATANR
100,P,00998833,A,209,1,,2018-01-01T00:00:00.0000000,9999-12-31T00:00:00.0000000,CP50035300,000,1001,P 00100769,A209,2020-12-11T00:00:00.0000000,TRP1000,,,00000000,CP,60095300,0.00,

Prompt:
Table HRP1001 contains parent-child information on the companies organization 
structure. Column OBJID is containing the object ID and SOBID is containing 
the related parent object ID. The column SCLAS is containing the information 
weather the object ID is a position 'S', organizational unit 'O' or role 'K'. 
In Pyspark: Give me the code that creates a dataframe containing the 
object id of the highest available organizational unit of each object by 
recursively dissolving the hierarchy. Use meaningful column aliases.

The Language model produces a recursive function that is searching for the highest level organizational unit (‘O’) for each object:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Initialize a Spark session
spark = SparkSession.builder.appName("HighestAvailableOrgUnit").getOrCreate()

HRP1001_df = spark.read.table("structuredzone.HRP1001")

# Recursive function to find the highest available organizational unit
def find_highest_org_unit(df, obj_id):
    # Filter rows for the current object ID
    object_df = df.filter(col("OBJID") == obj_id)
    
    if object_df.count() == 0:
        return None
    
    # Get the parent object ID (SOBID) and SCLAS for the current object
    parent_obj_id = object_df.select("SOBID").first()[0]
    sclas = object_df.select("SCLAS").first()[0]
    
    # Recursively find the highest available organizational unit
    highest_org_unit = find_highest_org_unit(df, parent_obj_id)
    
    # If the current object is an organizational unit ('O') or the highest_org_unit is None, update the result
    if sclas == 'O' or highest_org_unit is None:
        highest_org_unit = obj_id
    
    return highest_org_unit

# Create a UDF (User-Defined Function) to apply the recursive function
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

find_highest_org_unit_udf = udf(find_highest_org_unit, StringType())

# Apply the UDF to find the highest available organizational unit for each object
highest_org_unit_df = df.withColumn("HighestAvailableOrgUnit", find_highest_org_unit_udf(df, col("OBJID")))

# Select meaningful columns and aliases
result_df = highest_org_unit_df.select(
    col("OBJID").alias("ObjectID"),
    col("HighestAvailableOrgUnit").alias("HighestAvailableOrgUnit")
)

# Show the resulting DataFrame
result_df.show()

# Stop the Spark session
spark.stop()

Even when the created code is not very well optimized it contains a recursive function to dissolve the hierarchy. Users that prefer common table expressions (CTEs) should give the hint (using a common table expression) in the input prompt to create a more readable and understandable PySpark statement.

from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("OrganizationalUnitHierarchy").getOrCreate()

# Define the HRP1001 DataFrame (assuming you have it loaded already)
# Replace 'hrp1001_df' with your actual DataFrame
# Make sure you have the appropriate columns: 'OBJID', 'SOBID', and 'SCLAS'

# Register the HRP1001 DataFrame as a temporary SQL table
hrp1001_df.createOrReplaceTempView("hrp1001")

# Define the recursive query to dissolve the hierarchy and find the highest organizational unit
recursive_query = """
WITH RECURSIVE OrgHierarchy AS (
  SELECT OBJID, SOBID, SCLAS
  FROM hrp1001
  WHERE SCLAS = 'O'  -- Start with organizational units
  
  UNION ALL
  
  SELECT h.OBJID, p.SOBID, p.SCLAS
  FROM OrgHierarchy h
  JOIN hrp1001 p ON h.SOBID = p.OBJID
)
SELECT DISTINCT
  h.OBJID AS EmployeeID,
  h.SOBID AS HighestOrganizationalUnit
FROM OrgHierarchy h
"""

# Execute the recursive query and create the resulting DataFrame
result_df = spark.sql(recursive_query)

# Show the result
result_df.show()

# Stop the Spark session when done
spark.stop()

Both code versions will create the dataframe for the flattened hierarchical organization structure which can be used for further data integration steps by simply joining it to the previously generated DataFrame.

Dataset containing hierarchical information

Conclusion

We have shown how to tackle more advanced data engineering tasks in a practical use case by extracting and integrating data from SAP Systems using ChatGPT to generate PySpark code. Large Language models might not yet be perfect but everyone can perhaps already imagine how powerful these techniques can become for data engineers. There are several key takeaways:

ChatGPT is capable of understanding the fundamental principles of data models. You can refine its understanding utilizing prompting techniques to supply more in depth knowledge.
Even if the approach won´t produce perfect code at the first try, we can easily adjust the created code to fit our individual scenarios.
Due to the wide availability of open reference documents and SAP knowledge bases, the approach can be expanded to an Retrieval-Augmented Generation (RAG) solution.

When it comes to prompt engineering best practices, try to be as precise as possible and provide errorcodes returned by your Spark environment to leverage the LLMs capabilities to refactor the generated code. Multiple tries might be necessary to refine the code, nevertheless adding keywords like “precise” to your prompt might help ChatGPT to produce better results. Ask for detailed explanation for the solution approach as this will force ChatGPTs transformer model to dig deeper.

Note: The prompts containing the csv example datasets had to be cut off due to length constraints of this article.

About the authors

Markus Stadi is a Senior Cloud Data Engineer at Dehn SE working in the field of Data Engineering, Data Science and Data Analytics for many years.

Christian Koch is an Enterprise Architect at BWI GmbH and Lecturer at the Nuremberg Institute of Technology Georg Simon Ohm.

Mastering data integration from SAP Systems with prompt engineering was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

View the full article

Sign In

Mastering data integration from SAP Systems with prompt engineering

Recommended Posts

TDS