SAP Data Intelligence Platform 3.0 – Let’s install on Open Source

With the Data Intelligence Platform (DI) SAP enhances the functionality of the Datahub by adding special tools for machine learning and AI.

Although the Datahub has been available for some time now as both a cloud and as an on-premise system, the successor to the Datahub, the Data Intelligence Platform, followed only recently and is also available to customers on their local network. Since we have already reported on the installation of Datahub in an Open Source environment, we might as well take a closer look at the DI installation and check out the differences and what you need to be aware of in the process.

The first thing you notice is that the supported versions have changed. We can, for example, now use version 1.15.12 when installing the Kubernetes cluster. We also used CentOS 7 as a stable and established distribution, when we installed Kubernetes for the SAP Datahub 2.7, whereas this time we will upgrade the used OS and install the environment based on CentOS 8. Based on that upgrade we can also change the method of implementing our Ceph storage. The tool “ceph-deploy” was still usable in CentOS 7 (many “HowTo”-guides out there refer to it).This has changed in the meantime and the storage is now, like many other software solutions, container (docker) based. You can install it by using the recommended tool “cephadm”. The detailed instructions can be found on the homepage of the project. From our experience the container deploy worked very uncomplicated and has been running stable since!

So, let’s start to build the infrastructure. In general, the main steps haven’t changed from the installation of Datahub – we will therefore try to keep it short.

  1. Initiate Kubernetes cluster
  2. Deploy pod network
  3. Add worker nodes
  4. Configure DNS settings
  5. Connect default storage
  6. Provide a container registry

If you’d like to have a more detailed description check out our previous article linked in the first section.

When installing the Data Intelligence Platform, the previously used “Host Agent” no longer exists and the SLC Bridge has now fully replaced the alternative command-line tool “install.sh”. If you have never before performed a Datahub or a DI installation, this quote from the SAP installation guide might give you a quick impression of how it works.

You run any deployment using the Software Lifecycle Container Bridge 1.0 tool with Maintenance Planner. In the following, we use the abbreviated tool name “SLC Bridge”. The SLC Bridge runs as a Pod named “SLC Bridge Base” in your Kubernetes cluster. The SLC Bridge Base downloads the required images from the SAP Registry and deploys them to your private image repository (Container Image Repository) from which they are retrieved by the SLC Bridge to perform an installation, upgrade, or uninstallation.Link

After you have completed the first sections and selected the correct product for installation, a “Prerequiste Check” will be executed. If everything is good and all requirements pass the test, you can move on and start to define the DI parameters.

Most of the steps are very intuitive, but we’d still like to zoom in two points that have changed since the previous installation.

First, we can now configure an external checkpoint store for the Vora database in SAP DI Platform Full and therefore use streaming tables with an external storage. This is not necessarily required for development or Proof-of-Concept installations, but highly recommended for the production use – so reason enough to take a closer look!

 

There are some external storage options to choose from at this point, but since we perform an on-premise installation and have a Hadoop cluster running internally, HDFS works fine for us. Everything is still Open Source without any extra costs and our system remains only in the local network.

 

 

Define our WebHDFS host with port and path:

 

 

In the last step of this section you can run a validation test for your configuration – if everything is correct you can move on to the next section!

 

 

The second part of the installation that might require a little more attention is a checkbox named “Kaniko” which is activated by default from the beginning.

 

“Enable Kaniko Usage” made us work up quite a sweat during the first installation. The reason for that was that our private docker registry used https and a self-signed certificate up until now. This was no problem so far, but to get Kaniko correctly running a secured registry with a trusted certificate is an important requirement! So, if your installation doesn’t work 100% and maybe hangs on the “vflow” verification at the end, this might also be a problem of your on-premise installation. In our opinion there are only two clean solutions, either you don’t use Kaniko or you use a trusted certificate for your private docker registry. We therefore first performed a successful installation without enabling the Kaniko checkbox, just to see if it would work and afterwards started all over again to perform a little trick: We created a valid wildcard certificate through the free to use Let’s Encrypt service for our external domain and used it internally, by adding the related search domain to our local DNS server and exchange the self-signed certificate in the registry with the new one. It works perfectly and doesn’t require any port forwarding – cool thing! But keep in mind, Let’s Encrypt certificates are only valid for 3 months, so don’t forget to replace it.

Apart from those points, the installation should run the same way as for the Datahub. After the execution and all checks are completed, the corresponding service / port needs to be exposed again. If everything worked out, you can reach the start page via the browser and login to your SAP Data Intelligence Platform for the first time – that’s it!

 

If you have further comments or questions, we’ll be happy to help you. Just leave us feedback in the comments!

Best regards and see you soon for more.

 

Ansprechpartner

Oliver Ossenbrink

Geschäftsführung Vertrieb und HR

Data Products Setup

I’ll start with Data Products setup. If you’re new to the concept, this recent video is a great starting point, but here’s a short summary. A data product is a well-described, easily discoverable, and consumable collection of data sets.

Creating a Data Product in Datasphere

Note that in this article I create Data Products in the Data Sharing Cockpit in Datasphere. This functionality is expected to move into the Data Product Studio, but that had not taken place at the time writing.

Before creating a Data Product in Datasphere, I need to set up a Data Provider profile, collecting descriptive metadata like contact and address details, industry, regional coverage, and importantly define Data Product Visibility. Enabling Formations allows me to share the Data Product with systems across your BDC Formation – Databricks, in this case.

With the Data Provider set up, I can go ahead and create a Data Product. As with the Data Provider, I’ll need to add metadata about the product and define its artifacts – the datasets it contains. Only datasets from a space of SAP HANA Data Lake Files type can be selected. Since this Data Product is visible across the Formation, it is available free of charge.

For this demo, the artifact is a local table containing ten years of Ice Cream sales data. Since this is a File type space, importing a CSV file directly to create a local table isn’t an option (see documentation).

I used a Replication Flow to perform an initial load from a BW aDSO table into a local table.

Once Data Product is created and listed, it becomes available in the Catalog & Marketplace, from where it can be shared with Databricks by selecting the appropriate connection details.

Jump into Databricks

To use the shared object In Databricks, I need to mount it to the Catalog – either by creating a new Catalog or using an existing one.

Databricks appends a version number to the end of the schema – ‘:v1’ – to maintain versioning in case of any future changes to the Data Product.

Once the share is mounted, the schema is created automatically, and the Sales actual data table becomes available within it. From there, I can access the shared table directly in a Notebook.

Creating a Data Product in Databricks

To create a Data Product in Databricks, I first need to create a Share – which I can either do via the Delta Sharing settings in the Catalog:

Or directly out of the table which is going to become a part of the Share:

Since a single Share can contain multiple tables, I have the option to either add the table to an existing Share, or create a new one:

To publish the Share as a Data Product, I run a Python script where I define the target table for the forecast and describe the Share in CSN notation, setting the Primary Keys. Primary Keys are required for installing Data Products in Datasphere.

Jump back into Datasphere

Once the Databricks Data Product is available in Datasphere, I install it into a Space configured as a HANA Database space – since my intention is to build a view on top of the table and use it for planning in SAC.

There are two installation options: as a Remote table for live data access, or as a Replication Flow, in which case the data is physically copied into the object store in Datasphere.

Since I want live access, I install it as a Remote Table:

and build a Graphical view of type Fact on top:

Forecast calculation

With my Data Products set up and Sales actual data are available in Databricks, I create a Notebook to calculate the Sales Forecast.

The approach combines Sales and Weather data to train a Linear Regression model. I import the Weather data *https://zenodo.org/records/4770937 from an external server directly into Databricks, select the relevant features from the weather dataset, and combine them with the Sales actual data:

* Klein Tank, A.M.G. and Coauthors, 2002. Daily dataset of 20th-century surface
air temperature and precipitation series for the European Climate Assessment.
Int. J. of Climatol., 22, 1441-1453.
Data and metadata available at http://www.ecad.eu

Using the “sklearn” library, I build and train a Linear regression model:

Once trained, the model predicts the Sales forecast for Rome in June 2026 based on the weather forecast, and I save the results to my Catalog table:

Seamless planning data model

Seamless planning concept is built around physically storing planning data and public dimensions directly in Datasphere, keeping them alongside the actual data.

Since the QRC4 2025 SAC release, it has also been possible to use live versions and bring reference data into planning models without replication.

In this scenario, I build a seamless planning model on top of the Graphical view I created over the Remote table. This lets me use the forecast generated in Databricks as a reference for the final SAC Forecast version.

 

The model setup follows these steps:

Create a new model:

Start with data:

Select Datasphere as the data storage:

From there, I define the model structure and can review the data in the preview.

For a deeper dive into Seamless Planning, I recommend this biX blog.

Process Flow automation

Multi-action triggers Datasphere task chain

The final step is automating the entire forecast generation by using SAC Multi-actions and a Task-Chain in Datasphere – so that my user can trigger the calculation with a single button click from an SAC Story.

The model setup follows these steps:

Create a new model:

Triggering Task Chains from Multi-actions is a recent release. This blog post walks through how to set it up.

For details on how to trigger a Databricks Notebook from Datasphere, I recommend referring to this blog.

With everything in place, I create a Story, add my Seamless planning Model, and attach the Multi-action:

Running the Multi-action triggers the Task Chain, which in turn triggers the Databricks Notebook.

I can monitor the execution details in Datasphere:

and in Databricks:

Once the calculation completes, the updated forecast appears in the Story:

The end-to-end calculation took 2 minutes 45 seconds in total. The Task Chain in Datasphere is triggered almost instantly by the Multi-action, the Databricks Notebook execution itself took 1 minute 29 seconds, with the remaining time spent on Serverless Cluster startup.   

 

From here, I can copy the calculated forecast into a new private version:

adjust the numbers as needed, and publish it as a new public version to Datasphere:

Conclusion

With SAP Business Data Cloud, it is possible to build a forecasting workflow that feels seamless to the end user — even though it spans multiple systems under the hood.

Companies using BW as the main Data Warehouse and Databricks for ML calculations or Data Science tasks can benefit from using the platform, as the data no longer needs to be physically copied out of BW.

What this scenario demonstrates is that once wrapped as a Data Product, BW sales data can be shared with Databricks via the Delta Share protocol. Databricks, in turn, can then create its own Data Products on top of the calculation results and share them back with Datasphere as a Remote Table.

A Seamless Planning model in SAC sits on top of that Remote Table, giving planners live access to the generated forecast. A single Multi-action in an SAC Story ties it all together, triggering a Datasphere Task Chain that kicks off the Databricks Notebook — completing the full cycle in under three minutes.

As SAP Business Data Cloud continues to mature, scenarios like this one are becoming achievable – leaving the complexity in the architecture and not in the workflow.

Ansprech­partner

Ilya Kirzner
Consultant
biX Consulting
Datenschutz-Übersicht

Diese Website verwendet Cookies, damit wir dir die bestmögliche Benutzererfahrung bieten können. Cookie-Informationen werden in deinem Browser gespeichert und führen Funktionen aus, wie das Wiedererkennen von dir, wenn du auf unsere Website zurückkehrst, und hilft unserem Team zu verstehen, welche Abschnitte der Website für dich am interessantesten und nützlichsten sind.