Strategies for Data Replication

2. RE: Strategies for Data Replication

Like

TM Forum Member

Vance Shipley

Posted Aug 07, 2023 01:59

Jonathan,

I tend to favour the option of maintaining a subset of the customer information by subscribing to change events. This is the optimal solution for minimizing load while maximizing performance. If the API producer supports it you can include a query portion in your subscription which can minimize the data transferred to just what you need.

The down side of this approach is the potential for loss of synchronization. My preference for solving that is to recognize a potential loss and resynchronize through a query on the collection for all items which have been updated since the time of the loss of synchronization (i.e. ?lastUpdate.gt=2023-08-07T04:20:00Z). Unfortunately not all entities have such an attribute, but we polymorphically add one where required in our own implementations. IMHO all API collections should include this (see AP-2223).

------------------------------
Vance Shipley
SigScale
------------------------------

Original Message

Original Message:
Sent: Aug 06, 2023 10:07
From: Jonathan Goldberg
Subject: Strategies for Data Replication

Very unusually, I'm here to ask for advice. I'd like to share with you a design dilemma and get your feedback. Bottom line is I'm interested to hear in how you manage your strategies for data replication, when and why do you replicate (or perhaps you never replicate). I'm positing a very simple example, not necessarily reflecting a real business case.

Thanks in advance for your thoughts :)

Let's suppose that two software systems are involved in achieving some business capability, say assessing credit risk:
* A credit risk calculation module
* A customer information module

The business requirement is that customer's credit situation needs to be re-assessed every calendar month. So there is some job that runs each day that carries out the check on a subset of all the customers, such that the entire customer population is covered over the course of each month. Notifications of some sort will be raised for customers whose credit situation is not satisfactory.

The risk module needs customer information to perform its task, and there are multiple strategies for retrieving this information, such as:
* Invoke an API operation retrieve customer by ID against the customer module, on demand, each month
* Invoke a bulk API operation retrieve customers by list of IDs against the customer module, on demand, each month
* Maintain an exact copy of the customer information, and updates the copy whenever a change is made in the master, by subscribing to Customer Create and Customer Change events.
* Maintains a subset of the customer information, optimized and transforms to its needs, and updates the copy whenever a change is made in the master, by subscribing to Customer Create and Customer Change events.
* Other strategies?

There's a lot going on here:
* What the cadence of the data changes in customer module?
* How much does it cost to add storage?
* What are the implications of loss of synchronization between source and user
* How fault-tolerant can the risk assessment process be?
* more?

And:
* Does it make a difference if the two modules are supplied by one vendor or by two different vendors?
* Does it make a difference if there is an industry standard (e.g. TMF Open API) for the API operations and/or events?

------------------------------
Jonathan Goldberg
Amdocs Management Limited
Any opinions and statements made by me on this forum are purely personal, and do not necessarily reflect the position of the TM Forum or my employer.
------------------------------

3. RE: Strategies for Data Replication

Like

TM Forum Member

Matthieu Hattab

Posted Aug 07, 2023 04:11

When I took the ODA course and masterclass from TM Forum, there was some examples showing the power of ODA. Huwaei was of the examples given. They abandon data replication and "implemented" ODA and implemented TMF APIs so that data would only exist in a single point of truth. This approach saved them millions of dollars in data storage alone.

For our pre-sales processes, we do dunning check for existing customers and we also chose API (on demand) because it didn't make sense to duplicate data in each system that needs dunning status.

My 2 cents

PS using data replication also cause more concerns for data privacy and GDPR requirement.

------------------------------
Kind regards,

Matthieu Hattab
Lyse Platform
------------------------------

Original Message

Original Message:
Sent: Aug 06, 2023 10:07
From: Jonathan Goldberg
Subject: Strategies for Data Replication

Very unusually, I'm here to ask for advice. I'd like to share with you a design dilemma and get your feedback. Bottom line is I'm interested to hear in how you manage your strategies for data replication, when and why do you replicate (or perhaps you never replicate). I'm positing a very simple example, not necessarily reflecting a real business case.

Thanks in advance for your thoughts :)

Let's suppose that two software systems are involved in achieving some business capability, say assessing credit risk:
* A credit risk calculation module
* A customer information module

The business requirement is that customer's credit situation needs to be re-assessed every calendar month. So there is some job that runs each day that carries out the check on a subset of all the customers, such that the entire customer population is covered over the course of each month. Notifications of some sort will be raised for customers whose credit situation is not satisfactory.

The risk module needs customer information to perform its task, and there are multiple strategies for retrieving this information, such as:
* Invoke an API operation retrieve customer by ID against the customer module, on demand, each month
* Invoke a bulk API operation retrieve customers by list of IDs against the customer module, on demand, each month
* Maintain an exact copy of the customer information, and updates the copy whenever a change is made in the master, by subscribing to Customer Create and Customer Change events.
* Maintains a subset of the customer information, optimized and transforms to its needs, and updates the copy whenever a change is made in the master, by subscribing to Customer Create and Customer Change events.
* Other strategies?

There's a lot going on here:
* What the cadence of the data changes in customer module?
* How much does it cost to add storage?
* What are the implications of loss of synchronization between source and user
* How fault-tolerant can the risk assessment process be?
* more?

And:
* Does it make a difference if the two modules are supplied by one vendor or by two different vendors?
* Does it make a difference if there is an industry standard (e.g. TMF Open API) for the API operations and/or events?

------------------------------
Jonathan Goldberg
Amdocs Management Limited
Any opinions and statements made by me on this forum are purely personal, and do not necessarily reflect the position of the TM Forum or my employer.
------------------------------

4. RE: Strategies for Data Replication

Like

TM Forum Member

Koen Peeters

Posted Aug 07, 2023 11:34

Jonathan,

I tend to agree with @Vance Shipley that subscribing to the change events is the better approach. It is not only offering the best performance but it also makes it irrelevant if both are supplied by one or multiple vendors.

Lets assume 10M customers with 1M changes each month.

On demand API call results in 10M API calls per month and thight coupling to the customer information module. The query volume could adversely impact the performance of the customer information module.
Bulk API call results in 10M API calls in a short period of time. It still has a thight coupling to the customer information module and might make the customer information module unresponsive during the bulk operation
Using Notifications (Create, Change, Delete events) to maintain a copy results in 1M events per month and achieves loose coupling with the customer information module. Maintaining an optimised subset will potentially reduce storage requirement and provide higher performance for the credit risk calculation module. Generating Notifications is low cost compared with query operations. This means that the impact on the performance of the customer information module is also negligible.

Thight coupling means that the customer information module must be available and responsive during the credit risk assessment.

Using Notifications without an event bus (Kafka, MQ, ...) will reverse that dependency: the risk assessment module must be available to allow changes to the customer information or risk loss of synchronisation. When an event bus is used loss of synchronisation is only temporary. An event driven architecture is eventually consistent.

When synchronising legacy applications that don't offer support for Notifications, it is sometimes even better to use CDC (Change Data Capture) techniques to achieve loose coupling. Instead of business Events as in TMF OpenAPI, this uses low level DB information (transaction log or triggers) to generate events. In this case the volume of events will be higher but it still achieves the high performance and loose coupling aspects of the EDA.

Regards

------------------------------
Koen Peeters
OryxGateway FZ LLC
------------------------------

Original Message

Original Message:
Sent: Aug 06, 2023 10:07
From: Jonathan Goldberg
Subject: Strategies for Data Replication

Very unusually, I'm here to ask for advice. I'd like to share with you a design dilemma and get your feedback. Bottom line is I'm interested to hear in how you manage your strategies for data replication, when and why do you replicate (or perhaps you never replicate). I'm positing a very simple example, not necessarily reflecting a real business case.

Thanks in advance for your thoughts :)

Let's suppose that two software systems are involved in achieving some business capability, say assessing credit risk:
* A credit risk calculation module
* A customer information module

The business requirement is that customer's credit situation needs to be re-assessed every calendar month. So there is some job that runs each day that carries out the check on a subset of all the customers, such that the entire customer population is covered over the course of each month. Notifications of some sort will be raised for customers whose credit situation is not satisfactory.

The risk module needs customer information to perform its task, and there are multiple strategies for retrieving this information, such as:
* Invoke an API operation retrieve customer by ID against the customer module, on demand, each month
* Invoke a bulk API operation retrieve customers by list of IDs against the customer module, on demand, each month
* Maintain an exact copy of the customer information, and updates the copy whenever a change is made in the master, by subscribing to Customer Create and Customer Change events.
* Maintains a subset of the customer information, optimized and transforms to its needs, and updates the copy whenever a change is made in the master, by subscribing to Customer Create and Customer Change events.
* Other strategies?

There's a lot going on here:
* What the cadence of the data changes in customer module?
* How much does it cost to add storage?
* What are the implications of loss of synchronization between source and user
* How fault-tolerant can the risk assessment process be?
* more?

And:
* Does it make a difference if the two modules are supplied by one vendor or by two different vendors?
* Does it make a difference if there is an industry standard (e.g. TMF Open API) for the API operations and/or events?

------------------------------
Jonathan Goldberg
Amdocs Management Limited
Any opinions and statements made by me on this forum are purely personal, and do not necessarily reflect the position of the TM Forum or my employer.
------------------------------

6. RE: Strategies for Data Replication

Like

TM Forum Member

Amit Khare

Posted Aug 08, 2023 22:56

HI Jonathan,

In given example, I will take path of using API to get customer data since dunning is not master of customer data and creating another snapshot of customer will increase complexity of system. Even though there is technology advancement of event driven architecture to handle CDC but event sourcing doesn't fit in given example and further increase operation cost and most of time break rule of simplicity.

But in other scenarios like network profile, billing account etc. where semantic of customer data is according to domain, better approach is "Maintains a subset of the customer information, optimized and transforms to its needs, and updates the copy whenever a change is made in the master, by subscribing to Customer Create and Customer Change events. ". All BPM processes and use cases impacting customer data need to cover all domains via API call and where TMF specification helps to remove vendor specific coupling between two systems. Since customer data semantic is different for each system so overhead of data storage should be minimized along with compliance like PII, GDPR.

------------------------------
Amit Khare
Tech Mahindra Limited
------------------------------

Original Message

Original Message:
Sent: Aug 06, 2023 10:07
From: Jonathan Goldberg
Subject: Strategies for Data Replication

Very unusually, I'm here to ask for advice. I'd like to share with you a design dilemma and get your feedback. Bottom line is I'm interested to hear in how you manage your strategies for data replication, when and why do you replicate (or perhaps you never replicate). I'm positing a very simple example, not necessarily reflecting a real business case.

Thanks in advance for your thoughts :)

Let's suppose that two software systems are involved in achieving some business capability, say assessing credit risk:
* A credit risk calculation module
* A customer information module

The business requirement is that customer's credit situation needs to be re-assessed every calendar month. So there is some job that runs each day that carries out the check on a subset of all the customers, such that the entire customer population is covered over the course of each month. Notifications of some sort will be raised for customers whose credit situation is not satisfactory.

The risk module needs customer information to perform its task, and there are multiple strategies for retrieving this information, such as:
* Invoke an API operation retrieve customer by ID against the customer module, on demand, each month
* Invoke a bulk API operation retrieve customers by list of IDs against the customer module, on demand, each month
* Maintain an exact copy of the customer information, and updates the copy whenever a change is made in the master, by subscribing to Customer Create and Customer Change events.
* Maintains a subset of the customer information, optimized and transforms to its needs, and updates the copy whenever a change is made in the master, by subscribing to Customer Create and Customer Change events.
* Other strategies?

There's a lot going on here:
* What the cadence of the data changes in customer module?
* How much does it cost to add storage?
* What are the implications of loss of synchronization between source and user
* How fault-tolerant can the risk assessment process be?
* more?

And:
* Does it make a difference if the two modules are supplied by one vendor or by two different vendors?
* Does it make a difference if there is an industry standard (e.g. TMF Open API) for the API operations and/or events?

------------------------------
Jonathan Goldberg
Amdocs Management Limited
Any opinions and statements made by me on this forum are purely personal, and do not necessarily reflect the position of the TM Forum or my employer.
------------------------------

Open APIs