How to Use Synthetic Datasets in Machine Learning Models

Banks are eager to stamp out financial fraud. After all, every dollar lost to fraud costs financial institutions four times as much to resolve, according to a LexisNexis Risk Solutions study.

To help stem these losses, banks have begun to adopt AI and machine learning models for detecting fraud patterns. Despite the volume of transactions banks work with, however, the number of fraud cases the ML models can train on is comparatively small. This is where the use of synthetic datasets can help.

What Is Synthetic Data?

A synthetic dataset is a statistically representative version of a real dataset. The synthetic dataset does not contain any of the original’s real information, but it preserves its statistical characteristics. To put it simply, a synthetic dataset looks and acts like the original without any of the original’s information. For modeling and simulation purposes, close enough is good enough.

Synthetic data is generated by taking a first pass to recreate the original dataset as accurately as possible, warts and all. This generative model can then create additional rows of data or amplify and augment only select portions. A bank, for example, might generate synthetic datasets that have a higher prevalence of fraud than in the real dataset. Doing so gives the machine learning fraud detection model more samples to train on.

Use of Synthetic Data

In addition to fraud detection, synthetic data has applications for organizations concerned about using datasets with personally identifiable information (PII). “That’s becoming increasingly important because of regulations that are pushing toward more sustainable solutions to laws concerning data protection [like GDPR],” said Harry Keen, CEO and co-founder of Hazy, a startup that provides synthetic data to financial institutions.

Anonymizing PII is one option to preserve privacy. However, in instances where data origin is suspect, and compliance is on the line, organizations can lean on synthetic data to skirt the issue altogether.

Related: Does the Quality of Data Matter?

Cheaper synthetic data can also be used in instances where real-world data is expensive to source. “It’s going to allow for more agile decision-making when you work with a dataset that’s 95% reflective of actual real-world data,” said Steven Karan, vice president and head of insights and data at Capgemini Canada. The (lower) cost of synthetic data depends on the use case, Karan added. “On common use cases such as geolocation data, synthetic data will generally cost anywhere between 60% to 70% less than actual third-party data.”

The proportion of real and synthetic data that should be piped into a model varies by use case, Keen noted. “From a compliance and trust perspective, you may not want to train a machine vision system for self-driving cars entirely on synthetic data without any real-world use cases and then release it into the wild,” he said. “But when used to detect fraud, you can amplify different cases of fraud and may want to use a much higher proportion of synthetic data because you can prove your algorithms work much better in the real world.”

Limitations of Synthetic Data

When working with synthetic data to solve for outliers in use cases, data scientists must tread carefully. In fraud detection, synthetic data can amplify original data to generate greater numbers of fraud examples, but that does not mean it can cover all categories of fraud. The various outlier categories of fraud still must be present in the original dataset. While synthetic data can generate more volume, it cannot produce an entirely new kind of data on its own.

In addition, a synthetic dataset cannot provide insights down to the individual dataset level. That’s because a synthetic dataset does not map directly to a real-world dataset, Keen explained. So, while the ML models can deliver higher-level demographic insights, they cannot not be used for personalization down to an individual customer.

What the Future Holds

Karan is excited about the future use of AI to generate synthetic data. It is an active area of exploration.

Research firm Gartner predicts that by 2024 about 60% of the data used for AI and analytics projects will be synthetically generated.

Keen said synthetic data will see more uses at some point, including in developing ML models for autonomous vehicles, with larger datasets that contain more edge use cases. “It is a really valuable way to give these systems more data to understand how to drive in different scenarios that they may not have seen yet,” he said.

Conclusion

Today, the strongest use of synthetic data is for enterprises sitting on banks of data that might, for various reasons, be unusable in ML models. “Synthetic data can create safe, hyper-realistic datasets so you don’t have to use your production data in nonproduction environments,” Keen said. Enterprises no longer must play with fire by worrying about using sensitive information to develop ML models. Synthetic data offers an effective data alternative.

Comments

Plain text