How to Use Synthetic Datasets in Machine Learning Models

Machine learning models are only as good as the datasets they are built on. Here’s how synthetic data can help build robust and reliable models.

Poornima Apte, Contributor

May 19, 2022

4 Min Read

How to Use Synthetic Datasets in Machine Learning Models

Getty Images

Banks are eager to stamp out financial fraud. After all, every dollar lost to fraud costs financial institutions four times as much to resolve, according to a LexisNexis Risk Solutions study.

To help stem these losses, banks have begun to adopt AI and machine learning models for detecting fraud patterns. Despite the volume of transactions banks work with, however, the number of fraud cases the ML models can train on is comparatively small. This is where the use of synthetic datasets can help.

What Is Synthetic Data?

A synthetic dataset is a statistically representative version of a real dataset. The synthetic dataset does not contain any of the original’s real information, but it preserves its statistical characteristics. To put it simply, a synthetic dataset looks and acts like the original without any of the original’s information. For modeling and simulation purposes, close enough is good enough.

Synthetic data is generated by taking a first pass to recreate the original dataset as accurately as possible, warts and all. This generative model can then create additional rows of data or amplify and augment only select portions. A bank, for example, might generate synthetic datasets that have a higher prevalence of fraud than in the real dataset. Doing so gives the machine learning fraud detection model more samples to train on.

Use of Synthetic Data

In addition to fraud detection, synthetic data has applications for organizations concerned about using datasets with personally identifiable information (PII). “That’s becoming increasingly important because of regulations that are pushing toward more sustainable solutions to laws concerning data protection [like GDPR],” said Harry Keen, CEO and co-founder of Hazy, a startup that provides synthetic data to financial institutions.

Anonymizing PII is one option to preserve privacy. However, in instances where data origin is suspect, and compliance is on the line, organizations can lean on synthetic data to skirt the issue altogether.

Cheaper synthetic data can also be used in instances where real-world data is expensive to source. “It’s going to allow for more agile decision-making when you work with a dataset that’s 95% reflective of actual real-world data,” said Steven Karan, vice president and head of insights and data at Capgemini Canada. The (lower) cost of synthetic data depends on the use case, Karan added. “On common use cases such as geolocation data, synthetic data will generally cost anywhere between 60% to 70% less than actual third-party data.”

The proportion of real and synthetic data that should be piped into a model varies by use case, Keen noted. “From a compliance and trust perspective, you may not want to train a machine vision system for self-driving cars entirely on synthetic data without any real-world use cases and then release it into the wild,” he said. “But when used to detect fraud, you can amplify different cases of fraud and may want to use a much higher proportion of synthetic data because you can prove your algorithms work much better in the real world.”

Limitations of Synthetic Data

When working with synthetic data to solve for outliers in use cases, data scientists must tread carefully. In fraud detection, synthetic data can amplify original data to generate greater numbers of fraud examples, but that does not mean it can cover all categories of fraud. The various outlier categories of fraud still must be present in the original dataset. While synthetic data can generate more volume, it cannot produce an entirely new kind of data on its own.

In addition, a synthetic dataset cannot provide insights down to the individual dataset level. That’s because a synthetic dataset does not map directly to a real-world dataset, Keen explained. So, while the ML models can deliver higher-level demographic insights, they cannot not be used for personalization down to an individual customer.

What the Future Holds

Karan is excited about the future use of AI to generate synthetic data. It is an active area of exploration.

Research firm Gartner predicts that by 2024 about 60% of the data used for AI and analytics projects will be synthetically generated.

Keen said synthetic data will see more uses at some point, including in developing ML models for autonomous vehicles, with larger datasets that contain more edge use cases. “It is a really valuable way to give these systems more data to understand how to drive in different scenarios that they may not have seen yet,” he said.

Conclusion

Today, the strongest use of synthetic data is for enterprises sitting on banks of data that might, for various reasons, be unusable in ML models. “Synthetic data can create safe, hyper-realistic datasets so you don’t have to use your production data in nonproduction environments,” Keen said. Enterprises no longer must play with fire by worrying about using sensitive information to develop ML models. Synthetic data offers an effective data alternative.

About the Author

Poornima Apte

Contributor

Poornima Apte is a trained engineer turned writer who specializes in the fields of robotics, AI, IoT, 5G, cybersecurity, and more. Winner of a reporting award from the South Asian Journalists’ Association, Poornima loves learning and writing about new technologies—and the people behind them. Her client list includes numerous B2B and B2C outlets, who commission features, profiles, white papers, case studies, infographics, video scripts, and industry reports. Poornima reviews literary fiction for industry publications, is a card-carrying member of the Cloud Appreciation Society, and is happy when she makes “Queen Bee” in the New York Times Spelling Bee.

https://www.linkedin.com/in/poornimaapte/

See more from Poornima Apte

Related Topics

Recent in Cloud

Related Topics

Recent in OS

Related Topics

Recent in IT Mgmt

Related Topics

Recent in Career

Related Topics

Recent in Storage

Related Topics

Recent in Security

Related Topics

Recent in Dev

Related Topics

Recent in DX

Related Topics

Recent in Infrastructure

Related Topics

How to Use Synthetic Datasets in Machine Learning Models

What Is Synthetic Data?

Use of Synthetic Data

Limitations of Synthetic Data

What the Future Holds

Conclusion

About the Author

Editor's Choice

Featured Technical Explainers

Recent What Is

Related Topics

Recent in Cloud

Related Topics

Recent in OS

Related Topics

Recent in IT Mgmt

Related Topics

Recent in Career

Related Topics

Recent in Storage

Related Topics

Recent in Security

Related Topics

Recent in Dev

Related Topics

Recent in DX

Related Topics

Recent in Infrastructure

Related Topics

<span class="ArticleBase-LargeTitle">How to Use Synthetic Datasets in Machine Learning Models</span>How to Use Synthetic Datasets in Machine Learning Models

What Is Synthetic Data?

Use of Synthetic Data

Related: Does the Quality of Data Matter?

Limitations of Synthetic Data

What the Future Holds

Conclusion

About the Author

Editor's Choice

Featured Technical Explainers

Recent What Is

How to Use Synthetic Datasets in Machine Learning Models