Data Synthesizers on Amazon SageMaker: An Adversarial Gaussian Mixture Model vs XGBoost Architecture

Posted by Or Hiltch Jul 25, 2019

Synthetic data generation dates back to the 1990s, and is increasingly utilized today as a way of filling in gaps on data-heavy platforms where data is scarce or otherwise difficult to obtain. One prominent example of an innovative sector that has successfully used synthetic data is the autonomous vehicle industry. Data scientists realized that instead of driving around endlessly recording street mapping information, traffic, and other ambient data, it would be much faster and cheaper to train the car’s neural networks simply by using artificial data to create a simulation that replicated normal street driving.

Skyline AI builds predictive AI models for the commercial real estate world, and as one might guess, there are far fewer commercial properties out there than residential ones. Synthesizing data can help supplement missing data, bridging the predictive model gap between single-family residential units and smaller asset classes such as multifamily and commercial. It also enables better testing of current models, authenticating their robustness to market dynamics. With the rise of powerful AI algorithms and methods such as deep neural nets and extreme gradient boosting, artificially synthesized data is helping us gain deeper real-world insights than ever before possible.

Synthetic data generators often use deep neural networks, which consume a great deal of compute resources. For this reason, Amazon SageMaker is a perfect platform for developing such models. 

Much has been written about how GANs can be used to generate reliable synthetic data. In this writeup, we’ll see how a somewhat unexpected unsupervised learning model (Gaussian Mixture Model) could be deployed in a similar adversarial architecture together with XGBoost to create reliable synthetic data. 

The Challenge in Creating Reliable Synthetic Data

Creating quality synthetic data can be challenging when dealing with high dimensionality vectors. While it is relatively easy to find a univariate distribution it is extremely difficult to infer the multivariate one. 

Take your own neighborhood for example: say you wanted to create a “fake” building somewhere near the building where you live. Assuming you know the estimated age of the buildings in your neighborhood, if asked to create a synthetic building in a location you know, you may get a reliable “year built” data point.  Let’s say you pick 1990, an average vintage year for buildings in your area. 

Now you are asked to come up with another value  —  ceiling height. But wait! It turns out that the buildings in your area, which were built prior to 1990, feature higher ceilings, as was the style back then. 

Now add occupancy, rent, concessions, amenities in the building (pool, gym, etc.), and more. You get the picture: it’s hard to come up with reliable synthetic data because we have to come up with values that make sense in multiple value distributions which depend on each other.  

The Adversarial Concept

The idea behind an adversarial architecture data synthesizer is to use two entities that work ‘against’ each other: a generator, which generates synthetic data, and a detector, which tries to differentiate the real data from the synthetic. 

Through this adversarial relationship, both machines continue to improve until a certain stage — and if they become so good that the detector is having a hard time telling the difference between real and synthetic data, we have ourselves a solid generator!

Please note that this is part of the Skyline AI Engineering Blog Series. For an executive overview please check out our latest White Paper now for a layperson’s understanding of Skyline AI’s Synth City.

The Detector

To build the detector, we’ll be using an XGBoost classifier (available on SageMaker). We will use XGBoost to train a prediction model that learns to classify real data vs fake data.

SageMaker has a neat concept of Estimators, a high-level interface for SageMaker training. You can instantiate an XGBoost estimator like so (more on this here):syn city body image-1

Or, if you’d like to do it the old fashioned way such as running it directly on a Jupyter Notebook SageMaker, you can install XGBoost by inserting the following cell into your notebook:

syn city body image copy-1

In building our classifier, we’ll denote class = 0 as authentic and class = 1 as fake data. To create our labeled data set of real vs fake data, we’ll start off by building a baseline data synthesizer and marking its output as class = 1 (fake) while using some of our authentic data with class = 0 (real).

A Baseline Data Synthesizer

Prior to developing a fancy ML model, it’s always a good idea to start off with a simple model, which will provide us with a comparable baseline to estimate our model’s performance. 

In the case of our data synthesizer, we’ll build a naive model which builds a normal (Gaussian) distribution by using the mean and stdev of our data columns, and samples a random value from each distribution:

syn city body image copy 2-1

Performance Metrics 

To benchmark our model’s performance, we’ll be using precision, recall, and AUC-ROC, all available in sklearn library:

syn city body image copy 3-1

Preparing the Data for XGBoost

Now that we have our performance metrics and our initial model’s fake data, let’s prepare them for training with XGBoost. XGBoost uses DMatrix, an internal data structure to hold and transform data. We’ll build a train set and a test set by randomly sampling 70% of the data for each.

syn city body image copy 4-1

Now we’ll run the XGBoost algorithm to maximize recall on the test set and have the model learn to classify real (class=0) from synthetic (class=1) data points.

syn city body image copy 5-1

The results for our baseline model are:

syn city body image copy 6-1

The precision isn’t too bad, but a recall of 0.86 is pretty lame. This means our XGBoost detector was able to correctly classify our synthetic data as fake. Remember, our goal is to fool the detector. Let’s see how we can improve that!

Gaussian Mixture Models (GMMs)  —  How a Clustering Model Can Actually Be Used to Synthesize Data

GMM is a very powerful unsupervised learning algorithm often used as a clustering method. Despite this, GMM is fundamentally an algorithm for density estimation. Depending on its configuration (mainly the number of clusters/components and covariance type), GMM isn’t meant to find separated clusters of data, but rather to model the overall distribution of the input data. This is a generative model of the distribution — meaning that the GMM gives us the ability to generate synthetic data in a similar distribution to our original data.

In the Python Data Science Handbook, Jake VanderPlas provides a great example of this with the make moons dataset found in sklearn. Consider the following scatter plot:

syn city body image copy 7

With the naked eye, we identify two distributions here. Suppose we fit a GMM with two components to cluster them. The results aren’t quite so good:

syn city body image copy 8

But what if we use 16 components?

syn city body image copy 9

You can see that the GMM learned the overall distribution of the input data. Now we can use the fitted model to create new reliable artificial data points!

Prior to testing out a few alternatives for the GMM hyperparameters, it would be a good idea to reduce our original dataset with PCA. The fewer variable distributions we have, the easier it will be to synthesize the complete artificial vector.

syn city body image copy 10-1

Selecting the Number of Components for GMM

One common metric to measure a GMM’s fit is AIC. Let’s fit a few different GMMs that vary by the number of components and check their fit to our dataset using AIC:

syn city body image copy 11-1

Finding the Optimal Number of Components for our Data

Plotting AIC (Y-axis) vs the number of components (X-axis), the magic number appears to be 20 in our case (note the AIC before and after 20):

syn city body image copy 12

AIC vs the number of components

Now we’ll fit a 20-component GMM to our data. Following this, we’ll ask GMM to sample random data in the fitted distributions and we’ll inverse-transform the result using our PCA model to reconstruct a dataframe similar to our original data’s format:

syn city body image copy 13-1

Fooling our XGBoost Detector

How will our detector perform with the GMM-synthesized data? We’ll use the same xgb_test predictor created before to try to classify real data points from synthetic ones.

Let’s create artificial data using our GMM model. We ask our fitted GMM to sample the same number of rows we had in our original data. We invert the PCA’d data back into the original form and create a Pandas DataFrame from the synthesized data:

syn city body image copy 18

We use the same XGBoost detector for classification, checking the same metrics as before  —  ROC-AUC, precision and recall. 

syn city body image copy 14-2

syn city body image copy 15-1

It seems the ROC has slightly improved. But the interesting question is, has the recall improved?

syn city body image copy 16-1

syn city body image copy 17-1

The recall was reduced from 86% to just 20%  —  our GMM data synthesizer is significantly better at fooling our XGBoost classifier, such that it isn’t able to pick up on a significant portion of our artificial records!

Summary

We have seen that powerful ML algorithms can (somewhat unexpectedly) be used to come up with synthetic data generators. Powered by a strong compute platform for ML like SageMaker, we can train and fit XGBoost, PCA, and GMMs over large datasets, creating synthetic data that is capable of fooling even a state-of-the-art ML algorithm such as the XGBoost classifier.