Synthetic data generation dates back to the 1990s, and is increasingly utilized today as a way of filling in gaps on data-heavy platforms where data is scarce or otherwise difficult to obtain. One prominent example of an innovative sector that has successfully used synthetic data is the autonomous vehicle industry. Data scientists realized that instead of driving around endlessly recording street mapping information, traffic, and other ambient data, it would be much faster and cheaper to train the car’s neural networks simply by using artificial data to create a simulation that replicated normal street driving.
Skyline AI builds predictive AI models for the commercial real estate world, and as one might guess, there are far fewer commercial properties out there than residential ones. Synthesizing data can help supplement missing data, bridging the predictive model gap between single-family residential units and smaller asset classes such as multifamily and commercial. It also enables better testing of current models, authenticating their robustness to market dynamics. With the rise of powerful AI algorithms and methods such as deep neural nets and extreme gradient boosting, artificially synthesized data is helping us gain deeper real-world insights than ever before possible.
Synthetic data generators often use deep neural networks, which consume a great deal of compute resources. For this reason, Amazon SageMaker is a perfect platform for developing such models.
Much has been written about how GANs can be used to generate reliable synthetic data. In this writeup, we’ll see how a somewhat unexpected unsupervised learning model (Gaussian Mixture Model) could be deployed in a similar adversarial architecture together with XGBoost to create reliable synthetic data.
The Challenge in Creating Reliable Synthetic Data
Creating quality synthetic data can be challenging when dealing with high dimensionality vectors. While it is relatively easy to find a univariate distribution it is extremely difficult to infer the multivariate one.
Take your own neighborhood for example: say you wanted to create a “fake” building somewhere near the building where you live. Assuming you know the estimated age of the buildings in your neighborhood, if asked to create a synthetic building in a location you know, you may get a reliable “year built” data point. Let’s say you pick 1990, an average vintage year for buildings in your area.
Now you are asked to come up with another value — ceiling height. But wait! It turns out that the buildings in your area, which were built prior to 1990, feature higher ceilings, as was the style back then.
Now add occupancy, rent, concessions, amenities in the building (pool, gym, etc.), and more. You get the picture: it’s hard to come up with reliable synthetic data because we have to come up with values that make sense in multiple value distributions which depend on each other.
The Adversarial Concept
The idea behind an adversarial architecture data synthesizer is to use two entities that work ‘against’ each other: a generator, which generates synthetic data, and a detector, which tries to differentiate the real data from the synthetic.
Through this adversarial relationship, both machines continue to improve until a certain stage — and if they become so good that the detector is having a hard time telling the difference between real and synthetic data, we have ourselves a solid generator!
Please note that this is part of the Skyline AI Engineering Blog Series. For an executive overview please check out our latest White Paper now for a layperson’s understanding of Skyline AI’s Synth City.
The Detector
To build the detector, we’ll be using an XGBoost classifier (available on SageMaker). We will use XGBoost to train a prediction model that learns to classify real data vs fake data.
SageMaker has a neat concept of Estimators, a high-level interface for SageMaker training. You can instantiate an XGBoost estimator like so (more on this here):
Or, if you’d like to do it the old fashioned way such as running it directly on a Jupyter Notebook SageMaker, you can install XGBoost by inserting the following cell into your notebook:
In building our classifier, we’ll denote class = 0 as authentic and class = 1 as fake data. To create our labeled data set of real vs fake data, we’ll start off by building a baseline data synthesizer and marking its output as class = 1 (fake) while using some of our authentic data with class = 0 (real).
A Baseline Data Synthesizer
Prior to developing a fancy ML model, it’s always a good idea to start off with a simple model, which will provide us with a comparable baseline to estimate our model’s performance.
In the case of our data synthesizer, we’ll build a naive model which builds a normal (Gaussian) distribution by using the mean and stdev of our data columns, and samples a random value from each distribution:
Performance Metrics
To benchmark our model’s performance, we’ll be using precision, recall, and AUC-ROC, all available in sklearn library:
Preparing the Data for XGBoost
Now that we have our performance metrics and our initial model’s fake data, let’s prepare them for training with XGBoost. XGBoost uses DMatrix, an internal data structure to hold and transform data. We’ll build a train set and a test set by randomly sampling 70% of the data for each.
Now we’ll run the XGBoost algorithm to maximize recall on the test set and have the model learn to classify real (class=0) from synthetic (class=1) data points.
The results for our baseline model are:
The precision isn’t too bad, but a recall of 0.86 is pretty lame. This means our XGBoost detector was able to correctly classify our synthetic data as fake. Remember, our goal is to fool the detector. Let’s see how we can improve that!
Gaussian Mixture Models (GMMs) — How a Clustering Model Can Actually Be Used to Synthesize Data
GMM is a very powerful unsupervised learning algorithm often used as a clustering method. Despite this, GMM is fundamentally an algorithm for density estimation. Depending on its configuration (mainly the number of clusters/components and covariance type), GMM isn’t meant to find separated clusters of data, but rather to model the overall distribution of the input data. This is a generative model of the distribution — meaning that the GMM gives us the ability to generate synthetic data in a similar distribution to our original data.
In the Python Data Science Handbook, Jake VanderPlas provides a great example of this with the make moons dataset found in sklearn. Consider the following scatter plot:
With the naked eye, we identify two distributions here. Suppose we fit a GMM with two components to cluster them. The results aren’t quite so good:
But what if we use 16 components?
You can see that the GMM learned the overall distribution of the input data. Now we can use the fitted model to create new reliable artificial data points!
Prior to testing out a few alternatives for the GMM hyperparameters, it would be a good idea to reduce our original dataset with PCA. The fewer variable distributions we have, the easier it will be to synthesize the complete artificial vector.
Selecting the Number of Components for GMM
One common metric to measure a GMM’s fit is AIC. Let’s fit a few different GMMs that vary by the number of components and check their fit to our dataset using AIC:
Finding the Optimal Number of Components for our Data
Plotting AIC (Y-axis) vs the number of components (X-axis), the magic number appears to be 20 in our case (note the AIC before and after 20):
AIC vs the number of components
Now we’ll fit a 20-component GMM to our data. Following this, we’ll ask GMM to sample random data in the fitted distributions and we’ll inverse-transform the result using our PCA model to reconstruct a dataframe similar to our original data’s format:
Fooling our XGBoost Detector
How will our detector perform with the GMM-synthesized data? We’ll use the same xgb_test predictor created before to try to classify real data points from synthetic ones.
Let’s create artificial data using our GMM model. We ask our fitted GMM to sample the same number of rows we had in our original data. We invert the PCA’d data back into the original form and create a Pandas DataFrame from the synthesized data:
We use the same XGBoost detector for classification, checking the same metrics as before — ROC-AUC, precision and recall.
It seems the ROC has slightly improved. But the interesting question is, has the recall improved?
The recall was reduced from 86% to just 20% — our GMM data synthesizer is significantly better at fooling our XGBoost classifier, such that it isn’t able to pick up on a significant portion of our artificial records!
Summary
We have seen that powerful ML algorithms can (somewhat unexpectedly) be used to come up with synthetic data generators. Powered by a strong compute platform for ML like SageMaker, we can train and fit XGBoost, PCA, and GMMs over large datasets, creating synthetic data that is capable of fooling even a state-of-the-art ML algorithm such as the XGBoost classifier.