Synthetic Data and Generative AI

Advanced Logo Design

The performance of machine learning algorithms such as classification, clustering, regression, decision trees or neural networks can be significantly improved with synthetic data. It enriches training sets, allowing you to make predictions or assign a label to new observations that are significantly different from those in your dataset.

It is very useful if your training set is small or unbalanced. It also allows you to test the limits of your algorithms and find examples where it fails to work (for instance, failing to identify spam). I will show how to design rich, good quality synthetic data to meet these goals. In particular, I illustrate how to rebalance data sets with synthetic data when some categories have very few observations (in fraud detection or clinical trials), how to remove biases by including more minority people in your data, and how to anonymize your data to boost security and for compliance with privacy laws.

In this course taught by Dr. Vincent Granville, you will learn how to create your own synthetic data in Python. One example includes a real-life insurance data set: you will be able to create an alternate (synthetic) data set that matches extremely well the distribution of the observations in your training set. Other examples include computer vision, time series and animated data sets (agent-based modeling where you will also learn how to create spectacular data videos in Python).

Dr. Granville is consistently ranked by various media outlets as one of the top machine learning scientists in the world.

Learn from a world-class industry expert

Vincent Granville is a pioneering data scientist and machine learning expert, co-founder of Data Science Central (acquired by TechTarget in 2020), former VC-funded executive, author and patent owner. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, CNET, InfoSpace. Vincent is also a former post-doc at Cambridge University, and the National Institute of Statistical Sciences (NISS).

Vincent published in Journal of Number Theory, Journal of the Royal Statistical Society (Series B), and IEEE Transactions on Pattern Analysis and Machine Intelligence. He is also the author of multiple books, available here. He lives in Washington state, and enjoys doing research on stochastic processes, dynamical systems, experimental math and probabilistic number theory.

800  1,000  
Level - Learnify X Webflow Template
Level : 
Duration - Learnify X Webflow Template
Date : 
11 AM EST; Jan 30, Feb 2, Feb 6 and Feb 9
Lessons - Learnify X Webflow Template
Live Sessions : 
4 live courses (2 hours each) with weekly office hours and recordings
Lifetime access
Lifetime Access
Book a call
Course teacher
Vincent Granville


Foundations in matrix algebra, time series, calculus and optimization are especially useful.

This course is for:

Machine learning engineers, Data analysts, Data scientists, Software Engineers.

What you will be able to do after this course:

Master a number of techniques to generate and test rich synthetic data, and be able to quickly grasp future developments on this topic. Tasks performed during the training include writing Python code and using Python libraries, modeling and testing using cross-validation methods, implementing model-free techniques, feature and model selection and testing black-box systems using synthetic data.

Additional information about the course

Make Education Accessible

Module 1: Introduction to Synthetic Data and Explainable AI

Mon Jan 30th, 2023 - 04:00PM

What is synthetic data, generative models, explainable AI, augmented data? What are the benefits and limitations? Outlined applications:

  • Terrain generation, morphing and evolution, see Web API here.
  • Curve fitting: estimating the shape of a meteorite with model-free confidence regions, for meteorite classification.
  • Time series with double periodicity mimicking ocean tides.
  • Synthetic tabular data with pre-specified correlation matrix.
  • Synthetic data to test or benchmark algorithms.

Make Education Accessible

Module 2: Interpretable Machine Learning

Thu Feb 2nd, 2023 - 04:00PM

Some of the techniques presented here are used in the next two modules focusing on synthetic data. Before diving into these techniques, Vincent will discuss data cleaning automation, data animation (data videos), and simplicity (illustrated by case study: marketing attribution without math). The new machine learning techniques introduced include:

  • Generic unsupervised regression: covers all regression techniques and more, including an alternative to K-means.
  • Time series with double period (mimicking ocean tides).
  • Interpretable regression.
  • Simplified ensemble method, alternative to XGBoost.
  • Superimposed spatial point processes and alternative to GMM and GAN.

Make Education Accessible

Module 3: Tabular Data Generation

Mon Feb 6th, 2023 - 04:00PM

This type of data is traditionally used in banking, insurance and finance industry. Synthetic data has become very popular in this sector, as it helps reduce discrimination, algorithm bias, and contributes to the protection of personal data, explainable AI, and compliance with various regulations. In this module we will build a synthetic data set with a pre-specified autocorrelation matrix, such as those estimated on real-life data sets.

Make Education Accessible

Module 4: Synthetic Data in Computer Vision

Thu Feb 9th, 2023 - 04:00PM

In this module, Vincent will cover the terrain generation including 3D contour plots, and emulation of GPU clustering with techniques similar to deep neural networks. Depending on the interest of participants, I may cover shape generation or other evolutionary processes such as synthetic star clusters to understand possible evolution of our universe. He will also discuss nearest neighbor and collision graphs (such as this one), all synthetically generated. Some of the synthetic data videos that you will be able to produce, can be seen here and here.

Frequently asked questions