The Crossroads of Innovation and Privacy: Private Synthetic Data for Generative AI

Published

By , Principal Research Product Manager , Principal Research Manager , Senior Researcher , Senior Researcher , Partner Research Manager

diagramA flow chart with four successive blocks. Starting with a data owner, private data is provisioned to train a language model with differential privacy. The language model is subsequently prompted to generate novel synthetic data resembling the private data. This data can be used for down-stream applications such as machine learning, feedback analysis or statistical analysis.

Introduction

In today’s data-driven world, organizations strive to leverage data to train and adapt AI models. However, this pursuit often faces an important challenge: balancing the value of data with the need to safeguard individuals’ right to privacy and comply with data privacy regulations like the General Data Protection Regulation (opens in new tab) (GDPR) and the EU AI Act (opens in new tab)

Synthetic data has emerged as a powerful solution to privacy and compliance challenges. It allows organizations to create realistic and useful datasets, tailored to specific use cases, without compromising individual privacy. This enables organizations to:

  • Train and adapt AI models: Synthetic data can be used to train and adapt models to specific domains and industries, even when real-world data is limited, or privacy concerns exist.
  • Comply with regulations: Since it doesn’t require user data, synthetic data generation helps organizations adhere to data privacy regulations.
  • Unlock new possibilities: Synthetic data opens doors to innovative AI applications that were previously limited by data availability or privacy constraints.

Microsoft’s Phi-3 (opens in new tab) small language model (SLM) is a good example of how synthetic data can contribute to responsible AI development, enabling the creation of powerful language models without compromising privacy. Phi-3 leverages a combination of “textbook quality” web data and LLM-generated synthetic content, creating a strategic approach that doesn’t need real-world personal data. 

Microsoft Research Podcast

Collaborators: Holoportation™ communication technology with Spencer Fowers and Kwame Darko

Spencer Fowers and Kwame Darko break down how the technology behind Holoportation and the telecommunication device being built around it brings patients and doctors together when being in the same room isn’t an easy option and discuss the potential impact of the work.

However, synthetic data carries limitations. It can be difficult to artificially generate realistic data that anticipates a wide range of use cases and individual scenarios. Furthermore, synthetic data generated by pre-trained large-language models (LLMs) can sometimes reduce accuracy and increase bias on down-stream tasks (opens in new tab). So, how could we generate synthetic data that accurately captures the diversity and specificity of private data while maintaining strict privacy protections for data contributors? 

Differential privacy: A bridge between innovation and privacy

Differentially private (DP) synthetic data generation is a promising solution. It allows developers to pursue innovations in machine learning while prioritizing privacy. The goal of synthetic data generation is to produce data statistically similar to real-world data sources. However, when the data is too similar, replicating uniquely identifying details of the source data, the promise of preserving privacy is compromised. This is where DP can help. DP is a mathematical framework for providing a guarantee that a particular computation is relatively invariant to the addition or removal of a single data contributor. Using DP techniques, researchers can generate synthetic datasets that retain the statistical properties of the original data while ensuring that information that could help identify data contributors remains obscured. 

This blog post explores recent advancements in private synthetic data generation. We examine four recently published research papers that propose innovative techniques for generating synthetic data with strong privacy guarantees, while maintaining its usefulness for analytics, training AI models, and other tasks.

In the remainder of this blog post, we describe each approach in more detail, and present experimental results illustrating their value.

Technical deep dive: Differentially private synthetic data generation 

Synthetic Text Generation with Differential Privacy: A Simple and Practical Recipe

Generative LLMs offer the opportunity to produce synthetic text by sampling from LLM outputs. One avenue to generating realistic synthetic text is to fine-tune an LLM using representative data. For example, we could consider fine-tuning a pre-trained LLM on a corpus of scientific papers, enabling the model to more readily produce text that captures the knowledge and writing style used in scientific writing. Suppose, however, that we want to produce synthetic text based on a private corpus of documents. What steps can we take to protect the document authors and any sensitive information in their documents? For example, we may want to produce synthetic medical notes, or personal emails. LLMs have a well-known capacity to memorize training examples, and a model with the potential for reproducing samples from the training set might pose significant privacy risks.

In the paper Synthetic Text Generation with Differential Privacy: A Simple and Practical Recipe, researchers from Microsoft presented an approach to leveraging a private data corpus for synthetic generation, without compromising the privacy of the data subjects. This approach uses differentially private stochastic gradient descent (DP-SGD) to fine-tune an LLM on the private documents with a strong privacy guarantee. Differentially private model training provides a mathematical guarantee that the trained model parameters, and any subsequent model outputs, are relatively unaffected by the addition or removal of any single user’s training examples.

The synthetic generation approach described in this work was validated by training on restaurant reviews with varying levels of privacy protection, then prompting the model to generate novel reviews. These reviews were then used for downstream classification tasks, such as sentiment prediction and restaurant genre classification, and the results, which are shown in Table 1, demonstrated only small accuracy penalties compared to training on the raw private data. This approach unlocks a powerful way for realistic synthetic data to be generated from private data without compromising privacy or confidentiality.

A flow chart with four successive blocks. Starting with a data owner, private data is provisioned to train a language model with differential privacy. The language model is subsequently prompted to generate novel synthetic data resembling the private data. This data can be used for down-stream applications such as machine learning, feedback analysis or statistical analysis.
Figure 1: By fine-tuning an LLM with differential privacy, the model can be used to generate synthetic examples that resemble the private corpus 
A table of results with four columns and four rows. The columns indicate data type, data generator, epsilon, rating and category.  The first row indicates “original” data type and no entry for data generator or epsilon. The rating is 0.733 and category is 0.775.  The following three rows all indicate Synthetic for data type and GPT2, GPT2-Medium, and GPT2-Large for the data generator. Each row is further divided into two rows corresponding to epsilon = 4 and epsilon = infinity respectively. In all cases the rating and category scores are lower than the row marked original by a few percentage points. The rows corresponding to epsilon = 4 are lower than corresponding rows marked epsilon=infinity by 1-2 percentage points. In general the epsilon = 4 rows have increased scores for larger GPT2 models, while the epsilon=infinity rows are relatively flat.
Table 1: Various versions of GPT-2 were trained on restaurant reviews both with (ε=4) and without (ε =∞) a privacy guarantee. These models were used to produce synthetic training sets, which were used to train classification models for review rating and restaurant category, and subsequently evaluated for accuracy on a private hold-out set. The results show that models trained on the synthetic data can achieve accuracy competitive with models trained without a privacy guarantee. 

Differentially Private Synthetic Data via Foundation Model APIs

While the ACL paper demonstrated a robust approach to synthetic data generation, fine-tuning a large model can be impractical. Model training requires significant computing capacity and some of the most powerful models available are proprietary and not accessible for DP training. Recognizing this challenge, researchers at Microsoft explored whether synthetic data can be generated directly using only inference API access to a model, even while utilizing an untrusted model controlled by a third party. Crucially, the synthetic data should resemble a targeted private corpus, and yield a similar DP guarantee as was met in the previous work based on model training. In two separate papers, the authors demonstrate an approach to this problem using a differentially private sampling approach called Private Evolution (PE). 

Two independent flow charts. In the first, private data is applied to a pre-trained model using DP-SGD. The fine-tuned model is used to produce differentially private synthetic data.  In the second chart, a pre-trained model is prompted via its API to produce generic data. Private data is used to inform selection of the generated data, with a strong privacy guarantee, yielding differentially private synthetic data.
Figure 2: Instead of fine-tuning pre-trained models with DP-SGD (top figure), Private Evolution (PE) only requires accessing the inference APIs of a model (bottom figure). Thus, PE is easily compatible with foundation models that are difficult to DP-fine-tune (e.g., because they are too large) or infeasible to fine-tune (e.g., they are only accessible through inference APIs).

Synthetic image generation using foundation model APIs: In Differentially Private Synthetic Data via Foundation Model APIs 1: Images, the authors introduced Private Evolution (PE), an approach that enables DP image synthesis merely through inference APIs of a generative model. PE operates by sampling from a pre-trained diffusion model such as Stable Diffusion, which has no knowledge of the private corpus. PE then iteratively compares these samples to the private corpus, keeps the ones that are most similar to the private corpus, and uses the pre-trained model to generate more such samples. Crucially, the comparison to the private corpus is done with a DP guarantee, so that any information revealed about the private corpus is strictly bounded. Also, all the queries to the foundation model APIs satisfy the same DP guarantee, so that we can safely use APIs provided by (untrusted) third parties. 

Overview of PE. We use two private and synthetic images for illustration. Step 1 (RANDOM_API): we use the model API to generate random images. Step 2: We iteratively go through steps 2.1-2.3 to refine the synthetic images towards the private images. Step 2.1: Each private image votes for their closet synthetic image in the embedding space. In this example, we assume that the bird image gets two votes, and the car image gets zero votes. We then add Gaussian noise to the votes to ensure DP. This gives us the DP Nearest Neighbor Histogram (DP_NN_HISTOGRAM). Step 2.2: We resample the generated images proportional to the histogram. We assume that only the bird image remains. Step 2.3 (VARIATION_API): We use the model API to generate new similar images to the bird image, which are the initial synthetic images in the next iteration. 
Figure 3: Overview of PE. We use two private and synthetic images for illustration. Step 1 (RANDOM_API): we use the model API to generate random images. Step 2: We iteratively go through steps 2.1-2.3 to refine the synthetic images towards the private images. Step 2.1: Each private image votes for their closet synthetic image in the embedding space. In this example, we assume that the bird image gets two votes, and the car image gets zero votes. We then add Gaussian noise to the votes to ensure DP. This gives us the DP Nearest Neighbor Histogram (DP_NN_HISTOGRAM). Step 2.2: We resample the generated images proportional to the histogram. We assume that only the bird image remains. Step 2.3 (VARIATION_API): We use the model API to generate new similar images to the bird image, which are the initial synthetic images in the next iteration. 

Even without doing any model training, PE significantly advances state-of-the-art results on some of the datasets. For example, on CIFAR10 dataset (opens in new tab), we achieve FID score (image quality measure, smaller is better) ≤ 7.9 with DP privacy cost ϵ = 0.67, significantly improving the previous SOTA from ϵ = 32. In the paper, we also show that PE requires less computational resource (GPU hours) than DP fine-tuning to achieve such results. 

A 2D line chart with six line series, comprising conditional and unconditional variations on the private evolution and DP-MEPF methods, as well as DP-GAN and DP-Diffusion. The x axis presents values of epsilon from 0 to 32. The y axis presents values of the image quality measure FID from 0 to 80, where values are better.  All six series show decreasing values of FID for increasing values of epsilon. Both of the series corresponding to private evolution show significantly lower FID values, ranging from about epsilon = 0.1 to epsilon = 2.
Figure 4: FID (image quality measure, lower is better) vs. DP privacy cost ϵ on CIFAR10 (δ = 10−5 ). (Un)cond means (un)conditional generation. Ours achieves the best privacy-quality trade-off compared to prior training-based approaches.
An array of ten rows of thumbnails, each row depicting ten instances of generated synthetic images. The rows include birds, cars, cats, dogs, and other animals, planes, boats and trucks.  Most of the images appear to be realistic with some exhibiting unusual artifacts.
Figure 5: Private Evolution-generated samples using CIFAR-10 as the private corpus (ε =0.67, δ =10-5). Each row corresponds to one object class.

Synthetic Text Generation using foundation model APIs: the PE approach described above works well for images since it is easy to produce nearby perturbations of promising images. In Differentially Private Synthetic Data via Foundation Model APIs 2: Text, Microsoft researchers explored whether a similar approach could be applied to text. Their method, called Augmented Private Evolution (Aug-PE), operates similarly to the basic PE approach, but leverages the power of a pre-trained LLM to produce variations and re-wordings of input text. Aug-PE also proposes some fundamental algorithmic improvements that may benefit future development of PE. 

An overview of the Augmented Private Evolution algorithm for synthetic text generation. Step 1 invokes a language model to produce random text. Step 2.1 uses private data and differential private to vote on the best candidates from step 1, Step 2.2 samples from this differentially private histogram to produce a selected set of generations. Step 2.3 prompts a language model to produce variants of the selected generations, and steps 2.1 to 2.3 are repeated.
Figure 6: Augmented Private Evolution (Aug-PE) leverages a foundational LLM to synthesize text and compare in a privacy-preserving way with a private corpus. Similar to PE for images, in Aug-PE, samples that more closely resemble the private data are retained and refined to produce new synthetic text with a strong privacy guarantee. The illustration shows how we generate DP synthetic reviews for restaurants given two private samples.

Results show that Aug-PE is a promising alternative to DP-fine-tuning for DP text synthesis. With the same foundation model, PE can match or even beat DP-fine-tuning in terms of the trade-off between text quality and privacy. Moreover, as Aug-PE only requires inference APIs, Aug-PE can easily work with the most advanced LLMs such as GPT-3.5, LLaMA, and Mixtral to further improve the text quality. In terms of computational cost (GPU hours), PE can achieve up to 65.7x speedup compared to the DP fine-tuning approach.

A table of results for area and rating classification accuracy for a variety of models and comparing PE with DP synthesis. The table contains the remark that with the same model PE matches or beats DP fine-tuning on text quality vs privacy, and PE works well with advanced LLMs which may be challenging or impossible to fine-tune. The models compared include three sizes of GPT-2, several major open source models, and GPT-3.5. PE on the Mixtral model shows the strongest Area classification accuracy at 43.6 while PE on GPT-3.5 shows the strongest Rating classification accuracy at 43.1.
Table 2: Results on ICLR 2023 paper reviews (ϵ = 1). We use each method to generate DP synthetic paper reviews and test the utility of the data by training downstream paper area or rating classifiers and evaluate their accuracies on the real hold-out data (higher is better). Under the same base model (GPT-2 families), PE achieves competitive results with DP fine-tuning. PE also supports advanced LLMs that may be challenging to work with DP fine-tuning due to large model sizes or black box access.

Privacy-Preserving In-Context Learning with Differentially Private Few-Shot Generation

In-context learning is a technique for performing tasks with an LLM by providing a sample of demonstration examples in the prompt of the LLM before presenting it with a specific task. For example, we might show a few movie plots and their genre and ask the LLM to suggest the genre for a particular plot of interest. In-context learning harnesses the strong generalization capabilities of LLMs, but it requires a sample of labeled demonstration examples at inference time. How can we perform in-context learning when the only available labeled examples are private? A naïve solution might be to use the private examples but hide the demonstration prompt from the user. However, the threat posed by jailbreak attacks puts these examples at risk for exposure to a malicious user.

In Privacy-Preserving In-Context Learning with Differentially Private Few-Shot Generation, Microsoft researchers explored how demonstration examples can be synthesized from a private corpus with a privacy guarantee. The method operates by incrementally drawing samples from a token distribution defined by the private examples but with noise added to the distribution. The noise is calibrated to ensure a bound on the privacy lost with each sample. The research demonstrated that in-context learning can out-perform zero-shot learning (querying a model without any demonstration examples) and comes close to performing at the same level as the case with no privacy mitigations, as shown in Table 3. 

An overview of differentially private few-shot generation.  A round of token generation is depicted with four steps. Given the tokens generated so far, step 1 selects the relevant private data. Step 2 takes an M by N sample of the private data, producing M batches of N examples. Step 3 assembles M LLM prompts with task instructions and the N examples appended. Step 4 feeds the M prompts to the LLM and performs noisy aggregation over the LLM’s output probabilities to select the next generated token.
Figure 7: Illustration of DP few-shot generation. The example shows a synthetic demonstration generated token by token for the topic school with a differentially private guarantee. As new tokens are sampled, the private examples inform the sampling probability of each subsequent token, with noise injected to preserve privacy. 
A table of results for private in-context learning tasks, including text classification on three datasets (AGNews, DBPedia, and TREC) and information extraction on two datasets (MIT-G and MIT-D).  Accuracy is compared across two cases where epsilon = 0 (zero-shot and four-shot) and values of epsilon at 1, 2, 4, 8 and infinity. Generally, accuracy improves as epsilon increases but epsilon = 8 often outperforms epsilon = infinity.
Table 3: For classification and information extraction tasks, DP in-context learning achieves accuracy similar to non-private ICL (ϵ =∞) 

Conclusion

Synthetic data generation presents enormous opportunities to develop AI systems without compromising end-user privacy. In this blog post, we have explored recent innovations in synthetic data generation with strong privacy guarantees. These approaches can enable practitioners to produce synthetic data from private entities, while mitigating the risk that private information might be revealed. While these approaches are highly promising, they do have limitations. For example, we are currently limited to producing relatively short text passages. Future work will continue to explore the opportunities presented by these approaches, with an aim to produce increasingly realistic data with strong privacy guarantees.

Acknowledgments: The authors are grateful for the contributions of the co-authors of the papers reviewed in this blog post: Xiang Yue, Xuechen Li, Girish Kumar, Julia McAnallen, Hoda Shajari, Huan Sun, David Levitan, Chulin Xie, Arturs Backurs, Sivakanth Gopi, Da Yu, Harsha Nori, Haotian Jiang, Huishuai Zhang, Yin Tat Lee, Bo Li, Janardhan Kulkarni, Xinyu Tang, Richard Shin, Andre Manoel, and Niloofar Mireshghallah.

Related publications

Continue reading

See all blog posts

Research Areas

Related projects