Privasis: Synthesizing the Largest
'Public' Private Dataset from Scratch

1NVIDIA, 2Carnegie Mellon University, 3University of Southern California, 4University of Washington
*: equal contribution
Image credit: Nano Banana

Privasis, the Privacy Oasis in the Desert of Privacy Data

Figure 1: An overview of our Privasis project.


For years, privacy-focused AI research has faced a fundamental limitation: there's no large-scale public data to work with. And for good reason. Truly private data—medical history, financial records, personal messages—cannot be shared without violating legal and ethical boundaries. As a result, privacy research has been constrained to small, narrow datasets, in stark contrast to the scale that has driven progress across the rest of AI.

At the same time, the stakes have never been higher. Modern AI agents like OpenClaw, ChatGPT Health, and Gemini Agent are rapidly gaining access to our deeply sensitive personal information—often processing this data directly at inference time.

However, the research community has almost no practical way to study or mitigate these risks. Without access to realistic private data, it's incredibly difficult to understand where systems fail, how privacy leaks occur, or what robust defenses actually look like in practice.

That's the gap we aim to close. We introduce a generative process that synthesizes realistic user profiles—complete with personal attributes like names, contact details, medical history, and financial records—using only demographic seeds and no real private data. This process enables the creation of Privasis (short for privacy oasis), a million-scale synthetic dataset with over 55 million annotated attributes. As one demonstration of downstream use, we also release Privasis-Sanitization, a 100K-record parallel corpus that pairs original texts with natural sanitization instructions and their corresponding sanitized outputs—designed specifically for training and evaluating text sanitization models.

Synthesizing a 1M-scale Privacy-forward Corpus from Scratch

Pipeline of Privasis

Figure 2: Our synthesis pipeline and an example of the generated data.

Directly sampling rare, highly specific data from LLMs is difficult because they favor generic, high-probability outputs, especially without reference data to guide generation. We address this by using informed initialization with auxiliary control variables, followed by a diversity-preserving revision and selection process that efficiently explores the space of possible texts.

1. Informed Initialization

We generate records bottom-up, combining profile, record type, and background context to define semantic content and structural format. Profiles are fully synthetic, mixing random sampling with LLM-generated fields. An LLM then generates a plausible record_type and generates background_context explaining why the record exists. Sampling both content and format—especially format—produces more realistic and diverse initial_drafts.

2. Diversity-Preserving Revision

Initial_draft often look generic, so we iteratively refine them with the LLM and obtain a new_draft. A judge model compares drafts for realism and specificity, while the Vendi embedding score measures diversity relative to the existing pool. New_draft is accepted only if it improves both quality and diversity, preventing collapse into repetitive patterns.
Note: When we repeatedly optimize for quality only, generations tend to converge. Over time, drafts become increasingly similar, collapsing into a narrow set of patterns. While each individual record may improve, the overall diversity of the dataset degrades.

3. Attribute Annotation and Clustering

We extract implicit attributes from the final record and update the profile in structured JSON. Related attributes are then clustered into higher-level grouped_attributes, capturing natural co-occurrence and adding useful structure for downstream tasks. An example would be clustering clinic_name, pharmacy_name, and room_number under a shared location category.

Analysis of Privasis

Statistics

Privasis contains a total of 1,414,871 records and 55,092,084 annotated attributes, averaging 39 attributes per record.

For generating the records, we use multiple LLMs to generate more diverse records, including GPT-OSS-120B (67.9%), GPT-4.1-Mini (21.6%), Exaone-3.5-32B (7.2%), Qwen3-80B (1.1%), Llama-3.3-70B (1.1%), GPT-4.1 (0.7%), and other frontier models.

What kind of records are in Privasis?

Explore example records from different domains in the Privasis dataset.


How does it compare to existing human-written datasets?

Diversity: We compare Privasis to human-written datasets using four diversity metrics (MATTR, bigram diversity, Shannon entropy, and cosine similarity). Across all domains, Privasis consistently shows greater diversity: higher MATTR and bigram diversity indicate richer vocabulary and syntactic variation, higher Shannon entropy reflects less repetition, and lower cosine similarity confirms reduced redundancy and greater semantic diversity.

Table 2: Diversity comparison between Privasis and human-written datasets.

Naturalness and Coherence: We also ran a blind human evaluation of naturalness and coherence. Seven annotators reviewed 128 Privasis records and 128 human-written records from similar domains. Privasis performed comparably to human data, with 113 records judged natural and coherent versus 111 for the human-written set.


Do the profiles correspond to real people?

No. We conducted a large-scale verification using web-enabled GPT-5 on 1K randomly sampled profiles. None were identified as real individuals: while some shared common names, all showed clear discrepancies in attributes such as age, nationality, or contact details. This confirms that the generated profiles are synthetic and do not correspond to real people.


Text Sanitization as a Use Case of Privasis

With Privasis, we now introduce **a new dataset and benchmark for text sanitization: Privasis-Sanitization.** In the real world, sensitive information rarely appears as clean, well-defined PII fields. Instead, it's scattered across long documents—medical notes, emails, financial records—entangled with context that users still want to keep. Our goal in the sanitization setting is therefore more ambitious than classic anonymization. We want models that can **selectively remove or abstract sensitive information**, *while preserving utility, coherence, and instruction-following flexibility*. To achieve this, we introduce Privasis-Sanitization, a large-scale parallel corpus built from Privasis and a decomposition-based pipeline designed specifically for this challenge.

Why Existing Sanitization Falls Short

Most prior sanitization datasets and systems focus on one or more narrow constraints: - fixed PII categories (e.g., names, phone numbers), - simple deletion or masking, - short, single-domain text snippets. However, real users' privacy needs are contextual. A user may want to: - remove identities but preserve roles, - drop exact locations while keeping cities, - sanitize medical details without destroying clinical meaning. Even frontier LLMs struggle with these requirements—especially on long documents—often missing at least one sensitive attribute or over-editing non-sensitive content. In privacy-sensitive settings, one miss is enough to fail.

A Decomposition-based Pipeline for building a parallel corpus for text sanitization

Figure 3: Our decomposition-based pipeline for building the parallel corpus for text sanitization. This illustrates a single target sanitization process. Our pipeline operates on multiple targets simultaneously.

To make fine-grained sanitization tractable, we design a decomposition-based pipeline that breaks the problem into manageable, grounded steps.

##### 1. Decomposition We split each document into small, semantically coherent `chunks` (e.g., paragraphs, lists). Sanitizing at the chunk level preserves local context while improving reliability and enabling parallel processing. ##### 2. Target Selection Rather than relying on predefined PII labels, we sample multiple sanitization `targets` from the document's annotated attributes. These targets may be individual attributes (e.g., date of birth) or grouped concepts (e.g., locations). Each `target` is labeled with a `sanitization_action`: - DROP: remove the information entirely - ABSTRACT: rewrite it at a higher level (e.g., “March 3, 2024” → “early March”) This allows the dataset to reflect the fact that what counts as sensitive is user- and context-dependent. ##### 3. Grounded Sanitization For each `target`, the LLM finds all relevant chunks (`chunks_with_target`), extracts the `spans`, and generates a single target-specific `sanitization_instruction` grounded in every occurrence. Applying this instruction across chunks ensures consistent edits and supports parallelization. ##### 4. Reconstruction We merge sanitized and untouched chunks back together, preserving the original structure and producing a consistent final `sanitized_record`. ##### 5. Instruction Synthesis and Retention All target-level instructions are combined into a single, natural user-like instruction, with explicit retention constraints (`sanitization_action`: KEEP) to avoid over-sanitization. The key idea is **global grounding per target with local, parallel execution per chunk**, allowing us to scale to long documents without missing or inconsistent edits. The result is a high-quality triplet—`(original_record, sanitization_instruction, sanitized_record)`—used to build the **Privasis-Sanitization** dataset of 100K records.

How does Privasis-Sanitization compare to existing sanitization datasets?

Table 4: Comparison of Privasis-Sanitization to existing sanitization datasets.

Experiments

##### Training We train a sanitizer Privasis-Cleaner that, given a `text` and a `sanitization_instruction`, outputs a sanitized version `sanitized_text` where target attributes are abstracted or removed. Since it is safer when sanitization models are run locally, we target lightweight Qwen3 models: 0.6B, 1.7B, and 4B. We train them on a 37K subset of Privasis-Sanitization. ##### Evaluation We evaluate models on the Privasis-Sanitization test set, which includes records generated by four frontier models: Gemini-2.5-pro, GPT-5, Llama-4-Maverick, and Qwen3-235B. To assess sanitization quality, we use a hierarchical evaluation framework that captures three types of information leakage: 1. **Direct leak**: the sensitive value appears verbatim in the sanitized text (checked via exact string matching). 2. **Inference leak**: the value is inferrable from the sanitized text through reasoning. We test this by prompting an evaluator LLM with only the sanitized text and the attribute type and checking whether it recovers the original value. 3. **Proximity leak**: even if the value cannot be exactly recovered, the sanitized text can remain nearly as informative as the original. We detect this by comparing the evaluator's predictions from the sanitized and original records and checking whether the sanitized prediction is equally close to the true value. A record is considered successfully sanitized only if none of its target attributes exhibit any of these leakage types. To prevent trivial solutions (e.g., deleting everything), we also measure **information retention**: specified retention attributes must remain present in the sanitized text. A record is fully successful only if it avoids all leakage while preserving all required information. We report success at both the record and attribute levels and release two evaluation splits. The **Vanilla set** (1,042 records) contains cases where our pipeline achieves perfect sanitization, while the **Hard set** (1,149 records) includes longer, more complex records with a higher fraction of grouped attributes, where even our method fails—providing a challenging benchmark for future work.

Explore Evaluation Examples

Explore example records from the evaluation set. The leak examples highlight where the sanitization failed.


#### Do not trust your off-the-shelf LLMs on sanitization Overall, the results make two things clear: sanitization is still tough for LLMs, and Privasis-trained models are meaningfully more reliable than the best-performing off-the-shelf LLMs. We use a strict **end-to-end success metric** (called *Full Successful Record* in the paper): a record only counts as "successful" if the model (1) removes or abstracts **every** requested sensitive attribute **and** (2) preserves the non-sensitive "retention" details that are supposed to stay. One miss fails the whole example. Under this metric, **Privasis-Cleaner-4B—despite not being a reasoning model—beats the strongest reasoning model (o3) on the vanilla test set, and also comes out ahead of GPT-5, which is orders of magnitude larger**. What's especially important is that many models "mostly" succeed—sanitizing a lot of targets—but miss just some. Because end-to-end success is all-or-nothing, those small misses dominate real privacy risk.
Table 5: Performance comparison on the Privasis-Sanitization benchmark.

LLMs leave sensitive values in place verbatim

Surprisingly, the most common failure mode is direct leakage, where the model simply leaves a sensitive value in place verbatim—like a full name, an exact date, or a user handle that the instruction asked to remove. Even when models avoid verbatim leakage, they can still fail via inference or proximity leaks: the output may keep enough surrounding context that the evaluator can still guess the private value, or the sanitized text ends up nearly as revealing as the original. Note, inference leak is more difficult to catch than proximity leak, because we use exact string matching on the evaluator's prediction.

How models fail to sanitize sensitive values
Table 6: Ratios across different leakage types.

Structured records are more challenging

Difficulty isn't evenly distributed across domains. Business & Finance and Health & Wellness are consistently the hardest categories for most models. These domains tend to be dense with structured, high-salience details—IDs, transactions, dates, diagnoses—that are both sensitive and tightly woven into the text, making "remove this, keep that" decisions harder than they sound. Privasis-trained models look more balanced across categories, which is a good sign that the gains aren't coming from a narrow slice of easy examples—they're coming from better generalization across messy, real-world-style private documents.

Top 5 challenging domains for sanitization in Privasis
Table 7: Top 5 challenging domains for sanitization.

What's Next in Privacy and Agents operating on Social Records?

Privasis is a starting point. As AI agents gain deeper access to our personal data, one open question is: *how do we decide what to minimize, and how much*? It turns out models can tolerate surprisingly aggressive redaction—[sometimes 85% or more](https://arxiv.org/abs/2510.03662)—without losing functionality. But here's the problem: models themselves are bad at predicting what they actually need. They have a bias toward requesting more information than necessary, which leads to systematic oversharing. One promising direction is local-remote collaboration: send generic, non-private queries to powerful remote models for reasoning decomposition, but keep all private data processing local with smaller trusted models. The idea is that by letting a powerful model break down problems into sub-queries and reasoning scaffolds, even lightweight local models can perform well—without ever exposing sensitive data. But we're still early. We need more research into what information is truly necessary for which tasks, how to build models that are aware of their own information requirements, and how to design architectures that treat privacy as a first-class constraint rather than an afterthought.

BibTeX

@article{kim2026privasis,
    title={Privasis: Synthesizing the Largest 'Public' Private Dataset from Scratch},
    author={Kim, Hyunwoo and Mireshghallah, Niloofar and Duan, Michael and Xin, Rui and Li, Shuyue Stella and Jung, Jaehun and Acuna, David and Pang, Qi and Xiao, Hanshen and Suh, G. Edward and Oh, Sewoong and Tsvetkov, Yulia and Koh, Pang Wei and Choi, Yejin},
    booktitle ={arXiv preprint arXiv:2601.12345},
    year=2026
}