Differential privacy (DP) mechanisms are increasingly proposed to afford
public release of sensitive information, offering strong theoretical guarantees
for privacy, yet limited empirical evidence of utility. Utility is typically
measured as the error on representative proxy tasks, such as descriptive
statistics or performance over a query workload. The ability for these results
to generalize to practitioners’ experience has been questioned in a number of
settings, including the U.S. Census. In this paper, we propose an evaluation
methodology for synthetic data that avoids assumptions about the
representativeness of proxy tasks, instead measuring the likelihood that
published conclusions would change had the authors used synthetic data, a
condition we call epistemic parity.

We instantiate our methodology over a benchmark of recent peer-reviewed
papers that analyze public datasets in the ICPSR social science repository. We
model quantitative claims computationally to automate the experimental
workflow, and model qualitative claims by reproducing visualizations and
comparing the results manually. We then generate DP synthetic datasets using
multiple state-of-the-art mechanisms, and estimate the likelihood that these
conclusions will hold. We find that, for reasonable privacy regimes,
state-of-the-art DP synthesizers are able to achieve high epistemic parity for
several papers in our benchmark. However, some papers, and particularly some
specific findings, are difficult to reproduce for any of the synthesizers.
Given these results, we advocate for a new class of mechanisms that can reorder
the priorities for DP data synthesis: favor stronger guarantees for utility (as
measured by epistemic parity) and offer privacy protection with a focus on
application-specific threat models and risk-assessment.

