Exploring responsible applications  of synthetic data to advance online  safety research and development

Pica Johansson; Shyam Krishna; Jonathan Bright; David Leslie; Claudia Fischer

Data are the foundational components used to train, test, and validate machine learning applications. As demand for machine learning technologies spreads across more and more areas of human activity and data sources multiply as a result of extensive digitisation, the challenges surrounding the responsible sourcing of large, high quality datasets have grown more acute and widespread. These challenges are even more pronounced in the case of the collection, management, and use of online safety data where issues around the privacy and sensitivity of digital trace data, and potential vicarious harms to both annotators and data subjects, are extremely complex.

The difficulties surrounding the collection of data for machine learning have led to a recent wave of interest in how to move beyond traditional data gathering and preprocessing approaches towards other techniques that can leverage more impact from existing data, and reduce the burden of data collection and preparation. One of these techniques is called synthetic data generation: a process by which novel, fabricated data are created, either by humans or by machines, to imitate a genuine dataset.

To take a simple example, imagine needing a large dataset of photos of sunsets, each varying slightly in their colours, setting and location. One approach would be to source these photos from genuine photographers, however this might prove time consuming and expensive. A second approach would be to collect a small amount of photos, and then use these to train a generative model for creating synthetic data. This model would allow you to use your existing photos as input to create a much larger set of photos that capture the general composition of the ones you already have by replicating them with random variations learned from the existing photos, and not just by duplicating them.

Synthetic data generation presents a number of opportunities. It allows augmentation of existing datasets; an increase in both the quantity and diversity of data examples; and novel possibilities for mitigating biases and improving the balance and representativeness of original training datasets (for example, by adding data that represent members of minority groups who have not been properly included in the data collection process). It can allow for quick prototyping and testing of systems. Synthetic datasets also allow the potential fabrication of as yet unseen or hypothetical scenarios, which could help improve system robustness. In theory, synthetic data should also be able to facilitate more responsible data sharing insofar as risks of compromising privacy and potential data leakage are mitigated by the generation of non-identifiable synthetic datasets. Synthetic data has generated much optimism and has already started to be applied widely across different domains in machine learning, with Gartner suggesting that already 60% of data used to train models will be synthetic by 2024 (White, 2021).

However, as an emerging field, the range of ethical risks and challenges associated with using synthetic data is yet to be fully explored. The aim of this report is to critically examine the ethics of using synthetic data, with a particular focus on its use in online safety technology, for example the technologies used in the detection of harmful content online. The report first outlines the context within which ethics will be discussed by defining key concepts, application areas and providing a brief overview of how synthetic data are created. Next, this report details the ethical implications associated with creating synthetic data, sharing synthetic data and modelling with synthetic data. Throughout the report, we propose holistic and practical approaches to governing synthetic data, aimed at policymakers and other stakeholders.

The report first outlines the context within which ethics will be discussed by defining key concepts, application areas and providing a brief overview of how synthetic data are created. Next, this report details the ethical implications associated with creating synthetic data, sharing synthetic data and modelling with synthetic data. Throughout the report, we propose holistic and practical approaches to governing synthetic data, aimed at policymakers and other stakeholders.

Exploring responsible applications of synthetic data to advance online safety research and development

Abstract

Files and links (2)

Metrics

Details

Exploring responsible applications of synthetic data to advance online safety research and development

Abstract

Files and links (2)

Metrics

Details

The Alan Turing Institute Social media