Supercomputing and Computation
Advanced Data Analysis Capability and Surrogate GenerationMay 16, 2013
- Researchers rely on realistic datasets to build new analysis algorithms, but in many cases, realistic datasets have sensitivities that limit the audience that can work with them.
- One solution is to transform the original datasets with the intent to mask or remove the sensitive portions. This method is problematic because it is very difficult to ensure the transformation cannot be reversed.
- A better solution is to create surrogate datasets that have the properties of the original datasets but have no relationship to the content of the original data.
- Compute a comprehensive set of statistical measures that capture the properties of the original dataset.
- Review the statistical measures carefully for any of the sensitive properties in the original dataset and remove. For example, with geospatial data, if the location is sensitive, then translate and randomize the points across a new region.
- Use the statistical measures and a large volume of external data to randomly generate a surrogate dataset with similar properties as the original.
- This process has been used with structured datasets and geospatial datasets.
- Produces a dataset that preserves many of the properties of the original with no direct connection to the content of the original data.
- Reviewing the statistical measures for sensitive information that could possibly carry into the surrogate is a shorter process than a review of the original dataset.
- The fidelity of the surrogate dataset is easily selectable by manipulation of the statistical measures and with the addition of randomness in the generation process.