Micro Projects 2015

Synthetic Data for Testing, Demos, Benchmarking

There are many occasions when using real data is not feasible, either 1) because doing so would violate privacy or ethical considerations, or 2) because real data is not available (this is likely to apply while people take early steps with learning analytics). Simply obfuscating obviously personal data is not sufficient; there are both examples and theoretical studies that show this kind of attempt at anonymisation can be defeated. Issues 1 and 2 both make it problematical for integration and user-level (dashboard) testing - e.g. in the context of the Jisc Learning Analytics Services project - and for doing demonstrations of software.


An additional issue is benchmarking to compare different predictive models or software. A standardised and openly available synthetic dataset could give a useful comparative measure of assurance.


Synthetic data can go some way to dealing with these issues and would be more meaningful than manually-created "testing data". Although synthetic data is not real, it reflects either the statistics of real data or a chosen model.


Useful synthetic data could take various forms, and the mini-project would focus on #1 or #2 according to feedback:

1. Data suitable for prediction of risk (for guiding interventions). The approaches used for synthesising longitudinal household survey data indicate this is likely to be feasible. The mini-project would explore the adoption of these methods and fine-tune them to the learning analytics context. It would go on to assess the suitability of this kind of simulated data for benchmarking (it is assumed it would be quite adequate for the other uses mentioned above). The project would deliver a packaged "recipe" and scripts to allow (tabular) datasets with arbitrary attributes to be synthesised. For the mini-project, a single set of attributes would be used (guided by available data and existing evidence on appropriate metrics, and prioritiesed according to stakholder consultation)

2. Learning activity data, e.g. for storage into a Learning Record Store (LRS). In this case, simulation of activity based on a parameterised learner model and "microsimulation" would be appropriate. There are several established methods and software libraries, some used in health and demographics studies. These would be appropriate but it would not be feasible in a micro-project to calibrate the simulated data to be realistic, as opposed to stereotypical. This kind of simulated data would be particularly useful for dashboard prototyping and demonstration. The project would deliver a recipe and source code to allow activity data to be injected into an xAPI LRS (as identified in the Jisc Learning Analytics Services project) using the same JSON template as at least one VLE/LMS. Stakeholder consultation would determine the target VLE and priority activity types. Documentation would also be produced to show how to adapt the scripts for different activity sources and for different activity stores (e.g. to accommodate IMS Caliper). A proof-of-concept online activity data synthesis service would also be developed.


Both of these options would be good to run alongside developer centred activities (e.g. hackdays) and would complement the system development and piloting phases of the Jisc Learning Analytics Services project


Synthetic data may also be useful for learning analytics research, for example in support of "reproducible research" and algorithm work, but these are not the focus of the proposal. Any spill-over benefits would be a bonus.



8 votes
8 up votes
0 down votes
Idea No. 26