Pseudo randomizing the database

This does not achieve complete randomization of the database since that would make the database useless for development purposes. As a matter of fact, there is a good chance that there are dependencies in the data that makes this problematic, and that we will have to redo this at some point.

1   Simple improvements

These are the simple things that can be done without dealing with chained logic. This should be done irrespective of other considerations.

1.1   Size

Perform cascade deletion on persons.

Significantly reduce the number of entities related to persons of the database. We can probably make do with 10-25% of the original. We require a representative and coherent database which reflects the complexity of our data.

Remove auditlog.

1.2   Password

Table: password_history
  • entity_id: shuffle
  • hash: empty or replace with a random hash for all rows. Uniqueness is not required.
Table: account_authentication
  • For each method, shuffle account_id.

1.3   Quarantines

Table: entity_quarantine
  • entity_id: Shuffle.
  • description: Empty or replace with a string value for all rows. Uniqueness is not required.

1.4   Postal address shuffling

Table: entity_address
  • entity_id: shuffle.

2   The harder things

Some things must be handled with a bit more care.

We get a lot of information from other source systems. This is marked as the source system for various values. If this is not taken into consideration when shuffling values around, pandemonium will ensue.

2.1   Contact info shuffling

Table: entity_contact_info
  • For each contact_type in each source_system, shuffle the corresponding segment of the entity_id column.
  • We want shuffling rather than synthetitzation, since adequately malformed contact info will be hard to produce artificially.

2.2   External id scrambling

External ids involve dependencies between various tables, but it is possible to shuffle them with a few precautions.

2.2.1   The data selection

With the following tables joined from the database:

SELECT *
  FROM person_info pi
  JOIN entity_external_id eid
    ON pi.person_id = eid.entity_id;

2.2.2   The shuffling scheme

The data should subsequently be shuffled according to the following categories:

Column entity_type person_id export_id birth_date gender description deceased_date entity_id entity_type id_type source_system external_id
Category 102 A A B B B B A 102 Y X B
  • For each Y in each X, shuffle all relations between A's and B's.
  • The columns with an integer value remains unchanged
  • Do not shuffle A or B internally! This internal coherence must be kept intact. For example, if external ID is a Norwegian fnr, then both gender and birth date are dependencies of the fnr value.
  • Cross shuffling between different id_types and source_systems must not occur. For example: An fnr cannot be labeled a student number, and a student number cannot be provided by SAP.

2.2.3   The desired result

The tables entity_external_id and person_id must reflect changes as described above.

2.3   People

Names is possible the most important piece of information that needs scrambling. There are two possible courses of action. One is to scramble the connection between entity_id and names. We believe this is inadequate.

A better solution is synthetic names. This will greatly anonymize the database, but requires synthetic names. Is the National Registry a possible source of random names?

2.3.1   Names

Table: person_name

  • For each person_id, replace all names with synthetic names. First name, last name and combined name.
  • We want some names with special characters like Æ, Ø, Å, ß et cetera.
  • We want mostly coherence for each person. That is, first name Peder and last name Ås does not combine to Marte Kirkerud, or become something completely different in another file system.
  • optional However, if a modicum of noise can be added, that would be nice! Example: First name is Petter in one system, and Peter-Elias in another (for a handful of people).

2.3.2   Account names

Table: entity_name

  • bootstrap_account and bootstrap_group are left untouched.
  • column values of entity_name is replaced with unique random strings.
  • Keep them short, please!

2.3.3   Email addresses

Table: email_address

  • Shuffle local_part.
  • Shuffle change_date.
  • Empty expire_date.

2.3.4   Groups

Table: group_member

Do not touch this table. The other proposed actions obviates the need to scramble this table, and doing it will cause havoc.

Table: posix_user

  • Shuffle posix_uid.
  • Shuffle gid.
  • Shuffle shell.
  • Empty gecos.

3   Access to real data

  • All imports from SAP and FS will break with the proposed scheme.
  • Sometimes, weird user issues meander all the way past Houston and Drift, and ends up at our table. Bug fixing cannot be done without diving into the real data.
  • Development and testing will be significantly harder without access to real data in some steps.
  • More undiscovered bugs will be created.

Corollary: We must retain access to the true database for some development purposes.

The above is probably not an exhaustive list of such purposes. However, for most day-to-day development, the proposed scheme is expected to satisfy our needs.

4   What we really want

All we really need is synthetic data from SAP and FS. This give us truly anonymous data, all imports will work as intended, the overall test environment will correspond to specifications and there will be no added work loads or maintenance.

Av ae
Publisert 13. jan. 2020 10:01