1 Simple improvements
These are the simple things that can be done without dealing with chained logic. This should be done irrespective of other considerations.
1.1 Size
Perform cascade deletion on persons.
Significantly reduce the number of entities related to persons of the database. We can probably make do with 10-25% of the original. We require a representative and coherent database which reflects the complexity of our data.
Remove auditlog.
1.2 Password
- Table: password_history
- entity_id: shuffle
- hash: empty or replace with a random hash for all rows. Uniqueness is not required.
- Table: account_authentication
- For each method, shuffle account_id.
1.3 Quarantines
- Table: entity_quarantine
- entity_id: Shuffle.
- description: Empty or replace with a string value for all rows. Uniqueness is not required.
1.4 Postal address shuffling
- Table: entity_address
- entity_id: shuffle.
2 The harder things
Some things must be handled with a bit more care.
We get a lot of information from other source systems. This is marked as the source system for various values. If this is not taken into consideration when shuffling values around, pandemonium will ensue.
2.1 Contact info shuffling
- Table: entity_contact_info
- For each contact_type in each source_system, shuffle the corresponding segment of the entity_id column.
- We want shuffling rather than synthetitzation, since adequately malformed contact info will be hard to produce artificially.
2.2 External id scrambling
External ids involve dependencies between various tables, but it is possible to shuffle them with a few precautions.
2.2.1 The data selection
With the following tables joined from the database:
SELECT * FROM person_info pi JOIN entity_external_id eid ON pi.person_id = eid.entity_id;
2.2.2 The shuffling scheme
The data should subsequently be shuffled according to the following categories:
Column | entity_type | person_id | export_id | birth_date | gender | description | deceased_date | entity_id | entity_type | id_type | source_system | external_id |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Category | 102 | A | A | B | B | B | B | A | 102 | Y | X | B |
- For each Y in each X, shuffle all relations between A's and B's.
- The columns with an integer value remains unchanged
- Do not shuffle A or B internally! This internal coherence must be kept intact. For example, if external ID is a Norwegian fnr, then both gender and birth date are dependencies of the fnr value.
- Cross shuffling between different id_types and source_systems must not occur. For example: An fnr cannot be labeled a student number, and a student number cannot be provided by SAP.
2.2.3 The desired result
The tables entity_external_id and person_id must reflect changes as described above.
2.3 People
Names is possible the most important piece of information that needs scrambling. There are two possible courses of action. One is to scramble the connection between entity_id and names. We believe this is inadequate.
A better solution is synthetic names. This will greatly anonymize the database, but requires synthetic names. Is the National Registry a possible source of random names?
2.3.1 Names
Table: person_name
- For each person_id, replace all names with synthetic names. First name, last name and combined name.
- We want some names with special characters like Æ, Ø, Å, ß et cetera.
- We want mostly coherence for each person. That is, first name Peder and last name Ås does not combine to Marte Kirkerud, or become something completely different in another file system.
- optional However, if a modicum of noise can be added, that would be nice! Example: First name is Petter in one system, and Peter-Elias in another (for a handful of people).
2.3.2 Account names
Table: entity_name
- bootstrap_account and bootstrap_group are left untouched.
- column values of entity_name is replaced with unique random strings.
- Keep them short, please!
2.3.3 Email addresses
Table: email_address
- Shuffle local_part.
- Shuffle change_date.
- Empty expire_date.
2.3.4 Groups
Table: group_member
Do not touch this table. The other proposed actions obviates the need to scramble this table, and doing it will cause havoc.
Table: posix_user
- Shuffle posix_uid.
- Shuffle gid.
- Shuffle shell.
- Empty gecos.
3 Access to real data
- All imports from SAP and FS will break with the proposed scheme.
- Sometimes, weird user issues meander all the way past Houston and Drift, and ends up at our table. Bug fixing cannot be done without diving into the real data.
- Development and testing will be significantly harder without access to real data in some steps.
- More undiscovered bugs will be created.
Corollary: We must retain access to the true database for some development purposes.
The above is probably not an exhaustive list of such purposes. However, for most day-to-day development, the proposed scheme is expected to satisfy our needs.
4 What we really want
All we really need is synthetic data from SAP and FS. This give us truly anonymous data, all imports will work as intended, the overall test environment will correspond to specifications and there will be no added work loads or maintenance.