03-05-transform.qmd

---
title: "Transformation"
---


TRANSFORM picks up where EXTRACT finished by using the extracted csv files as the source.

```{dot}
digraph transform {
    // Graph layout - top to bottom
    rankdir = TD;
    nodesep = 1.0;
    splines = false;
    
    // Node styles
    node [shape=box, style="filled,rounded", fillcolor="#f0f0f0", fontname="Arial"];
    edge [fontname="Arial"];
    
    // Nodes
    source_files [label="CSV Files\n(./{hostkeys}/extract)", shape=folder, fillcolor="#ffdead"];
    config [label="Configuration"];
    validate_data [label="Validating Data"];
    clean_data [label="Cleaning Data"];
    add_department [label="Adding 'Department' to Nodes"];
    anonymise_data [label="𝗔𝗡𝗢𝗡𝗬𝗠𝗜𝗦𝗜𝗡𝗚\nPersonal Data", style="filled,bold", fillcolor="#ffcccc"];
    augment_rooms [label="Augmenting Rooms\nwith Archibus Data"];
    create_relationships [label="Creating\nRelationship Tables"];
    processed_files [label="Processed CSV Files", shape=folder, fillcolor="#ffdead"];
    
    // Edges and labels
    source_files -> validate_data;
    validate_data -> clean_data;
    clean_data -> add_department;
    add_department -> anonymise_data;
    anonymise_data -> augment_rooms;
    augment_rooms -> create_relationships;
    create_relationships -> processed_files;
    
    // Configuration connections
    config -> validate_data;
    config -> clean_data;
    config -> add_department;
    config -> anonymise_data;
    config -> augment_rooms;
    config -> create_relationships;
    
    // Positioning config in a separate column
    { rank = same; config; }
}

```

Configuration allows the user to control which nodes and relationships are included and how they are processed.  There are options to specify validation, cleaning, data linking, anonymisation and relationship details. 

It is also possible to specify datatypes.  Neo4j assumes `string` datatype unless it is well-formatted or pre-determined.  Config allows the user to specify specific datatypes like dates, times, point, Boolean, etc.


## All data

1. **Validation** - basic validation of the data is performed.  Validation is extensible and can be expanded, as requirements are identified.
2. **Cleaned** - basic cleaning of all data is performed by stripping empty space and removing non-printable characters, etc. using regex.  The cleaning functionality is expandable. 

With clean data, the transformation proper starts: 

## Nodes and relationships

1. **Add Organisational Unit** -  where appropriate, the University Organisational Unit (e.g. College, School, Department) is added to the node.  This will be picked up as a property during load.  
2. **Data Augmentation** - Room data is augmented with additional properties from the location master database, including latitude, longitude, square meterage, etc.  Data augmentation is extensible. 
3. **Anonymisation** - Personal data is anonymised.  An anonymisation function was developed to remove and replace any personally identifiable information (PII).  The pipeline extracts minimal PII but this is safely anonymised.  The functional also adds fake emails.  [See Appendix for additional details](appendix-anonymise.qmd)
4. **Relationships** - Based on requirements in the configuration, relationships are extracted including optional relationship properties.