-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path03-03-config.qmd
159 lines (142 loc) · 14.8 KB
/
03-03-config.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
---
title: "Configuration and Logging"
---
Configuration and logging are essential components of the ETL pipeline. Config allows the user to manage different aspects of the ETL pipeline, while logging provides a record of the pipeline's execution. They emerged from initial design and from discovering during development.
### Main Configuration options
* Configuration parameters are centralised in Python scripts.
* The design primarily aims for automatic and dynamic operation with well-structured data, but includes override options.
* A YAML file ([Appendix-config](appendix-config.qmd)) holds configuration options, including general settings for filtering data extraction, dynamic folder/filepath creation, and secure credential storage.
* Config also controls options for validation, data augmentation and graph structures (nodes, relationships) to be created.
### Logging
* Each module has its own log file with customisable log level (DEBUG, INFO, WARNING, ERROR, CRITICAL).
* Timing function which tracks and stores various execution and elapsed times with a view to optimising performance or identifying bottlenecks.
#### Example Extract Log
<pre style="max-height: 200px; overflow-y: auto;">
Setting up logger: extract
Logger extract setup completed with 1 handlers.
Setting up logger: process
Logger process setup completed with 1 handlers.
Setting up logger: load
Logger load setup completed with 1 handlers.
2024-07-04 11:26:01,124 - INFO - extract_main - Starting data extraction process
2024-07-04 11:26:01,124 - INFO - extract_main - Output Directory: C:\Users\pho-lovehagen\OneDrive - UWE Bristol\Personal\data-science\graph-project\code\pipe3\INB112\extract
2024-07-04 11:26:01,124 - INFO - extract_main - Hostkeys: ['INB112']
2024-07-04 11:26:01,124 - INFO - extract_main - Chunksize: 20000
2024-07-04 11:26:01,124 - INFO - extract_data - Starting data extraction process...
2024-07-04 11:26:01,284 - INFO - Connected to SQL Server database.
2024-07-04 11:26:01,284 - INFO - temp_table_loader - Creating global temporary tables...
2024-07-04 11:26:07,878 - INFO - extract_sql_file - Processing SQL file: node-activity-by-pos-temp.sql
2024-07-04 11:26:07,878 - INFO - extract_sql_file - Extracting data for node-activity-by-pos-temp with hostkey: INB112
C:\Users\pho-lovehagen\OneDrive - UWE Bristol\Personal\data-science\graph-project\code\pipe3\extract_sql_file.py:44: UserWarning: pandas only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 connection. Other DBAPI2 objects are not tested. Please consider using SQLAlchemy.
for chunk in pd.read_sql_query(query, conn, chunksize=CHUNK_SIZE):
2024-07-04 11:26:07,960 - INFO - extract_sql_file - Saved 631 rows to C:\Users\pho-lovehagen\OneDrive - UWE Bristol\Personal\data-science\graph-project\code\pipe3\INB112\extract/node-activity-by-pos-temp_INB112_1.csv
2024-07-04 11:26:07,966 - INFO - extract_sql_file - Query for node-activity-by-pos-temp, INB112 took 0.09 seconds
2024-07-04 11:26:07,966 - INFO - extract_sql_file - Processing SQL file: node-activityType-by-pos-temp.sql
2024-07-04 11:26:07,968 - INFO - extract_sql_file - Extracting data for node-activityType-by-pos-temp with hostkey: INB112
2024-07-04 11:26:07,985 - INFO - extract_sql_file - Saved 16 rows to C:\Users\pho-lovehagen\OneDrive - UWE Bristol\Personal\data-science\graph-project\code\pipe3\INB112\extract/node-activityType-by-pos-temp_INB112_1.csv
2024-07-04 11:26:07,985 - INFO - extract_sql_file - Query for node-activityType-by-pos-temp, INB112 took 0.02 seconds
2024-07-04 11:26:07,985 - INFO - extract_sql_file - Processing SQL file: node-dept-all.sql
2024-07-04 11:26:07,987 - INFO - extract_sql_file - Extracting data for node-dept-all with hostkey: INB112
2024-07-04 11:26:07,990 - INFO - extract_sql_file - Saved 24 rows to C:\Users\pho-lovehagen\OneDrive - UWE Bristol\Personal\data-science\graph-project\code\pipe3\INB112\extract/node-dept-all_INB112_1.csv
2024-07-04 11:26:07,995 - INFO - extract_sql_file - Query for node-dept-all, INB112 took 0.01 seconds
2024-07-04 11:26:07,995 - INFO - extract_sql_file - Processing SQL file: node-module-by-pos-temp.sql
2024-07-04 11:26:07,995 - INFO - extract_sql_file - Extracting data for node-module-by-pos-temp with hostkey: INB112
2024-07-04 11:26:08,009 - INFO - extract_sql_file - Saved 42 rows to C:\Users\pho-lovehagen\OneDrive - UWE Bristol\Personal\data-science\graph-project\code\pipe3\INB112\extract/node-module-by-pos-temp_INB112_1.csv
2024-07-04 11:26:08,009 - INFO - extract_sql_file - Query for node-module-by-pos-temp, INB112 took 0.01 seconds
2024-07-04 11:26:08,009 - INFO - extract_sql_file - Processing SQL file: node-pos-by-pos-temp.sql
2024-07-04 11:26:08,009 - INFO - extract_sql_file - Extracting data for node-pos-by-pos-temp with hostkey: INB112
2024-07-04 11:26:08,020 - INFO - extract_sql_file - Saved 8 rows to C:\Users\pho-lovehagen\OneDrive - UWE Bristol\Personal\data-science\graph-project\code\pipe3\INB112\extract/node-pos-by-pos-temp_INB112_1.csv
2024-07-04 11:26:08,020 - INFO - extract_sql_file - Query for node-pos-by-pos-temp, INB112 took 0.01 seconds
2024-07-04 11:26:08,020 - INFO - extract_sql_file - Processing SQL file: node-room-by-pos-temp.sql
2024-07-04 11:26:08,020 - INFO - extract_sql_file - Extracting data for node-room-by-pos-temp with hostkey: INB112
2024-07-04 11:26:08,037 - INFO - extract_sql_file - Saved 44 rows to C:\Users\pho-lovehagen\OneDrive - UWE Bristol\Personal\data-science\graph-project\code\pipe3\INB112\extract/node-room-by-pos-temp_INB112_1.csv
2024-07-04 11:26:08,037 - INFO - extract_sql_file - Query for node-room-by-pos-temp, INB112 took 0.02 seconds
2024-07-04 11:26:08,037 - INFO - extract_sql_file - Processing SQL file: node-staff-by-pos-temp.sql
2024-07-04 11:26:08,037 - INFO - extract_sql_file - Extracting data for node-staff-by-pos-temp with hostkey: INB112
2024-07-04 11:26:08,055 - INFO - extract_sql_file - Saved 33 rows to C:\Users\pho-lovehagen\OneDrive - UWE Bristol\Personal\data-science\graph-project\code\pipe3\INB112\extract/node-staff-by-pos-temp_INB112_1.csv
2024-07-04 11:26:08,055 - INFO - extract_sql_file - Query for node-staff-by-pos-temp, INB112 took 0.02 seconds
2024-07-04 11:26:08,055 - INFO - extract_sql_file - Processing SQL file: node-student-by-pos-temp.sql
2024-07-04 11:26:08,055 - INFO - extract_sql_file - Extracting data for node-student-by-pos-temp with hostkey: INB112
2024-07-04 11:26:08,075 - INFO - extract_sql_file - Saved 206 rows to C:\Users\pho-lovehagen\OneDrive - UWE Bristol\Personal\data-science\graph-project\code\pipe3\INB112\extract/node-student-by-pos-temp_INB112_1.csv
2024-07-04 11:26:08,075 - INFO - extract_sql_file - Query for node-student-by-pos-temp, INB112 took 0.02 seconds
2024-07-04 11:26:08,075 - INFO - extract_sql_file - Processing SQL file: rel-activity-module-by-pos-temp.sql
2024-07-04 11:26:08,075 - INFO - extract_sql_file - Extracting data for rel-activity-module-by-pos-temp with hostkey: INB112
2024-07-04 11:26:08,085 - INFO - extract_sql_file - Saved 168 rows to C:\Users\pho-lovehagen\OneDrive - UWE Bristol\Personal\data-science\graph-project\code\pipe3\INB112\extract/rel-activity-module-by-pos-temp_INB112_1.csv
2024-07-04 11:26:08,085 - INFO - extract_sql_file - Query for rel-activity-module-by-pos-temp, INB112 took 0.01 seconds
2024-07-04 11:26:08,085 - INFO - extract_sql_file - Processing SQL file: rel-activity-room-by-pos-temp.sql
2024-07-04 11:26:08,085 - INFO - extract_sql_file - Extracting data for rel-activity-room-by-pos-temp with hostkey: INB112
2024-07-04 11:26:08,105 - INFO - extract_sql_file - Saved 611 rows to C:\Users\pho-lovehagen\OneDrive - UWE Bristol\Personal\data-science\graph-project\code\pipe3\INB112\extract/rel-activity-room-by-pos-temp_INB112_1.csv
2024-07-04 11:26:08,105 - INFO - extract_sql_file - Query for rel-activity-room-by-pos-temp, INB112 took 0.02 seconds
2024-07-04 11:26:08,105 - INFO - extract_sql_file - Processing SQL file: rel-activity-staff-by-pos-temp.sql
2024-07-04 11:26:08,105 - INFO - extract_sql_file - Extracting data for rel-activity-staff-by-pos-temp with hostkey: INB112
2024-07-04 11:26:08,120 - INFO - extract_sql_file - Saved 868 rows to C:\Users\pho-lovehagen\OneDrive - UWE Bristol\Personal\data-science\graph-project\code\pipe3\INB112\extract/rel-activity-staff-by-pos-temp_INB112_1.csv
2024-07-04 11:26:08,120 - INFO - extract_sql_file - Query for rel-activity-staff-by-pos-temp, INB112 took 0.02 seconds
2024-07-04 11:26:08,125 - INFO - extract_sql_file - Processing SQL file: rel-activity-student-by-pos-temp.sql
2024-07-04 11:26:08,125 - INFO - extract_sql_file - Extracting data for rel-activity-student-by-pos-temp with hostkey: INB112
2024-07-04 11:26:08,245 - INFO - extract_sql_file - Saved 13423 rows to C:\Users\pho-lovehagen\OneDrive - UWE Bristol\Personal\data-science\graph-project\code\pipe3\INB112\extract/rel-activity-student-by-pos-temp_INB112_1.csv
2024-07-04 11:26:08,245 - INFO - extract_sql_file - Query for rel-activity-student-by-pos-temp, INB112 took 0.12 seconds
2024-07-04 11:26:08,255 - INFO - extract_sql_file - Processing SQL file: rel-mod-pos-by-pos-temp.sql
2024-07-04 11:26:08,255 - INFO - extract_sql_file - Extracting data for rel-mod-pos-by-pos-temp with hostkey: INB112
2024-07-04 11:26:08,255 - INFO - extract_sql_file - Saved 82 rows to C:\Users\pho-lovehagen\OneDrive - UWE Bristol\Personal\data-science\graph-project\code\pipe3\INB112\extract/rel-mod-pos-by-pos-temp_INB112_1.csv
2024-07-04 11:26:08,255 - INFO - extract_sql_file - Query for rel-mod-pos-by-pos-temp, INB112 took 0.00 seconds
2024-07-04 11:26:08,265 - INFO - extract_data - Data extraction completed.
2024-07-04 11:26:08,285 - INFO - extract_main - Data extraction completed.
2024-07-04 11:26:08,285 - INFO - extract_main - Extraction Time Summary:
2024-07-04 11:26:08,285 - INFO - extract_main - Function load_temp_tables took 6.59 seconds
2024-07-04 11:26:08,285 - INFO - extract_main - Function main took 7.17 seconds
</pre>
#### Example Google Drive Log
<pre style="max-height: 200px; overflow-y: auto;">
2024-07-11 13:28:22,339 - INFO - gdrive_upload - Starting Google Drive upload process.
2024-07-11 13:28:22,789 - INFO - gdrive_upload - Found existing folder: INB112
2024-07-11 13:28:23,115 - INFO - gdrive_upload - Found existing folder: node
2024-07-11 13:28:23,366 - INFO - gdrive_upload - Found existing folder: relationship
2024-07-11 13:28:23,576 - INFO - gdrive_upload - Found existing folder: node
2024-07-11 13:28:24,810 - INFO - gdrive_upload - File node-activity-processed.csv uploaded to Google Drive folder ID: 1Rc3vQCF6CwxV3yNjfUTWXv61BgYD1j_3
2024-07-11 13:28:25,046 - INFO - gdrive_upload - Found existing folder: node
2024-07-11 13:28:26,170 - INFO - gdrive_upload - File node-activityType-processed.csv uploaded to Google Drive folder ID: 1Rc3vQCF6CwxV3yNjfUTWXv61BgYD1j_3
2024-07-11 13:28:26,394 - INFO - gdrive_upload - Found existing folder: node
2024-07-11 13:28:27,495 - INFO - gdrive_upload - File node-department-processed.csv uploaded to Google Drive folder ID: 1Rc3vQCF6CwxV3yNjfUTWXv61BgYD1j_3
2024-07-11 13:28:27,715 - INFO - gdrive_upload - Found existing folder: node
2024-07-11 13:28:28,970 - INFO - gdrive_upload - File node-module-processed.csv uploaded to Google Drive folder ID: 1Rc3vQCF6CwxV3yNjfUTWXv61BgYD1j_3
2024-07-11 13:28:29,206 - INFO - gdrive_upload - Found existing folder: node
2024-07-11 13:28:30,294 - INFO - gdrive_upload - File node-programme-processed.csv uploaded to Google Drive folder ID: 1Rc3vQCF6CwxV3yNjfUTWXv61BgYD1j_3
2024-07-11 13:28:30,515 - INFO - gdrive_upload - Found existing folder: node
2024-07-11 13:28:31,595 - INFO - gdrive_upload - File node-room-processed.csv uploaded to Google Drive folder ID: 1Rc3vQCF6CwxV3yNjfUTWXv61BgYD1j_3
2024-07-11 13:28:31,846 - INFO - gdrive_upload - Found existing folder: node
2024-07-11 13:28:32,895 - INFO - gdrive_upload - File node-staff-processed.csv uploaded to Google Drive folder ID: 1Rc3vQCF6CwxV3yNjfUTWXv61BgYD1j_3
2024-07-11 13:28:33,128 - INFO - gdrive_upload - Found existing folder: node
2024-07-11 13:28:34,220 - INFO - gdrive_upload - File node-student-processed.csv uploaded to Google Drive folder ID: 1Rc3vQCF6CwxV3yNjfUTWXv61BgYD1j_3
2024-07-11 13:28:34,220 - WARNING - gdrive_upload - Skipping file with unrecognized prefix: process.log
2024-07-11 13:28:34,416 - INFO - gdrive_upload - Found existing folder: relationship
2024-07-11 13:28:35,565 - INFO - gdrive_upload - File rel-activity_activityType-processed.csv uploaded to Google Drive folder ID: 1w_ea6ETzRcdYz71crLxL9khjLrEfcbuH
</pre>
#### Example Process Log
<pre style="max-height: 200px; overflow-y: auto;">
2024-07-11 13:31:24,284 - INFO - process_utils - Processing relationship data
2024-07-11 13:31:24,284 - INFO - process_utils - Config: {'filename_pattern': 'rel-mod-pos-by-pos-temp_INB112*.csv', 'node1_col': 'modSplusID', 'node2_col': 'posSplusID', 'relationship': 'BELONGS_TO', 'properties': ['modType']}
2024-07-11 13:31:24,285 - INFO - process_utils - Saving processed relationship file for: module_programme
2024-07-11 13:31:24,289 - INFO - process_main - Function process_department took 0.01 seconds
2024-07-11 13:31:24,292 - INFO - process_main - Function process_module took 0.01 seconds
2024-07-11 13:31:24,292 - INFO - process_main - Function process_room took 0.02 seconds
2024-07-11 13:31:24,294 - INFO - process_main - Function process_programme took 0.01 seconds
2024-07-11 13:31:24,294 - INFO - process_main - Function process_activityType took 0.02 seconds
2024-07-11 13:31:24,294 - INFO - process_main - Function process_staff took 1.82 seconds
2024-07-11 13:31:24,294 - INFO - process_main - Function process_student took 8.76 seconds
2024-07-11 13:31:24,294 - INFO - process_main - Function process_activity took 0.10 seconds
</pre>
#### Example Load Log
<pre style="max-height: 200px; overflow-y: auto;">
2024-07-11 16:28:45,220 - INFO - load_main - Neo4j driver closed.
2024-07-11 16:28:45,220 - INFO - load_main - Total execution time: 0.51 seconds
2024-07-11 16:33:31,166 - INFO - connect_to_neo4j_db - Connected to Neo4j database successfully! Driver: <neo4j._sync.driver.Neo4jDriver object at 0x0000023AAB4FA5D0>
2024-07-11 16:33:31,601 - INFO - load_relationships - Found 1 relationship files in Google Drive.
2024-07-11 16:33:31,604 - INFO - load_relationships - Processing file: rel-activity_room-processed.csv (ID: 1fwUicLoaJ5YJk9tZOpDwyNMFAFWh2Q5v)
2024-07-11 16:33:31,605 - INFO - google_drive_utils - Downloading CSV file 1fwUicLoaJ5YJk9tZOpDwyNMFAFWh2Q5v from Google Drive.
2024-07-11 16:33:35,665 - INFO - google_drive_utils - Downloaded CSV file 1fwUicLoaJ5YJk9tZOpDwyNMFAFWh2Q5v.
2024-07-11 16:33:45,428 - INFO - load_relationships - Finished loading activity_room relationships:
2024-07-11 16:33:45,430 - INFO - load_relationships - Total rows processed: 611
2024-07-11 16:33:45,430 - INFO - load_relationships - Relationships created: 1
2024-07-11 16:33:45,431 - INFO - load_main - Neo4j driver closed.
2024-07-11 16:33:45,432 - INFO - load_main - Total execution time: 14.34 seconds
</pre>