-
Notifications
You must be signed in to change notification settings - Fork 0
/
appendix-cypher6.qmd
527 lines (429 loc) · 19.7 KB
/
appendix-cypher6.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
---
title: "T: Soft Constraints"
lightbox:
match: auto
effect: fade
desc-position: bottom
loop: false
---
<br>
Soft constraints in a timetabling context are (strong) preferences. They should be generally met and only violated when absolutely necessary, although there is an argument for a sliding scale of soft constraint adherence.
For example, a member of staff may be unavailable on Fridays, generally, but at a push can be available. Other examples might include ensuring that students have an opportunity to eat lunch by ensuring at least 30 minutes free time between 12:00-14:00 or minimising travel between activities.
This page contains cypher queries that can be used to identify where a timetabling soft constraint has been violated.
Example soft constraints include:
- **Minimal Idle Time (aka no large gaps):** Minimise gaps in staff and student schedules (within reason).
- **Spread Activities (aka maximum consecutive hours):** Avoid clumping all activities for a student or staff member on one day.
- **Preferred Times:** Consider staff and student preferences for morning, afternoon, or evening classes
- **Travel Time:** Minimise the time students need to travel between consecutive classes (especially on large campuses), e.g. between building blocks or by lat/long
- **Lunch Breaks:** Ensure students have sufficient time for lunch breaks.
### Minimal idle time
Identifying time gaps between scheduled activities is very complex and requires several steps, clauses and comprehensions within a single query:
* *grouping and sorting* - activities are grouped by student and date, and sorted within groups to establish the sequence
* *gap calculation* - time difference between the end of one activity and the start of the next is calculated for consecutive pairs of activities within a day
* *filtering and aggregation* - gaps are filtered based on threshold (e.g. 6 hours) and then the maximum gap for each day is identified
* *data restructuring* - output is restructured.
#### Cypher logic for identifying gaps
The below is the logic for identifying gaps between activities, using an example student:
```{cypher}{.scroll-cypher}
MATCH (s:student)-[:ATTENDS]->(a:activity)
WHERE s.stuID_anon = "stu-10085720"
AND a.actStartDate = date("2022-10-03")
WITH s, a
ORDER BY a.actStartTime
// Collecting the start and end times of the activities
WITH s, collect({start: a.actStartTime, end: a.actEndTime}) AS times
// Calculating the gaps in minutes between consecutive activities
WITH s, times,
[i IN range(0, size(times)-2) |
duration.between(times[i].end, times[i+1].start).minutes / 60.0] AS gaps
// Finding the maximum gap
RETURN s.stuID_anon AS student, times, gaps, reduce(maxGap = 0.0, gap IN gaps | CASE WHEN gap > maxGap THEN gap ELSE maxGap END) AS maxGap
```
![Identifying blocks and time gaps](./images/cypher-maxGap-example.png)
#### Python code on graph
The below code cell returns the first 5 rows where a 6-hour maximum gap has been violated.
```{python}
from connect_to_neo4j_db import connect_to_neo4j
from neo4j import GraphDatabase
import pandas as pd
# connect to Neo4j
driver = connect_to_neo4j()
# session
session = driver.session()
# run query (modified RETURN clause)
query = """
// students with gaps between activities
MATCH (s:student)-[:ATTENDS]->(a:activity)
WITH s, a
ORDER BY s.stuFirstName_anon, a.actStartDate, a.actStartTime
// Group activities by student and date
WITH s, a.actStartDate AS date, collect({start: a.actStartTime, end: a.actEndTime, activity: a}) AS times
// calculating the gaps in hours between consecutive activities
WITH s, date, times,
[i IN range(0, size(times)-2) |
{gap: duration.between(times[i].end, times[i+1].start).minutes / 60.0,
firstActivity: times[i].activity,
secondActivity: times[i+1].activity}] AS gaps
// filtering gaps based on a threshold of 6 hours
WITH s, date, gaps
WHERE any(gapRecord IN gaps WHERE gapRecord.gap > 6.0)
// Finding the maximum gap that exceeds the threshold
WITH s, date, reduce(maxGap = {gap: 0.0, firstActivity: null, secondActivity: null}, gapRecord IN gaps |
CASE WHEN gapRecord.gap > maxGap.gap THEN gapRecord ELSE maxGap END) AS maxGapRecord
// group by student to remove duplications
WITH s.stuID_anon AS student,
collect({date: date,
activity1: maxGapRecord.firstActivity.actName,
activity1_time: maxGapRecord.firstActivity.actStartTime + "-" + maxGapRecord.firstActivity.actEndTime,
activity2: maxGapRecord.secondActivity.actName,
activity2_time: maxGapRecord.secondActivity.actStartTime + "-" + maxGapRecord.secondActivity.actEndTime,
maxGapInHours: maxGapRecord.gap}) AS gapRecords
// Unwind the collected records
UNWIND gapRecords AS record
// Returning the result
RETURN student,
record.date AS date,
record.activity1 AS activity1,
record.activity1_time AS activity1_time,
record.activity2 AS activity2,
record.activity2_time AS activity2_time,
record.maxGapInHours AS maxGapInHours
ORDER BY student, date
"""
print("Running query...\n")
result = session.run(query)
# list to hold records
records = []
for record in result:
records.append(record)
# df
df = pd.DataFrame(records, columns=["student", "date", "activity1", "activity1_time",
"activity2", "activity2_time", "maxGapInHours"])
# print
print(df.head(5))
# close session and driver
session.close()
driver.close()
```
#### Python to return total hours and block hours
Alternatively, we can amend the query to return, for each date and student combination, their total scheduled hours, maximum consecutive block hours and the number of activities within the continuous block.
```{python}
from connect_to_neo4j_db import connect_to_neo4j
from neo4j import GraphDatabase
import pandas as pd
# connect to Neo4j
driver = connect_to_neo4j()
# session
session = driver.session()
# run query (modified RETURN clause)
query = """
// calculates - total hours, max block hours, max block activities per day
// to be used for max block hours and max block activities per day
// logic - example
// matches specified student and all attended activities
MATCH (s:student {stuID_anon:"stu-10085720"})-[:ATTENDS]->(a:activity)
// sorts activities by start date and then by start time within each date
WITH s, a ORDER BY a.actStartDate, a.actStartTime
// calculates total hours spent on activities for each date
WITH s, a.actStartDate AS date,
SUM(a.actDurationInMinutes) / 60.0 AS totalHours,
// groups activities into blocks based on time overlaps
REDUCE(
blockInfo = [],
activity IN COLLECT(a)
| CASE
WHEN blockInfo = [] THEN [[activity]]
ELSE CASE
WHEN head(last(blockInfo)).actEndTime >= activity.actStartTime
THEN blockInfo[..-1] + [last(blockInfo) + activity]
ELSE blockInfo + [[activity]]
END
END
) AS blocks
// unwinds the list of blocks, processing each block individually
UNWIND blocks AS block
// calculates the total duration in hours for each block
WITH s, date, totalHours, blocks,
REDUCE(blockHours = 0.0, activity IN block | blockHours + activity.actDurationInMinutes) / 60.0 AS blockHours,
SIZE(block) AS blockActivities
// returns aggregated results
RETURN s.stuFullName_anon AS student, date, totalHours,
MAX(blockHours) AS blockHours,
MAX(blockActivities) AS blockActivities
ORDER BY date;
"""
print("Running query...\n")
result = session.run(query)
# list to hold records
records = []
for record in result:
records.append(record)
# df
df = pd.DataFrame(records, columns=["student", "date", "totalHours", "blockHours", "blockActivities"])
# print
print(df.head(5))
# close session and driver
session.close()
driver.close()
```
#### Example Use case - identifying students with 5+ hours in a single block
This query returns the first five rows where a student has more than 5 consecutive scheduled hours on a date.
```{python}
from connect_to_neo4j_db import connect_to_neo4j
from neo4j import GraphDatabase
import pandas as pd
# connect to Neo4j
driver = connect_to_neo4j()
# session
session = driver.session()
# run query (modified RETURN clause)
query = """
MATCH (s:student)-[:ATTENDS]->(a:activity)<-[:TEACHES]-(:staff) // Filter for teaching activities
WITH s, a ORDER BY a.actStartDate, a.actStartTime
WITH s, a.actStartDate AS date,
SUM(a.actDurationInMinutes) / 60.0 AS totalHours,
REDUCE(
blockInfo = [],
activity IN COLLECT(a)
| CASE
WHEN blockInfo = [] THEN [[activity]]
ELSE CASE
WHEN head(last(blockInfo)).actEndTime >= activity.actStartTime
THEN blockInfo[..-1] + [last(blockInfo) + activity]
ELSE blockInfo + [[activity]]
END
END
) AS blocks
UNWIND blocks AS block
WITH s, date, totalHours, blocks,
REDUCE(blockHours = 0.0, activity IN block | blockHours + activity.actDurationInMinutes) / 60.0 AS blockHours,
SIZE(block) AS blockActivities
WHERE blockHours > 5 // Filter for blocks with more than 5 hours
RETURN s.stuFullName_anon AS student, date, totalHours,
blockHours,
blockActivities
ORDER BY date;
"""
print("Running query...\n")
result = session.run(query)
# list to hold records
records = []
for record in result:
records.append(record)
# df
df = pd.DataFrame(records, columns=["student", "date", "totalHours", "blockHours", "blockActivities"])
# print
print(df.head(5))
# close session and driver
session.close()
driver.close()
```
#### Total hours per day
In contrast, calculating simple total hours per day is achieved by:
```{cypher}{.scroll-cypher}
MATCH (s:student )-[:ATTENDS]->(a:activity)
WITH s, a.actStartDate AS Date, SUM(a.actDurationInMinutes) / 60.0 AS totalHours
RETURN s.stuFullName_anon AS Student, Date, totalHours
ORDER BY Date;
```
![Total hours per day](./images/cypher-hoursPerDay.png)
#### Longest consecutive block of activities per day
We can use the earlier cypher logic to identify the longest consecutive block of activities for a student, or student on a day, etc.
```{cypher}{.scroll-cypher}
MATCH (s:student {stuFullName_anon: "Susan Lopez"})-[:ATTENDS]->(a:activity {actStartDate: date("2022-09-27")})
WITH s, a
ORDER BY a.actStartTime
WITH s, COLLECT(a) AS activities
WITH s, activities,
REDUCE(
state = {currentBlock: {duration: 0, start: null, end: null}, longestBlock: {duration: 0, start: null, end: null}},
activity IN activities |
CASE
WHEN state.currentBlock.end IS NULL OR
activity.actStartTime > state.currentBlock.end
THEN {
currentBlock: {
duration: activity.actDurationInMinutes,
start: activity.actStartTime,
end: activity.actEndTime
},
longestBlock:
CASE
WHEN activity.actDurationInMinutes > state.longestBlock.duration
THEN {
duration: activity.actDurationInMinutes,
start: activity.actStartTime,
end: activity.actEndTime
}
ELSE state.longestBlock
END
}
ELSE {
currentBlock: {
duration: (activity.actEndTime.hour * 60 + activity.actEndTime.minute) -
(state.currentBlock.start.hour * 60 + state.currentBlock.start.minute),
start: state.currentBlock.start,
end: activity.actEndTime
},
longestBlock:
CASE
WHEN ((activity.actEndTime.hour * 60 + activity.actEndTime.minute) -
(state.currentBlock.start.hour * 60 + state.currentBlock.start.minute)) > state.longestBlock.duration
THEN {
duration: (activity.actEndTime.hour * 60 + activity.actEndTime.minute) -
(state.currentBlock.start.hour * 60 + state.currentBlock.start.minute),
start: state.currentBlock.start,
end: activity.actEndTime
}
ELSE state.longestBlock
END
}
END
) AS finalState
RETURN
s.stuFullName_anon AS stuName,
activities[0].actStartDate AS date,
finalState.longestBlock.duration AS longestConsecutiveBlockDuration,
finalState.longestBlock.start AS blockStartTime,
finalState.longestBlock.end AS blockEndTime
```
#### Example - longest consecutive block for 'Susan Lopez' on 2022-09-27
The below finds the longest consecutive block in a day for a student:
```{python}
from connect_to_neo4j_db import connect_to_neo4j
from neo4j import GraphDatabase
import pandas as pd
# connect to Neo4j
driver = connect_to_neo4j()
# session
session = driver.session()
# run query (modified RETURN clause)
query = """
MATCH (s:student {stuFullName_anon: "Susan Lopez"})-[:ATTENDS]->(a:activity{actStartDate: date("2022-09-27")})
WITH s, a
ORDER BY a.actStartTime
WITH s, COLLECT(a) AS activities
WITH s, activities,
REDUCE(
state = {currentBlock: {duration: 0, start: null, end: null}, longestBlock: {duration: 0, start: null, end: null}},
activity IN activities |
CASE
WHEN state.currentBlock.end IS NULL OR
activity.actStartTime > state.currentBlock.end
THEN {
currentBlock: {
duration: activity.actDurationInMinutes,
start: activity.actStartTime,
end: activity.actEndTime
},
longestBlock:
CASE
WHEN activity.actDurationInMinutes > state.longestBlock.duration
THEN {
duration: activity.actDurationInMinutes,
start: activity.actStartTime,
end: activity.actEndTime
}
ELSE state.longestBlock
END
}
ELSE {
currentBlock: {
duration: (activity.actEndTime.hour * 60 + activity.actEndTime.minute) -
(state.currentBlock.start.hour * 60 + state.currentBlock.start.minute),
start: state.currentBlock.start,
end: activity.actEndTime
},
longestBlock:
CASE
WHEN ((activity.actEndTime.hour * 60 + activity.actEndTime.minute) -
(state.currentBlock.start.hour * 60 + state.currentBlock.start.minute)) > state.longestBlock.duration
THEN {
duration: (activity.actEndTime.hour * 60 + activity.actEndTime.minute) -
(state.currentBlock.start.hour * 60 + state.currentBlock.start.minute),
start: state.currentBlock.start,
end: activity.actEndTime
}
ELSE state.longestBlock
END
}
END
) AS finalState
RETURN
s.stuFullName_anon AS student,
activities[0].actStartDate AS date,
finalState.longestBlock.duration AS longestConsecutiveBlockDuration,
finalState.longestBlock.start AS blockStartTime,
finalState.longestBlock.end AS blockEndTime
"""
print("Running query...\n")
result = session.run(query)
# list to hold records
records = []
for record in result:
records.append(record)
# df
df = pd.DataFrame(records, columns=["student", "date", "longestConsecutiveBlockDuration", "blockStartTime", "blockEndTime"])
# print
print(df.head(5))
# close session and driver
session.close()
driver.close()
```
### Max hours in a day
This query calculates the total scheduled hours for each student on a day and returns the results ordered by date. This example filters for students who have more than 7 hours of activities in a day.
```{cypher}{.scroll-cypher}
// sum of activity durations
// does not account for simultaneous activities (clashes) - so could be inflated, e.g. 12.5 hour students
MATCH (s:student)-[:ATTENDS]->(a:activity)
WITH s, a.actStartDate AS Date, SUM(a.actDurationInMinutes) / 60.0 AS totalHours
WHERE totalHours > 7 // Set maximum here
RETURN s.stuFullName_anon AS Student, Date, totalHours
ORDER BY Date;
```
![Max hours in a day](./images/cypher-maxTotalDay.png)
### Travel time between activities
This query calculates the travel time between consecutive activities for a student on a specific date. It uses the lat/long coordinates of the buildings to calculate the distance and time taken to travel between them and a default walking speed of 1.4 m/s to calculate time.
This is a simple example and does not account for factors like traffic, walking speed, or other modes of transport.
```{cypher}{.scroll-cypher}
// Calculate travel time between consecutive activities for a student on a specific date
MATCH (s:student {stuFullName_anon: "David Johnson"})-[:ATTENDS]->(a1:activity)-[:OCCUPIES]->(r1:room),
(s)-[:ATTENDS]->(a2:activity)-[:OCCUPIES]->(r2:room)
WHERE a1.actEndTime = a2.actStartTime AND a1.actStartDate = a2.actStartDate AND a1 <> a2 AND
a1.actStartDate IN [date("2023-01-11"), date("2022-09-27"), date("2023-03-14")]
RETURN DISTINCT
s.stuFullName_anon,
a1.actName AS act1, a1.actStartDate AS date, a1.actStartTime+"-"+a1.actEndTime AS act1Times, a2.actStartTime+"-"+a2.actEndTime AS act2Times, a2.actName AS act2,
point.distance(r1.location, r2.location) AS distance,
round(point.distance(r1.location, r2.location) / 1.4) AS walkingTimeSeconds // Calculate walking time in seconds
```
![Travel time between activities](./images/cypher-b2b-travel.png)
### Lunch breaks
It might be expected that a student (or staff) has a lunch break. The cypher below calculates free and booked tie within a window, in this case 12:00 and 14:00. It can be used to find students who do not have a lunch break or count the number of days that a student does not have a lunch break.
```{cypher}{.scroll-cypher}
MATCH (s:student)-[:ATTENDS]->(a:activity)
WITH s, a
ORDER BY a.actStartDate, a.actStartTime
WITH s, COLLECT(a) AS activities
UNWIND activities AS activity
WITH s.stuFullName_anon AS Student, activity.actStartDate AS Date, time(activity.actStartTime) AS StartTime, time(activity.actEndTime) AS EndTime, duration.between(time('12:00'), time('14:00')).minutes AS BreakWindow_12_14
WITH Student, Date, BreakWindow_12_14, COLLECT([StartTime, EndTime]) AS Activities
UNWIND Activities AS activity
WITH Student, Date, BreakWindow_12_14, activity[0] AS StartTime, activity[1] AS EndTime
WITH Student, Date, BreakWindow_12_14,
CASE
WHEN StartTime >= time('14:00') OR EndTime <= time('12:00') THEN 0
WHEN StartTime < time('12:00') AND EndTime > time('14:00') THEN BreakWindow_12_14
WHEN StartTime >= time('12:00') AND StartTime < time('14:00') THEN duration.between(StartTime, time('14:00')).minutes
WHEN EndTime > time('12:00') AND EndTime <= time('14:00') THEN duration.between(time('12:00'), EndTime).minutes
END AS BookedDurationMinutes
RETURN Student, Date, BreakWindow_12_14, BreakWindow_12_14 - SUM(BookedDurationMinutes) AS FreeTimeMinutes, SUM(BookedDurationMinutes) AS BookedTimeMinutes
ORDER BY Date
```
![Identifying lunch breaks](./images/cypher-lunchBreak.png)
Interestingly, `Michael Johnson` has a negative lunch break! A quick look showed that there are actually two `Michael Johnsons` attending this class and they both have 30 minutes free time in the 2-hour lunch break window.
Because the query was written using student name, it is incorrectly aggregating the two students into one person as follows:
$$
2 \text{ hour lunch window} - (1.5 \text{ hours/class} \times 2 \text{ students}) = -1 \text{ hour}
$$
To remedy this, the query can be updated to use student ID or a different unique identifier. I would also like to update the anonymisation function in the ETL so that it does not duplicate names in the output.