-
Notifications
You must be signed in to change notification settings - Fork 6
Tilda JSON Syntax: View Realization
<-- View Syntax
One issue with any complex view is that they tend to be slow if you are dealing with more complex joins and millions of records. So how do we deal with this? View materialization is an answer most databases have implemented. Materializing a view is akin to copying all the data into a table. Fully denormalized, and without any joins, accessing that table becomes MUCH faster in MOST cases. At a high level, it's functionally equivalent to doing
create table X as select * from X_view
Databases provide some automation and maintenance capabilities so that view materialization is often better than the quick and dirty SQL above. However, it might not always be the best answer:
- A view may be rather complex and refreshing the materialized table under the covers may not be trivial, sometimes requiring the entire view to be recomputed.
- The resulting underlying table loses many of the semantics from the tables that made up the view such as foreign keys, indices and so on.
- Sometimes, you know the data better and could do a materialization better.
With Tilda, we encourage the creation of views as they are ideal to encapsulate concepts and/or transformations on your data. This is important because Tilda can be used in environments where some ETL (Extract Transform Load) processes are implemented. Tilda IS NOT an ETL tool, but some simpler view-level transformations, formulas and features like Realization, can often lighten the load on the amount of ETL work you may have to do.
Making liberal use of views is a key aspect of good SQL database design. Views allow you to encapsulate the details of the structure of your tables, which might change as your application evolves, behind consistent interfaces.
– PostgreSQL Documentation - Advanced Features - Views
Tilda has a feature called "Realization" (to make it close to, but different from the term "Materialization") that helps make the process more efficient, especially at scale. It allows to dump the contents of a view into a target table but do this so that:
- The semantics of your choice from the underlying tables are "ported" to the target table.
- The process can be customized (in limited ways today) to, for example, implement incremental updates which are critical for scalability.
- The target table exists implicitly in the Tilda model which makes further views or other constructs possible on top of a performing version of the data.
- The process of realizing a view is called a "refill".
Declaring realization is simple:
,"realize":{
"primary":{ "columns":["formRefnum"] }
,"foreign":[
{ "name":"Form" , "srcColumns":["formRefnum" ], "destObject": "Form" }
,{ "name":"User" , "srcColumns":["formUserRefnum"], "destObject": "User" }
]
,"indices":[
{ "name":"User", "columns":["formUserRefnum"], "orderBy":["formFillDate desc"] }
,{ "name":"Type", "columns":["formType" ], "orderBy":["formFillDate desc"] }
]
}
The fields are:
- name: An optional name for the target table. If not provided, the name of the view will be used where the "View" or "PivotView" suffixes will be replaced by "Realized".
- primary: Defines a primary key, if applicable.
- foreign: Defines one or more foreign keys, if applicable.
- indices: Defines one or more indices, if applicable.
🎈 NOTE: There were a lot of heated conversations around whether the realized table should be allowed a custom name. On the one hand, enforcing the suffix "Realized" on the target table name proved to be a good thing as people looking at the schema know exactly how they work and could look up their corresponding view without having to look at docs. On the other hand, we faced cases where the target table was intended for groups who needed to dictate the names of the tables they wanted to look at. As a result, we added two new attributes where you can explicitly name the output table and optionally the target schema (useful to co-locate a number of realized tables often used as flattened data-marts in one schema destination):
,"realize":{
"name":"MyRealizedTableX" // optional
,"targetSchema":"MySchema" // optional but only valid if "name" is provided
,"primary":{ "columns":["formRefnum"] }
,"foreign":[
{ "name":"Form" , "srcColumns":["formRefnum" ], "destObject": "Form" }
,{ "name":"User" , "srcColumns":["formUserRefnum"], "destObject": "User" }
]
,"indices":[
{ "name":"User", "columns":["formUserRefnum"], "orderBy":["formFillDate desc"] }
,{ "name":"Type", "columns":["formType" ], "orderBy":["formFillDate desc"] }
]
}
The realization will create and maintain several assets:
- A definition for a new table to host the contents of the view realization.
- The view itself of course
- Possibly parallel "R" views (describe bellow)
- A database function to do the "refill" (i.e., realize the view) called Refill_<tableName minus 'view'>Realized. For example, if the view is called MySampleView, the realized table will be called by default MySampleRealized and the refill function will be called Refill_MySampleRealized().
🎈 NOTE: An interesting problem to solve is about what happens if a view is created over another view that is realized. For example, let's say you have View1 as a simple view that is realized. Then you define View2, use View1, and realize View2. It would be silly to have View2's refill not take advantage of View1's refill. To achieve this, Tilda maintains a separate hierarchy of "R" views which substitute the appropriate realized table when applicable. Those views exist in a separate schema, are used by the refill functions and are maintained/migrated automatically by Tilda. Those views shouldn't be used ever directly and there are discussions about eliminating them altogether for a simpler implementation.
X_View
TableA
TableB
TableC
--> X_Realized
--> Refill_X_Realizes uses X_View
Y_View
X_View
TableY
TableZ
--> Y_Realized
--> Y_R
X_Realized
TableY
TableZ
---> Refill_Y_Realized uses Y_R
With V2, we added incremental support. The syntax is as follows:
,"realize":{
...
,"incremental":{
"deleteFirst": true
,"whereClause": "lastUpdated >= ${START_DATE} - 30 or status='CHECKEDIN'"
}
}
This is a tricky feature that needs to be used with care and properly tested as many parameters become important to consider when dealing with incremental logic. Critical to the proper functioning of such logic first is to have proper record-keeping timestamps in your tables. Thankfully, Tilda by design automatically adds three "life-cycle" timestamps by default unless otherwise blocked. These columns are "created", "lastUpdated" and "deleted".
An additional Refill function is generated which takes a datetime parameter to indicate the starting time for considering records for an incremental update. if the view is called MySampleView, the realized table will be called by default MySampleRealized and the incremental refill function will be called Refill_MySampleRealized(datetime start_time). When run dynamically, a timestamp, ostensibly from the previous run, would be provided and accessible directly via the "${START_DATE}" placeholder in the quere-clause.
Let's examine the various scenarios that you need to think about.
Designing the right where-clause for your incremental logic is not necessarily obvious. You must capture the notion of "updated" but that can often go beyond the "lastUpdated" flag of the main table. You may have a very complex view with many dependencies and aggregates. Understanding those dependencies is the Key. For example, imagine that you are rolling up aggregate results for tests at the student level and your view uses data from both tables. A final "realized" record will therefore need updating if either of the records used under the covers get updated.
Imagine a simple Student table and a Test table. It's a one-to-many relationship where 1 student can have 0 or more tests. Your where-clause would have to look at the both sets of "lastUpdated", one of which would have to be aggregated, i.e., "GREATEST(Student.lastUpdated, max(Test.lastUpdated))".
There may also be other attributes managing the life-cycle of your data. Imagine an application for a hotel accumulating many data points for current guests. Instead of using "lastUpdated", you may use a status flag such as "checkedIn" and process all current guests as long as they are active guests. This may be a better option and more performant one too if you are aggregating data over many related tables.
If you have data that can be deleted (physically or simply with the "deleted" timestamp marker), it's possible and even likely that a record gets deleted between two realization calls. In this case, logic in your view might miss such records based on how you specified your where-clause and joins. In that case, it's important to sweep first and delete records before doing an incremental realization.
Let's assume a system where we keep student tests and for some reason, some tests can be canceled or deleted. If you create a complex view assembling various metrics about a test, it would make sense to not keep them around if that test was no longer considered valid.
- The view explicitly excludes those records ("deleted" is null in the where clause)
- On Monday, a test was added
- On Monday night, the realization process occurred and the record for that test was written to the realized table.
- On Tuesday, that test was deleted (marked as deleted) and more tests were added
- On Tuesday night, the realization process runs and doesn't see the deleted record.
Without defining a whole other complex query (which would add too much maintenance complexity), we must delete the records "touched" between Monday night and Tuesday night and refill the data, which will omit the recently deleted record. The attribute "deleteFirst" does exactly that, using the "whereClause" supplied against the realized table instead of the source view. Of course, the where-clause logic has to be aware of this since technically, a record touched on Tuesday wouldn't show up in the table realized on Monday night, so, typically, some padding is used.
Padding can either be implemented by adding a delay on the "lastUpdated" flag (e.g., "lastUpdated >= ${START_DATE} - 15"), or run the refill process itself within a rolling window time frame (e.g., a process run nightly would have a 2-day rolling window).
,"realize":{
...
,"indexTemplates":[
{ "name":"i", "columns":["test*"], "orderBy":["formUserRefnum"], "subWhere":"?=1" }
]
}