alias |
---|
DB |
- Databases
- are structured digital information stores
- build indices to efficiently query data
- define relationships between datasets
- Every database has a way of structuring / organizing and interacting with this stored information: Database Management Systems (DBMS)
- Fundamental ways of interacting with databases:
- Create
- Read
- Update
- Delete
- Fundamental ways of interacting with databases:
- Organize data in tabular form with columns (categories / attributes) and rows (entries)
- Uses a structure that allows us to identify and access data in relation to another piece of data in the database.
- Often, data is organized into tables.
- Highly structured
- Strict data types
- Great for complex datasets
- [[SQL]] is used to interact with them
- A [[Database Schema|schema]] needs to be defined before putting anything into a database
- Commonly used DBMSs: MySQL & PostgreSQL.
- Row / Record
- A single entry in a table.
- Column / Field
- The smallest entity of a table.
- Constraint
- A restriction on the type or value that can be assigned to a
column
. - Enforce data integrity and maintain consistency within a relational database.
- Primary Key
- A field which uniquely identifies each row/record in a database table.
- Foreign Key
- aka referencing key.
- Used to link two tables together.
- A column or combination of columns whose values match a primary key in a different table.
- A restriction on the type or value that can be assigned to a
- Complexity
- Implementing and maintaining an RDBMS can be challenging, particularly for larger systems, as it demands specialized expertise to effectively manage, fine-tune, and enhance database performance.
- Cost
- RDBMSs can be expensive, both in terms of the computational and storage resources they require and associated licensing fees.
- Fixed Schema
- RDBMSs operate on a rigid schema structure, which can make changes to the data model time-intensive and complicated.
- Handling of Unstructured Data
- RDBMSs are not suitable for handling unstructured data like multimedia files, social media posts, and sensor data, as their relational structure is optimized for structured data.
- Horizontal Scalability
- Unlike NoSQL databases, RDBMSs are not as easily horizontally scalable.
- Adding more machines to the system can be both costly and complex.
- Each record in one table is related to exactly one record in another table.
- e.g. Each user has one shipping address.
-- Creating
CREATE TABLE users (
user_id INT PRIMARY KEY,
username VARCHAR(50) NOT NULL,
email VARCHAR(100) NOT NULL
);
CREATE TABLE shipping_addresses (
address_id INT PRIMARY KEY,
user_id INT UNIQUE,
street VARCHAR(100) NOT NULL,
city VARCHAR(50) NOT NULL,
country VARCHAR(50) NOT NULL,
FOREIGN KEY (user_id) REFERENCES users(user_id)
);
-- Querying
SELECT u.username, s.street, s.city, s.country
FROM users u
JOIN shipping_addresses s ON u.user_id = s.user_id
WHERE u.user_id = 1;
- One record in a table can be associated with multiple records in another table.
- The most common type of relationship.
- e.g. One user can place many orders.
-- Creating
CREATE TABLE users (
user_id INT PRIMARY KEY,
username VARCHAR(50) NOT NULL,
email VARCHAR(100) NOT NULL
);
CREATE TABLE orders (
order_id INT PRIMARY KEY,
user_id INT,
order_date DATE NOT NULL,
total_amount DECIMAL(10, 2) NOT NULL,
FOREIGN KEY (user_id) REFERENCES users(user_id)
);
-- Querying
SELECT o.order_id, o.order_date, o.total_amount
FROM orders o
JOIN users u ON o.user_id = u.user_id
WHERE u.user_id = 1
ORDER BY o.order_date DESC;
- Multiple records in one table are related to multiple records in another table.
- Requires a junction table that has a foreign key to both tables.
- e.g. Many products can belong to many categories, and vice versa.
-- Creating
CREATE TABLE products (
product_id INT PRIMARY KEY,
product_name VARCHAR(100) NOT NULL,
price DECIMAL(10, 2) NOT NULL
);
CREATE TABLE categories (
category_id INT PRIMARY KEY,
category_name VARCHAR(50) NOT NULL
);
CREATE TABLE product_categories (
product_id INT,
category_id INT,
PRIMARY KEY (product_id, category_id),
FOREIGN KEY (product_id) REFERENCES products(product_id),
FOREIGN KEY (category_id) REFERENCES categories(category_id)
);
-- Querying
-- All Products in a specific category
SELECT p.product_name, p.price
FROM products p
JOIN product_categories pc ON p.product_id = pc.product_id
JOIN categories c ON pc.category_id = c.category_id
WHERE c.category_id = 1
ORDER BY p.price ASC;
- Occurs when a table has a foreign key that references its own primary key.
- e.g. Employees with managers (who are also employees).
CREATE TABLE employees (
employee_id INT PRIMARY KEY,
name VARCHAR(100),
manager_id INT,
FOREIGN KEY (manager_id) REFERENCES employees(employee_id)
);
-- Querying
SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.employee_id;
![[Normalization]]
- Each table cell should contain a single value (and not a list of values).
- Each record needs to be unique.
- Conclusive of a relational database.
- For a database to be considered relational, all relations in the database are in 1NF.
- A database is considered relational if all the fields in the tables are atomic, every column is a unique attribute, and a unique identifier or primary key is used.
- Deals with the elimination of circular dependencies from a relation.
- A relation is in 2NF if it is in 1NF and if every non-key attribute is completely dependent only on the Primary Key.
- e.g. A single table with
OrderID
,ProductID
,ProductName
,CustomerID
andCustomerName
fields violates 2NF becauseCustomerName
depends onCustomerID
, not the full primary key (OrderID
,ProductID
).- The table can be split into two:
- Orders (
OrderID
,ProductID
andCustomerID
) - Customers (
CustomerID
andCustomerName
)
- Orders (
- The table can be split into two:
Note
A non-key attribute is any column that cannot be used to uniquely identify the table.
- Deals with the elimination of non-key attributes that do not describe the Primary Key.
- For a relation to be in 3NF, the relationship between any two non-key attributes, or groups of non-key attributes, must not be in a one-to-one relation.
- Attributes should be mutually independent which means, none of the attributes should be functionally dependent on any combination of attributes.
- This ensures that any update on the individual attribute will not affect other attributes in a row.
- e.g. A table that contains
EmployeeID
,DepartmentID
,DepartmentName
andManagerID
violates 3NF becauseDepartmentName
is transitively dependent onEmployeeID
throughDepartmentID
.- The table can be split into two:
- Employees (
EmployeeID
andDepartmentID
) - Departments (
DepartmentID
,DepartmentName
andManagerID
)
- Employees (
- The table can be split into two:
- An extension to the third normal form, and is also known as 3.5 Normal Form.
- A table should follow these two conditions to satisfy BCNF:
- It should be in 3NF.
- And, for any dependency A β B, A should be a super key. which means that A should be a non-key attribute if B is a key attribute.
-
Entity Integrity
- Ensures that each row in a table is uniquely identifiable.
- Typically achieved through the use of primary keys.
- e.g., In an employee database, each employee record might have a unique "Employee ID" as the primary key.
- This prevents duplicate entries and ensures each employee can be distinctly identified.
-
Referential Integrity
- Used to build and maintain logical relationships between tables to avoid logical corruption of data.
- Made up of the combination of a primary key and a foreign key.
- ==Requires that if a foreign key in one table references a primary key in another table, the referenced value must exist.==
- e.g., if an "Orders" table has a customer ID foreign key, that ID must exist in the "Customers" table.
- Helps prevent orphaned records (records with foreign key values that don't match any primary key).
- In such relationships, typically, the table with a Primary key is considered as a parent table and the table with a foreign key is considered a child table.
- Updating and Deletion in primary key table and Insertion and Updating in the foreign key table could violate the referential integrity.
- Actions on violation:
- Prevents deletion of referenced data.
- Deletes or updates related records automatically.
- Sets the foreign key to null or a default value if the referenced record is deleted.
- To avoid violating Referential Integrity,
ON DELETE CASCADE
, set the foreign key to NULL, or set foreign key to default are implemented.
-
Domain Integrity
- Enforces valid entries for a given column by specifying acceptable values or data types.
- e.g., A "Date of Birth" column might be restricted to accept only valid dates within a reasonable range, preventing entries like "February 30" or future dates.
-
Physical Integrity
- Safeguards data against physical issues during storage and retrieval.
- Ensures that data remains intact despite hardware failures or environmental factors.
- e.g., Implementing redundant storage systems like RAID
- An index is a database object that provides a fast and efficient way to look up and retrieve data from a table.
- Typically a data structure that stores a sorted or hashed subset of a table's data.
- By default, tables are scanned for data row by row.
- Indexes improve the performance of
SELECT
queries by reducing the amount of data that needs to be scanned. - They can also enhance the performance of join operations.
- DML operations can be slower because indexes need to be updated along with the data.
- Indexes should be limited on frequently updated tables.
- Indexes also consume additional space.
- A covering index includes all columns needed for a query so that SQL can retrieve data directly from the index without accessing the underlying table.
- e.g. a composite index with the required columns in the query.
-- Non-Clustered Index on 'Users'
CREATE INDEX idx_UserID ON Users(user_id);
-- Useful if users are frequently queried by their id.
-- Composite Index on 'Users'
CREATE INDEX idx_City_Balance ON Users(city, balance);
-- Clustered Index on 'Accounts'
CREATE INDEX idx_AccountID ON Accounts(account_id);
CLUSTER Accounts USING idx_AccountID;
Important
The order of operations is crucial.
- Ensure that a series of database operations are treated as a single unit of work, maintaining data integrity and [[Database Consistency|consistency]], even in the face of system failures or concurrent access.
- Follow the [[ACID]] properties.
BEGIN TRANSACTION;
UPDATE accounts SET balance = balance - 100 WHERE account_id = 1;
UPDATE accounts SET balance = balance + 100 WHERE account_id = 2;
COMMIT;
-- or ROLLBACK; if there's an error
-- With Error Handling
BEGIN TRANSACTION;
BEGIN TRY
UPDATE accounts SET balance = balance - 100 WHERE account_id = 1;
UPDATE accounts SET balance = balance + 100 WHERE account_id = 2;
COMMIT;
END TRY
BEGIN CATCH
ROLLBACK;
-- Log error or handle it appropriately
END CATCH
// Using Node.js
const { Pool } = require('pg')
const pool = new Pool()
const client = await pool.connect()
try {
await client.query('BEGIN');
const queryText = 'INSERT INTO users(name) VALUES($1) RETURNING id';
const res = await client.query(queryText, ['brianc']);
await client.query('COMMIT');
} catch (e) {
await client.query('ROLLBACK');
throw e;
} finally {
client.release();
}
- Savepoints can be used to create intermediate points within a transaction.
BEGIN TRANSACTION;
INSERT INTO orders (customer_id, product_id) VALUES (1, 101);
SAVE TRANSACTION SavePoint1;
INSERT INTO order_items (order_id, item_id) VALUES (SCOPE_IDENTITY(), 1);
IF ERROR
ROLLBACK TRANSACTION SavePoint1;
ELSE
COMMIT;
-
Deadlocks can occur when multiple transactions are waiting for each other to release resources. To minimize the risk of deadlocks:
- Keep transactions short and fast
- Access objects in a consistent order
- Use appropriate isolation levels
-
Isolation Levels
READ UNCOMMITTED
- Lowest isolation level
- Allows dirty reads (reading uncommitted changes from other transactions)
- No shared locks issued; no exclusive locks honored
- Highest concurrency, but lowest data consistency
- Rarely used due to potential data integrity issues
READ COMMITTED
- Default isolation level in most databases
- Prevents dirty reads
- Shared locks are held only while data is being read
- Allows non-repeatable reads and phantom reads
- Balances concurrency and consistency
REPEATABLE READ
- Prevents dirty reads and non-repeatable reads
- Shared locks are held for the duration of the transaction
- Still allows phantom reads
- Lower concurrency than READ COMMITTED, but higher data consistency
SERIALIZABLE
- Highest isolation level
- Prevents dirty reads, non-repeatable reads, and phantom reads
- Places a range lock on the data, preventing other transactions from modifying or inserting data that would affect the current transaction
- Lowest concurrency, but highest data consistency
SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
BEGIN TRANSACTION;
-- operations here
COMMIT;
[!note]- Related Terminology
- Dirty Read - Reading data that has been modified by another transaction but not yet committed.
- Non-repeatable Read - Getting different results when reading the same data multiple times within a transaction due to updates by other transactions.
- Phantom Read - When a transaction re-executes a query returning a set of rows that satisfy a search condition and finds that the set of rows has changed due to another recently committed transaction.
- Generally more flexible and less strict
- Creating a schema isn't necessary
- Ideal for
- decentralized / distributed networks of data.
- dealing with large amounts of unstructured data.
- Commonly used databases: MongoDB, Redis.
- Types:
- Graph
- Key-Value
- Document
- Enhances the scalability and performance of databases by distributing data across multiple servers or nodes.
- Involves partitioning a large database into smaller, more manageable pieces called "shards," which can be stored on different machines.
- Each shard contains a subset of the data, allowing for parallel processing and improved response times.
- Horizontal Sharding
- Divides a database table into smaller tables (shards) where each shard contains unique rows but shares the same schema.
- It could be implemented alphabetically, geographically, etc.
- e.g. customer records could be split based on geographic regions.
- Divides a database table into smaller tables (shards) where each shard contains unique rows but shares the same schema.
- Vertical Sharding
- A table's columns are divided into different tables.
- It is useful when certain queries only require specific columns, allowing for more efficient data retrieval.
- Allows to define policies that restrict which rows a user can access in a table.
- Useful for multi-tenant databases or applications where users should only see their own data.
ALTER TABLE table_name ENABLE ROW LEVEL SECURITY;
- Policies define the conditions under which rows can be accessed.
- Logical operators can be used to create multiple policies on a table.
-- Allow users to see only their own data
CREATE POLICY user_data_policy ON users
USING (user_id = current_user);
-- Allow managers to see data for their department
CREATE POLICY dept_data_policy ON employees
USING (department_id IN (
SELECT department_id
FROM managers
WHERE manager_id = current_user
));
- Roles can be used to group privileges and can be assigned to users.
-- Creating roles
CREATE ROLE readonly;
CREATE ROLE readwrite;
-- Granting privileges to roles
GRANT SELECT ON ALL TABLES IN SCHEMA public TO readonly;
GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO readwrite;
-- Assigning roles to users
GRANT readonly TO user1;
GRANT readwrite TO user2;
- Users are assigned roles, and permissions are granted to roles rather than individual users.
CREATE ROLE data_analyst;
GRANT SELECT ON sales_data TO data_analyst;
GRANT data_analyst TO jane_doe;
- The owner of an object grants permissions to other users or roles.
GRANT UPDATE ON customers TO john_smith;
- Restrict access to specific columns.
GRANT SELECT (public_col1, public_col2) ON sensitive_table TO public_role;
GRANT SELECT ON sensitive_table TO admin_role;
-- Control which users can execute specific functions
REVOKE ALL ON FUNCTION sensitive_function() FROM PUBLIC;
GRANT EXECUTE ON FUNCTION sensitive_function() TO admin_role;
-- Control access at the schema level
REVOKE ALL ON SCHEMA private FROM PUBLIC;
GRANT USAGE ON SCHEMA private TO authorized_role;
- Use [[Connection Pool|connection pools]] to reduce connection overhead.
- Configure to optimize resource utilization
- Normalize databases to minimize redundancy and maintain data integrity.
- Choose the right data types for fields for both performance and data integrity.
- Utilize lazy loading, eager loading and batch processing for efficient data retrieval.
- Consider denormalization for read-heavy workloads.
- Avoid unnecessary joins.
- Set up database replication for redundancy and improved read performance.
- Utilize sharding.
- Use indexes to improve query performance, but judiciously as they add overhead during insertions and updates.
- Specify only the columns you need with
SELECT
.- Avoid
SELECT *
.
- Avoid
- Use
LIMIT
to control the number of rows returned by a query for large datasets. - Analyze query execution plans using
EXPLAIN
to identify potential bottlenecks. - Avoid using functions in
WHERE
clauses.
- Use the principle of least privilege: Grant only the necessary permissions.
- Regularly audit and review access controls and RLS policies.
- Use roles to group privileges for easier management.
- Implement application-level security in addition to database-level security.
- Use SSL/TLS for encrypted connections to the database.
- Data Modeling
- ER Diagrams
- UML
- ER Diagrams
- Scaling
- Replication
- CAP Theorem
- Isolation Levels
- Security
- Data Masking
- Encryption
- Caching
- Online Transaction Processing (OLTP) vs Online Analytical Processing (OLAP)
- N+1 Problem
- Denormalization
- Local-First Apps
- ElectricSQL
- TinyBase
- Seven Databases in Seven Weeks (Eric Redmond)