Indexes can be used to speed up database queries, but when should you add a new index? Traditionally the answer to this question has been to add an index for every column used for filtering or joining data. For example, consider the following query:
SELECT *
FROM projects
WHERE user_id = 2;
Here we are filtering by the user_id
column and as such a developer may decide
to index this column.
While in certain cases indexing columns using the above approach may make sense it can actually have a negative impact. Whenever you write data to a table any existing indexes need to be updated. The more indexes there are the slower this can potentially become. Indexes can also take up quite some disk space depending on the amount of data indexed and the index type. For example, PostgreSQL offers "GIN" indexes which can be used to index certain data types that can not be indexed by regular btree indexes. These indexes however generally take up more data and are slower to update compared to btree indexes.
Because of all this one should not blindly add a new index for every column used to filter data by. Instead one should ask themselves the following questions:
- Can I write my query in such a way that it re-uses as many existing indexes as possible?
- Is the data going to be large enough that using an index will actually be faster than just iterating over the rows in the table?
- Is the overhead of maintaining the index worth the reduction in query timings?
We'll explore every question in detail below.
The first step is to make sure your query re-uses as many existing indexes as possible. For example, consider the following query:
SELECT *
FROM todos
WHERE user_id = 123
AND state = 'open';
Now imagine we already have an index on the user_id
column but not on the
state
column. One may think this query will perform badly due to state
being
unindexed. In reality the query may perform just fine given the index on
user_id
can filter out enough rows.
The best way to determine if indexes are re-used is to run your query using
EXPLAIN ANALYZE
. Depending on any extra tables that may be joined and
other columns being used for filtering you may find an extra index is not going
to make much (if any) difference. On the other hand you may determine that the
index may make a difference.
In short:
- Try to write your query in such a way that it re-uses as many existing indexes as possible.
- Run the query using
EXPLAIN ANALYZE
and study the output to find the most ideal query.
A database may decide not to use an index despite it existing in case a regular sequence scan (= simply iterating over all existing rows) is faster. This is especially the case for small tables.
If a table is expected to grow in size and you expect your query has to filter out a lot of rows you may want to consider adding an index. If the table size is very small (e.g. only a handful of rows) or any existing indexes filter out enough rows you may not want to add a new index.
Indexes have to be updated on every table write. In case of PostgreSQL all existing indexes will be updated whenever data is written to a table. As a result of this having many indexes on the same table will slow down writes.
Because of this one should ask themselves: is the reduction in query performance worth the overhead of maintaining an extra index?
If adding an index reduces SELECT timings by 5 milliseconds but increases INSERT/UPDATE/DELETE timings by 10 milliseconds then the index may not be worth it. On the other hand, if SELECT timings are reduced but INSERT/UPDATE/DELETE timings are not affected you may want to add the index after all.
To see which indexes are unused you can run the following query:
SELECT relname as table_name, indexrelname as index_name, idx_scan, idx_tup_read, idx_tup_fetch, pg_size_pretty(pg_relation_size(indexrelname::regclass))
FROM pg_stat_all_indexes
WHERE schemaname = 'public'
AND idx_scan = 0
AND idx_tup_read = 0
AND idx_tup_fetch = 0
ORDER BY pg_relation_size(indexrelname::regclass) desc;
This query outputs a list containing all indexes that are never used and sorts them by indexes sizes in descending order. This query can be useful to determine if any previously indexes are useful after all. More information on the meaning of the various columns can be found at https://www.postgresql.org/docs/current/static/monitoring-stats.html.
Because the output of this query relies on the actual usage of your database it may be affected by factors such as (but not limited to):
- Certain queries never being executed, thus not being able to use certain indexes.
- Certain tables having little data, resulting in PostgreSQL using sequence scans instead of index scans.
In other words, this data is only reliable for a frequently used database with plenty of data and with as many GitLab features enabled (and being used) as possible.