diff --git a/ADF.q b/ADF.q index cf60afb..e27caea 100644 --- a/ADF.q +++ b/ADF.q @@ -32,15 +32,18 @@ rs:{([]sym:x;close:first((5#" "),"F";csv) 0:`$":data/stocks/",string[x],".csv")} t: `sym xgroup raze rs each syms // We apply our cointegration function on every pair of symbols form our crossedList -matrix: fcoint .' 0f^neg[trange]#''@\:[;`close](@/:[t]')syms cross syms +u:ps where (<).' ps:syms cross syms +matrix: fcoint .' 0f^neg[trange]#''@\:[;`close](@/:[t]')u // We create a p-values matrix from the ADF test results and set values above the diagonal to 1 -pvalues: ones (count[syms]*til count syms)_matrix +pvalues: m,'not null reverse m:-1 rotate sums[til count[syms]] _ reverse matrix + +show pvalues pyhm:.pykx.import[`seaborn]`:heatmap -pyhm[pvalues;`xticklabels pykw syms;`yticklabels pykw syms;`cmap pykw `RdYlGn_r] +pyhm[pvalues;`xticklabels pykw reverse syms;`yticklabels pykw reverse syms;`cmap pykw `RdYlGn_r] pyshow:.pykx.import[`matplotlib.pyplot]`:show diff --git a/PostPairsTrading.md b/PostPairsTrading.md index 208c9b0..d98bb52 100644 --- a/PostPairsTrading.md +++ b/PostPairsTrading.md @@ -1,26 +1,40 @@ -# A Match Made in Trading: Step-by-Step Pairs Trading Guide +# Pairs Trading in KDB+ using Tick Architecture in 25 lines -KDB+/Q stands out as **a powerful tool in finance**, renowned for its ability to handle vast volumes of real-time data amidst the relentless dynamics of the market. In this article, we embark on an insightful exploration of Pairs Trading and its implementation in Q, offering a comprehensive guide to one of the most popular strategies in the trading world. +Kdb+/q stands out as **a powerful tool in finance**, renowned for its ability to handle vast volumes of real-time data amidst the relentless dynamics of the market. In this article, we embark on an insightful exploration of [_Pairs Trading_](https://en.wikipedia.org/wiki/Pairs_trade), one of the most popular strategies in the trading world, and its implementation in q. Our primary goal is to demonstrate how straightforward it is to create a simple real-time implementation of this strategy in just 25 lines of code. We accomplish this by leveraging the language's conciseness and expressiveness, along with reusing typical patterns and tools from the kdb ecosystems. -Our objective is to provide **an implementation of a Pair Trading strategy in q**, which includes identifying cointegrated indexes, implementing a model to calculate spreads, making it work in real-time, and integrating it with the tick architecture. +In order to achieve this, we have outlined the following steps, corresponding to the main sections of this article: +* Identifying related indexes +* Implementing a model to calculate their spreads +* Visualizing the approach in real-time -We'll proceed methodically, ensuring each question leads to a comprehensive answer. To start, we'll contextualize our current situation by addressing key questions such as **"What do we know about the market and how can we benefit from it?"**. This will lay the foundation for constructing a real-time simulated environment on Pairs Trading that will exemplify everything we have explained thus far. +We will use the [_Tick Architecture_](https://github.com/KxSystems/kdb-tick), the typical setup found in kdb+ systems, to introduce, contextualize and guide each step. As we acknowledge that the typical reader is a quant interested in learning the virtues of q, a concise introduction to this architecture will be provided. We have also aimed to briefly introduce pairs trading so that other developers can follow and benefit from this text. Please, feel free to skip the following section if you are familiar with it. -Whether taking a technical or quantitative approach, these insights will provide valuable foundations for constructing this algorithm effectively. +## What is Pairs Trading? -We will be covering the following aspects using the KDB tick architecture as a base: +The market has often been described as a **stochastic** (a term which essentially means random) **process** where prices fluctuate irregularly. However, amidst this apparent randomness, we observe that **certain assets move in tandem** due to their inherent relationships. -![Architecture](resources/general-architecture.png) +For instance, it's logical to expect that if oil prices rise, the stock prices of airlines should also rise. This is because airlines rely on oil for their operations, illustrating an interconnectedness between the two. Over the long run, they tend to follow similar trends, reflecting their underlying relationship. -If this is the first time you see the KDB+/Q tick architecture, **DONT PANIC!** The tick architecture is designed to handle high-frequency trading data efficiently. At its heart is the tickerplant, a crucial component responsible for receiving and timestamping incoming data, then broadcasting it to other components such as the real-time database (RDB) and historical database (HDB). The tickerplant ensures that data is distributed in a timely manner, allowing for real-time analytics and decision-making. The RDB stores recent data for quick access, while the HDB archives older data for long-term storage and analysis. Additionally, the feedhandler plays a vital role by interfacing with external data sources, making sure that the tickerplant receives accurate and up-to-date information. This architecture guarantees seamless data flow and rapid access to both real-time and historical data, making it ideal for high-frequency trading applications. +The pairs trading technique leverages this phenomenon by identifying two assets whose prices exhibit a stable relationship over time. When these prices deviate from their historical relationship, a trading opportunity arises. This strategy involves buying the undervalued asset (the one that has fallen more than expected) and selling the overvalued asset (the one that has risen more than expected). The expectation is that the prices will eventually revert to their mean, allowing the trader to profit from this convergence. -To develop the pair trading strategy, we have added a couple of new components to the default architecture. So, with all that said, let's get to work! +This market-neutral approach is particularly attractive because it doesn't rely on the overall market direction. Instead, it focuses on the relative performance of the paired assets, which can provide consistent returns even in volatile market conditions. By maintaining both long and short positions, pairs trading inherently hedges market risk, which can enhance portfolio stability and reduce exposure to broad market movements. -## In the untamed realm of the market +Before diving into this search for related pairs, let me introduce you the tick architecture. -The market has often been described as a **stochastic** (a term which essentially means random) **process** where prices fluctuate irregularly. However, amidst this apparent randomness, we observe that **certain assets move in tandem** due to their inherent relationships. +## What is the Tick Architecture? -For instance, it's logical to expect that if the prices of petrol rise, the prices of cars should also rise. This is because auto companies rely on petrol for their operations, indicating an interconnectedness between the two. Over the long run, they tend to follow similar trends, **reflecting their underlying relationship**. +The tick architecture can be seen as a series of interconnected q processes commonly used to handle high-frequency trading data very efficiently. At its heart is the tickerplant (TP), a crucial component responsible for receiving and timestamping incoming data, then broadcasting it to other components such as the real-time database (RDB). The TP ensures that data is distributed in a timely manner, allowing for real-time analytics and decision-making. The RDB stores recent data for quick access, while the historical database (HDB) archives older data for long-term storage and analysis. Additionally, the feed handler plays a vital role by interfacing with external data sources, making sure that the TP receives accurate and up-to-date information. This architecture guarantees seamless data flow and rapid access to both real-time and historical data, making it ideal for high-frequency trading applications. In fact, this simple setup can process [large amounts of data in a very short time](https://kx.com/blog/what-makes-time-series-database-kdb-so-fast/) with a small memory footprint. + +![Architecture](resources/general-architecture.png) +*kdb tick architecture [diagram by Alexander Unterrainer](https://www.defconq.tech/docs/architecture/plain), extended by us for this scenario.* + +To develop the pairs trading strategy, we have integrated several new components into the default vanilla architecture. These include the Model Server (MS), Real-time Pairs Trading (RPT), and KX Dashboard, which will be introduced as needed. Now, let's begin our journey with the first step: identifying the most suitable pairs. + +> 📚 If you are still curious about this topic, you can check out the new [kdb+ architecture course](https://learninghub.kx.com/courses/kdb-architecture/) at the KX Academy by Michaela Woods. If you want to go down to the metal, Alexander Unterrainer does an excellent job dissecting all the components, line by line, in his [KDB Tick Explained](https://www.defconq.tech/docs/tutorials/tick) series. + +## Identifying cointegrated indexes + +The pairs trading strategy relies heavily on identifying pairs of assets that maintain a long-term equilibrium relationship as mentioned previously. Is this described mathematically? **Yes**: @@ -29,114 +43,109 @@ The concept we're referring to is **cointegration** (although there are other me > 💡 Which should not be confused with correlation; cointegration is a statistical property of two-time series, indicating a long-term relationship between them despite short-term fluctuations. Cointegrated series move together over time, sharing a common stochastic drift. On the other hand, correlation measures the strength and direction of the linear relationship between two variables at a specific point in time. While correlation captures the degree of association between variables, cointegration reflects a deeper, underlying relationship that persists over time. Hence, we're interested in **cointegrated assets**, which are assets that exhibit the following characteristics: +* They have a similar trend, meaning the difference between both assets maintains a constant mean, and this difference fluctuates around that same mean. +* This inherent relationship remains constant in the long run, indicating that the connection between the series is time-invariant. -1. They have a similar trend, meaning the difference between both assets maintains a constant mean, and this difference fluctuates around that same mean. - -2. This inherent relationship persists in the long run, meaning that our series is not dependent on time. - -## Identifying cointegrated indexes +Let's shift our focus to the MS component, which has several responsibilities. In this section, we'll concentrate on its first task: reading data from the HDB and calculating the degree of cointegration among existing pairs. This step is crucial as it lays the foundation for subsequent actions in the strategy. How should we go about implementing this process? -Imagine **we selected 13 world indexes** and aimed to assess whether they are **cointegrated or not**. In KDB+/Q we can start by declaring a variable containing every index and generating the cartesian product (`cross`) of this indexes. +![Arch 1st Part](resources/general-architecture-ms.png) +If our objective is to analyze potential pairs for the strategy using index data and their corresponding prices, querying the HDB for this information is a sensible approach. Assuming the HDB maintains a _stock_ table with the closing prices of each index for every date, we can employ the following query to retrieve all closing prices from the last `x` days: ```q -syms:`SP500`NASDAQ100`BFX`FCHI`GDAXI`HSI`KS11`MXX`N100`N225`NYA`RUT`STOXX -pairs: syms cross syms +rs:{select close by sym from stock where date within(.z.d-x;.z.d)} ``` +As you can see, the query resembles SQL: _select closing price by symbol from the stock table where the date is in the range between x days ago and today_. Indeed, we are taking advantage of the q-SQL syntactic facilities, which enable newcomers to be productive quickly, as evidenced by the [KX Academy](https://learninghub.kx.com/courses/kdb-developer-level-1/). It is worth mentioning that the query is enclosed in curly brackets, indicating that it is actually a function taking `x` as a parameter. The function is named `rs` to signify that it (r)eads (s)tocks. -In this scenario, a crucial tool at our disposal is the Augmented Dickey-Fuller (ADF) test, an essential statistical test for assessing the stationarity of time series data. The more stationary the time series are, the more cointegrated they are likely to be. - -This statistical test is a **hypothesis test**, where we use our data to see if we can accept or reject a hypothesis. In our case, the hypothesis is whether the time series is non-stationary. To determine this, we use **p-values**. - -**P-values** help us decide whether to reject the null hypothesis. If the p-value is low, it indicates that we can reject the hypothesis that the time series is non-stationary, suggesting that our assets are cointegrated. The lower the p-value, the greater the confidence in rejecting the null hypothesis. It is very common to use a threshold of 0.05 on the p-value to reject hypotheses. - -For the sake of simplicity, we will be using [PyKX](https://code.kx.com/pykx/2.4/index.html). This is necessary as we require importing our ADF test function and plotting a heatmap of our results. Developing these functionalities directly in Q would be time-consuming and prone to errors. Although an implementation of the ADF test in KDB+/Q would be more efficient and faster, the effort required would outweigh the benefits for this particular case where performance isn't critical. Therefore, we rely on PyKX to streamline the process by leveraging relevant libraries from the Python ecosystem. +> 💡 Using short names for variables is a standard convention that, once you get used to it, improves readability. +Now that we have defined a query, we need to execute it on the HDB component. To do so, we use [interprocess communication](https://code.kx.com/q/basics/ipc/) (IPC) to send the query: ```q -system"l pykx.q" +hdb:`$":",.z.x 0 +cls:hdb(rs;2*365) ``` +In the first line, we read the initial command line argument (`.z.x`) which should specify the port number where the HDB is expected to be running. This argument is used to construct the process handle, for instance `::5010`, which references localhost at the specified port. -One such library is **statsmodels**, a prominent tool in Python for statistical modeling and hypothesis testing. It equips analysts with a robust toolkit for regression, time series, and multivariate analysis. Specifically, within the **statsmodels** package, the **statsmodels.tsa.stattools** module features **the Augmented Dickey-Fuller (ADF) test**. +Following this, we provide the `rs` query along with its parameter to define the range (specifically, two years ago from the current date). These are then transmitted to the HDB using IPC, and the result produced by the HDB is assigned to the variable `cls`, that we show here: +``` +sym | close .. +---------| ------------------------------------------------------------------.. +BFX | 3746.07 3726.25 3682.07 3663.02 3711.71 3768.53 3785.46 377.. +FCHI | 6086.02 6031.48 5922.86 5794.96 5912.38 6006.7 6033.13 599.. +GDAXI | 13231.82 13003.35 12783.77 12401.2 12594.52 12843.22 13015.23 128.. +HSI | 22418.97 21996.89 21859.79 21853.07 21586.66 21643.58 21725.78 211.. +N225 | 27049.47 26804.6 26393.04 26423.47 26107.65 26490.53 26517.19 268.. +NASDAQ100| 11637.77 11658.26 11503.72 11779.9 11852.59 12109.05 12125.69 118.. +NYA | 14667.32 14599.59 14487.64 14499.49 14465.29 14676.5 14642.33 145.. +RUT | 1738.84 1719.37 1707.99 1741.33 1727.55 1769.6 1769.36 173.. +SP500 | 3821.55 3818.83 3785.38 3831.39 3845.08 3902.62 3899.38 385.. +STOXX | 416.19 413.42 407.2 400.68 407.34 415.01 417.12 415.. +``` -We can proceed to create a function called **fcoint** to call our imported function from PyKX, handle any null values by filling them with 0, using `0f^` and return the second value, which in this case is the p-value. +> 💡 This approach exemplifies a good practice in kdb+: _keeping computations as close to the data as possible_. Instead of requesting data and then applying a filter or transformation to it, we send the computation to the HDB itself so we avoid transmitting unnecessary data over the communication. +Once we have collected the prices, it is useful to identify all possible pairs: ```q -coint:.pykx.import[`statsmodels.tsa.stattools]`:coint -fcoint:{@[;1]0f^coint[x;y]`} +syms:exec sym from cls +ps:sx where (<).' sx:syms cross syms ``` +The first line extracts the sym column from the table. The second line generates their Cartesian product (`cross`) and filters out duplicate and inverse pairs by selecting only those pairs `where` the first element is alphabetically less than the second, ensuring each combination is unique. We are now ready to start analysing the cointegration of our pairs. -As we saw in the introduction, we are working in a tick architecture environment. This means that, in addition to receiving real-time prices for our indices, this architecture provides a tool to store the closing prices of our indices in our historical database (HDB) at the end of the day. +To assess the level of cointegration between our chosen pair of assets, we'll employ a statistical test. This test will help us quantify the strength of the long-term equilibrium relationship we're looking for in our pairs trading strategy. The cointegration test is a **hypothesis test**, where we use our data to evaluate a specific hypothesis. In this case, our null hypothesis is that the two time series are not cointegrated. To interpret the results of this test, we'll be focusing on **p-values**. -Therefore, we simply need to execute a straightforward query on the HDB to read these data and load them into memory. To achieve this, we define the function `rs`, which takes a date range and the indices for which we want to retrieve data. We can then use the **qSQL syntax** (very similar to SQL) to obtain the desired data. +P-values help us decide whether to reject the null hypothesis. If the p-value is low, it suggests that we can reject the hypothesis of no cointegration, indicating that our assets are likely cointegrated. The lower the p-value, the stronger the evidence for cointegration. Conventionally, a threshold of 0.05 is often used for the p-value to reject the null hypothesis. +For the sake of simplicity, we will be using [PyKX](https://code.kx.com/pykx/2.4/index.html). This is necessary as we require importing our cointegration test function and plotting a heatmap of our results. Developing these functionalities directly in q would be time-consuming and prone to errors. Although an implementation of the cointegration test in kdb+/q would be more efficient and faster, the effort required would outweigh the benefits for this particular case where performance isn't critical. Therefore, we rely on PyKX to streamline the process by leveraging relevant libraries from the Python ecosystem. ```q -t:`sym xgroup hdb(query_prices;trange;syms) -:{[tr;syms]select from prices where date within (.z.d-tr;.z.d),sym in syms} +system"l pykx.q" ``` - -Now we simply need to send this function with the necessary parameters to our HDB. To do this, we open (`hopen`) a connection to our HDB process, obtaining a handle. To communicate with the process, we pass a list to the handle with the first element being the function and the subsequent elements being the parameters, once we get our data we finally group (`xgroup`) by index. Finally, let's not forget to close (`hclose`) the connection to HDB. - +One such library is **statsmodels**, a prominent tool in Python for statistical modeling and hypothesis testing. It equips analysts with a robust toolkit for regression, time series, and multivariate analysis. Specifically, within the statsmodels package, the _statsmodels.tsa.stattools_ module features the Cointegration test. ```q -tr:4*365 / date range -hdb:hopen port -t:`sym xgroup hdb(query_prices;tr;syms) -hclose h +co:.pykx.import[`statsmodels.tsa.stattools]`:coint ``` - - In our case we are going to use closing prices to our ADF test, so we have to index (`@`) by column **close** from each pair in our table. Additionally, we take (`#`) the last **tr** days of data for both indexes, and finally apply our **fcoint** function to each (`.'`) pair of data lists. - +We can proceed to create a function called `cof` to call our imported function from PyKX and return the second element from the resulting list, which in this case is the p-value, [as described in the official documentation](https://www.statsmodels.org/stable/generated/statsmodels.tsa.stattools.coint.html). ```q -matrix:fcoint .' 0f^@\:[;`close](@/:[t]')syms cross syms +cof:@[;1]co[<]:: ``` -Now, with our matrix in hand, we can plot it and **visually identify** which asset is more favorable. In order to do that, we can leverage PyKX once again to bring the `heatmap` module from `seaborn` a prominent data visualization library in the Python ecosystem to q: - -> 💡 We could have created a dashboard to plot the heatmap using KX Dashboard, but in this case, it is simpler and faster to use PyKX and plot as we would in Python, with minor modifications to the syntax. +> 💡 This function definition doesn't explicitly reference its arguments, thanks to the way kdb+/q enables us to compose functions. This programming style, where function arguments are not named explicitly, is known as tacit programming or point-free style. In kdb+/q, tacit programming is especially powerful and concise, allowing us to create expressive and efficient code. By leveraging the language's capabilities for function composition and implicit argument passing, we can achieve elegant solutions like the one demonstrated. +Next, we will get the prices involved in each pair and use the function above to produce the desired p-values: ```q -pyhm:.pykx.import[`seaborn]`:heatmap -pyhm[pvalues;`xticklabels pykw syms;`yticklabels pykw syms;`cmap pykw `RdYlGn_r] +pv:cof .'({x`close}')cls([]sym:ps) ``` +The implementation details are not as important as illustrating the concision and terseness achieved when developing code in q. -And plot it: - +Now, with our p-values in hand, let's show another example of the virtues of PyKX by plotting a heatmap, using the _seaborn_ library: ```q -pyshow:.pykx.import[`matplotlib.pyplot]`:show -pyshow[::] +pyhm:.pykx.import[`seaborn]`:heatmap +mx:flip("f"$1,'not null reverse m),'m:(0,sums[reverse 1_til count[syms]])_ pv +pyhm[mx;`xticklabels pykw syms;`yticklabels pykw syms;`cmap pykw `RdYlGn_r] +pysh:.pykx.import[`matplotlib.pyplot]`:show +pysh[::] ``` +After importing the library utility and arranging the p-values into a matrix `mx` (filled with dummy values in the top right) we produce the heatmap and show it: -Given our following assets: - -| SP500 | NASDAQ100 | BFX | FCHI | GDAXI | HSI | KS11 | MXX | N100 | N225 | NYA | RUT | STOXX | -| :---: | :-------: | :-----: | :----: | :-----: | :-------: | :-----: | :----: | :----: | :---: | :---: | :---: | :----: | -| USA | USA | Belgium | France | Germany | Hong Kong | S.Korea | Mexico | Europe | Japan | USA | USA | Europe | - -Our heatmap looks like this: +![ADF heatmap](/resources/heatmap.png) -![ADF heatmap](https://github.com/hablapps/pairstrading/blob/5-Post/resources/ADFgif.gif?raw=true) - -As we can observe, there are several cointegrated indices, but our attention will be drawn towards the **NASDAQ100 and SP500** synergy. Both of these indices belong to the American market and share numerous characteristics. They encompass American companies traded within the same scenario, which is what makes them a perfect fit for our case. -In this heatmap, they exhibit a vibrant green colour, indicative of a high degree of cointegration, or, in simpler terms, a very low probability of not being cointegrated. They demonstrate low p-values suggesting their strength as candidates. - - > 💡 As we can see, this pair of indexes is not the best candidate according to our ADF tests. However, we chose it because the tick data for their prices is publicly available. We used TickStory to obtain the data. +We will draw our attention to the synergy between _FCHI_ and _GDAXI_. They exhibit a vibrant green color, indicative of a high degree of cointegration, or, in simpler terms, a very low probability of not being related. They demonstrate the lowest p-value, suggesting their strength as candidates. The next graph consolidates this intuition: ![Prices](resources/cointegration.png) +The plotted graph displays the prices of both indexes together, providing a clearer comparison that showcases the cointegration between them. The blue and green lines represent GDAXI and FCHI, respectively. The close alignment of their price movements indicates that they are cointegrated to some extent. This means that, despite short-term deviations, the indexes tend to move together over time, maintaining a stable relationship. This graph was generated by [KX Dashboard](https://code.kx.com/dashboards/), which receives data from the Pairs Trading process and renders visualizations; we'll use this tool again later on. -The plotted graph displays the prices of both indexes together, providing a clearer comparison that showcases the cointegration between them. The blue line represents SP500, and NASDAQ100 is represented by the green line. The close alignment of their price movements indicates that they are cointegrated to some extent. This means that, despite short-term deviations, the indices tend to move together as time goes on, maintaining a stable relationship. This graph was generated by [KX Dashboard](https://code.kx.com/dashboards/), which receives data from the Pairs Trading process and renders visualizations. +As mentioned earlier, the market is inherently random and doesn't always behave predictably. While FCHI and GDAXI often follow similar trends, their individual values can sometimes diverge significantly. For instance, FCHI may rise while GDAXI falls, or vice versa. However, this presents **an opportunity for profit** because we know that these assets tend to revert to their shared mean over time. If one asset is _overpriced_ and likely to decrease, we may consider selling it (going short). Conversely, if an asset is _underpriced_ and expected to increase, we may consider buying it (going long). That is actually the essence of Pairs Trading. -Overall in this section we explained how the first part of our diagram works. We first discovered that, even in the randomness of the market, some assets show some similar movements. Then, based on that information and using the historical data from our tick architecture, we used the cointegration and ADF test to pinpoint a specific pair of them for our analysis: NASDAQ100 and SP500. +> 💡 This strategy possesses financial characteristics: our _profitability remains unaffected by the broader market trends_, as our focus lies solely on the disparity between the two assets. It's about relative movements rather than absolute ones; we're indifferent to whether prices are rising or falling. This quality defines it as a neutral market strategy. -![Arch 1st Part](resources/general-architecture-top.png) +As you might guess, our next task is to build the model that helps us determine whether an index is actually overpriced or underpriced. -As mentioned earlier, the market is inherently random and doesn't always behave predictably. While NASDAQ100 and SP500 often follow similar trends, their individual values **can sometimes diverge significantly**. For instance, NASDAQ100 may rise while SP500 falls, or vice versa. +## Determining how to calculate the spreads -However, this presents **an opportunity for profit** because we know that these assets tend to revert to their shared mean over time. If one asset is **overpriced** and likely to decrease, we may consider **selling it** (going short). Conversely, if an asset is **underpriced** and expected to increase, we may consider **buying it** (going long). That is what we call Pairs Trading. +Having identified the optimal pair of indices for our pairs trading strategy, let's now explore how to deploy a simple model that detects trading opportunities. Additionally, we'll examine how all of this fits within the MS (Model Server) component. -> 💡 This strategy possesses financial characteristics: our **profitability remains unaffected by the broader market trends**, as our focus lies solely on the disparity between the two assets. It's about relative movements rather than absolute ones; we're indifferent to whether prices are rising or falling. This quality defines it as a **neutral market strategy**. +![Arch-ML](resources/general-architecture-ms.png) -## Determining how to calculate the spreads - -After this initial market assesment, we can move on to coding the sctual Pair Trading model that calculates the spreads. The first approach we may try could simply be to subtract them and observe if the difference deviates significantly from zero, considering their scale difference. +At this point, we can start coding the actual pairs trading model that calculates the relationships between the prices of our indexes, which we'll refer to as **spreads**. Our initial approach might simply involve subtracting the prices and observing whether the difference deviates significantly from zero, taking their scale difference into account. Indeed, just subtracting the prices of two assets, as in $price_y−price_x$ may not provide a clear understanding of their relationship. Let's illustrate this with an example: @@ -149,7 +158,6 @@ q)spreads: price_y - price_x **These spread values don't offer much insight** into the relationship between the two assets. Are both assets increasing? Are they moving in opposite directions? It's unclear from these numbers alone. - Let's consider **using logarithms**, as they possess favourable properties for our pricing model. They prevent negative values and stabilize variance. Log returns are time-additive and symmetric, simplifying the calculation and analysis of returns. This improves the accuracy of statistical models and ensures non-negative pricing, enhancing model robustness and reliability: ```q @@ -161,17 +169,17 @@ q)spreads: log[price_y] - log price_x 1.526056 1.098612 1.272966 2.014903 1.475907 ``` -We're making progress, as we observe **numbers now fluctuating within much smaller ranges**. However, we're still missing a clear understanding of the underlying relationship. While we've normalized the data using logarithms, we now need to align their discrepancies to a single asset. - -Since both assets are related, **we can leverage linear regression** to our advantage. This enables us to simplify our spreads effectively. So, we'll conduct a basic linear regression analysis using historical data to discern the disparity between them. The generic formulae for one is: +As we progress in our analysis, we're now seeing **numbers fluctuate within much smaller ranges**, thanks to our logarithmic normalization. However, to fully capitalize on the relationship between our assets, we need to align their discrepancies. This is where **linear regression** becomes a powerful tool in our algorithmic trading arsenal. By applying linear regression, we can find the optimal parameters α (alpha) and β (beta) that minimize the variance of the spread between our assets. This reduced variance is a crucial property for an algorithmic trading strategy, as it indicates a more stable and predictable relationship between the assets. The generic formula for our linear regression is: $$Y = \alpha + \beta X + \varepsilon$$ -In this context, Y represents the NASDAQ 100 index, X represents the S&P 500 index, α is the intercept, β is the slope (which indicates the relationship strength between the two indices), and ε is the error term. +Where ε represents the spread, whose variance we aim to minimize. By finding the α and β that best fit this equation, we're essentially identifying the most consistent relationship between our assets. This consistency is invaluable in algorithmic trading, as it can lead to more reliable trading signals and potentially lower risk. The resulting spread, with its minimized variance, becomes a key indicator for our trading decisions, allowing us to exploit even small deviations from this optimized relationship. + +In this context, Y represents the FCHI index, X represents the GDAXI index, α is the intercept, β is the slope (which indicates the relationship strength between the two indexes), and ε is the error term. We illustrate it in the next graph: ![LinearRegression](resources/linear_regression.png) -The plotted graph above illustrates the relationship between the NASDAQ 100 and the S&P 500 indices, with each purple dot representing a data point of their prices at a given time. The linear trend visible in the scatter plot suggests a strong positive cointegration between the two indices. By applying linear regression, we can model this relationship mathematically, allowing us to predict the NASDAQ 100 index price based on the S&P 500 index price. This predictive power is crucial for pair trading, as it helps identify mispricings and potential trading opportunities. +As you can see, it shows the relationship between both indexes, with each purple dot representing a data point of their prices at a given time. The linear trend visible in the scatter plot suggests a strong positive cointegration between the two indexes. By applying linear regression, we can model this relationship mathematically, allowing us to predict the FCHI index price based on the GDAXI one. This predictive power is crucial for pairs trading, as it helps identify mispricings and potential trading opportunities. Linear regression aims to identify relationships between historical data, which we then extrapolate to current data. The differences between these relationships, or deviations, are our spreads. We've already calculated the 𝛼 and 𝛽 using the logarithmic values of our historical data (since real-time price values for price_x and price_y are unknown). Now, we simply combine everything and apply linear regression to our price logarithms: @@ -182,132 +190,148 @@ q)spreads: log[price_y] - alpha + log[price_x] * beta -0.1493929 0.0451223 -0.08835117 0.0451223 0.1579725 ``` -There are different methods we can use to obtain the best alpha and beta values that minimize the spreads or, in other words, there are mathematical ways to find the line that best fits the prices. +To find the best relationship between our pair of assets in a pairs trading strategy, we need to determine the optimal values of alpha and beta that minimize the spread. In mathematical terms, we're looking for the line that best fits the prices. The most common approach to this problem is the least squares method, let's see how can we approach this problem. -The most common method to find the best relationships (alpha and beta) is the least squares method, which minimizes the sum of the squared residuals: -$$S(\alpha, \beta) = (log(priceY) - (\beta \cdot log(priceX)+\alpha)^2$$ +Firstly, we can rewrite the linear regression equation as a matrix product: -After taking partial derivatives with respect beta and setting to zero, and then solving, we can arrive at this formula: +```math +Y = (\alpha, \beta)\begin{pmatrix}1 \\ X\end{pmatrix} +``` -$$\beta = \frac{{(n \cdot \sum(x \cdot y)) - (\sum x \cdot \sum y)}}{{(n \cdot \sum(x^2)) - (\sum x)^2}}$$ +Where α represents the intercept and β the slope of our regression line $Y = α + βX$. -Which we can see implemented in the following functions: +In kdb+/q, we can efficiently solve this equation using the `lsq` (least squares) operator. Here's a compact function that computes the linear regression coefficients: ```q -betaF:{dot:{sum x*y}; - ((n*dot[x;y])-(*/)(sum')(x;y))% - ((n:count[x])*dot[x;x])-sum[x]xexp 2} +lrf:{first enlist[y]lsq x xexp/:0 1} ``` -Now, following the same steps as before but for alpha, we arrive at: +This function works by creating a column vector [1, X] using a trick with `xexp`, then applying the `lsq` operator to solve the least squares problem. It leverages q's vectorized operations and built-in least squares solver to efficiently compute the regression coefficients. -$$\alpha = \bar y - \beta \cdot \bar x$$ +Now, let's encapsulate the linear regression fitting and spread calculation processes. This will provide an interface that other components can use to fit a linear regression and calculate spreads. -Which is implemented in the following line of code: +First, to encapsulate the **linear regression fitting**, we'll develop a function called `ab`. This function takes a pair of indices as input, retrieves data from the `cls` table (which we used in the cointegration test), and uses the `lrf` function to obtain the linear regression parameters. The implementation would look like this: ```q -alphaF: {avg[y]-betaF[x;y]*avg[x]} +ab:{lrf . log(cls([]sym:x))`close} ``` -Finally, we can encapsulate both parameters in just one function called `lr_fit`. This function only has to apply each fit function to our input data. +Next, we'll encapsulate the **spread calculation**. We'll create a function that takes a pair of indices and returns another function. This returned function will expect the prices of `x` and `y` as inputs and calculate the spread. Leveraging q's conciseness, we can implement this as follows: ```q -lr_fit:{(alphaF;betaF).\:(x;y)} +sm:{[a;b;x;y]y-a+b*x}. ab@ ``` -Now we simply need to apply `lr_fit` to find the optimal alpha and beta on the historical prices (which we took from HDB) of the indices we choose. +The result is a powerful and concise interface for fitting linear regressions and calculating spreads, which can be easily used by other components in our system. -```q -params:lr_fit[t[`SP500]`close;t[`NASDAQ100]`close] -alpha_lr:params 0;beta_lr:params 1 -``` +> 💡 As you may have noticed, this idea could be generalized and expanded to create a full-fledged model server. This server could host a variety of models, ranging from simple ones like linear regression to more complex models, making them accessible to other components. It could also include additional functions that would transform it into a powerful tool in the context of algorithmic trading and financial systems. This precisely meets one of our objectives: getting **a comprehensive method for representing relative changes between both assets**. As we can deduce, our mean is now 0 because our assets are normalized, cointegrated and on the same scale. Therefore, ideally, the differential between their prices should be 0. Consequently, when our spread is below 0, we infer that asset X is overpriced, whereas if it's above 0, then asset Y is overpriced. - ## Real-time spread calculation -Now that we have selected a pair of cointegrated indices and understand how to calculate their relationships, let's see how we can create a real-time pair trading scenario. - -To do this, we need to focus on the Real-Time Pair Trading Subscriber (RPT). As can be seen in the diagram, this component receives 3 inputs. +Now that we have selected a pair of cointegrated indexes and built a model to calculate their relationships, it's time to formalize its subscription as a real-time component. Once we start receiving data from the TP, we can use the model to produce the spreads, which will then be sent to the dashboard, as illustrated in the diagram. -Let's start with the easy part: first, we receive which pair of indexes we have chosen in the previous step (this step, in our case, is done AD-HOC since, as we mentioned, we have chosen these two indices not only because of their cointegration but also because they are indices whose tick data is public). - -The next step is to receive the data for this pair of indices. For this, we need to subscribe to the tickerplant using the tick architecture function `.u.sub`. Here, we need to specify which table and which symbols (indexes in our case) we want to subscribe to. This information needs to be passed through the connection to the tickerplant. - -> 💡 `.z.x` in the KDB+/Q language is an environment variable that stores the command-line arguments passed to a q script. +![Arch-bottom](resources/general-architecture-rpt.png) +The first step in implementing this component is to retrieve the most cointegrated pair and its associated model. As we did before, we use IPC to communicate with the MS component: ```q -(hopen `$":",.z.x 0)"(.u.sub[`quote;`SP500`NASDAQ100])" +(ix;sp):ms({enlist[ix],sm ix:ps pv?min pv};::) ``` +Here, we assume that `ms` points to the MS component (similar to how `hdb` was used in a previous section), and we send a query for MS to execute. It calculates the most cointegrated pair by finding the minimum p-value, identifying its position, and using it to find the names of the indexes. Finally, it executes the `sm` function obtained from the previous section to get the spread model for the most cointegrated pair. As a result, we get both the pair of indexes (`ix`) and the spread model (`sp`). -> 💻 Basically, what `.u.sub` does is transfer the information about our process to the tickerplant and add it to its registers. By using `.u.sub`, we inform the tickerplant of our subscription interest, enabling us to receive real-time updates for the specified indices. - -Additionally, we will connect to the HDB (as shown in the ADF testing) to calculate the alpha (`alpha_lr`) and beta (`beta_lr`) of the linear regression. - -We would then only need two things, first to see how the tickerplant sends (or receives) this data, and second to calculate the spread given the data and then publish it again so that the Dashboard can read and plot it. Let's see that we can really do all this in one line of code in a very simple way. - -The first thing we need to know is how tickerplant sends the data. What it does is to send a call to an `upd` function (that we must have declared in our process) with two arguments, the first one is the name of the table to which the incoming data belongs as the second argument. Then the first thing to do is to declare an `upd` function. What this function should do is to publish in a table (`spread`) the spreads of the linear regression. For this we will use the `u.pub` function of the tick architecture. - -> 💻 The `.u.pub` function takes two parameters: the name of the table to publish to and the content to publish, then it publishes the content to the subscribers of the table, again calling the `upd` method that should be declared in the subscriber process. - +Real-time components can manifest their interest in a particular table and a subset of symbols. Naturally, we are only interested in getting quotes associated with one of the indexes in our `ix` pair: ```q -upd:{.u.pub[`spread;([]time:1#y`time;spread:sp . y`bid)]}; +tp"(.u.sub[`quote;",(.Q.s1 ix),"])" ``` +As usual, we assume that `tp` is a pointer to the TP process. Essentially, `.u.sub` registers the RPT handle in the TP so it can later notify the recently subscribed component about new events. It assumes that the subscriber has defined an `upd` function as a kind of callback: +```q +d:ix!2#0f +upd:{d^:exec sym!log(ask+bid)%2 from select by sym from y} +``` +Our `upd` function takes ticks as input (`y`), selects the most recent one for each symbol, calculates their mid price, and updates the values (if any) in a simple dictionary `d`. This dictionary acts as a cache, which becomes essential in the following and final step. -We simply need to declare an `sp` function that calculates the spread given the prices, but this is something we already know how to do. - +Now we need to notify the KX Dashboard about the spreads. The dashboard subscribes to the RPT using the same interface that the RPT uses to subscribe to the TP (.u.sub). However, in this case, the dashboard makes this task automatic and transparent to the user by invoking this function once a dashboard element has selected the `spread` table from the RPT process as its data source. With this setup, we can publish events to our subscribers using `.u.pub`: ```q -sp:{y - alpha_lr + beta_lr * x}; +.z.ts:{.u.pub[`spread;([]time:1#.z.N;spread:sp . d ix)]} +\t 16 ``` +As shown, we wrap this publication as part of a `.z.ts` function. This function is special because it can invoked automatically by setting a value for `\t`. In this case, we instruct kdb to publish the last seen spread from `d` every 16 ms. Why 16 ms? Because this allows the dashboard to update at a rate of ~60 frames per second, as illustrated in the [Streaming section](https://code.kx.com/dashboards/datasources/#streaming) from the KX Dashboard documentation. -By using this approach, we only need to connect KX Dashboards to our publisher by setting up a new connection from the connection selector in the UI. -This will allow us to plot our spreads in real time and we will end up with something like this: +> 📚 We highly recommend reading [Building real-time tick subscribers](https://code.kx.com/q/wp/rt-tick/) by Nathan Perrem before building your own real-time components. + +By embracing this approach, it only remains to connect KX Dashboards to our publisher by setting up a new connection from the connection selector in the UI. This will allow us to plot our spreads in real time and we will end up with something like this: ![SpreadsD](resources/spreads.gif) -And there we have it! **A perfectly plotted spread series in real-time**, ready to be utilized for further analysis and exploitation. +And there we have it! A perfectly plotted spread series in real-time, ready to be utilized for further analysis and exploitation. Up until this point in this section, we have taken a look at how we, having previously identified a pair of compatible assets, could reliably calculate a meaningful spread and implemented it in a simulated real-time scenario. Thanks to KX Dashboards we were also able to create a simple plot to show all this information in a way that's easily understandable. -![Arch-bottom](resources/general-architecture-bottom.png) +To finish, once we have our spreads accurately calculated and observe how our data is being updated, we'll convert these spreads into **z-scores** for more effective **signal generation**. This conversion allows us to execute buy and sell orders when significant spread discrepancies occur, based on standardized signal windows. To implement this, we first fit the **mean and standard deviation** of our historical spreads. The z-score of a spread is then calculated as its distance from the mean in terms of standard deviations, providing a consistent framework for identifying trading opportunities across different pairs. + +```math +z_t = \frac{\epsilon_t - \mu_x}{\sigma_x} +``` -To finish, once we have our spreads accurately calculated and observe how our data is being updated we can **execute buy and sell orders when spread discrepancies occur** based on some signal windows. +We'll use z-score thresholds of -1.96 and 1.96 as our trading signals. These values correspond to the 95% confidence interval in a normal distribution. When the z-score exceeds 1.96, we'll sell the overvalued asset and buy the undervalued one. Conversely, when it falls below -1.96, we'll do the opposite. We'll unwind positions as the z-score returns to 0. -A simple approach to window signals is to set these windows as twice the historical standard deviation of the spreads. Therefore, if either of these limits is reached, we should sell the overvalued index and buy the undervalued one, and then unwind our position when the spread returns to 0. Let's clarify this with a specific example: +This approach allows us to identify and trade on significant deviations in the relationship between our paired assets, with the expectation that these deviations will eventually revert to the mean. Let's clarify this with a specific example: ![WSignals](resources/window_signals.gif) -In this instance, we can see that the spread (purple line) is positive and above the signal (blue line), indicating that our Y index (NASDAQ100) is overvalued relative to the SP500. Therefore, we should sell NASDAQ100 and buy SP500. At the end of the gif, it can be observed that the spread returns to 0 (green line), meaning the indexes are no longer overvalued or undervalued, respectively. At this point, we should unwind the positions we acquired earlier. +In this instance, we can see that the spread (purple line) is positive and above the signal (blue line), indicating that our Y index (GDAXI) is overvalued relative to the FCHI. Therefore, we should sell GDAXI and buy FCHI. At the end of the animation, it can be observed that the spread returns to 0 (green line), meaning the indexes are no longer overvalued or undervalued, respectively. At this point, we should unwind the positions we acquired earlier. > 💡 Signal windows play a pivotal role in implementing Pairs Trading strategies. They serve as indicators for determining when to execute buy and sell actions, acting as arbitrary thresholds that guide our algorithm's decision-making process. These windows are derived from the variance of our data, representing a static variance assumption due to our consideration of a time-independent cointegrated series. -## Conclusion and Future Work +As you can imagine, by taking advantage of the flexibility of the Tick architecture and the Kx Dashboard, what we have set up for two indexes could be implemented for all pairs of indexes that exhibit some cointegration without losing performance. + +## Conclusion + +This post has demonstrated that implementing a simple Pairs Trading strategy in kdb+/q is straightforward. + +The primary enabler of this simplicity is the Tick Architecture, a robust design pattern that allows us to handle historical and real-time data with ease. We observed this in action while retrieving HDB data through the MS to calculate the spread models, and while subscribing the RPT to the TP to get the most up-to-date ticks. As we have seen, IPC is at the heart of this architecture and is essential for integrating new components into the system. -In this post, we have provided a comprehensive overview of Pairs Trading, covering its implementation and intricacies in KDB+/Q. +Secondly, the kdb ecosystem provides excellent tools, such as KX Dashboard, which we utilized to produce several graphs, particularly emphasizing real-time visualizations. As a niche platform, kdb cannot compete with Python in terms of ecosystem variety. However, PyKX allows us to overcome this limitation by enabling interaction with the Python ecosystem. This opens up possibilities for leveraging libraries like statsmodels and seaborn, as demonstrated in our examples. -We have discussed: +Last but not least, the concision and expressiveness of q is impressive. We only needed to extend the vanilla architecture with less than 25 lines of q code (skipping blank lines) to implement the [MS](https://github.com/hablapps/pt/blob/main/ms.q) and [RPT](https://github.com/hablapps/pt/blob/main/tick/rpt.q) components. In our view, maintaining 25 lines of code is far preferable to maintaining 250 lines. If you explore the [pt repo](https://github.com/hablapps/pt), you will see that beyond these components, it extends the [kdb+tick vanilla setup](https://github.com/KxSystems/kdb-tick) with a feed handler and an historical database process. These are just dummy plumbing components to help you reproduce the results. Finally, while becoming proficient with this language takes time, you don't need to master it to start analyzing data. q-SQL is a great entry point for beginners. -1. An examination of cointegrated assets within the market. -2. Multiple Augmented Dickey-Fuller (ADF) tests on real assets. -3. An introduction to the pairs trading strategy itself. -4. A clear and guided explanation of spread calculation and interpretation in KDB+/Q. +However, there are several other benefits of this landscape that we couldn't explore in this article. -One valid concern is that our calculations might be heavily influenced by past data and rely too much on historical changes that may not accurately reflect the present reality. To address this, we could implement a rolling window approach where the linear regression is continuously updated, ensuring our model remains responsive to changes in the underlying data over time. Additionally, using the Kalman Filter to dynamically fit the alpha and beta of the linear regression can effectively filter noise and predict states in a dynamic system, allowing for real-time adjustments and providing a more accurate reflection of current market conditions. We will delve deeper into the topic of window signals as well, exploring more advanced techniques and their applications in real-time pair trading. This will further enhance our model's responsiveness and accuracy, providing a robust framework for effective trading strategies. +Namely, the performance of these technologies is superb, both in terms of speed and memory footprint. In fact, as stated by KX in one of the articles linked above, kdb+ supports a throughput of 35K ticks per second for a single node. If we scale the solution to several nodes, we should easily accommodate thousands of pairs in the system simultaneously. We leave scaling this solution up as future work. -Our goal was to demonstrate the capabilities of KDB+/Q and its potential in implementing a simplified yet powerful financial strategy. By doing so, we hope to make these concepts more accessible and empower individuals to leverage these tools at their own work. If you have any questions or need further clarification, don't hesitate to reach out. +Additionally, one of the killer features of this platform is flexibility. We showed a bit of it while extending the architecture with our own components, but we could go further. For example, we could produce more accurate models by embracing the Kalman Filter to dynamically fit the coefficients of the linear regression. This would allow for real-time adjustments and provide a more accurate reflection of current market conditions. We could also delve deeper into the topic of window signals to provide a robust framework for effective trading strategies. Analyzing the complexity of integrating these changes, and more generally analyzing the evolvability of the system, is also left as future work. -Special thanks to [...] for [...] +We hope you enjoyed reading this article! Please feel free to experiment with the repo and extend it in any way. Feedback is more than welcome! + +## Acknowledgements + +We wish to express our sincere gratitude to Álvaro Sánchez-Paniagua Ríos for initiating the development and research process of this post; we greatly appreciate the foundational work he established. Furthermore, we extend our deepest thanks to Javier Sabio for introducing us to the topic of pairs trading and generously providing the initial documentation that facilitated our further exploration and development of this subject matter. Finally, we are immensely grateful to Jesús López González for his invaluable assistance in integrating the pairs trading strategy within the tick architecture and for his thorough review of the post's content. His expertise and insights have significantly enhanced the quality and practicality of this work. ## References and Documentation For the technical implementation, we relied on: * Kx Documentation: https://code.kx.com/q/ref/ -* Q for mortals: https://code.kx.com/q4m3/ * PyKX Documentation: https://code.kx.com/pykx/2.4/index.html +* KDB Tick Explained: https://www.defconq.tech/docs/tutorials/tick +* kdb+ architecture: https://learninghub.kx.com/courses/kdb-architecture/ * statsmodels Documentation: https://www.statsmodels.org/dev/generated/statsmodels.tsa.stattools.coint.html +* q201: https://q201.org +* Q for mortals: https://code.kx.com/q4m3/ For the financial implementation, we used: * QuantResearch: https://github.com/QuantConnect/Research/blob/master/Analysis/02%20Kalman%20Filter%20Based%20Pairs%20Trading.ipynb + +For the data gathering, we used: +* Yahoo Finance API: https://github.com/ranaroussi/yfinance +* Tickstory: https://tickstory.com/ + +## Other articles by Habla you might find interesting + +* [Exploring KX Dashboards](https://www.habla.dev/blog/2023/10/03/Exploring-KX-Dashboards.html) +* [All Roads Lead to PyKX](https://www.habla.dev/blog/2023/07/31/all-roads-lead-to-pykx.html) +* [All Roads Lead to Kdb: Technical Counterpart](https://www.habla.dev/blog/2023/09/15/all-roads-lead-to-kdb-the-technical-counterpart) +* [Contributing to PyKX](https://www.habla.dev/blog/2024/04/10/Contributing-to-PyKX.html) + diff --git a/resources/ADFgif.gif b/resources/ADFgif.gif deleted file mode 100644 index 8152dba..0000000 Binary files a/resources/ADFgif.gif and /dev/null differ diff --git a/resources/cointegration.png b/resources/cointegration.png index 494b043..f1c9e4b 100644 Binary files a/resources/cointegration.png and b/resources/cointegration.png differ diff --git a/resources/general-architecture-bottom.png b/resources/general-architecture-bottom.png deleted file mode 100644 index f86de5f..0000000 Binary files a/resources/general-architecture-bottom.png and /dev/null differ diff --git a/resources/general-architecture-ms.png b/resources/general-architecture-ms.png new file mode 100644 index 0000000..7ea0773 Binary files /dev/null and b/resources/general-architecture-ms.png differ diff --git a/resources/general-architecture-rpt.png b/resources/general-architecture-rpt.png new file mode 100644 index 0000000..a8c8b8c Binary files /dev/null and b/resources/general-architecture-rpt.png differ diff --git a/resources/general-architecture-top.png b/resources/general-architecture-top.png deleted file mode 100644 index 84c372c..0000000 Binary files a/resources/general-architecture-top.png and /dev/null differ diff --git a/resources/general-architecture.png b/resources/general-architecture.png index acf81d5..0f5506d 100644 Binary files a/resources/general-architecture.png and b/resources/general-architecture.png differ diff --git a/resources/heatmap.png b/resources/heatmap.png new file mode 100644 index 0000000..71dcf1e Binary files /dev/null and b/resources/heatmap.png differ diff --git a/resources/linear_regression.png b/resources/linear_regression.png index 7bebb22..a8bfa97 100644 Binary files a/resources/linear_regression.png and b/resources/linear_regression.png differ diff --git a/resources/tfg18.png b/resources/tfg18.png deleted file mode 100644 index 9a9a6ec..0000000 Binary files a/resources/tfg18.png and /dev/null differ