-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What regions to choose for single or multiple unit testing? #319
Comments
Awesome thoughts @cetagostini. Just a few quick responses. Both approaches?I think it makes a lot of sense to go after both approaches. So we'd end up having something like:
But because I don't have experience of applying this in real situations... do we even need to do this? If we know what test regions to run, then all other regions are by default control regions. So can't we just run our test in the geo's that we want to, then run regular synthetic control after the fact and that will choose the set of control geo weightings. The only difference is that we've not decided those weightings (or which geo's will be weighted at zero) in advance of running the test. [Also see next point] I have had questions from a client who genuinely wants to know what the best test regions are. This is a very general problem, so this doesn't give anything away...
So I don't think we should throw out the idea of SparsityThe LASSO idea is an interesting one. There could be some slight complications in using LASSO for geo selection but then ending up analysing with synthetic control. If that does turn out to be problematic, then we could consider altering the hyper parameters of the Dirichlet distribution so as to favour fewer more strongly weighted regions. So basically take that idea of sparsity from LASSO but implement it in the synthetic control QuestionI don't have experience of actually using these methods in the real world. It seems like it's 'easier' to calculate the posterior in the situation of no effect, in order to calculate a ROPE and make statements about minimal detectable effects etc. But when it comes to simulating effect sizes, this seems more complex. Should you simulate an immediate and sustained impact, or consider some other time course of impact? We don't know what the time course of the impacts are, so this seems more tricky to get right. |
Hey @drbenvincent Agree we can have both, maybe I'm very bias by my usual work use cases! 🙌🏻
Very valid point, I have been there, and to add context, you'll be assuming that using all or some other geos would give us a good enough model. What happens with that is sometimes, putting all the elements in the blender and clicking on it results in a very poor result. If we execute an activity in Y blindly, without first verifying which other regions could serve as a control, we could end up in a situation where after the test you want to use region X, Z as a control but these give a poor counterfactual, with a high MDE, which ultimately makes your experiment end up with little significance. If you had known this before, then we could have restructured your experiment, or simply looked for other regions. I think that is the value of the function
This is a pretty interesting case, I think it proves my point a bit. They know exactly which regions they want to take for tests (the cheap regions -> Because they have an small budget) but even within those regions, they need to make a sub-selection! Thanks for sharing the context, definitely can imagine SparsityBy issues do you mean maybe Lasso can zero out some predictor coefficients? which might translate into zero weights for some regions in the synthetic control? Just to be sure we are thinking the same!
I like this idea, do you think we'll end up with two typologies of the Simulated Effect
Great question, from my experience, indeed ROPE makes things simple. That's why I like it 😅 However, here I have no opinions about what could be better. Sometimes according to my assumptions and mood, if I think the effect will be continuous then I simulate it constantly and immediately during a fake period without effect and see if the model recovers it. I did a package doing this like a one/two years ago. But since it is a very naive way, and sometimes I have the assumption that the effect could vary in time, I have a function that generates a random walk time series of a given length, ensuring that the arithmetic mean of the time series equals a specified target means (simulated mean effect). def generate_simulated_random_walk_impact(length, mean_effect):
# Generate standard normal steps
steps = np.random.normal(loc=0, scale=1.0, size=length)
# Create the random walk by taking the cumulative sum of the steps
random_walk = np.cumsum(steps)
# Adjust the random walk to achieve the target mean
current_mean = np.mean(random_walk)
adjustment = mean_effect - current_mean
adjusted_random_walk = random_walk + adjustment
return adjusted_random_walk Those random walks with that mean X represent your effect and are multiplied over the period of the same length where no effect exists. Usually, you can say Once you have those different time series with different patterns but with the same total effect, you can compute the distribution of your daily or cumulative value (you decide), and once you have the posterior of your model, you can compare it with the distribution of your effect. If the posterior and the effect's distribution are significantly different, it indicates that your model is detecting it more accurately. You can actually create an score around it, computing the difference between the distributions and finally determinate MDE. However, in the end that whole process is not much different than using ROPE during a period of no effect, and ROPE is much shorter and simpler. But would it still be worth exploring? |
I think at this point I'm yet to be convinced how useful In terms of sparsity of the synthetic control, I think the simplest way would be just to allow the user to specify hyper parameters of the dirichlet distribution of |
In the context of geo testing... Let's say we have historical data of some KPI (such as sales, or customer sign ups) across multiple geographies. And we are considering running some geo testing on a single (or multiple) regions. That is, we are planning on an intervention in one (or multiple) geographical regions. The question at the top of our minds is "what region (or regions) should we select to receive the intervention?"
This question has been tackled in this GeoLift docs page, and uses power analysis as the basic approach.
We can formalise this a bit as follows: How to best select
n
test regions out of the total pool of all regions?The general approach we can use here is to iterate through a list of all possible valid test regions. For each case, we can use some kind of scoring function. We can then evaluate the scoring function for each possible valid set of test region(s), then pick the best. So something along the lines of:
That will give us a list which looks like this:
Let's continue with our approach:
So the question is, what happens in
scoring_function
? Well that depends on what we want to do, but it will be along the lines of:Note: Running synthetic control with multiple treatment regions would benefit from having addressed #320 first.
All of this should be wrapped into a function (or class) called something like
FindBestTestRegions
with a signature of something like:Input:
data
: the raw panel dataregion_list
: list of all regions/geo'sn_regions
: list of number of regions we want to consider testing in, e.g. [1, 2, 3]scoring_function
: either an actual function, or a string to dispatch to a specific function that determines the scoring method we are using.model
andexperiment
: although these might be pre-defined and not necessary as inputs.Output:
results
: a dataframe where each row represents a different combination of test regions. There will be a score column which is the result of the scoring function. This will be ordered from best to worst in terms of the scoring function.The text was updated successfully, but these errors were encountered: