Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix timeout caused by nginx #4336

Merged
merged 3 commits into from
Nov 19, 2023

Conversation

aronasorman
Copy link
Collaborator

Summary

Description of the change(s) you made

Fixes https://github.com/learningequality/infrastructure/issues/452. Nginx is the source of the timeout. Extends this to 300 seconds (5 minutes).

First commit is to update and cleanup Sigil, the templating engine we use for nginx.

Docker is having trouble parsing complex env var substitution, so we move that logic to a full fledged script.
@aronasorman aronasorman changed the title Fix timeout Fix timeout caused by nginx Nov 10, 2023
Copy link
Member

@DXCanas DXCanas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely looks like some yakshaving for that one liner lol

Comment on lines +1 to +8
#!/bin/sh
export SIGIL_VERSION=0.10.1
export OS=`sh -c "uname -s | tr '[:upper:]' '[:lower:]'"`
export ARCH=`sh -c "uname -m | tr '[:upper:]' '[:lower:]' | sed 's/aarch64/arm64/'"`


curl -L "https://github.com/gliderlabs/sigil/releases/download/v${SIGIL_VERSION}/gliderlabs-sigil_${SIGIL_VERSION}_${OS}_${ARCH}.tgz" | tar -zxC /tmp
mv /tmp/gliderlabs-sigil-${ARCH} /tmp/sigil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a refactor in our future lol

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually didn't carry out the refactor we talked about, where we can simply dispense with Sigil. Should I do that in this PR @DXCanas ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah. Too big a load, and this should be fixed ASAP. We should also think about the various paths forward. Hence my asking @bjester for his thoughts on moving some of this to infra:
#4336 (comment)

Comment on lines +5 to +7
COPY /k8s/images/nginx/download_sigil.sh /tmp/download_sigil.sh
RUN chmod +x /tmp/download_sigil.sh
RUN /tmp/download_sigil.sh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved it to a script, since complex env vas substitution doesn't work in a Dockerfile.

@@ -82,6 +82,7 @@ http {
proxy_pass http://studio;
proxy_redirect off;
proxy_set_header Host $host;
proxy_read_timeout 300s;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what was the reasoning in choosing 300s?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh, I'd be surprised if the reasoning was anything deeper than "add a 0 at the end"/"it's longer than 30s" and it made for easy testing.

Are you thinking we should have it match the Cloudflare at 100s? Or does the Studio API have its own value we want to match?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I mean if Cloudflare is less than this at 100s, I'm not sure there's any use in having it longer? Just means more resources are tied up longer?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I've pushed a commit limiting it at 100s. let me know if it looks good and I'll squash the fixup commit.

@DXCanas
Copy link
Member

DXCanas commented Nov 13, 2023

@bjester this looks alright to me, but I think it'd be a good idea for you to be the one to review.

Would you have any objections to us bringing more of this stuff into the infra repo rather than studio?

RUN curl -L "https://github.com/gliderlabs/sigil/releases/download/v0.4.0/sigil_0.4.0_$(uname -sm|tr \ _).tgz" \ | tar -zxC /usr/bin
COPY /k8s/images/nginx/download_sigil.sh /tmp/download_sigil.sh
RUN chmod +x /tmp/download_sigil.sh
RUN /tmp/download_sigil.sh

FROM nginx:1.11
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to not update nginx while we're at it? I think 1.11 was EOL in 2017...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't want to speak for Aron, but just to hold you over till he's out: I think he mentioned that he was running into some build issues after an upgrade? But I could be conflating nginx and the sigil work during our verbal updates.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can upgrade the version number, but I think it's best to do that in another PR. Keep the PRs small!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the Sigil issue @DXCanas

@bjester
Copy link
Member

bjester commented Nov 14, 2023

@bjester this looks alright to me, but I think it'd be a good idea for you to be the one to review.

Would you have any objections to us bringing more of this stuff into the infra repo rather than studio?

If our plan is to remove nginx at some point, then I don't think it matters. If we don't remove it, I think it is helpful in this repo for replicating production, if we actually had the dev setup to do so (something which I moved us towards in an unmerged PR IIRC). For example, when BCK bugs come up, there's some tension regarding whether the bug is BCK or Kolibri related, which we don't have a good developer setup to help us navigate that through. I hope and envision that for Studio though.

@DXCanas
Copy link
Member

DXCanas commented Nov 14, 2023

when BCK bugs come up, there's some tension regarding whether the bug is BCK or Kolibri related

For Kolibri specifically, this was intentional — at the time of design, everything lived in Kolibri. But this was getting increasingly bloated and difficult to manage. Any little infra-ish fix would necessitate an entirely new tagged Kolibri release, and going through the Kolibri testing just to get an image to deploy. With a Kolibri PR that required a Kolibri dev to review code they didn't understand. Only for the fix not to work and for us to start the process over again.

Having things in the infra repo allows for rapid iteration. And, because there's such a deep integration with values from Kubernetes, postgres, etc, a lot of the code is templated and populated using infra tooling, like Terraform. This is actually the reason Aron had to add Sigil here — he had no templating library to work with.

Processes have changed, and I see your point. But having that permeable membrane between the 2 teams has its benefits. KDP having its production docker image in the KDP repo can get cumbersome as well — @anguyen1234 is starting to see this for herself as she iterates on the Frontend/Worker division. Some things about the KDP image don't match up with industry best practices. But we don't want to go in and break the dev workflow by iterating on it there. I have a hard time even adding a requirement to requirements.txt because I have to go through requirements.in which just straight up won't compile on my machine. Gets yakshavey really quick.

I think do think there's a middle ground here. But that we should be careful not to optimize too much for the workflow of either an app dev or an infra dev. Might be a good talking point for a dev meeting one of these days?

@bjester
Copy link
Member

bjester commented Nov 14, 2023

@DXCanas I think the point is that there is pain on both sides. You bring up the infra pain, but don't forget there's pain on our end too trying to debug production issues. It isn't one-sided

@DXCanas
Copy link
Member

DXCanas commented Nov 14, 2023

@bjester totally agree. I just think it's important we get input from both sides as we come up with a solution.

I think do think there's a middle ground here. But that we should be careful not to optimize too much for the workflow of either an app dev or an infra dev

@aronasorman aronasorman merged commit 35321d9 into learningequality:unstable Nov 19, 2023
9 checks passed
@aronasorman aronasorman deleted the fix-timeout branch November 19, 2023 18:49
@bjester
Copy link
Member

bjester commented Nov 20, 2023

@aronasorman Looks like the Cloud Build failed:

mv: can't rename '/tmp/gliderlabs-sigil-x86_64': No such file or directory
The command '/bin/sh -c /tmp/download_sigil.sh' returned a non-zero code: 1

@aronasorman
Copy link
Collaborator Author

aronasorman commented Nov 20, 2023 via email

@jredrejo
Copy link
Member

when BCK bugs come up, there's some tension regarding whether the bug is BCK or Kolibri related

For Kolibri specifically, this was intentional — at the time of design, everything lived in Kolibri. But this was getting increasingly bloated and difficult to manage. Any little infra-ish fix would necessitate an entirely new tagged Kolibri release, and going through the Kolibri testing just to get an image to deploy. With a Kolibri PR that required a Kolibri dev to review code they didn't understand. Only for the fix not to work and for us to start the process over again.

Having things in the infra repo allows for rapid iteration. And, because there's such a deep integration with values from Kubernetes, postgres, etc, a lot of the code is templated and populated using infra tooling, like Terraform. This is actually the reason Aron had to add Sigil here — he had no templating library to work with.

Processes have changed, and I see your point. But having that permeable membrane between the 2 teams has its benefits. KDP having its production docker image in the KDP repo can get cumbersome as well — @anguyen1234 is starting to see this for herself as she iterates on the Frontend/Worker division. Some things about the KDP image don't match up with industry best practices. But we don't want to go in and break the dev workflow by iterating on it there. I have a hard time even adding a requirement to requirements.txt because I have to go through requirements.in which just straight up won't compile on my machine. Gets yakshavey really quick.

I think do think there's a middle ground here. But that we should be careful not to optimize too much for the workflow of either an app dev or an infra dev. Might be a good talking point for a dev meeting one of these days?

As I mentioned in https://github.com/learningequality/kolibri-data-portal/pull/335#issuecomment-1828500394 I don't see the point in maintaining a Dockerfile in the KDP repo with a cloud deployment target. Dockerfile in KDP repo, if needed, only makes sense for local developers installation, from my particular point of view. Infra must do production deployments based on their own needs/limitations/architecture decissions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants