Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NFS: Stale file handle - after log cycling occurs. #505

Open
darkl0rd opened this issue Apr 22, 2021 · 21 comments · Fixed by #514
Open

NFS: Stale file handle - after log cycling occurs. #505

darkl0rd opened this issue Apr 22, 2021 · 21 comments · Fixed by #514
Labels
bug This is considered a bug and shall get fixed help wanted mtail-Log Tailing Issues related to log polling and tailing

Comments

@darkl0rd
Copy link

I'm running mtail as a separate Docker container, mounting an NFS (EFS) volume to which another application writes out it's log files. I noticed however that at some point the mtail containers went to 100% CPU, closer investigation shows that the mtail logs are rapidly filling up with:

I0422 19:46:12.453819      13 filestream.go:116] read /logs/app.log: stale NFS file handle
I0422 19:46:12.453841      13 filestream.go:116] read /logs/app.log: stale NFS file handle
I0422 19:46:12.453872      13 filestream.go:116] read /logs/app.log: stale NFS file handle
...
...

That explains the 100% CPU at least, I guess. However, I found out that I can trigger this behaviour by simply redeploying the application that is writing out the logs (app.log will be renamed to app.log.1 on deployment and a fresh app.log is written out).

I'm understanding that mtail has some heuristics to attempt to follow files after rotation, is there a particular reason why this is failing when the files are served from NFS?

@jaqx0r
Copy link
Contributor

jaqx0r commented Apr 23, 2021

This is weird. Can you share a simple reproduction, like docker command line and any dockerfile if necessary? Or at least examples so I can try to set up a volume mount and a fake log rotation to reproduce.

@darkl0rd
Copy link
Author

On contrary to my original post, when I just found out about the problem, I've found out that it actually happens at random, not on container restarts / log rotation. I'm currently guessing it has more to do with EFS temporarily becoming unavailable, the reason I assume this is that it does happen for "groups" of applications at once.
It's very strange though, because the application writing the logs (and many other applications I run, which are using EFS) don't experience the "Stale Handle" problem (or, of course maybe they do, but they recover nicely and don't log a thing).

Anyway - it's hard to share a fully working reproduction case, since I can't actually reproduce it at will. To give an idea - I have about 30 of pairs like these running, and this happens consistently at least once a day for each of them.

The gist of it:

AWS EFS, mounted with "defaults":

my-efs:/ on /some/path type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=x.x.x.x,local_lock=none,addr=x.x.x.x,_netdev)

Containers

version: "3.8"

services:
  my-app:
    image: "some-app:some-tag"
    volumes: 
      - type: "bind"
        source: "/some/path/application-instance"
        target: "/my/application/logs"
    networks:
      - prometheus
  mtail:
    image: "my-mtail-image:my-tag"
    volumes:
      - type: "bind"
        source: "/some/path/application-instance"
        target: "/logs"
    networks:
      - prometheus
   command: mtail -progs /etc/mtail/progs -logtostderr -logs /logs/my-app.log

I'll be happy to provide any additional debugging information or otherwise.

@jaqx0r
Copy link
Contributor

jaqx0r commented Apr 28, 2021

I think mtail should handle the stale file handle error and attempt to reopen the file. Thanks for the docker-compose, that should help!

Is your my-mtail-image image built from this repos' Dockerfile ?

@darkl0rd
Copy link
Author

No, it uses my own base image (ubuntu:18.04), adds mtail to /usr/local/bin, copies my progs and starts mtail; nothing fancy, no weird options passed to mtail or anything.

I think so too, since a kill -HUP of the mtail process resolves the problem - I think it should be a matter of handling the ESTALE error. I'd still be curious to find out why the file handle supposedly goes "stale" though.

@jaqx0r
Copy link
Contributor

jaqx0r commented Apr 28, 2021

I was reading this, and it seems the common solution is to remount the NFS volume in the client. https://www.cyberciti.biz/tips/nfs-stale-file-handle-error-and-solution.html

Thus, not sure mtail can do anything other than closing the file handle so that the volume is unused and then some other process remounts the volume.

But, I wonder what happens with the HUP? That signal should only cause the programmes to be reloaded, not the log files! Can you increase the logging in the tailer with --vmodule=filestream=2,tail=2 flags and redo the HUP and send me some logs?

Thanks!

@darkl0rd
Copy link
Author

darkl0rd commented Apr 29, 2021

I read that too - but in my case the client is the 'host', as they are bind mounts.
There are 2 other containers (minio, nginx) also reading from the same directory and accessing the same file (at times), that do not suffer from the problem. this might of course well be, because these containers open the file only when it's required (effectively a new file handle) rather than keeping it open. That being said, the application writing to the file keeps the file open continuously and never suffers from the problem.

Also worth noting - I'm currently pinpointing the exact timings, but -all- my mtail containers suffer from this problem at around the same time frame. Mostly seems to start at 21:00/22:00. and at 10:00. Haven't been able to pin it on any 'cron' jobs or alike.

I'll bump the logging and wait for the next occurence - it happened at 08:00 CEST this morning, sadly sending a 'HUP' signal turned out to have no effect after all.
I have verified on the host that the mount is not stale, and verified that the other two containers can still access the file. No log rotation has taken place. Sadly the affected container does not have it's logging bumped - so we need to give it a few more hours for this container to run into the same problem.

@darkl0rd
Copy link
Author

darkl0rd commented May 4, 2021

Over the past few days I have found out that it always correlates with the following event in syslog:

2021-05-04T17:50:38.131712+00:00 xxx systemd-networkd[5250]: eth0: Configured
2021-05-04T17:50:38.146945+00:00 xxx systemd[1]: Started ntp-systemd-netif.service.

The other containers, running on the same node continue to access the file just fine. For reference, I also am running grafana/promtail next to it, tailing the same log, it's also unaffected.

@jaqx0r
Copy link
Contributor

jaqx0r commented May 5, 2021 via email

@darkl0rd
Copy link
Author

darkl0rd commented May 5, 2021

All containers have a bind mount from the host (which has the EFS filesystem mounted), which looks like:

  ... 
  volumes:
    - /host/logs:/logs:ro
  ...

I added promtail 2 days back, for testing purposes, because of this issue I'm experiencing with mtail. This way I I want to determine whether it is mtail or the EFS mount. At this point, I have not experienced issues with promtail yet.

Minio: Serves the files on-access; so this is not a permanent open file handle
NGiNX: likewise

mtail: continuous tail of the file
promtail: continuous tail of the file (added this, to have a more similar & comparable scenario than nginx/minio)

I am still trying to find out where the systemd messages are coming from. Manually restarting "systemd-networkd" does not result in the same behavior.

@jaqx0r
Copy link
Contributor

jaqx0r commented May 5, 2021 via email

@darkl0rd
Copy link
Author

darkl0rd commented May 5, 2021

Not re-exported, the EFS mount from the host is mapped into the container (a bind mount). (re-)exporting (NFS terminology) would imply that the host exports the filesystem again - it doesn't.

I'm thinking something along the same lines, at this point I think we can safely say that it definitely is related to how mtail deals (or doesn't deal) with Stale handles.

@jaqx0r
Copy link
Contributor

jaqx0r commented May 6, 2021 via email

@darkl0rd
Copy link
Author

darkl0rd commented May 6, 2021

There's no indication that it's actually being remounted. An explicit remount (mount -o remount .. ) does also not trigger the behavior. The only pointer at this point is the systemd-networkd log message. Forcefully restarting systemd-networkd does not trigger the behavior though.

@jaqx0r
Copy link
Contributor

jaqx0r commented May 6, 2021 via email

@jaqx0r
Copy link
Contributor

jaqx0r commented May 6, 2021 via email

@jaqx0r
Copy link
Contributor

jaqx0r commented May 6, 2021 via email

@jaqx0r
Copy link
Contributor

jaqx0r commented May 7, 2021

Please build from HEAD and try out the new change. I don't think we're fixed yet as I haven't tested this code. I would really like to be able to reproduce this before I call it done, though.

If it works, you should get a log message indicating that we saw ESTALE and attempted to reopen the log file.

@jaqx0r jaqx0r reopened this May 7, 2021
@jaqx0r jaqx0r added the bug This is considered a bug and shall get fixed label May 7, 2021
@darkl0rd
Copy link
Author

darkl0rd commented May 8, 2021

Just to let you know, that this issue didn't go stale ;-), I have just built and deployed HEAD. Will report back in 24h.

The log message, do I need to bump the log level for it? Or is it always printed?

@jaqx0r
Copy link
Contributor

jaqx0r commented May 8, 2021 via email

@darkl0rd
Copy link
Author

darkl0rd commented May 9, 2021

Sadly no additional logging, output is the same as before:

...
I0509 19:11:19.868818      13 filestream.go:131] read /logs/app.log: stale NFS file handle
I0509 19:11:19.869217      13 filestream.go:131] read /logs/app.log: stale NFS file handle
I0509 19:11:19.869311      13 filestream.go:131] read /logs/app.log: stale NFS file handle
...
# mtail -version
mtail version v3.0.0-rc45-57-g8f0b94fa-dirty git revision 8f0b94fad7f6c8359e31a01d7ee72e8312d915a3 go version go1.16.4 go arch amd64 go os linux

@jaqx0r
Copy link
Contributor

jaqx0r commented May 9, 2021 via email

@jaqx0r jaqx0r added the mtail-Log Tailing Issues related to log polling and tailing label Jun 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This is considered a bug and shall get fixed help wanted mtail-Log Tailing Issues related to log polling and tailing
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants