Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restarting the appliance leaves old podman tmpfs directories causing failures on next boot #116

Open
agrare opened this issue Sep 10, 2024 · 17 comments
Assignees
Labels
bug Something isn't working stale

Comments

@agrare
Copy link
Member

agrare commented Sep 10, 2024

After restarting the appliance subsequent podman runs fail with the following error:
Error: current system boot ID differs from cached boot ID; an unhandled reboot has occurred. Please delete directories \"/tmp/storage-run-988/containers\" and \"/tmp/storage-run-988/libpod/tmp\" and re-run Podman

The issue being podman expects its temp directories to be on a tmpfs filesystem and thus cleared on boot, however the ManageIQ appliance has an XFS formatted logical volume for /tmp so it is persisted across boots.

@agrare agrare added the bug Something isn't working label Sep 10, 2024
@agrare agrare self-assigned this Sep 10, 2024
@Fryguy
Copy link
Member

Fryguy commented Sep 10, 2024

Is there a reason we don't use tmpfs? Maybe @bdunne or @jrafanie know.

@agrare
Copy link
Member Author

agrare commented Sep 10, 2024

Yeah that was a surprise to me as well, according to this podman expects its temp dirs to be on tmpfs containers/podman#23193 (comment)

@jrafanie
Copy link
Member

jrafanie commented Sep 10, 2024

I assume it was never changed. For debugging, some stuff would be retained in tmp so it might be helpful to not lose it on reboot.

@Fryguy
Copy link
Member

Fryguy commented Sep 10, 2024

Interesting - I can't think of what we would put diagnostically in /tmp, except maybe ansible runner things, and those are cleaned up anyway.

@jrafanie
Copy link
Member

jrafanie commented Sep 10, 2024

Interesting - I can't think of what we would put diagnostically in /tmp, except maybe ansible runner things, and those are cleaned up anyway.

Right, I think that's the question. If it uses the system tmp by default, things will use it. Ansible was indeed one of the things I was thinking of but even automate. I'm not sure if anything on the system uses it as most should use journal. Does kickstart use it for storing the kickstart or whatever it's applying? I vaguely recall that might be the home directory. I'm not sure if that changed after we stopped running as root. cloud-init? I'm not sure. It should be using journal but not sure if any temporary stuff useful for diagnostics were kept there.

I'm not saying we can't make it tmpfs, only that this was probably the reason we wanted it persisted.

@agrare
Copy link
Member Author

agrare commented Sep 10, 2024

It appears that a LV backed /tmp was added here: ManageIQ/manageiq-appliance-build#51

I'm not sure why STIG recommends a persistent /tmp but if that is no longer a requirement then it looks like we'd be safe to delete it.

Honestly a 1GB partition backed /tmp isn't guaranteed to be "more space" than a RAM backed tmpfs anyway with the amount of memory we recommend so I don't think this is a "usage" argument. Also nothing (well designed) should ever depend on /tmp persisting across reboots and I highly doubt ansible would depend on that but we can always check it.

Rebooting in the middle of an ansible-runner execution should be clear and clean on boot as the execution is certainly cut short.

@jrafanie
Copy link
Member

I'm not sure why STIG recommends a persistent /tmp but if that is no longer a requirement then it looks like we'd be safe to delete it.

It looks like STIG wanted separate filesystems and not specifically persistent.

@agrare
Copy link
Member Author

agrare commented Sep 10, 2024

It looks like STIG wanted separate filesystems and not specifically persistent.

👍 now that I 100% agree with but we would have already had that (I hope)

@jrafanie
Copy link
Member

Also nothing (well designed) should ever depend on /tmp persisting across reboots and I highly doubt ansible would depend on that but we can always check it.

Rebooting in the middle of an ansible-runner execution should be clear and clean on boot as the execution is certainly cut short.

Right. I don't recall anything depending on persistent data in /tmp, only that the diagnostics were there for review if there was a problem. Again, it's been so long and I have a strong feeling the only thing we'd ever lose is diagnostics. I think we've moved a lot of the diagnostics to actual logging through journal so this is less of an issue than even at the time of this PR.

@jrafanie
Copy link
Member

It looks like STIG wanted separate filesystems and not specifically persistent.

👍 now that I 100% agree with but we would have already had that (I hope)

Before that PR, logs, tmp, home, etc. were all on the / volume. He created separate filesystems for each. I think we're agreeing but want to make it clear it was really about splitting up the different types of data on different volumes with different filesystems and less about the implementation of the volume.

@agrare
Copy link
Member Author

agrare commented Sep 10, 2024

Before that PR, logs, tmp, home, etc. were all on the / volume.

Oh interesting, thanks yeah I didn't realize /tmp was just on /...

@Fryguy
Copy link
Member

Fryguy commented Sep 10, 2024

@Fryguy
Copy link
Member

Fryguy commented Sep 10, 2024

Also nothing (well designed) should ever depend on /tmp persisting across reboots and I highly doubt ansible would depend on that but we can always check it.

We temporarily checkout git repos to /tmp and then ansible-runner in it, so that's not ansible doing it. For diagnostics, you can set an ENV var to tell our Ansible::Runner code to not clean up after itself, so you can see what's in there, but even in that case I'd want tmpfs to go away and cleanup on reboot.

@Fryguy
Copy link
Member

Fryguy commented Sep 10, 2024

I'm 100% for /tmp being a separate volume that is tmpfs - and based on the discussion here I'm pretty sure that satisfies STIG as well.

@Fryguy
Copy link
Member

Fryguy commented Sep 10, 2024

Strangely, I can't find anything STIG related in our code base - I was pretty sure we had something in appliance console, but maybe that was CloudForms specific? I also don't see anything STIG related in the Appliance Hardening Guide.

That being said, I also recall that we tried to default as much as we could from the STIG and SCAP rules so that we could minimize changes that the customer needed to do themselves.

@agrare
Copy link
Member Author

agrare commented Sep 13, 2024

Interestingly there is a podman-clean-transient.service oneshot service but that doesn't cleanup anything in /tmp assuming that they are automatically cleared.

[root@manageiq ~]# runuser --login manageiq --command 'podman system prune --external'
Error: current system boot ID differs from cached boot ID; an unhandled reboot has occurred. Please delete directories "/tmp/storage-run-986/containers" and "/tmp/storage-run-986/libpod/tmp" and re-run Podman

@miq-bot
Copy link
Member

miq-bot commented Dec 16, 2024

This issue has been automatically marked as stale because it has not been updated for at least 3 months.

If you can still reproduce this issue on the current release or on master, please reply with all of the information you have about it in order to keep the issue open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale
Projects
None yet
Development

No branches or pull requests

4 participants