Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many tailed files collected #3783

Open
pmoravec opened this issue Sep 24, 2024 · 9 comments
Open

Too many tailed files collected #3783

pmoravec opened this issue Sep 24, 2024 · 9 comments

Comments

@pmoravec
Copy link
Contributor

We noticed a high occurrence of tailing some specific files in different sosreports. Below is a list of the most often tailed files and my suggestion to that. Any comment / suggestion is welcomed. Possible options are "leave as is" or "increase sizelimit" or "drop that file or some data to truncate it".

  • postgresql/var.lib.pgsql.data.log.postgresql-*.log : this is most probably from Satellite / foreman systems with bigger postgres queries logged. Probably worth increasing the sizelimit, I will raise PR for it
  • sar/sa*.xml : we collect the files due to legacy reasons only (imho). I would vote for dropping them (until somebody needs them). If that isnt welcomed, let increase sizelimit - having incomplete/broken xml file is bit useless.
  • various var/log/* files, namely messages* or audit.log or secure - probably let it be, maybe audits or secure should be collected for past X days instead of given filesize..?
  • pacemaker/var.log.pacemaker.pacemaker.log - any suggestion from pacemaker plugin authors @TurboTurtle , @nrwahl2 ?
  • pulpcore/core_task - we collect all details about the tasks. Since many of the details are encrypted now, to prevent password leak, a lot of data are useless and I should improve the query. TODO point on me
  • crio/journalctl_--no-pager_--unit_crio - any suggestion from crio plugin authors @TurboTurtle , @vteratipally , @haircommander ?
  • openshift/journalctl_--no-pager_--unit_kubelet - any suggestion from openshift plugin authors @TurboTurtle , @vwalek ?
  • logs/journalctl_--no-pager - that is expected and reasonable, no action
@jcastill
Copy link
Member

* sar/sa*.xml : we collect the files due to legacy reasons only (imho). I would vote for dropping them (until somebody needs them). If that isnt welcomed, let increase sizelimit - having incomplete/broken xml file is bit useless.

I'm not sure these files are needed at all, but instead of dropping we could add an option to collect them if needed, in case anyone relies on them for any scripts. "Interpreted/decoded" ones in plain text are more useful.

* various `var/log/*` files, namely `messages*` or `audit.log` or `secure` - probably let it be, maybe audits or secure should be collected for past X days instead of given filesize..?

Agreed, maybe two/three days should be enough by default, or even just one day.

* `logs/journalctl_--no-pager` - that is expected and reasonable, no action

Agreed

pmoravec added a commit to pmoravec/sos that referenced this issue Sep 24, 2024
These columns are either empty, containing passwords or some encoded
data.

Get the *remaining* column names and query for them.

If the query for column names fail, failover to current "SELECT *".

Relevant: sosreport#3783
Resolves: sosreport#3784

Signed-off-by: Pavel Moravec <pmoravec@redhat.com>
@pmoravec
Copy link
Contributor Author

* `postgresql/var.lib.pgsql.data.log.postgresql-*.log` : this is most probably from Satellite / foreman systems with bigger postgres queries logged. Probably worth increasing the sizelimit, I will raise PR for it

This happens for Satellite / foreman, where we already increased sizelimit to 100MB via preset. And I confirm it is applied to these files. Raising it higher is possible, but.. not much worth of it. Usually, tailed files are from previous days only, that is sufficient.

@haircommander
Copy link
Contributor

from my perspective as a node team member, crio and kubelet logs are the most important pieces for us to debug issues. We don't need them if they're caught in the overall journal though. Is bumping the size limit an option for those? or, we bump the size limit for the overall journal, and drop the crio/kubelet specfic journals. What do folks think?

@TurboTurtle
Copy link
Member

TurboTurtle commented Sep 24, 2024

I'd prefer increasing the size limit of unit-specific journals and/or log files over increasing the system journal collection. It gives us granularity without enforcing potentially very large system journal collections across the board. Granted, I get the point of "well it's going to be the majority of the system journal anyway...", but I think this is the least-bad option overall.

As far as the sar/sa files go, I'd defer to support teams on how often they're used. I know there's been a general shift away from sar but there's a lot of knowledge built around the use of these, at least the plaintext translations. I'd be open to dropping the binary collections since you need to use the same version to translate those as which generated them (hence why we do that during collection at all), but I'd be wary of dropping them entirely.

@jcastill
Copy link
Member

The plain text ones are used a lot, even though they are not the most accurate output you could get... but as a first step when looking into performance issues, they are good enough.
I've searched internally and I haven't found any reference to the xmls or any tool that may use them, but "absence of proof..." . I don't remember using them for any support case.
I think there's an old tool, kSar, abandoned now, that used to read the xmls, but other than that nothing.

@nrwahl2
Copy link
Contributor

nrwahl2 commented Sep 24, 2024

Pacemaker: It's been a couple of years since I've worked in support, so I would defer to any support engineers. Whether the limit is sufficient will always depend on how promptly the user opens a support ticket after an issue occurs, and on whether additional verbosity has been configured (it usually hasn't been).

We could increase the size limit to some arbitrary higher number. I don't know what fraction of sosreports have truncated Pacemaker log files currently and whether this would be worth doing.

Support engineers should not hesitate to request the full pacemaker.log file if the relevant timestamps are not present. Ideally, that should introduce only a small delay in investigation, though that depends on both the support team and the user.

@pafernanr
Copy link
Contributor

pafernanr commented Sep 25, 2024

Hello all,

+1 to remove sa*.xml files. They are redundant, binary saXX files are also included and they contain the full day dump. Some times also truncated, but not usual. It can happen if interval is too short.

I'd also like to suggest increasing the size limit to the foreman plugin. These CSV files are sometimes truncated which leads to missing important dynflow steps. Note that the plugin already limits the output to last 14 days, which should be enough for any support case. That said, although I fully agree a limit is mandatory, in this specific plugin, file limit is somehow "redundant". IMO increasing it to 150/200M could be a good choice to let the 14 days limit the output in as many cases as possible.

@pmoravec
Copy link
Contributor Author

SAR data: I would drop the xml as rarely-if-at-all used (I am asking internally, either way), while I would keep the binary data (the "source of truth" that we can copy to another system with same sysstat version and get whatever we want) and also text saXX files (concise enough text interpretation of the binary data).

Increasing the 100M limit of foreman's dynflow* tables: no strong opinion. Can you @pafernanr evaluate the impact? I.e. generate so many foreman tasks to have 200M data in each such table, and compare execution time and tarball size for sizelimits of 100MB, 150MB and 200MB? On one side, we would get some more history of tasks. On the other side, the data are already ordered by time so most recent is always present, and I am on torns if it is worth paying the extra cost in longer time and tarball size to get that info. This sizelimit affected my own investigation of foreman/Satellite support cases only rarely, hence my reluctant attitude. But if others hit it more often, no objections.

@pmoravec
Copy link
Contributor Author

SAR: Feedback from two groups of support engineers in Red Hat: "we dont use XML format, but we heavily use binary saXX and text sarXX formats". So I would vote for dropping the xml format (and a reference in release notes - so maybe worth waiting for 4.8.2 tag to mention it in "more major" 4.9 RN?)

TurboTurtle pushed a commit that referenced this issue Sep 28, 2024
These columns are either empty, containing passwords or some encoded
data.

Get the *remaining* column names and query for them.

If the query for column names fail, failover to current "SELECT *".

Relevant: #3783
Resolves: #3784

Signed-off-by: Pavel Moravec <pmoravec@redhat.com>
MichaelThamm pushed a commit to MichaelThamm/sos that referenced this issue Nov 19, 2024
These columns are either empty, containing passwords or some encoded
data.

Get the *remaining* column names and query for them.

If the query for column names fail, failover to current "SELECT *".

Relevant: sosreport#3783
Resolves: sosreport#3784

Signed-off-by: Pavel Moravec <pmoravec@redhat.com>
MichaelThamm pushed a commit to MichaelThamm/sos that referenced this issue Nov 19, 2024
These columns are either empty, containing passwords or some encoded
data.

Get the *remaining* column names and query for them.

If the query for column names fail, failover to current "SELECT *".

Relevant: sosreport#3783
Resolves: sosreport#3784

Signed-off-by: Pavel Moravec <pmoravec@redhat.com>
filanov pushed a commit to filanov/doca-sosreport that referenced this issue Nov 24, 2024
These columns are either empty, containing passwords or some encoded
data.

Get the *remaining* column names and query for them.

If the query for column names fail, failover to current "SELECT *".

Relevant: sosreport#3783
Resolves: sosreport#3784

Signed-off-by: Pavel Moravec <pmoravec@redhat.com>
filanov pushed a commit to NVIDIA/doca-sosreport that referenced this issue Nov 25, 2024
These columns are either empty, containing passwords or some encoded
data.

Get the *remaining* column names and query for them.

If the query for column names fail, failover to current "SELECT *".

Relevant: sosreport#3783
Resolves: sosreport#3784

Signed-off-by: Pavel Moravec <pmoravec@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants