Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workaround the flaky downgrade robustness test due to WAL records missing in all members #19147

Open
ahrtr opened this issue Jan 8, 2025 · 0 comments
Assignees
Labels
area/robustness-testing priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. type/feature

Comments

@ahrtr
Copy link
Member

ahrtr commented Jan 8, 2025

What would you like to be added?

Background

Currently robustness test leverages the WAL records to rebuild the etcdserver's real history to check correctness

  • check whether it matches what the client side receives.

So it requires that at least one member has complete WAL records.

Issue

Based on the discussion in #19095 and #19038, a member might fail to flush some WAL records to disk when being stopped (and restarted again later). Accordingly, it causes the error failed to read WAL, cannot be repaired, err: wal: slice bounds out of range, refer to #19038 (comment); the issue was fixed in #19095.

Usually it isn't a problem, because usually only one member gets stopped or killed in robustness test, only one member missing WAL record isn't a problem.

  • Note there is NO any issue from users perspective. A member missing some WAL records isn't a problem, because it can get a snapshot from the leader when it gets started again if it's lag far behind the leader.

But in downgrade test, we need to stop & restart all the members one by one, so it's possible that each member has some missing WAL records. So it might be impossible to read the complete WAL records; accordingly it causes the error last succesful client write .... was not persisted, required to validate, refer to #19095 (comment)

  • Again, there is NO any issue from users perspective.

Proposed solution

Based on my previous test, usually the issue (WAL records fail to be flushed to disk) only happens in high traffic scenario, so one workaround solution is that we only play very low traffic when doing downgrade case in robustness test.

Also currently robustness test reads the longest WAL records. But the longest one may not be the correct one. We should ensure at least majorities members have the same longest WAL records. Refer to #19095 (comment)

cc @siyuanfoundation @serathius

Related discussion

This is for other contributors reference. The goal of robustness test is to verify correctness of etcd, so ideally it should NOT depend on the any data (including WAL files) generated by etcd; it should fully regard etcd as a black box.

But it's hard and super cost for robustness to build all exponencial possibilities (when a client gets a failure response, the server side may fail or success; when there are multiple failed client requests, then the possibilities increase exponencially). So a practical way is to use WAL records to build the real history from the server side.

Why is this needed?

Workaround the issue of the downgrade robustness test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/robustness-testing priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. type/feature
Development

No branches or pull requests

2 participants