Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eliminate the use of package context manipulation #42

Merged
merged 7 commits into from
Jun 4, 2023

Conversation

johnbelamaric
Copy link
Member

@johnbelamaric johnbelamaric commented Jun 1, 2023

This removes all use of the package variant "package context" manipulation feature. This feature was ill-conceived and causes some issues.

Instead, this uses a new in-cluster resource called WorkloadCluster, which right now contains only the clusterName, but in the future may have a few other fields (like CIDRs perhaps). This resource lives in the "controlling" or "parent" package nephio-workload-cluster. The name we use when we clone that package becomes the name of the workload cluster, and it is used as the basis for the names of all the 4 or 5 "secondary" packages needed to realize the cluster. By putting this name in a real object rather than setting it in the packageContext, we avoid the "flapping" or "thrashing" that happens due to issues with package context described in kptdev/kpt#3973.

The apply-replacements function is used in nephio-workload-cluster to propagate the name from package context to everywhere, including this new resource (note, "name" is an appropriate, intended use of package context). This resource then gets put into the mgmt cluster along with the PV resource themselves, which have been configured to inject this WorkloadCluster resource into the packages they create.

Note that this is not the same as ClusterContext, which has more detailed information about the CNI config of the cluster, though this resource - which is in input to the capi-cluster-kind package - can be used to help construct that.
Hopefully, this:

Partially fixes nephio-project/nephio#248
Partially fixes nephio-project/nephio#249

Still testing!

/hold

@nephio-prow nephio-prow bot requested review from henderiw and tliron June 1, 2023 01:11
@johnbelamaric
Copy link
Member Author

Noticed this while testing: kptdev/kpt#3979

@johnbelamaric
Copy link
Member Author

Found this too: kptdev/kpt#3980 - but not sure if it's a bug or not.

@johnbelamaric
Copy link
Member Author

cc @henderiw

@henderiw
Copy link
Contributor

henderiw commented Jun 1, 2023

Why are we having an injection in the cluster-pv and not in the other pv(s)?

@johnbelamaric
Copy link
Member Author

Accidentally...will fix

@henderiw
Copy link
Contributor

henderiw commented Jun 1, 2023

Do we need the WorkloadCluster also in the downstream packages to start with ?

@johnbelamaric
Copy link
Member Author

Do we need the WorkloadCluster also in the downstream packages to start with ?

We need it in those that vary based on cluster name: cluster-capi-kind, rootsync, repository. I added it there. repository is a little special because we also use it without injection.

Testing now. I don't think it fixes all the problems, but it fixes some. The "flapping" problem looks to be a deeper issue. My suspicion right now is that that is either with the way optimistic concurrency is being used in the Porch client, or within the Porch server itself (or some combination of the too).

Fix annotation name to check for injection
Allow rootsync to be used in non-injected cases.
@johnbelamaric
Copy link
Member Author

/hold cancel

Ok, so, this reduces the noise but doesn't fix all the flakiness. Another step towards fixing flakiness will be nephio-project/nephio#251. I will probably bump the default auto-approve delay up to 2m in the PR for that too (though once that is fixed I am not sure we need the delay).

@johnbelamaric
Copy link
Member Author

/hold cancel

@johnbelamaric
Copy link
Member Author

If you want to try this out:

  1. You can build a sandbox using my branch:
    gcloud compute instances create --machine-type e2-standard-8 \
                                   --boot-disk-size 200GB \
                                   --image-family=ubuntu-2204-lts \
                                   --image-project=ubuntu-os-cloud \
                                   --metadata=startup-script-url=https://raw.githubusercontent.com/johnbelamaric/nephio-test-infra/no-pkg-ctx/e2e/provision/gce_init.sh \
                                   nephio-r1-e2e
    
  2. Manually register my repo (https://github.com/johnbelamaric/nephio-example-packages.git) using the UI (authentication "None", read-only, external blueprints).
  3. Clone nephio-workload-package/v6 to mgmt repo with the cluster name (say, "test01"), and approve it.

nephio-prow bot pushed a commit to nephio-project/nephio that referenced this pull request Jun 2, 2023
- Increase the default delay to two minutes
- Check the Ready status of the owning PackageVariant

Along with
nephio-project/nephio-example-packages#42 this
should:

Fixes #248
Fixes #249
Fixes #251
@henderiw
Copy link
Contributor

henderiw commented Jun 4, 2023

/approve
/lgtm

@nephio-prow
Copy link
Contributor

nephio-prow bot commented Jun 4, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: henderiw

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@nephio-prow nephio-prow bot merged commit ce0e222 into nephio-project:main Jun 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sometimes clusterName is not applied correctly Sometimes auto-approval packages produced by PVS/PV get stuck
2 participants