feat(AIP-236): Creating new AIP-236

nitinapi · web-flow · commit 0bfd6877af31 · 2023-04-05T15:59:47.000-07:00
Add a new AIP around policy preview. See #1030
diff --git a/aip/general/0236.md b/aip/general/0236.md
@@ -0,0 +1,381 @@
+---
+id: 236
+state: approved
+created: 2023-03-30
+updated: 2023-03-30
+placement:
+  category: resource-design
+  order: 240
+---
+
+# Policy preview
+
+A policy is a resource that provides rules that admit or deny access to other
+resources. Generally, the outcome of a policy can be evaluated to a specific set
+of outcomes.
+
+Changes to policies without proper validation may have unintended consequences
+that can severely impact a customer’s overall infrastructure setup. To safely
+update resources, it is beneficial to test these changes via policy rollout 
+APIs.
+
+Preview is a rollout safety mechanism for policy resources, which gives the
+customer the ability to validate the effect of their proposed changes against
+production traffic prior to the changes going live. The result of the policy
+evaluation against traffic is logged in order to give the customer the data
+required to test the correctness of the change.
+
+Firewall policies exemplify a case that is suitable for previewing. A new 
+configuration can be evaluated against traffic to observe which IPs would be 
+allowed or denied. This gives the customer the data to guide a decision on
+whether to promote the proposed changes to live.
+
+The expected flow for previewing a policy is as follows:
+
+1. The user creates an experiment containing a new policy configuration
+   intended to replace the live policy.
+2. The user uses the "preview" method to start generating logs which compare
+   the live and experiment policy evaluations against live traffic.
+3. The user inspects the logs to determine whether the experiment has the
+   intended result.
+4. The user uses the "commit" method to promote the experiment to live.
+
+## Guidance
+
+### Non-goals
+
+This proposal is for a safety mechanism for policy rollouts only. Safe rollouts
+for non-policy resources are not in scope.
+
+### Experiments
+
+A new configuration of a policy to be previewed is stored as a nested collection
+under the policy. These nested collections are known as experiments.
+
+A hypothetical policy resource called, `Policy`, is used throughout. It has the
+following resource name pattern:
+
+`projects/{project}/locations/{location}/policies/{policy}`
+
+The experimental versions of the resource used for previewing or other safe
+rollout practices are represented as a nested collection under `Policy` using a
+new resource type. The resource type **must** follow the naming convention
+*RegularResourceType*`Experiment`. 
+
+The following pattern is used for the experiment collection:
+
+`projects/{project}/locations/{location}/policies/{policy}/experiments/{experiment}`
+
+A proto used to represent an experiment **must** contain the following:
+
+  1. The required top-level fields for a resource, like `name` and `etag`
+  2. The policy message that is being tested itself
+  3. The field, `preview_metadata`, which contains metadata specific to
+     previewing the experiment of a specific resource type.
+
+```proto
+message PolicyExperiment {
+
+  // The resource name of the PolicyExperiment.
+  string name = 1;
+ 
+  // The policy experiment. This Policy will be used to preview the effects of
+  // the change but will not affect life traffic.
+  Policy policy = 2;
+
+  // The metadata associated with this policy experiment.
+  PolicyPreviewMetadata preview_metadata = 3
+      [(google.api.field_behavior) = OUTPUT_ONLY];
+}
+```
+
+- The experiment proto **must** have a top-level field with the same type as the
+  live policy. 
+  - It **must** be named as the live resource type. For example, if the
+    experiment is for FirewallPolicy, then this field **must** be named
+    `firewall_policy`.
+  - The name inside the embedded `policy` message **must** be the name of the
+    live policy. 
+- When the user is ready to promote an experiment, they **must** copy the 
+  `policy` message into the live policy and delete the experiment. This can be
+  done manually or via a "commit" custom method.
+- A product **may** support multiple experiments concurrently being previewed 
+  for a single live policy. 
+  - Each experiment must generate logs having each entry preceded by log_prefix
+    so that the user can compare the results of the experiment with the behavior
+    of the live policy.
+  - The number of experimental configurations for a given live policy **may** be
+    capped at a certain number and the cap **must** be documented.
+- Cascading deletes **must** occur: if the live policy is deleted, all
+  experiments **must** also be deleted.
+
+### Metadata
+
+`preview_metadata` tracks all metadata of previewing the experiment. The
+messages **must** follow the convention: *RegularResourceType*`PreviewMetadata`.
+This is so the proto can be defined uniquely for each resource type in the
+same service with experiments.
+ 
+```proto 
+message PolicyPreviewMetadata {
+  // The state of previewing the experiment. The possible values are ACTIVE
+  // and INACTIVE. ACTIVE indicates that results are being logged upstream.
+  string state = 1;
+
+  // An identifying string common to all logs generated when previewing the
+  // experiment. Searching all logs for this string will isolate the results.
+  string log_prefix = 2;
+
+  // The most recent time at which this experiment started previewing.
+  google.protobuf.Timestamp start_time = 3;
+
+  // The most recent time at which this experiment stopped previewing.
+  google.protobuf.Timestamp stop_time = 4;
+}
+```
+
+- `PolicyPreviewMetadata` **must** have the fields defined in the proto above.
+  - `state` **must** be `ACTIVE` or `INACTIVE`.
+  - It **may** have additional fields if the service or resource requires it.
+- When an experiment is first previewed, `preview_metadata` **must** be
+  absent.
+  - It is present on the experiment once the "preview" method is used.
+- All `preview_metadata` fields **must** be output only.
+- `state` changes to and from `ACTIVE` and `INACTIVE` when the experiment is
+  started or stopped, which can only be done by the "preview" and "stop" custom
+  methods.
+- The first time the "preview" custom method is used, the system **must** create
+  `preview_metadata` and do the following: 
+  - It **must** set the `state` to `ACTIVE`
+  - It **must** populate `start_time` with the current time. 
+    - `start_time` **must** be updated every time the status is changed to
+      `ACTIVE`.
+  - It **must** set a system generated `log_prefix` string, which is a
+    predefined constant hard coded by the system developers. 
+  - The same value is used for previewing experiments for the given resource
+    type. For example, "FirewallPolicyPreviewLog" for FirewallPolicy.
+- When the "stop" custom method is used, the system **must** do the following:
+  - It **must** set the `state` to `INACTIVE`
+  - It **must** populate the `stop_time` with the current time. 
+
+### Methods
+
+#### create
+
+- The resource **must** be created using long-running 
+  [Create][aip-133-long-running] and 
+  `google.longrunning.operation_info.response_type` **must** be
+  `PolicyExperiment`.
+- Creating a new experiment to preview **must** support the following use
+cases:
+  - Preview a new policy.
+  - Preview an update to an already live policy.
+  - Preview a deletion of a current policy.
+- For the update and delete use cases, the `policy` field in the experiment
+  **must** have the full payload of the live policy copied into it, including
+  the name. 
+  - The user **must** set the rules to the new intended state to preview an
+    update.
+  - The user **must** set set the rules to represent a no-op to preview a
+    delete.
+- To preview a new policy, the system must do the following: 
+  - If the system does not support a nested collection without a live policy, 
+    the user **must** create a live policy and set the rules to represent a
+    no-op. For example, the rules of a no-op policy **may** be empty.
+    - An experiment is created as a child of the no-op policy.
+- If the system supports previewing multiple experiments for a live policy,
+  calling "create" more than once **must** create multiple experiments.
+
+#### update
+
+- The resource **must** be updated using long-running 
+  [Update][aip-134-long-running] and 
+  `google.longrunning.operation_info.response_type` **must** be
+  `PolicyExperiment`.
+- The name inside `policy` **must not** change but the other fields can in
+  order to change the experiment being previewed because this `policy` is
+  intended to replace the live policy, and the name of the live policy
+  **must not** change.
+- The system **must** set the `state` to INACTIVE if the `state` was ACTIVE at
+  the time of an update.
+  - This is so the user can easily distinguish between different versions of
+    the experiment being previewed.
+
+#### get
+- The standard method, [Get][aip-131], **must** be included for
+  `PolicyExperiment` resource types.
+
+#### list
+
+- The standard method, [List][aip-132], **must** be included for
+  `PolicyExperiment` resource types.
+- Filtering on `PolicyPreviewMetadata` indicates which experiments are actively
+  previewed.
+  - For example, the following filter string returns a List response with 
+    experiments being previewed: preview_metadata.state = ACTIVE.
+
+#### delete
+
+- The resource **must** be deleted using long-running 
+  [Delete][aip-135-long-running] and 
+  `google.longrunning.operation_info.response_type` **must** be
+  `PolicyExperiment`.
+
+#### preview
+
+```proto
+// Starts previewing a PolicyExperiment. This triggers the system to start
+// generating logs to evaluate the PolicyExperiment.
+rpc PreviewPolicyExperiment(PreviewPolicyExperimentRequest)
+    returns (google.longrunning.Operation) {
+  option (google.api.http) = {
+    post: "/v1/{name=policies/*/experiments/*}:preview"
+    body: "*"
+  };
+  option (google.longrunning.operation_info) = {
+    response_type: "PolicyExperiment"
+    metadata_type: "PreviewPolicyExperimentMetadata"
+  };
+}
+
+// The request message for the preview custom method.
+message PreviewPolicyExperimentRequest {
+  // The name of the PolicyExperiment.
+  string name = 1;
+}
+```
+
+- This custom method is required.
+- `google.longrunning.Operation.metadata_type` **must** follow guidance on
+  [Long-running operations][aip-151]
+- This method **must** trigger the system to start generating logs to preview 
+  the experiment.
+- Whenever the method is called successfully, the system **must** set the
+  following values in the `PolicyPreviewMetadata`:
+  - `log_prefix` to the predefined constant.
+  - `start_time` to the current time
+  - `state` to `ACTIVE`.
+- If the method is called on an experiment with the rules representing a no-op,
+  then the system **must** preview the deletion of the live policy.
+
+#### stop
+
+```proto
+// Stops previewing a PolicyExperiment. This triggers the system to stop
+// generating logs to evaluate the PolicyExperiment.
+rpc StopPolicyExperiment(StopPolicyExperimentRequest)
+    returns (google.longrunning.Operation) {
+  option (google.api.http) = {
+    post: "/v1/{name=policies/*/experiments/*}:stop"
+    body: "*"
+  };
+  option (google.longrunning.operation_info) = {
+    response_type: "PolicyExperiment"
+    metadata_type: "StopPolicyExperimentMetadata"
+  };
+}
+
+// The request message for the stop custom method.
+message StopPolicyExperimentRequest {
+  // The name of the PolicyExperiment.
+  string name = 1;
+}
+```
+
+- This custom method is required.
+- `google.longrunning.Operation.metadata_type` **must** follow guidance on
+  [Long-running operations][aip-151]
+- This method **must** trigger the system to stop generating logs to preview the
+  experiment.
+- Whenever the method is called successfully, the system **must** set the
+  following values in the `PolicyPreviewMetadata`:
+  - `stop_time` to the current time
+  - `state` to `INACTIVE`
+
+#### commit
+
+The resource **may** expose a new custom method called "commit" to promote an
+experiment. The system copies `policy` from the experiment into the live policy
+and then deletes the experiment.
+
+Declarative clients **may** manually copy fields from an experiment into the
+live policy and then delete the experiment rather than calling "commit" if
+preferable.
+
+```proto
+// Commits a PolicyExperiment. This copies the PolicyExperiment's policy message
+// to the live policy then deletes the PolicyExperiment.
+rpc CommitPolicyExperiment(CommitPolicyExperimentRequest)
+    returns (google.longrunning.Operation) {
+  option (google.api.http) = {
+    post: "/v1/{name=policies/*/experiments/*}:commit"
+    body: "*"
+  };
+  option (google.longrunning.operation_info) = {
+    response_type: "google.protobuf.Empty"
+    metadata_type: "CommitPolicyExperimentMetadata"
+  };
+}
+
+// The request message for the commit custom method.
+message CommitPolicyExperimentRequest {
+  string name = 1;
+  string etag = 2;
+  string parent_etag = 3;
+}
+```
+
+- `google.longrunning.Operation.metadata_type` **must** follow guidance on
+  [Long-running operations][aip-151]
+- The method **must** atomically copy `policy` from the experiment into the live
+  policy, and then delete the experiment.
+- If any experiment fails "commit", previewing it **must not** stop, and the
+  live policy **must not** be updated.
+- The method can be called on an experiment in any state.
+- The `etag` **must** match that of the experiment in order for commit to be
+  successful. This is so the user does not commit an unintended version of the
+  experiment.
+  - If no `etag` is provided, the API **must not** succeed to prevent the user
+    from unintentionally committing a different version of the experiment as
+    intended.
+  - A `parent_etag` **may** be provided to guarantee that the experiment
+    overwrites a specific version of the live policy.
+- The method is not idempotent and calling it twice on the same experiment 
+  **must** return a 404 NOT_FOUND as the experiment is deleted as part of the
+  first call.
+
+### Changes to live policy API methods
+
+#### delete
+
+- A delete of the live policy **must** delete all experiments.
+- To maintain the experiments while negating the effect of the live policy, the
+  live policy **must** be changed to a no-op policy instead of using this
+  method.
+
+### Logging
+
+Logging is crucial for the user to evaluate whether an experiment should be
+promoted to live.
+
+Logs **must** contain the results of the evaluated experiment, the `etag`
+associated with that experiment alongside that of the live policy, and be
+preceded by the value of `log_prefix`.
+  - The `etag` fields help the user identify which
+    configurations of the live and experiment are evaluated in the log.
+  - `log_prefix` helps the user separate logs specifically generated for 
+    previewing the experiment from other use cases.
+
+Overall, these logs help the user make a decision about whether to promote the
+experiment to live.
+
+## Changelog
+
+- **2023-03-30:** Initial AIP written.
+
+[aip-131]: https://aip.dev/131
+[aip-132]: https://aip.dev/132
+[aip-133-long-running]: https://aip.dev/133#long-running-create
+[aip-134-long-running]: https://aip.dev/134#long-running-update
+[aip-135-long-running]: https://aip.dev/135#long-running-delete
+[aip-151]: https://google.aip.dev/151