Skip to content

Commit 3ed3fcd

Browse files
feat(batch): Allow partial success for all batch mutations (#1488)
1 parent ac580e9 commit 3ed3fcd

3 files changed

Lines changed: 368 additions & 26 deletions

File tree

aip/general/0233.md

Lines changed: 122 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,9 @@ transaction. A batch create method provides this functionality.
1515

1616
## Guidance
1717

18-
APIs **may** support Batch Create using the following pattern:
18+
APIs **may** support Batch Create using the following two patterns:
19+
20+
Returning the response synchronously
1921

2022
```proto
2123
rpc BatchCreateBooks(BatchCreateBooksRequest) returns (BatchCreateBooksResponse) {
@@ -26,25 +28,51 @@ rpc BatchCreateBooks(BatchCreateBooksRequest) returns (BatchCreateBooksResponse)
2628
}
2729
```
2830

31+
Returning an Operation which resolves to the response asynchronously
32+
33+
```proto
34+
rpc BatchCreateBooks(BatchCreateBooksRequest) returns (google.longrunning.Operation) {
35+
option (google.api.http) = {
36+
post: "/v1/{parent=publishers/*}/books:batchCreate"
37+
body: "*"
38+
};
39+
option (google.longrunning.operation_info) = {
40+
response_type: "BatchCreateBooksResponse"
41+
metadata_type: "BatchCreateBooksOperationMetadata"
42+
};
43+
}
44+
```
45+
2946
- The RPC's name **must** begin with `BatchCreate`. The remainder of the RPC
3047
name **should** be the plural form of the resource being created.
3148
- The request and response messages **must** match the RPC name, with
3249
`Request` and `Response` suffixes.
33-
- However, in the event that the request may take a significant amount of
34-
time, the response message **must** be a `google.longrunning.Operation`
35-
which ultimately resolves to the `Response` type.
50+
- If the batch method returns an `google.longrunning.Operation`, both the
51+
`response_type` and `metadata_type` fields **must** be specified.
3652
- The HTTP verb **must** be `POST`.
3753
- The HTTP URI **must** end with `:batchCreate`.
3854
- The URI path **should** represent the collection for the resource, matching
3955
the collection used for simple CRUD operations. If the operation spans
4056
parents, a dash (`-`) **may** be accepted as a wildcard.
4157
- The body clause in the `google.api.http` annotation **should** be `"*"`.
42-
- The operation **should** be atomic: it **should** fail for all resources or
43-
succeed for all resources (no partial success).
44-
- If the operation covers multiple locations and at least one location is
45-
down, the operation **must** fail.
46-
- In cases where supporting partial responses cannot be avoided, the design
47-
should follow the guidelines of [AIP-193](https://aip.dev/193).
58+
59+
### Atomic vs. Partial Success
60+
61+
- The batch create method **may** support atomic (all resources created or none
62+
are) or partial success behavior. To make a choice, consider the following
63+
factors:
64+
- **Complexity of Ensuring Atomicity:** Operations that are simple
65+
passthrough database transactions **should** use an atomic operation,
66+
while operations that manage complex resources **should** use partial
67+
success operations.
68+
- **End-User Experience:** Consider the perspective of the API consumer.
69+
Would atomic behavior be preferable for the given use case, even if it
70+
means that a large batch could fail due to issues with a single or a few
71+
entries?
72+
- Synchronous batch create **must** be atomic.
73+
- Asynchronous batch create **may** support atomic or partial success.
74+
- If supporting partial success, see
75+
[Operation metadata message](#operation-metadata-message) requirements.
4876

4977
### Request message
5078

@@ -111,11 +139,95 @@ message BatchCreateBooksResponse {
111139
- The response message **must** include one repeated field corresponding to the
112140
resources that were created.
113141

142+
### Operation metadata message
143+
144+
- The `metadata_type` message **must** either match the RPC name with
145+
`OperationMetadata` suffix, or be named with `Batch` prefix and
146+
`OperationMetadata` suffix if the type is shared by multiple Batch methods.
147+
- If batch create method supports partial success, the metadata message **must**
148+
include a `map<int32, google.rpc.Status> failed_requests` field to communicate
149+
the partial failures.
150+
- The key in this map is the index of the request in the `requests` field in
151+
the batch request.
152+
- The value in each map entry **must** mirror the error(s) that would normally
153+
be returned by the singular Standard Create method.
154+
- If a failed request can eventually succeed due to server side retries, such
155+
transient errors **must not** be communicated using `failed_requests`.
156+
- When all requests in the batch fail, `Operation.error` **must** be set with
157+
`code = google.rpc.Code.Aborted` and `message = "None of the requests
158+
succeeded, refer to the BatchCreateBooksOperationMetadata.failed_requests
159+
for individual error details"`
160+
- The metadata message **may** include other fields to communicate the
161+
operation progress.
162+
163+
### Adopting Partial Success
164+
165+
In order for an existing Batch API to adopt the partial success pattern, the API
166+
must do the following:
167+
168+
- The default behavior must be retained to avoid incompatible behavioral
169+
changes.
170+
- If the API returns an Operation:
171+
- The request message **must** have a `bool return_partial_success` field.
172+
- The Operation `metadata_type` **must** include a
173+
`map<int32, google.rpc.Status> failed_requests` field.
174+
- When the `bool return_partial_success` field is set to true in a request,
175+
the API should allow partial success behavior, otherwise it should continue
176+
with atomic behavior as default.
177+
- If the API returns a direct response synchronously:
178+
- Since the existing clients will treat a success response as an atomic
179+
operation, the existing version of the API **must not** adopt the partial
180+
success pattern.
181+
- A new version **must** be created instead that returns an Operation and
182+
follows the partial success pattern described in this AIP.
183+
184+
## Rationale
185+
186+
### Restricting synchronous batch methods to be atomic
187+
188+
The restriction that synchronous batch methods must be atomic is a result of
189+
the following considerations.
190+
191+
The previous iteration of this AIP recommended batch methods must be atomic.
192+
There is no clear way to convey partial failure in a sync response status code
193+
because an OK implies it all worked. Therefore, adding a new field to the
194+
response to indicate partial failure would be a breaking change because the
195+
existing clients would interpret an OK response as all resources created.
196+
197+
On the other hand, as described in [AIP-193](https://aip.dev/193), Operations
198+
are more capable of presenting partial states. The response status code for an
199+
Operation does not convey anything about the outcome of the underlying operation
200+
and a client has to check the response body to determine if the operation was
201+
successful.
202+
203+
### Communicating partial failures
204+
205+
The AIP recommends using a `map<int32, google.rpc.Status> failed_requests` field
206+
to communicate partial failures, where the key is the index of the failed
207+
request in the original batch request. The other options considered were:
208+
209+
- A `repeated google.rpc.Status` field. This was rejected because it is not
210+
clear which entry corresponds to which request.
211+
- A `map<string, google.rpc.Status>` field, where the key is the request id of
212+
the failed request. This was rejected because:
213+
- Client will need to maintain a map of request_id -> request in order to use
214+
the partial success response.
215+
- Populating a request id for the purpose of communicating errors could
216+
conflict with [AIP-155](https://aip.dev/155) if the service can not
217+
guarantee idempotency for an individual request across multiple batch
218+
requests.
219+
- A `repeated FailedRequest` field, where FailedRequest contains the individual
220+
create request and the `google.rpc.Status`. This was rejected because echoing
221+
the request payload back in response is discouraged due to additional
222+
challenges around user data sensitivity.
223+
114224
[aip-122-parent]: ./0122.md#fields-representing-a-resources-parent
115225
[request-message]: ./0133.md#request-message
116226

117227
## Changelog
118228

229+
- **2025-03-06**: Added detailed guidance for partial success behavior, and
230+
decision framework for choosing between atomic and partial success
119231
- **2023-04-18**: Changed the recommendation to allow returning partial
120232
successes.
121233
- **2022-06-02**: Changed suffix descriptions to eliminate superfluous "-".

aip/general/0234.md

Lines changed: 122 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,9 @@ transaction. A batch update method provides this functionality.
1515

1616
## Guidance
1717

18-
APIs **may** support Batch Update using the following pattern:
18+
APIs **may** support Batch Update using the following two patterns:
19+
20+
Returning the response synchronously
1921

2022
```proto
2123
rpc BatchUpdateBooks(BatchUpdateBooksRequest) returns (BatchUpdateBooksResponse) {
@@ -26,23 +28,51 @@ rpc BatchUpdateBooks(BatchUpdateBooksRequest) returns (BatchUpdateBooksResponse)
2628
}
2729
```
2830

31+
Returning an Operation which resolves to the response asynchronously
32+
33+
```proto
34+
rpc BatchUpdateBooks(BatchUpdateBooksRequest) returns (google.longrunning.Operation) {
35+
option (google.api.http) = {
36+
post: "/v1/{parent=publishers/*}/books:batchUpdate"
37+
body: "*"
38+
};
39+
option (google.longrunning.operation_info) = {
40+
response_type: "BatchUpdateBooksResponse"
41+
metadata_type: "BatchUpdateBooksOperationMetadata"
42+
};
43+
}
44+
```
45+
2946
- The RPC's name **must** begin with `BatchUpdate`. The remainder of the RPC
3047
name **should** be the plural form of the resource being updated.
3148
- The request and response messages **must** match the RPC name, with
3249
`Request` and `Response` suffixes.
33-
- However, in the event that the request may take a significant amount of
34-
time, the response message **must** be a `google.longrunning.Operation`
35-
which ultimately resolves to the `Response` type.
50+
- If the batch method returns an `google.longrunning.Operation`, both the
51+
`response_type` and `metadata_type` fields **must** be specified.
3652
- The HTTP verb **must** be `POST`.
3753
- The HTTP URI **must** end with `:batchUpdate`.
3854
- The URI path **should** represent the collection for the resource, matching
3955
the collection used for simple CRUD operations. If the operation spans
4056
parents, a dash (`-`) **may** be accepted as a wildcard.
4157
- The body clause in the `google.api.http` annotation **should** be `"*"`.
42-
- The operation **must** be atomic: it **must** fail for all resources or
43-
succeed for all resources (no partial success).
44-
- If the operation covers multiple locations and at least one location is
45-
down, the operation **must** fail.
58+
59+
### Atomic vs. Partial Success
60+
61+
- The batch update method **may** support atomic (all resources updated or none
62+
are) or partial success behavior. To make a choice, consider the following
63+
factors:
64+
- **Complexity of Ensuring Atomicity:** Operations that are simple
65+
passthrough database transactions **should** use an atomic operation,
66+
while operations that manage complex resources **should** use partial
67+
success operations.
68+
- **End-User Experience:** Consider the perspective of the API consumer.
69+
Would atomic behavior be preferable for the given use case, even if it
70+
means that a large batch could fail due to issues with a single or a few
71+
entries?
72+
- Synchronous batch update **must** be atomic.
73+
- Asynchronous batch update **may** support atomic or partial success.
74+
- If supporting partial success, see
75+
[Operation metadata message](#operation-metadata-message) requirements.
4676

4777
### Request message
4878

@@ -107,11 +137,95 @@ message BatchUpdateBooksResponse {
107137
- The response message **must** include one repeated field corresponding to the
108138
resources that were updated.
109139

140+
### Operation metadata message
141+
142+
- The `metadata_type` message **must** either match the RPC name with
143+
`OperationMetadata` suffix, or be named with `Batch` prefix and
144+
`OperationMetadata` suffix if the type is shared by multiple Batch methods.
145+
- If batch update method supports partial success, the metadata message **must**
146+
include a `map<int32, google.rpc.Status> failed_requests` field to communicate
147+
the partial failures.
148+
- The key in this map is the index of the request in the `requests` field
149+
in the batch request.
150+
- The value in each map entry **must** mirror the error(s) that would normally
151+
be returned by the singular Standard Update method.
152+
- If a failed request can eventually succeed due to server side retries, such
153+
transient errors **must not** be communicated using `failed_requests`.
154+
- When all requests in the batch fail, `Operation.error` **must** be set with
155+
`code = google.rpc.Code.Aborted` and `message = "None of the requests
156+
succeeded, refer to the BatchUpdateBooksOperationMetadata.failed_requests
157+
for individual error details"`
158+
- The metadata message **may** include other fields to communicate the
159+
operation progress.
160+
161+
### Adopting Partial Success
162+
163+
In order for an existing Batch API to adopt the partial success pattern, the API
164+
must do the following:
165+
166+
- The default behavior must be retained to avoid incompatible behavioral
167+
changes.
168+
- If the API returns an Operation:
169+
- The request message **must** have a `bool return_partial_success` field.
170+
- The Operation `metadata_type` **must** include a
171+
`map<int32, google.rpc.Status> failed_requests` field.
172+
- When the `bool return_partial_success` field is set to true in a request,
173+
the API should allow partial success behavior, otherwise it should continue
174+
with atomic behavior as default.
175+
- If the API returns a direct response synchronously:
176+
- Since the existing clients will treat a success response as an atomic
177+
operation, the existing version of the API **must not** adopt the partial
178+
success pattern.
179+
- A new version **must** be created instead that returns an Operation and
180+
follows the partial success pattern described in this AIP.
181+
182+
## Rationale
183+
184+
### Restricting synchronous batch methods to be atomic
185+
186+
The restriction that synchronous batch methods must be atomic is a result of
187+
the following considerations.
188+
189+
The previous iteration of this AIP recommended batch methods must be atomic.
190+
There is no clear way to convey partial failure in a sync response status code
191+
because an OK implies it all worked. Therefore, adding a new field to the
192+
response to indicate partial failure would be a breaking change because the
193+
existing clients would interpret an OK response as all resources updated.
194+
195+
On the other hand, as described in [AIP-193](https://aip.dev/193), Operations
196+
are more capable of presenting partial states. The response status code for an
197+
Operation does not convey anything about the outcome of the underlying operation
198+
and a client has to check the response body to determine if the operation was
199+
successful.
200+
201+
### Communicating partial failures
202+
203+
The AIP recommends using a `map<int32, google.rpc.Status> failed_requests` field
204+
to communicate partial failures, where the key is the index of the failed
205+
request in the original batch request. The other options considered were:
206+
207+
- A `repeated google.rpc.Status` field. This was rejected because it is not
208+
clear which entry corresponds to which request.
209+
- A `map<string, google.rpc.Status>` field, where the key is the request id of
210+
the failed request. This was rejected because:
211+
- Client will need to maintain a map of request_id -> request in order to use
212+
the partial success response.
213+
- Populating a request id for the purpose of communicating errors could
214+
conflict with [AIP-155](https://aip.dev/155) if the service can not
215+
guarantee idempotency for an individual request across multiple batch
216+
requests.
217+
- A `repeated FailedRequest` field, where FailedRequest contains the individual
218+
update request and the `google.rpc.Status`. This was rejected because echoing
219+
the request payload back in response is discouraged due to additional
220+
challenges around user data sensitivity.
221+
110222
[aip-122-parent]: ./0122.md#fields-representing-a-resources-parent
111223
[request-message]: ./0134.md#request-message
112224

113225
## Changelog
114226

227+
- **2025-03-06**: Changed recommendation to allow partial success, along with
228+
detailed guidance
115229
- **2022-06-02:** Changed suffix descriptions to eliminate superfluous "-".
116230
- **2020-09-16**: Suggested annotating `parent` and `requests` fields.
117231
- **2020-08-27**: Removed parent recommendations for top-level resources.

0 commit comments

Comments
 (0)