Skip to content

Commit 600c0e3

Browse files
authored
Merge pull request #10 from microsoft/architecture-updates
docs: architecture updates
2 parents 3d9bad0 + fa4c91c commit 600c0e3

9 files changed

Lines changed: 17 additions & 19 deletions

TRANSPARENCY_FAQ.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,15 +2,15 @@
22

33
- ### What is the Content Processing Solution Accelerator?
44

5-
This solution accelerator is an open-source GitHub Repository to extract data from unstructured documents and transform the data into defined schemas with validation to enhance the speed of downstream data ingestion and improve quality. It enables the ability to efficiently automate extraction, validation, and structuring of information for event driven system-to-system workflows. The solution is built using Azure OpenAI, Azure AI Services, Content Understanding Services, CosmosDB, and Azure Containers.
5+
This solution accelerator is an open-source GitHub Repository to extract data from unstructured documents and transform the data into defined schemas with validation to enhance the speed of downstream data ingestion and improve quality. It enables the ability to efficiently automate extraction, validation, and structuring of information for event driven system-to-system workflows. The solution is built using Azure OpenAI Service, Azure AI Services, Azure AI Content Understanding Service, Azure Cosmos DB, and Azure Container Apps.
66

77

88

99
- ### What can the Content Processing Solution Accelerator do?
1010

11-
The sample solution is tailored for a Data Analyst at a property insurance company, who analyzes large amounts of claim-related data including forms, reports, invoices, and property loss documentation. The sample data is synthetically generated utilizing Azure OpenAI and saved into related templates and files, which are unstructured documents that can be used to show the processing pipeline. Any names and other personally identifiable information in the sample data is fictitious.
11+
The sample solution is tailored for a Data Analyst at a property insurance company, who analyzes large amounts of claim-related data including forms, reports, invoices, and property loss documentation. The sample data is synthetically generated utilizing Azure OpenAI Service and saved into related templates and files, which are unstructured documents that can be used to show the processing pipeline. Any names and other personally identifiable information in the sample data is fictitious.
1212

13-
The sample solution processes the uploaded documents by exposing an API endpoint that utilizes Azure OpenAI and Content Understanding Service for extraction. The extracted data is then transformed into a specific schema output based on the content type (ex: invoice), and validates the extraction and schema mapping through accuracy scoring. The scoring enables thresholds to dictate a human-in-the-loop review of the output if needed, allowing a user to review, update, and add comments.
13+
The sample solution processes the uploaded documents by exposing an API endpoint that utilizes Azure OpenAI Service and Azure AI Content Understanding Service for extraction. The extracted data is then transformed into a specific schema output based on the content type (ex: invoice), and validates the extraction and schema mapping through accuracy scoring. The scoring enables thresholds to dictate a human-in-the-loop review of the output if needed, allowing a user to review, update, and add comments.
1414

1515
- ### What is/are the Content Processing Solution Accelerator’s intended use(s)?
1616

docs/CustomizingAzdParameters.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ Set the Environment Name Prefix
1111
azd env set AZURE_ENV_NAME 'cps'
1212
```
1313

14-
Change the Content Understanding Service Location (example: eastus2, westus2, etc.)
14+
Change the Azure Content Understanding Service Location (example: eastus2, westus2, etc.)
1515
```shell
1616
azd env set AZURE_ENV_CU_LOCATION 'West US'
1717
```

docs/DeploymentGuide.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,7 @@ Check the [Azure Products by Region](https://azure.microsoft.com/en-us/explore/g
88

99
- [Azure AI Foundry](https://learn.microsoft.com/en-us/azure/ai-foundry/)
1010
- [Azure OpenAI Service](https://learn.microsoft.com/en-us/azure/ai-services/openai/)
11-
- [Azure AI Document Intelligence](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/)
12-
- [Azure AI Content Understanding](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/)
11+
- [Azure AI Content Understanding Service](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/)
1312
- [Azure Blob Storage](https://learn.microsoft.com/en-us/azure/storage/blobs/)
1413
- [Azure Container Apps](https://learn.microsoft.com/en-us/azure/container-apps/)
1514
- [Azure Container Registry](https://learn.microsoft.com/en-us/azure/container-registry/)

docs/Images/ReadMe/approach.png

37.8 KB
Loading
-51.4 KB
Loading

docs/ProcessingPipelineApproach.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ At the application level, when a file is processed a number of steps take place
1010

1111
3. Images are extracted from individual pages and included with the markdown content in a second call to Azure OpenAI Vision to complete a second extraction and multiple extraction prompts relating to the schema initially selected.
1212

13-
4. These two extracted datasets are compared and use system level logs from Azure AI Content Understanding and Azure OpenAI to determine the extraction score. This score is used to determine which extraction method is the most accurate for the schema and content and sent to be transformed and structured for finalization.
13+
4. These two extracted datasets are compared and use system level logs from Azure AI Content Understanding and Azure OpenAI Service to determine the extraction score. This score is used to determine which extraction method is the most accurate for the schema and content and sent to be transformed and structured for finalization.
1414

1515
5. The top performing data is used for transforming the data into its selected schema. This is saved as a JSON format along with the final extraction and schema mapping scores. These scores can be used to initiate human-in-the-loop review - allowing for manual review, updates, and annotation of changes.
1616

@@ -21,16 +21,16 @@ At the application level, when a file is processed a number of steps take place
2121

2222
1. **Extract Pipeline** – Text Extraction via Azure Content Understanding.
2323

24-
Uses Azure Content Understanding Service to detect and extract text from images and PDFs. This service also retrieves the coordinates of each piece of text, along with confidence scores, by leveraging built-in (pretrained) models.
24+
Uses Azure AI Content Understanding Service to detect and extract text from images and PDFs. This service also retrieves the coordinates of each piece of text, along with confidence scores, by leveraging built-in (pretrained) models.
2525

26-
2. **Map Pipeline** – Mapping Extracted Text with Azure OpenAI GPT-4o
26+
2. **Map Pipeline** – Mapping Extracted Text with Azure OpenAI Service GPT-4o
2727

2828
Takes the extracted text (as context) and the associated document images, then applies GPT-4o’s vision capabilities to interpret the content. It maps the recognized text to a predefined entity schema, providing structured data fields and confidence scores derived from model log probabilities.
2929

3030
3. **Evaluate Pipeline** – Merging and Evaluating Extraction Results
3131

32-
Combines confidence scores from both the Extract pipeline (Azure Content Understanding) and the Map pipeline (GPT-4o). It then calculates an overall confidence level by merging and comparing these scores, ensuring accuracy and consistency in the final extracted data.
32+
Combines confidence scores from both the Extract pipeline (Azure AI Content Understanding) and the Map pipeline (GPT-4o). It then calculates an overall confidence level by merging and comparing these scores, ensuring accuracy and consistency in the final extracted data.
3333

34-
4. **Save Pipeline** – Storing Results in Azure Blob Storage and Cosmos DB
34+
4. **Save Pipeline** – Storing Results in Azure Blob Storage and Azure Cosmos DB
3535

3636
Aggregates all outputs from the Extract, Map, and Evaluate steps. It finalizes and saves the processed data to Azure Blob Storage for file-based retrieval and updates or creates records in Azure Cosmos DB for structured, queryable storage. Confidence scoring is captured and saved with results for down-stream use - showing up, for example, in the web UI of the processing queue. This is surfaced as "extraction score" and "schema score" and is used to highlight the need for human-in-the-loop if desired.

docs/TechnicalArchitecture.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,6 @@ Using Azure Container App, this includes API end points exposed to facilitate in
2020
### Content Process Monitor Web
2121
Using Azure Container App, this app acts as the UI for the process monitoring queue. The app is built with React and TypeScript. It acts as an API client to create an experience for uploading new documents, monitoring current and historical processes, and reviewing output results.
2222

23-
2423
### App Configuration
2524
Using Azure App Configuration, app settings and configurations are centralized and used with the Content Processor, Content process API, and Content Process Monitor Web.
2625

@@ -30,11 +29,11 @@ Using Azure Storage Queue, pipeline work steps and processing jobs are added to
3029
### Azure AI Content Understanding Service
3130
Used to detect and extract text from images and PDFs. This service also retrieves the coordinates of each piece of text, along with confidence scores, by leveraging built-in (pretrained) models. This utilizes the prebuild-layout 2024-12-01-preview for extraction.
3231

33-
### Azure OpenAI
34-
Using Azure OpenAI, a deployment of the GPT-4o 2024-10-01-preview model is used during the content processing pipeline to extract content. GPT Vision is used for extraction and validation functions during processing. This model can be changed to a different Azure OpenAI model if desired, but this has not been thoroughly tested and may be affected by the output token limits.
32+
### Azure OpenAI Service
33+
Using Azure OpenAI Service, a deployment of the GPT-4o 2024-10-01-preview model is used during the content processing pipeline to extract content. GPT Vision is used for extraction and validation functions during processing. This model can be changed to a different Azure OpenAI Service model if desired, but this has not been thoroughly tested and may be affected by the output token limits.
3534

3635
### Blob Storage
3736
Using Azure Blob Storage, schema .py files, source files for processing, and final output JSON files are stored in blob storage.
3837

39-
### Cosmos DB for MongoDB
38+
### Azure Cosmos DB for MongoDB
4039
Using Azure Cosmos DB for MongoDB, files that have been submitted for processing are added to the DB and their processing step history is saved. The processing queue stores individual processes information and history for status and processing step review, along with final extraction and transformation into JSON for its selected schema.

infra/deploy_role_assignments.bicep

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,8 @@ param appConfigResourceId string // Resource ID of the App Configuration instanc
66
param storageResourceId string // Resource ID of the Storage account
77
param storagePrincipalId string // Resource ID of the Storage account
88

9-
param aiServiceCUId string // Resource ID of the Content Understanding Service
10-
param aiServiceId string // Resource ID of the Open AI service
9+
param aiServiceCUId string // Resource ID of the Azure AI Content Understanding Service
10+
param aiServiceId string // Resource ID of the Azure Open AI service
1111

1212
param containerRegistryReaderPrincipalId string
1313

infra/main.bicep

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,11 +9,11 @@ param environmentName string
99
var uniqueId = toLower(uniqueString(subscription().id, environmentName, resourceGroup().location))
1010
var solutionPrefix = 'cps-${padLeft(take(uniqueId, 12), 12, '0')}'
1111

12-
@description('Location used for Cosmos DB, Container App deployment')
12+
@description('Location used for Azure Cosmos DB, Azure Container App deployment')
1313
param secondaryLocation string = 'EastUs2'
1414

1515
@minLength(1)
16-
@description('Location for the Content Understanding service deployment:')
16+
@description('Location for the Azure AI Content Understanding service deployment:')
1717
@allowed(['WestUS', 'SwedenCentral', 'AustraliaEast'])
1818
@metadata({
1919
azd: {

0 commit comments

Comments
 (0)