Merge branch 'main' into dev

Prajwal-Microsoft · web-flow · commit 474127ffb69a · 2025-04-02T23:23:17.000+05:30
diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md
@@ -0,0 +1,45 @@
+---
+name: Bug report
+about: Create a report to help us improve
+title: ''
+labels: bug
+assignees: ''
+
+---
+
+# Describe the bug
+A clear and concise description of what the bug is.
+
+# Expected behavior
+A clear and concise description of what you expected to happen.
+
+# How does this bug make you feel?
+_Share a gif from [giphy](https://giphy.com/) to tells us how you'd feel_
+
+---
+
+# Debugging information
+
+## Steps to reproduce
+Steps to reproduce the behavior:
+1. Go to '...'
+2. Click on '....'
+3. Scroll down to '....'
+4. See error
+
+## Screenshots
+If applicable, add screenshots to help explain your problem.
+
+## Logs
+
+If applicable, add logs to help the engineer debug the problem.
+
+---
+
+# Tasks
+
+_To be filled in by the engineer picking up the issue_
+
+- [ ] Task 1
+- [ ] Task 2
+- [ ] ...
diff --git a/.github/ISSUE_TEMPLATE/feature_request.md b/.github/ISSUE_TEMPLATE/feature_request.md
@@ -0,0 +1,32 @@
+---
+name: Feature request
+about: Suggest an idea for this project
+title: ''
+labels: enhancement
+assignees: ''
+
+---
+
+# Motivation
+
+A clear and concise description of why this feature would be useful and the value it would bring.
+Explain any alternatives considered and why they are not sufficient.
+
+# How would you feel if this feature request was implemented?
+
+_Share a gif from [giphy](https://giphy.com/) to tells us how you'd feel. Format: ![alt_text](https://media.giphy.com/media/xxx/giphy.gif)_
+
+# Requirements
+
+A list of requirements to consider this feature delivered
+- Requirement 1
+- Requirement 2
+- ...
+
+# Tasks
+
+_To be filled in by the engineer picking up the issue_
+
+- [ ] Task 1
+- [ ] Task 2
+- [ ] ...
diff --git a/.github/ISSUE_TEMPLATE/subtask.md b/.github/ISSUE_TEMPLATE/subtask.md
@@ -0,0 +1,22 @@
+---
+name: Sub task
+about: A sub task
+title: ''
+labels: subtask
+assignees: ''
+
+---
+
+Required by <link to parent issue>
+
+# Description
+
+A clear and concise description of what this subtask is.
+
+# Tasks
+
+_To be filled in by the engineer picking up the subtask
+
+- [ ] Task 1
+- [ ] Task 2
+- [ ] ...
diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md
@@ -0,0 +1,39 @@
+## Purpose
+<!-- Describe the intention of the changes being proposed. What problem does it solve or functionality does it add? -->
+* ...
+
+## Does this introduce a breaking change?
+<!-- Mark one with an "x". -->
+
+- [ ] Yes
+- [ ] No
+
+<!-- Please prefix your PR title with one of the following:
+  * `feat`: A new feature
+  * `fix`: A bug fix
+  * `docs`: Documentation only changes
+  * `style`: Changes that do not affect the meaning of the code (white-space, formatting, missing semi-colons, etc)
+  * `refactor`: A code change that neither fixes a bug nor adds a feature
+  * `perf`: A code change that improves performance
+  * `test`: Adding missing tests or correcting existing tests
+  * `build`: Changes that affect the build system or external dependencies (example scopes: gulp, broccoli, npm)
+  * `ci`: Changes to our CI configuration files and scripts (example scopes: Travis, Circle, BrowserStack, SauceLabs)
+  * `chore`: Other changes that don't modify src or test files
+  * `revert`: Reverts a previous commit
+  * !: A breaking change is indicated with a `!` after the listed prefixes above, e.g. `feat!`, `fix!`, `refactor!`, etc.
+-->
+
+## Golden Path Validation
+- [ ] I have tested the primary workflows (the "golden path") to ensure they function correctly without errors.
+
+## Deployment Validation
+- [ ] I have validated the deployment process successfully and all services are running as expected with this change.
+
+## What to Check
+Verify that the following are valid
+* ...
+
+## Other Information
+
+<!-- Add any other helpful information that may be needed here. -->
+
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,14 @@
+# Contributing
+
+This project welcomes contributions and suggestions. Most contributions require you to
+agree to a Contributor License Agreement (CLA) declaring that you have the right to,
+and actually do, grant us the rights to use your contribution. For details, visit
+https://cla.microsoft.com.
+
+When you submit a pull request, a CLA-bot will automatically determine whether you need
+to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the
+instructions provided by the bot. You will only need to do this once across all repositories using our CLA.
+
+This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
+For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/)
+or contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
diff --git a/TRANSPARENCY_FAQ.md b/TRANSPARENCY_FAQ.md
@@ -2,15 +2,15 @@
 
 - ### What is the Content Processing Solution Accelerator? 
 
-  This solution accelerator is an open-source GitHub Repository to extract data from unstructured documents and transform the data into defined schemas with validation to enhance the speed of downstream data ingestion and improve quality. It enables the ability to efficiently automate extraction, validation, and structuring of information for event driven system-to-system workflows. The solution is built using Azure OpenAI, Azure AI Services, Content Understanding Services, CosmosDB, and Azure Containers.  
+  This solution accelerator is an open-source GitHub Repository to extract data from unstructured documents and transform the data into defined schemas with validation to enhance the speed of downstream data ingestion and improve quality. It enables the ability to efficiently automate extraction, validation, and structuring of information for event driven system-to-system workflows. The solution is built using Azure OpenAI Service, Azure AI Services, Azure AI Content Understanding Service, Azure Cosmos DB, and Azure Container Apps.  
 
  
 
 - ### What can the Content Processing Solution Accelerator do?  
 
-    The sample solution is tailored for a Data Analyst at a property insurance company, who analyzes large amounts of claim-related data including forms, reports, invoices, and property loss documentation. The sample data is synthetically generated utilizing Azure OpenAI and saved into related templates and files, which are unstructured documents that can be used to show the processing pipeline. Any names and other personally identifiable information in the sample data is fictitious.  
+    The sample solution is tailored for a Data Analyst at a property insurance company, who analyzes large amounts of claim-related data including forms, reports, invoices, and property loss documentation. The sample data is synthetically generated utilizing Azure OpenAI Service and saved into related templates and files, which are unstructured documents that can be used to show the processing pipeline. Any names and other personally identifiable information in the sample data is fictitious.  
 
-    The sample solution processes the uploaded documents by exposing an API endpoint that utilizes Azure OpenAI and Content Understanding Service for extraction. The extracted data is then transformed into a specific schema output based on the content type (ex: invoice), and validates the extraction and schema mapping through accuracy scoring. The scoring enables thresholds to dictate a human-in-the-loop review of the output if needed, allowing a user to review, update, and add comments.  
+    The sample solution processes the uploaded documents by exposing an API endpoint that utilizes Azure OpenAI Service and Azure AI Content Understanding Service for extraction. The extracted data is then transformed into a specific schema output based on the content type (ex: invoice), and validates the extraction and schema mapping through accuracy scoring. The scoring enables thresholds to dictate a human-in-the-loop review of the output if needed, allowing a user to review, update, and add comments.  
 
 - ### What is/are the Content Processing Solution Accelerator’s intended use(s)? 
 
@@ -23,7 +23,11 @@
 
 - ### What are the limitations of the Content Processing Solution Accelerator? How can users minimize the Content Processing Solution Accelerator’s limitations when using the system?   
 
-  This solution accelerator can only be used as a sample to accelerate the creation of content processing solutions. The repository showcases a sample scenario of a Data Analyst at a property insurance company, analyzing large amounts of claim-related data, but a human must still be responsible to validate the accuracy and correctness of data extracted for their documents, schema definitions related to business specific documents to be extracted, quality and validation scoring logic and thresholds for human-in-the-loop review, ingesting transformed data into subsequent systems, and their relevancy for using with customers. Users of the accelerator should review the system prompts provided and update as per their organizational guidance. AI generated content in the solution may be inaccurate and should be manually reviewed by the user. Currently, the sample repository is available in English only and is only tested to support PDF, PNG, and JPEG files.
+  This solution accelerator can only be used as a sample to accelerate the creation of content processing solutions. The repository showcases a sample scenario of a Data Analyst at a property insurance company, analyzing large amounts of claim-related data, but a human must still be responsible to validate the accuracy and correctness of data extracted for their documents, schema definitions related to business specific documents to be extracted, quality and validation scoring logic and thresholds for human-in-the-loop review, ingesting transformed data into subsequent systems, and their relevancy for using with customers. Users of the accelerator should review the system prompts provided and update as per their organizational guidance. 
+  
+  AI generated content in the solution may be inaccurate and the outputs and integrated solutions derived from the output data are not robustly trustworthy and should be manually reviewed by the user. You can find more information on AI generated content accuracy at https://aka.ms/overreliance-framework.
+  
+  Currently, the sample repository is available in English only and is only tested to support PDF, PNG, and JPEG files up to 20MB in size.
 
 - ### What operational factors and settings allow for effective and responsible use of the Content Processing Solution Accelerator? 
 
diff --git a/docs/CustomizingAzdParameters.md b/docs/CustomizingAzdParameters.md
@@ -11,7 +11,7 @@ Set the Environment Name Prefix
 azd env set AZURE_ENV_NAME 'cps'
 ```
 
-Change the Content Understanding Service Location (example: eastus2, westus2, etc.)
+Change the Azure Content Understanding Service Location (example: eastus2, westus2, etc.)
 ```shell
 azd env set AZURE_ENV_CU_LOCATION 'West US'
 ```
diff --git a/docs/DeploymentGuide.md b/docs/DeploymentGuide.md
@@ -8,8 +8,7 @@ Check the [Azure Products by Region](https://azure.microsoft.com/en-us/explore/g
 
 - [Azure AI Foundry](https://learn.microsoft.com/en-us/azure/ai-foundry/)
 - [Azure OpenAI Service](https://learn.microsoft.com/en-us/azure/ai-services/openai/)
-- [Azure AI Document Intelligence](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/)
-- [Azure AI Content Understanding](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/)
+- [Azure AI Content Understanding Service](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/)
 - [Azure Blob Storage](https://learn.microsoft.com/en-us/azure/storage/blobs/)
 - [Azure Container Apps](https://learn.microsoft.com/en-us/azure/container-apps/)
 - [Azure Container Registry](https://learn.microsoft.com/en-us/azure/container-registry/)
@@ -225,29 +224,21 @@ This will rebuild the source code, package it into a container, and push it to t
         Bash  
 
         ```bash  
-
         ./upload_files.sh https://<< API Service Endpoint >>/contentprocessor/submit ./invoices <<Invoice Schema Id>>
-        
         ```
 
         ```bash
-        
         ./upload_files.sh https://<< API Service Endpoint >>/contentprocessor/submit ./propertyclaims <<Property Loss Damage Claim Form Schema Id>>
-        
         ```
 
         Windows
 
         ```powershell
-        
         ./upload_files.ps1 https://<< API Service Endpoint >>/contentprocessor/submit .\invoices <<Invoice Schema Id>>
-        
         ```
 
         ```powershell
-        
         ./upload_files.ps1 https://<< API Service Endpoint >>/contentprocessor/submit .\propertyclaims <<Property Loss Damage Claim Form Schema Id>>
-        
         ```
 
 3. **Add Authentication Provider**  
diff --git a/docs/Images/ReadMe/approach.png b/docs/Images/ReadMe/approach.png
diff --git a/docs/Images/ReadMe/solution-architecture.png b/docs/Images/ReadMe/solution-architecture.png
diff --git a/docs/ProcessingPipelineApproach.md b/docs/ProcessingPipelineApproach.md
@@ -10,7 +10,7 @@ At the application level, when a file is processed a number of steps take place
 
 3. Images are extracted from individual pages and included with the markdown content in a second call to Azure OpenAI Vision to complete a second extraction and multiple extraction prompts relating to the schema initially selected.
 
-4. These two extracted datasets are compared and use system level logs from Azure AI Content Understanding and Azure OpenAI to determine the extraction score. This score is used to determine which extraction method is the most accurate for the schema and content and sent to be transformed and structured for finalization.
+4. These two extracted datasets are compared and use system level logs from Azure AI Content Understanding and Azure OpenAI Service to determine the extraction score. This score is used to determine which extraction method is the most accurate for the schema and content and sent to be transformed and structured for finalization.
 
 5. The top performing data is used for transforming the data into its selected schema. This is saved as a JSON format along with the final extraction and schema mapping scores. These scores can be used to initiate human-in-the-loop review - allowing for manual review, updates, and annotation of changes.
 
@@ -21,16 +21,16 @@ At the application level, when a file is processed a number of steps take place
 
 1. **Extract Pipeline** – Text Extraction via Azure Content Understanding.
 
-    Uses Azure Content Understanding Service to detect and extract text from images and PDFs. This service also retrieves the coordinates of each piece of text, along with confidence scores, by leveraging built-in (pretrained) models.
+    Uses Azure AI Content Understanding Service to detect and extract text from images and PDFs. This service also retrieves the coordinates of each piece of text, along with confidence scores, by leveraging built-in (pretrained) models.
 
-2. **Map Pipeline** – Mapping Extracted Text with Azure OpenAI GPT-4o
+2. **Map Pipeline** – Mapping Extracted Text with Azure OpenAI Service GPT-4o
 
     Takes the extracted text (as context) and the associated document images, then applies GPT-4o’s vision capabilities to interpret the content. It maps the recognized text to a predefined entity schema, providing structured data fields and confidence scores derived from model log probabilities.
 
 3. **Evaluate Pipeline** – Merging and Evaluating Extraction Results
 
-    Combines confidence scores from both the Extract pipeline (Azure Content Understanding) and the Map pipeline (GPT-4o). It then calculates an overall confidence level by merging and comparing these scores, ensuring accuracy and consistency in the final extracted data. 
+    Combines confidence scores from both the Extract pipeline (Azure AI Content Understanding) and the Map pipeline (GPT-4o). It then calculates an overall confidence level by merging and comparing these scores, ensuring accuracy and consistency in the final extracted data. 
 
-4. **Save Pipeline** – Storing Results in Azure Blob Storage and Cosmos DB
+4. **Save Pipeline** – Storing Results in Azure Blob Storage and Azure Cosmos DB
 
     Aggregates all outputs from the Extract, Map, and Evaluate steps. It finalizes and saves the processed data to Azure Blob Storage for file-based retrieval and updates or creates records in Azure Cosmos DB for structured, queryable storage. Confidence scoring is captured and saved with results for down-stream use - showing up, for example, in the web UI of the processing queue. This is surfaced as "extraction score" and "schema score" and is used to highlight the need for human-in-the-loop if desired.
diff --git a/docs/TechnicalArchitecture.md b/docs/TechnicalArchitecture.md
@@ -20,7 +20,6 @@ Using Azure Container App, this includes API end points exposed to facilitate in
 ### Content Process Monitor Web
 Using Azure Container App, this app acts as the UI for the process monitoring queue. The app is built with React and TypeScript. It acts as an API client to create an experience for uploading new documents, monitoring current and historical processes, and reviewing output results.
 
-
 ### App Configuration
 Using Azure App Configuration, app settings and configurations are centralized and used with the Content Processor, Content process API, and Content Process Monitor Web.
 
@@ -30,11 +29,11 @@ Using Azure Storage Queue, pipeline work steps and processing jobs are added to
 ### Azure AI Content Understanding Service
 Used to detect and extract text from images and PDFs. This service also retrieves the coordinates of each piece of text, along with confidence scores, by leveraging built-in (pretrained) models. This utilizes the prebuild-layout 2024-12-01-preview for extraction.
 
-### Azure OpenAI
-Using Azure OpenAI, a deployment of the GPT-4o 2024-10-01-preview model is used during the content processing pipeline to extract content. GPT Vision is used for extraction and validation functions during processing. This model can be changed to a different Azure OpenAI model if desired, but this has not been thoroughly tested and may be affected by the output token limits.
+### Azure OpenAI Service
+Using Azure OpenAI Service, a deployment of the GPT-4o 2024-10-01-preview model is used during the content processing pipeline to extract content. GPT Vision is used for extraction and validation functions during processing. This model can be changed to a different Azure OpenAI Service model if desired, but this has not been thoroughly tested and may be affected by the output token limits.
 
 ### Blob Storage
 Using Azure Blob Storage, schema .py files, source files for processing, and final output JSON files are stored in blob storage.
 
-### Cosmos DB for MongoDB
+### Azure Cosmos DB for MongoDB
 Using Azure Cosmos DB for MongoDB, files that have been submitted for processing are added to the DB and their processing step history is saved. The processing queue stores individual processes information and history for status and processing step review, along with final extraction and transformation into JSON for its selected schema.
diff --git a/infra/deploy_role_assignments.bicep b/infra/deploy_role_assignments.bicep
@@ -6,8 +6,8 @@ param appConfigResourceId string // Resource ID of the App Configuration instanc
 param storageResourceId string // Resource ID of the Storage account
 param storagePrincipalId string // Resource ID of the Storage account
 
-param aiServiceCUId string // Resource ID of the Content Understanding Service
-param aiServiceId string // Resource ID of the Open AI service
+param aiServiceCUId string // Resource ID of the Azure AI Content Understanding Service
+param aiServiceId string // Resource ID of the Azure Open AI service
 
 param containerRegistryReaderPrincipalId string
 
diff --git a/infra/main.bicep b/infra/main.bicep
diff --git a/infra/main.bicepparam b/infra/main.bicepparam