Skip to content

Latest commit

 

History

History
368 lines (277 loc) · 36.2 KB

File metadata and controls

368 lines (277 loc) · 36.2 KB

Fabric Overview

Costa Rica

GitHub brown9804

Last updated: 2025-07-17


Microsoft Fabric is a comprehensive analytics and data platform tailored for enterprise needs, offering a seamless solution by integrating Software as a Service (SaaS). It merges both new and pre-existing elements from Power BI, Azure Synapse Analytics, Azure Data Factory, among others, into one cohesive platform.

Microsoft Fabric encompasses both Azure Data and Azure Analytics:

  • Azure Data: Incorporates tools for data engineering and management.
  • Azure Analytics: Features business intelligence and analytics tools, notably through its Power BI integration.
List of references (Click to expand)
Table of Contents (Click to expand)

Overview

Key Components

  • Data Engineering, Data Factory, Data Science, Real-Time Analytics, Data Warehouse, and Databases: These are the key services offered by Microsoft Fabric.
  • OneLake: This is the unification of lakehouses.
  • Real-Time hub: This is the unification of data streams.
image
Component Purpose Key Features Why Fabric vs. Each Resource Individually
Data Factory Cloud-based data integration service Orchestrates and automates data movement and transformation. Supports ETL processes. Enhanced Integration: Fabric integrates Data Factory with other components, reducing data silos and improving workflow efficiency.
Synapse Data Engineering Big data processing Utilizes Apache Spark for large-scale data preparation and transformation. Integrates with various data sources. Unified Platform: Fabric provides a cohesive environment where data engineering seamlessly connects with analytics and other services, streamlining data processing and reducing complexity.
Synapse Data Warehouse Scalable analytics service Combines big data and data warehousing. Provides high-performance SQL-based analytics. Comprehensive Insights: Fabric's integration allows for unified analytics, offering a holistic view of data across different services, enhancing decision-making and operational efficiency.
Synapse Real-Time Analytics Real-time data analytics Processes streaming data for immediate insights. Supports real-time dashboards and alerts. Seamless Data Flow: Fabric ensures continuous data flow and real-time analytics integration, enhancing responsiveness and enabling proactive decision-making.
Power BI Business analytics tool Interactive visualizations, self-service BI capabilities. Allows users to create reports and dashboards easily. Enhanced Connectivity: Fabric enhances Power BI's capabilities by providing better data connectivity and integration with other components, leading to richer insights and more comprehensive reporting.
Data Activator Trigger actions based on data conditions Automates responses to specific data events. Enhances data-driven decision-making. Integrated Automation: Fabric leverages comprehensive data insights for more effective automation and action-triggering, improving operational efficiency and responsiveness.
Synapse Data Science Machine learning and data science Tools for building, training, and deploying ML models. Integrates with the Synapse ecosystem for scalable data science workflows. Unified Workflows: Fabric integrates data science workflows with other Synapse services, enhancing collaboration, scalability, and the ability to derive actionable insights from data.

Before Fabric

Features

  • Unification with SaaS foundation: Fabric integrates workloads such as Data Engineering, Data Factory, Data Science, Data Warehouse, Real-Time Intelligence, Industry solutions, and Power BI into a shared SaaS foundation.
  • AI Integration: The entire Fabric stack has AI integration and it accelerates the data journey.
  • Unified Management and Governance: Fabric seamlessly integrates data and services, enabling unified management, governance, and discovery.
  • Security: It ensures security for items, data, and row-level access.
image

OneLake in Microsoft Fabric

OneLake is a unified, logical data lake designed to be the single repository for all your organization's analytics data, similar to how OneDrive serves as a central location for documents. It automatically comes with every Microsoft Fabric tenant, ensuring that all data is stored in one place, reducing duplication and simplifying data management. Designed to store all types of data in a single, centralized location. It serves as the foundational storage layer for all data within the Fabric ecosystem, ensuring seamless integration and accessibility across various services.

Key Features of OneLake Description
Unified Data Storage Stores all types of data, both structured and unstructured, in a single location.
Integration with Fabric All Fabric data items, such as data warehouses and lakehouses, store their data in OneLake using the Delta Parquet format.
Governance and Compliance Provides built-in data governance, including data lineage, protection, and catalog integration.
Open Access Supports ADLS Gen2 APIs and SDKs, making it compatible with existing ADLS Gen2 applications.
Shortcuts Allows data sharing across different domains without data movement, enabling virtual data products.
Data Types in OneLake - Structured Data: Tables, Databases, Data warehouses.
- Unstructured Data: Files, Documents, Images, Videos.
- Semi-Structured Data: JSON, XML, Log files.

Lakehouse & Data Warehouse

Lakehouse is a specific type of data architecture within Microsoft Fabric that combines the features of data lakes and data warehouses. It allows for the storage and processing of both structured and unstructured data, providing the flexibility of a data lake with the performance and management features of a data warehouse.

Data Warehouse is a centralized repository for storing large volumes of structured data. It is optimized for querying and analysis, providing high-performance SQL-based analytics.

Lakehouse Data Warehouse
image image
Aspect OneLake Lakehouse Data Warehouse
Purpose Acts as the central storage repository for all data types within Microsoft Fabric. Provides a unified architecture for managing and analyzing data, combining the benefits of data lakes and data warehouses. Focuses on storing and analyzing structured data for business intelligence and decision support.
Functionality Focuses on storing data in a unified, governed, and accessible manner. Focuses on enabling advanced analytics and machine learning by providing a flexible and scalable data architecture. Optimized for high-performance SQL-based querying and reporting on structured data.
When to Use Each - As the central storage repository for all data types within Microsoft Fabric.
- To ensure seamless integration and accessibility across various services.
- When you need to handle large volumes of diverse data types.
- When you require flexibility in data processing with multiple engines (e.g., Spark, SQL).
- For machine learning projects that need access to both raw and processed data.
- For real-time analytics and advanced data processing.
- When you need optimized performance for SQL-based querying and reporting.
- When your primary focus is on structured data and historical analysis.
- For a robust environment supporting business intelligence and decision support.

Parquet & Delta Data Formats

OneLake allows storage of delta parquet files, which can be read and worked with throughout all workloads. It's a single, unified, logical data lake for the whole organization. Like OneDrive, OneLake comes automatically with every Microsoft Fabric tenant and is designed to be the single place for all your analytics data.

Parquet is a columnar storage file format optimized for use with big data processing frameworks. In Microsoft Fabric, Parquet is commonly used for its efficiency in storing and querying large datasets.

  • Columnar Storage: Stores data in columns, which allows for efficient data compression and encoding schemes, reducing storage costs.
  • Efficient Query Performance: Optimized for read-heavy operations, making it ideal for analytical queries.
  • Compatibility: Supports various compression codecs (encodes or decodes a data stream or signal, "coder-decoder" or "compressor-decompressor) like gzip, snappy, and others, enhancing flexibility in data storage and retrieval.

graph TD
    subgraph Parquet Format
        A[Parquet Format] -->|Storage| B[Columnar Storage]
        A -->|Schema Enforcement| I[❌]
        A -->|Compression| D[Efficient Compression]
        A -->|Read/Write| E[Fast Read/Write]
        A -->|Immutability| F[✔️]
        A -->|ACID Transactions| G[❌]
        A -->|Data Versioning| H[❌]
        A -->|Efficient Updates| J[❌]
        A -->|Query Optimization| K[✔️]
    end
Loading

Delta format, often referred to as Delta Lake, is an open-source storage layer that brings ACID (Atomicity, Consistency, Isolation, Durability) transactions to big data workloads. It is designed to address the challenges of managing large-scale data lakes.

  • ACID Transactions: Ensures data reliability and consistency, supporting complex data operations without data corruption.
  • Schema Enforcement and Evolution: Allows for schema changes over time, making it easier to manage evolving data structures.
  • Time Travel: Enables querying of historical data, providing the ability to access and revert to previous versions of data.
  • Efficient Data Management: Features like compaction, Z-Order, and V-Order optimize data storage and query performance
graph TD
    subgraph Delta Format
        L[Delta Format] -->|Storage| B[Columnar Storage]
        L -->|Schema Enforcement| C[✔️]
        L -->|Compression| D[Efficient Compression]
        L -->|Read/Write| E[Fast Read/Write]
        L -->|Immutability| Q[❌]
        L -->|ACID| N[ACID Transactions]
        L -->|Data Versioning| O[Time Travel]
        L -->|Data Management| P[Data Lineage]
        L -->|Efficient Updates| U[✔️]
        L -->|Query Optimization| V[✔️]
    end
Loading
Feature Parquet Delta Available in Parquet? Available in Delta?
Storage Type Columnar storage format, ideal for analytical queries. Columnar storage format built on Parquet, optimized for data lakes. ✔️ ✔️
Schema Information Schema included in metadata, simplifying data processing. Schema included in metadata, ensuring consistency. ✔️ ✔️
Compression Supports various compression algorithms (e.g., Snappy, Gzip) for efficient storage. High compression, leveraging Parquet's capabilities. ✔️ ✔️
Row Groups Data organized into row groups with min/max statistics for efficient querying. Similar row group structure with min/max statistics, enhancing performance. ✔️ ✔️
Immutability Immutable files, ensuring data integrity but limiting updates. Supports updates and deletes, providing flexibility for data modifications. ✔️
ACID (Atomicity, Consistency, Isolation, Durability) Transactions Not supported, making it less suitable for complex data operations. Supports ACID transactions, ensuring reliable and consistent data operations. ✔️
Data Versioning Not available, limiting the ability to track changes over time. Provides data versioning, allowing for auditing and rollback scenarios. ✔️
Schema Enforcement No built-in schema enforcement, requiring external validation. Enforces schema consistency, maintaining data quality. ✔️
Efficient Updates Does not support efficient updates, making it less suitable for frequently changing data. Allows for efficient updates and deletes, ideal for dynamic datasets. ✔️
Query Optimization Basic query optimization, relying on columnar storage benefits. Advanced query optimization with features like data skipping and Z-order indexing. ✔️ ✔️
Use Case Ideal for data warehousing, batch processing, and scenarios where data is primarily read and not frequently updated. Best suited for data lakes, real-time analytics, and environments requiring strict data integrity and frequent updates. ✔️ ✔️
Additional Context Parquet is excellent for read-heavy workloads and large-scale data analytics. It's widely supported and highly efficient for scenarios where data doesn't change frequently. Delta builds on Parquet by adding features like ACID transactions, data versioning, and efficient updates/deletes. It's designed for environments where data integrity, frequent updates, and complex data operations are crucial. ✔️ ✔️

Z-Order and V-Order

Aspect Z-Order V-Order
Purpose Improves query performance by co-locating related information in the same set of files. Enhances read performance by organizing data in a way that leverages Microsoft Verti-Scan technology.
Key Features - Data Co-Location: Organizes data based on one or more columns, storing rows with similar values together.
- Query Efficiency: Reduces the amount of data read during queries, improving performance.
- Compatibility: Works with Delta Lake to enhance data-skipping algorithms.
- Special Sorting: Applies special sorting techniques to Parquet files.
- Row Group Distribution: Optimizes row group distribution for better read performance.
- Dictionary Encoding and Compression: Uses efficient dictionary encoding and compression.
- Performance Boost: Provides fast reads under various compute engines.
- Cost Efficiency: Reduces network, disk, and CPU resources during reads.
Timing Applied during read time (or table optimization). Applied during write time.
Use Cases - When you need to improve query performance by reducing the amount of data read.
- For queries that frequently filter on specific columns.
- When you need to enhance read performance and reduce storage costs.
- For scenarios requiring efficient data access across various compute engines.
Compatibility Requires specific tools like Delta Lake. Universally compatible with all Parquet engines.

Dataflow Gen2 & Data Pipelines

image
Feature Dataflow Gen2 Data Pipelines
Hint ETL/ELT Pipeline/Orchestrator
Purpose Data transformation and preparation using a low-code interface. Orchestration of data workflows and control flow of execution.
Use Case Ideal for cleaning, prepping, and transforming data with over 300 data and AI-based transformations. Best for creating complex data workflows that involve multiple steps and dependencies.
Authoring Experience Shorter and simpler with auto-save and background publishing. More detailed and flexible, allowing for complex orchestration.
Data Destinations Supports multiple destinations like Fabric Lakehouse, Azure Data Explorer, Azure Synapse Analytics, and Azure SQL Database. Can call Dataflows and other activities to move data to various destinations.
Integration Can be integrated within Data Pipelines for more complex workflows. Can include Dataflows as part of the pipeline activities.
Flexibility Provides a low-code interface for easy data transformation. Offers rich out-of-the-box orchestration capabilities for enterprise needs.
Monitoring and Refresh History Improved monitoring and refresh history features. Provides detailed monitoring and logging for all pipeline activities.
Scalability High-scale compute capabilities for large data transformations. Scalable orchestration for large and complex workflows.
When to Use When you need to perform data transformations and preparations with a low-code interface. When you require a quick and easy way to clean and prep data. When you want to leverage AI-based transformations. When you need to orchestrate complex data workflows with multiple steps and dependencies. When you require detailed control over the execution flow of your data processes. When you need to integrate various data activities, including Dataflows, into a single workflow.
Important Information Designed to make data transformation easier and more efficient with features like auto-save, background publishing, and support for multiple data destinations. Provides a robust framework for orchestrating data workflows, allowing you to create logical groupings of activities and manage dependencies effectively.

Shortcuts & Mirroring

Feature Shortcuts Mirroring
Definition Objects that point to other storage locations, either internal or external to OneLake. A data replication solution that continuously replicates data from various sources into OneLake.
Purpose To unify data across different domains, clouds, and accounts by creating a single virtual data lake. To bring data from different systems together into a single analytics platform, ensuring data is up-to-date and readily available for analysis.
Data Movement No data is copied or moved. Shortcuts link directly to the source data. Data is copied and stored in OneLake, providing a low-latency and centralized data management solution.
Usage Shortcuts appear as folders in OneLake and can be used by any service or workload that has access to OneLake. Mirrored data can be accessed and analyzed using various tools within Microsoft Fabric, such as Power BI and Azure Synapse.
Benefits - Reduces data duplication.
- Enhances accessibility by providing a unified view of data.
- Simplifies data management by abstracting the data retrieval process.
- Provides real-time data replication.
- Ensures data consistency and availability.
- Simplifies the process of keeping data synchronized across different systems.
Key Differences - No data movement.
- Live access to external data sources.
- Ideal for accessing data across multiple locations without duplication.
- Data is physically replicated.
- Access to a centralized, up-to-date copy of data.
- Suitable for centralized analysis and ensuring data consistency.
Compatible Products - Azure Data Lake Storage (ADLS) Gen2
- Amazon S3
- Google Cloud Storage
- Dataverse
- On-premises data sources via Fabric on-premises data gateway
- Azure SQL Database
- Azure Cosmos DB
- Snowflake

Data Factory

Example over ETL (Extract, Transform, Load)

image

Steps: Click here to see a visual guidance

  1. Select Data pipeline.

  2. You can either click on Copy data assist right away or simply click on Copy data.

    image
  3. Add the connection to your data source. In this example, we'll use Azure SQL Database.

  4. Choose the tables you want to include. You can do this using the Tables view, by writing a Query or Stored procedure.

    Option Description Use Case
    Table Allows you to select and work with entire tables from your data source. Ideal for scenarios where you need to copy or transform all data within a specific table.
    Query Enables you to write custom SQL queries to select specific data from your data source. Useful when you need to filter, join, or aggregate data before copying or transforming it.
    Stored Procedure Allows you to execute predefined stored procedures in your database. Best for complex data operations that are encapsulated within stored procedures, such as data transformations or business logic.
    image
  5. Select your destination, such as the Bronze layer.

  6. Test and validate your data pipeline.

    image

Medallion Architecture Overview

Click here to go to a quick guide over the medallion architecture, and see a demo with conventional lakehouse.

image

Fabric: Highlights into AI/LLMs

Click here to see a quick guide and demo.

Writing SQL: SQL Analytics Endpoint

In Microsoft Fabric, you can write your SQL queries using the SQL analytics endpoint. This endpoint provides a SQL-based experience for querying data stored in lakehouse Delta tables.
The SQL analytics endpoint allows you to use T-SQL (Transact-SQL) to analyze your data, create views, save functions, and apply SQL security.

Key Feature Description
SQL-Based Querying Use T-SQL to query Delta tables in your lakehouse.
Views and Functions Create and save SQL views and functions for reusable queries and operations.
SQL Security Apply object-level security to control access to your data.
Read-Only Mode The SQL analytics endpoint operates in read-only mode, meaning you can read data but not modify it directly through this endpoint.

How to Configure and Use the SQL Analytics Endpoint

Steps to Configure:

  1. Create a Lakehouse:

    • Go to your Microsoft Fabric workspace.

    • Click on New item and select Lakehouse to create a new lakehouse or select an existing one. This automatically provisions a SQL analytics endpoint for the lakehouse.

      image
    • Follow the prompts to set up your lakehouse.

      image
      Aspect Lakehouse in Microsoft Fabric Lakehouse with Schema in Microsoft Fabric
      Purpose Centralized storage for structured and unstructured data. Enhanced organization and management of tables using schemas.
      Data Organization Tables are stored without additional grouping. Tables are organized into schemas, providing a folder-like structure.
      Data Discovery Basic data discovery capabilities. Improved data discovery through schema grouping.
      Access Control Standard access control at the table level. Enhanced access control at the schema level, allowing for finer granularity.
      Integration Integrated with other Fabric components like SQL analytics endpoint. Same integration, with added schema support for better data management.
      SQL Analytics Endpoint Supports querying tables directly. Supports querying tables within schemas, using namespace notation.
      Schema Management Not applicable. Allows creation, modification, and management of schemas.
      Use Cases Suitable for general data storage and analytics. Ideal for scenarios requiring organized data structures and advanced access control.
      Lakehouse simple Lakehouse with schema
      image image
  2. Connect Using SQL Query Editor:

    • Open the SQL query editor from the Microsoft Fabric portal.

      1. Go to the workspace.

      2. Select the Lakehouse/Data Warehouse you want to query.

      3. Click on the "New SQL query" button in the ribbon at the top of the pageÑ This will open the SQL query editor where you can write and execute your T-SQL queries. The editor supports features like IntelliSense, code completion, and syntax highlighting to help you write queries efficiently.

        Lakehouse - SQL Analytics Endpoint Data Warehouse - SQL Analytics Endpoint
        image image
    • Write a simple query, for example:

      SELECT * FROM your_table_name;
    • Execute the query to see the results.

      image
    • You can use Explain query to add comments to your query:

      image
  3. Connect Using External Tools:

  • Access the SQL Analytics Endpoint: Copy the SQL analytics endpoint connection string (T-SQL connection string) from the Lakehouse explorer.

    • In the workspace view, select the lakehouse.

    • In the Lakehouse explorer, go to SQL analytics endpoint.

      image
  • SQL Server Management Studio (SSMS) or Azure Data Studio:

    • For this case, open SSMS and click on Connect.

    • Select Database Engine and paste the connection string.

    • Authenticate using Azure Active Entra credentials.

    • Write and execute your SQL queries.

      image

Fabric AI Skill

Fabric AI Skills let you build custom chat platforms for asking and answering questions using advanced AI. By setting up these skills, we can ask questions and get reliable answers based on data. We need to give clear instructions and examples so the AI can work well for our company's specific needs and data.

Click here for a quick guidance.

Total views

Refresh Date: 2025-07-18