Skip to content

DataJourneyHQ/DataJourney

DataJourney Stats
License OpenSSF Best Practices Code of Conduct
CI github-repo-stats Deploy DataJourney Stats Lint prose Monitor GitHub API Rate Limit

Recipient: GitHub Secure Open Source Fund
πŸ’– Sponsor DataJourneyHQ Β β€’Β  πŸ₯Official announcement

DJ rocks

πŸͺΆShort version

A design-first open-source data management toolkit.
Understand the mechanics of stitching tools together into one cohesive, beautiful system.

🌲Long version

DataJourney is a design-first open-source data management toolkit that teaches you how to assemble cohesive data systems from individual components.

Rather than prescribing specific tools, it demonstrates the mechanics of integration, demonstrating how to stitch together open-source technologies into scalable, reproducible workflows. With its modular, flexible design, DataJourney serves as both a learning resource and a practical toolkit for data professionals who want to grasp the art and science of building harmonious data systems.

🧱 Design Philosophy (LEGO)

Built with additive, subtractive capabilities glued with open source. Each layer has a certain strength of communication inbuilt

  • PO (Base): Static home(s) to keep it together (GitHub)
  • P1 (Tooling): Tooling, strings (Powered by open source)
  • P2 (Maintenance + Monitoring): Env, automations (Pixi + GHA)
  • P3 (Abstraction): Layer(s), CLI/task manager for users to interact with (Pixi)

DJ Design

πŸ›  Current workflows covered

{✨= Experimental, βœ… = Implemented}

Status Workflow Description Journey Type
βœ… Pre-commit hooks configured for code linting/formatting Code Quality
βœ… Exploratory data analysis (EDA) using mito EDA
βœ… Environment management via pixi Environment Management
βœ… GenAI examples to analyse data GitHub AI models AI Data Analysis
βœ… custom Dashboard using holoviews + panel Dashboarding
βœ… Reading data from online sources using intake Data Ingestion
βœ… Data pipeline built using Dagster Orchestration / Pipelines
βœ… Hello world LLM design example based on LangChain LLM Example
βœ… Python Packaging framework design principles Packaging / Project Structure
βœ… Prompt enhancer powered by gpt-oss-120b Prompt Engineering
βœ… RAG powered by langchain, chromadb & GitHub AI models RAG Pipeline
βœ… GitHub actions configured CI/CD
βœ… Web UI build on Flask Web Application
βœ… Web UI re-done and expanded with FastHTML Web Application
βœ… Vale.sh configured at PR level Docs Linting
βœ… Query engine for LLM application using Chromadb Vector Retrieval
βœ… LLM Evaluation & Tracing for data analysis pipelines using DeepEval LLM Evaluation

β˜•οΈ Quickly getting started with DataJourney

  • Fork the repository
  • Generate & add GITHUB_TOKEN, instructions here

    Additional requirement to run LLM based workflows; Eg: DJ_prompt_enhancer, DJ_llm_analysis, others

  • Switch directory cd DataJourney
  • Download pixi : prefix.dev
  • Activate env: pixi shell
  • Install DJ framework locally pixi run DJ_package
  • List all the tasks: pixi run DJ_list
  • Execute a specific task from the list: pixi run <TASK_NAME>
  • Execute a specific task with additional logs: pixi run -v <TASK_NAME>

πŸƒπŸ½β€β™€οΈ Active tasks under DJ

Task Name Description
GIT_TOKEN_CHECK Verifies the availability and validity of the Git authentication token.
DJ_package Prepares and builds the Python package for the DataJourney project.
DJ_pre_commit Runs pre-commit hooks to ensure code quality and adherence to standards.
DJ_dagster Sets up and runs a Dagster workflow for orchestration in the project.
DJ_fasthtml_app Executes a FastAPI-based HTML application.
DJ_flask_app Configures and runs a Flask-based application for data services.
DJ_mito_app Launches the Mito application for interactive data analysis in notebooks.
DJ_panel_app Executes a Panel dashboard app for data visualization and analytics.
DJ_llm_analysis Performs analysis using large language models (LLMs) on project data.
DJ_hello_world_langchain Sets up a basic LangChain app as a "Hello World" example for LLMs.
DJ_spanish_eng_translation Performs Spanish to English translation with Deepseek-R1 (NOTE: Takes about ~30 secs to execute this task)
DJ_sync_dataset_trees Downloads and synchronizes the trees.csv dataset into the project structure.
DJ_chromadb_gen_embedding Query engine for LLM applications
DJ_RAG_without_memory End-to-end Retrieval-Augmented Generation (RAG) pipeline
DJ_prompt_enhancer How to design a simple prompt enhancer using gpt-oss-120b

πŸ”Œ About pre-commit-hooks and activating

Just like the name suggests, pre-commit-hooks are designed to format the code based on PEP standards before committing. More details

pixi run DJ_pre_commit

🦭 Executing LLM script: Generate stock price recommendations

pixi run DJ_llm_analysis

πŸͺΌ Execute pre-configured Dagster pipeline

pixi run DJ_dagster

Dagit UI output

πŸ™ Panel app

pixi run DJ_panel_app

NOTE: The dashboard generated is exported into HTML format and saved as stock_price_twilio_dashboard

Panel app output

🐡 Mito

To explore further visit trymito.io

pixi run DJ_mito_app
mito_output mito_output

πŸ¦‹ Display all data sources present via web UI

# Run FastHTML app
pixi run DJ_fasthtml_app

data_sources_fasthtml.png

Sponsor this project

 

Packages

 
 
 

Contributors