Skip to content

loveland/rgdp-legacy-master-data-extract

Repository files navigation

rgdp-legacy-master-data-extract

Extracts objects from the Linode production PostgreSQL database and lands them in the AWS S3 Bronze layer as Parquet. Designed to be reused across multiple object types as the legacy → AWS migration progresses.

Snowflake DDL and delivery layer live in a separate repo: rgdp-snowflake-delivery


Architecture

Linode Postgres
  └── county tables (parcels, addresses, and future objects (datasets))
          │
          ▼
    extract.py  (runs locally via Docker, or in AWS Batch + Fargate)
          │
          ▼
    S3 Bronze — Parquet
    s3://regrid-data-platform-us-east-2-bronze/
      {source_system}/dataset={dataset}/admin0={admin0}/admin1={admin1}/filename_stem={filename_stem}/snapshot_dt={extract_date}/{table_name}_{extract_date}.parquet
          │
          ▼
    EMR Serverless PySpark  [post-POC per object]
          │
          ▼
    S3 Silver 1 — Iceberg
    s3://regrid-data-platform-us-east-2-silver/
      stage=silver1/source={source_system}/dataset={dataset}/

Supported Objects

Object Status Pattern Cadence Config
parcel Active — POC county_sharded monthly config/parcel_extract_config.yaml
verse Ready single_table weekly config/verse_extract_dataset.yaml
(future objects added here)

Two extract patterns are supported:

  • county_sharded — fan out one container per county. Used by parcel and intended for atom, ageon, attom-ownership, zoneomics. Driven by the geofips lookup in public.verse.
  • single_table — one container, one query, one parquet file. Used by verse and any future small reference table. SQL is read verbatim from the object's YAML config.

Cadences above are intended targets, currently invoked on demand. EventBridge scheduling is a follow-up after on-demand is proven.


Key Design Decisions

Decision Detail
Primary key geoid + ll_uuid composite — ll_uuid not globally unique in legacy
_source_hashkey SHA-256(geoid + ll_uuid) — identity, stable across runs
_record_hashkey SHA-256(geoid + ll_uuid + geometry_wkb) — content, detects geometry changes
Bronze compute AWS Batch + Fargate (1 container per county)
Silver 1 compute EMR Serverless PySpark (single job per object, post-POC)
Iceberg Silver 1 only — Bronze stays plain Parquet
Q4 amendment When AWS platform assigns global UUID, simplify _source_hashkey to SHA-256(new_uuid)

Local Development

Prerequisites:

  • Python 3.11+
  • Docker + Docker Compose
  • AWS credentials configured (aws configure or AWS_PROFILE)
  • VPN connection (for Linode Postgres access)
# 1. Copy and fill in credentials
cp dev/.env.local.example dev/.env.local
# Edit dev/.env.local — never commit this file

# 2. Confirm column names before building anything
psql "$LINODE_DSN" -c "
  SELECT column_name, data_type
  FROM information_schema.columns
  WHERE table_name = 'mi_wayne'
  ORDER BY ordinal_position;"

# 3. Build container
docker compose build parcel-extract

# 4. Run one county → writes to real S3 Bronze
docker compose run --rm parcel-extract

# 5. Validate output
python3 dev/validate_output.py

# 6. Run a small parcel batch (default: 3 TX counties)
python3 jobs/submit_jobs_local.py

# 7. Run the verse single-table extract
docker compose build verse-extract
python3 jobs/submit_jobs_local.py --object verse

The --object flag (or OBJECT env var) routes to the right config and container. Default is parcel, so existing invocations are unaffected.


Production (AWS Batch)

Requires VPC + NAT Gateway configured by admin. Run connectivity check first:

./infra/connectivity_check.sh

Once all checks pass:

# parcel — state-filtered extract (demo sample)
STATE_FILTER=MI,OH,IN,IL python3 jobs/submit_jobs_aws.py

# parcel — full national extract
python3 jobs/submit_jobs_aws.py

# verse single-table extract
python3 jobs/submit_jobs_aws.py --object verse

One-time AWS setup per object: a Batch job queue and job definition must exist before its --object invocation will succeed. Parcel uses parcel-extract-queue / parcel-county-extract. Verse needs verse-extract-queue / verse-extract created in the same account/region.


Adding a New Object

To extend this repo for a new object type (e.g. atom, ageon, addresses, zoning):

  1. Add a config file: config/{object}_extract_config.yaml — set extract_pattern to either county_sharded or single_table, and define the SQL under source.query (single_table) or rely on the geofips loop (county_sharded).
  2. Add an extract container: containers/{object}-extract/{Dockerfile,extract.py,requirements.txt}.
    • For single_table objects, use containers/verse-extract/ as the template.
    • For county_sharded objects, use containers/parcel-extract/ as the template.
  3. Add an entry to OBJECTS in jobs/object_registry.py with the compose service, Batch queue, Batch job definition, and pattern.
  4. Add the new service to docker-compose.yml.
  5. Create the Batch queue + job definition in AWS.
  6. Add a row to the Supported Objects table above.

Repository Structure

rgdp-legacy-master-data-extract/
  containers/
    _shared/
      extract_lib.py              ← shared OID→Arrow, hashing, S3 helpers
    parcel-extract/               ← Dockerfile, extract.py, requirements.txt
    verse-extract/                ← Dockerfile, extract.py, requirements.txt
    {future-object}-extract/      ← added per object as needed
  jobs/
    object_registry.py            ← object → config/container/Batch mapping
    submit_jobs_local.py          ← local sequential runner (dev)
    submit_jobs_aws.py            ← AWS Batch fan-out (production)
  config/
    parcel_extract_config.yaml
    verse_extract_dataset.yaml
  dev/
    .env.local.example            ← copy to .env.local, fill in, never commit
    validate_output.py            ← DuckDB validation against real S3
  infra/
    batch_job_definition.json
    connectivity_check.sh
    iam/
      batch_job_role_policy.json
  docs/
    RB-002_Parcel_Extract_Runbook_v2.docx

Related Repos

Repo Purpose
rgdp-legacy-master-data-extract This repo — Bronze extraction from Linode
regrid-snowflake-delivery Snowflake stages, external tables, views, Snowpipe
(main data platform repo) AWS medallion pipeline, Silver → Gold, EMR jobs

About

repo for synching landing product master data to aws data platform bronze layer

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors