rgdp-legacy-master-data-extract

Extracts objects from the Linode production PostgreSQL database and lands them in the AWS S3 Bronze layer as Parquet. Designed to be reused across multiple object types as the legacy → AWS migration progresses.

Snowflake DDL and delivery layer live in a separate repo: rgdp-snowflake-delivery

Architecture

Linode Postgres
  └── county tables (parcels, addresses, and future objects (datasets))
          │
          ▼
    extract.py  (runs locally via Docker, or in AWS Batch + Fargate)
          │
          ▼
    S3 Bronze — Parquet
    s3://regrid-data-platform-us-east-2-bronze/
      {source_system}/dataset={dataset}/admin0={admin0}/admin1={admin1}/filename_stem={filename_stem}/snapshot_dt={extract_date}/{table_name}_{extract_date}.parquet
          │
          ▼
    EMR Serverless PySpark  [post-POC per object]
          │
          ▼
    S3 Silver 1 — Iceberg
    s3://regrid-data-platform-us-east-2-silver/
      stage=silver1/source={source_system}/dataset={dataset}/

Supported Objects

Object	Status	Pattern	Cadence	Config
parcel	Active — POC	county_sharded	monthly	`config/parcel_extract_config.yaml`
verse	Ready	single_table	weekly	`config/verse_extract_dataset.yaml`
(future objects added here)

Two extract patterns are supported:

county_sharded — fan out one container per county. Used by parcel and intended for atom, ageon, attom-ownership, zoneomics. Driven by the geofips lookup in public.verse.
single_table — one container, one query, one parquet file. Used by verse and any future small reference table. SQL is read verbatim from the object's YAML config.

Cadences above are intended targets, currently invoked on demand. EventBridge scheduling is a follow-up after on-demand is proven.

Key Design Decisions

Decision	Detail
Primary key	`geoid + ll_uuid` composite — ll_uuid not globally unique in legacy
`_source_hashkey`	SHA-256(geoid + ll_uuid) — identity, stable across runs
`_record_hashkey`	SHA-256(geoid + ll_uuid + geometry_wkb) — content, detects geometry changes
Bronze compute	AWS Batch + Fargate (1 container per county)
Silver 1 compute	EMR Serverless PySpark (single job per object, post-POC)
Iceberg	Silver 1 only — Bronze stays plain Parquet
Q4 amendment	When AWS platform assigns global UUID, simplify `_source_hashkey` to SHA-256(new_uuid)

Local Development

Prerequisites:

Python 3.11+
Docker + Docker Compose
AWS credentials configured (aws configure or AWS_PROFILE)
VPN connection (for Linode Postgres access)

# 1. Copy and fill in credentials
cp dev/.env.local.example dev/.env.local
# Edit dev/.env.local — never commit this file

# 2. Confirm column names before building anything
psql "$LINODE_DSN" -c "
  SELECT column_name, data_type
  FROM information_schema.columns
  WHERE table_name = 'mi_wayne'
  ORDER BY ordinal_position;"

# 3. Build container
docker compose build parcel-extract

# 4. Run one county → writes to real S3 Bronze
docker compose run --rm parcel-extract

# 5. Validate output
python3 dev/validate_output.py

# 6. Run a small parcel batch (default: 3 TX counties)
python3 jobs/submit_jobs_local.py

# 7. Run the verse single-table extract
docker compose build verse-extract
python3 jobs/submit_jobs_local.py --object verse

The --object flag (or OBJECT env var) routes to the right config and container. Default is parcel, so existing invocations are unaffected.

Production (AWS Batch)

Requires VPC + NAT Gateway configured by admin. Run connectivity check first:

./infra/connectivity_check.sh

Once all checks pass:

# parcel — state-filtered extract (demo sample)
STATE_FILTER=MI,OH,IN,IL python3 jobs/submit_jobs_aws.py

# parcel — full national extract
python3 jobs/submit_jobs_aws.py

# verse single-table extract
python3 jobs/submit_jobs_aws.py --object verse

One-time AWS setup per object: a Batch job queue and job definition must exist before its --object invocation will succeed. Parcel uses parcel-extract-queue / parcel-county-extract. Verse needs verse-extract-queue / verse-extract created in the same account/region.

Adding a New Object

To extend this repo for a new object type (e.g. atom, ageon, addresses, zoning):

Add a config file: config/{object}_extract_config.yaml — set extract_pattern to either county_sharded or single_table, and define the SQL under source.query (single_table) or rely on the geofips loop (county_sharded).
Add an extract container: containers/{object}-extract/{Dockerfile,extract.py,requirements.txt}.
- For single_table objects, use containers/verse-extract/ as the template.
- For county_sharded objects, use containers/parcel-extract/ as the template.
Add an entry to OBJECTS in jobs/object_registry.py with the compose service, Batch queue, Batch job definition, and pattern.
Add the new service to docker-compose.yml.
Create the Batch queue + job definition in AWS.
Add a row to the Supported Objects table above.

Repository Structure

rgdp-legacy-master-data-extract/
  containers/
    _shared/
      extract_lib.py              ← shared OID→Arrow, hashing, S3 helpers
    parcel-extract/               ← Dockerfile, extract.py, requirements.txt
    verse-extract/                ← Dockerfile, extract.py, requirements.txt
    {future-object}-extract/      ← added per object as needed
  jobs/
    object_registry.py            ← object → config/container/Batch mapping
    submit_jobs_local.py          ← local sequential runner (dev)
    submit_jobs_aws.py            ← AWS Batch fan-out (production)
  config/
    parcel_extract_config.yaml
    verse_extract_dataset.yaml
  dev/
    .env.local.example            ← copy to .env.local, fill in, never commit
    validate_output.py            ← DuckDB validation against real S3
  infra/
    batch_job_definition.json
    connectivity_check.sh
    iam/
      batch_job_role_policy.json
  docs/
    RB-002_Parcel_Extract_Runbook_v2.docx

Related Repos

Repo	Purpose
`rgdp-legacy-master-data-extract`	This repo — Bronze extraction from Linode
`regrid-snowflake-delivery`	Snowflake stages, external tables, views, Snowpipe
(main data platform repo)	AWS medallion pipeline, Silver → Gold, EMR jobs

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
config		config
containers		containers
dev		dev
infra		infra
jobs		jobs
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
pyrightconfig.json		pyrightconfig.json
requirements-local.txt		requirements-local.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

rgdp-legacy-master-data-extract

Architecture

Supported Objects

Key Design Decisions

Local Development

Production (AWS Batch)

Adding a New Object

Repository Structure

Related Repos

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

rgdp-legacy-master-data-extract

Architecture

Supported Objects

Key Design Decisions

Local Development

Production (AWS Batch)

Adding a New Object

Repository Structure

Related Repos

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages