Extracts objects from the Linode production PostgreSQL database and lands them in the AWS S3 Bronze layer as Parquet. Designed to be reused across multiple object types as the legacy → AWS migration progresses.
Snowflake DDL and delivery layer live in a separate repo: rgdp-snowflake-delivery
Linode Postgres
└── county tables (parcels, addresses, and future objects (datasets))
│
▼
extract.py (runs locally via Docker, or in AWS Batch + Fargate)
│
▼
S3 Bronze — Parquet
s3://regrid-data-platform-us-east-2-bronze/
{source_system}/dataset={dataset}/admin0={admin0}/admin1={admin1}/filename_stem={filename_stem}/snapshot_dt={extract_date}/{table_name}_{extract_date}.parquet
│
▼
EMR Serverless PySpark [post-POC per object]
│
▼
S3 Silver 1 — Iceberg
s3://regrid-data-platform-us-east-2-silver/
stage=silver1/source={source_system}/dataset={dataset}/
| Object | Status | Pattern | Cadence | Config |
|---|---|---|---|---|
| parcel | Active — POC | county_sharded | monthly | config/parcel_extract_config.yaml |
| verse | Ready | single_table | weekly | config/verse_extract_dataset.yaml |
| (future objects added here) |
Two extract patterns are supported:
- county_sharded — fan out one container per county. Used by parcel and intended for atom, ageon, attom-ownership, zoneomics. Driven by the geofips lookup in
public.verse. - single_table — one container, one query, one parquet file. Used by verse and any future small reference table. SQL is read verbatim from the object's YAML config.
Cadences above are intended targets, currently invoked on demand. EventBridge scheduling is a follow-up after on-demand is proven.
| Decision | Detail |
|---|---|
| Primary key | geoid + ll_uuid composite — ll_uuid not globally unique in legacy |
_source_hashkey |
SHA-256(geoid + ll_uuid) — identity, stable across runs |
_record_hashkey |
SHA-256(geoid + ll_uuid + geometry_wkb) — content, detects geometry changes |
| Bronze compute | AWS Batch + Fargate (1 container per county) |
| Silver 1 compute | EMR Serverless PySpark (single job per object, post-POC) |
| Iceberg | Silver 1 only — Bronze stays plain Parquet |
| Q4 amendment | When AWS platform assigns global UUID, simplify _source_hashkey to SHA-256(new_uuid) |
Prerequisites:
- Python 3.11+
- Docker + Docker Compose
- AWS credentials configured (
aws configureorAWS_PROFILE) - VPN connection (for Linode Postgres access)
# 1. Copy and fill in credentials
cp dev/.env.local.example dev/.env.local
# Edit dev/.env.local — never commit this file
# 2. Confirm column names before building anything
psql "$LINODE_DSN" -c "
SELECT column_name, data_type
FROM information_schema.columns
WHERE table_name = 'mi_wayne'
ORDER BY ordinal_position;"
# 3. Build container
docker compose build parcel-extract
# 4. Run one county → writes to real S3 Bronze
docker compose run --rm parcel-extract
# 5. Validate output
python3 dev/validate_output.py
# 6. Run a small parcel batch (default: 3 TX counties)
python3 jobs/submit_jobs_local.py
# 7. Run the verse single-table extract
docker compose build verse-extract
python3 jobs/submit_jobs_local.py --object verseThe --object flag (or OBJECT env var) routes to the right config and
container. Default is parcel, so existing invocations are unaffected.
Requires VPC + NAT Gateway configured by admin. Run connectivity check first:
./infra/connectivity_check.shOnce all checks pass:
# parcel — state-filtered extract (demo sample)
STATE_FILTER=MI,OH,IN,IL python3 jobs/submit_jobs_aws.py
# parcel — full national extract
python3 jobs/submit_jobs_aws.py
# verse single-table extract
python3 jobs/submit_jobs_aws.py --object verseOne-time AWS setup per object: a Batch job queue and job definition must exist before its
--objectinvocation will succeed. Parcel usesparcel-extract-queue/parcel-county-extract. Verse needsverse-extract-queue/verse-extractcreated in the same account/region.
To extend this repo for a new object type (e.g. atom, ageon, addresses, zoning):
- Add a config file:
config/{object}_extract_config.yaml— setextract_patternto eithercounty_shardedorsingle_table, and define the SQL undersource.query(single_table) or rely on the geofips loop (county_sharded). - Add an extract container:
containers/{object}-extract/{Dockerfile,extract.py,requirements.txt}.- For single_table objects, use
containers/verse-extract/as the template. - For county_sharded objects, use
containers/parcel-extract/as the template.
- For single_table objects, use
- Add an entry to
OBJECTSinjobs/object_registry.pywith the compose service, Batch queue, Batch job definition, and pattern. - Add the new service to
docker-compose.yml. - Create the Batch queue + job definition in AWS.
- Add a row to the Supported Objects table above.
rgdp-legacy-master-data-extract/
containers/
_shared/
extract_lib.py ← shared OID→Arrow, hashing, S3 helpers
parcel-extract/ ← Dockerfile, extract.py, requirements.txt
verse-extract/ ← Dockerfile, extract.py, requirements.txt
{future-object}-extract/ ← added per object as needed
jobs/
object_registry.py ← object → config/container/Batch mapping
submit_jobs_local.py ← local sequential runner (dev)
submit_jobs_aws.py ← AWS Batch fan-out (production)
config/
parcel_extract_config.yaml
verse_extract_dataset.yaml
dev/
.env.local.example ← copy to .env.local, fill in, never commit
validate_output.py ← DuckDB validation against real S3
infra/
batch_job_definition.json
connectivity_check.sh
iam/
batch_job_role_policy.json
docs/
RB-002_Parcel_Extract_Runbook_v2.docx
| Repo | Purpose |
|---|---|
rgdp-legacy-master-data-extract |
This repo — Bronze extraction from Linode |
regrid-snowflake-delivery |
Snowflake stages, external tables, views, Snowpipe |
| (main data platform repo) | AWS medallion pipeline, Silver → Gold, EMR jobs |