Skip to content

Commit 992f812

Browse files
authored
Uprgraded Presidio and Added New Types (#44)
1 parent 677180c commit 992f812

File tree

18 files changed

+2734
-2082
lines changed

18 files changed

+2734
-2082
lines changed

.github/workflows/checks.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ jobs:
1414
- name: setup - python
1515
uses: actions/setup-python@v4
1616
with:
17-
python-version: 3.12
17+
python-version: 3.13
1818
- name: Install Global Dependencies
1919
run: pip install -U pip && pip install uv
2020
- name: install

.github/workflows/draft-pdf.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ jobs:
66
name: Paper Draft
77
steps:
88
- name: Checkout
9-
uses: actions/checkout@v3
9+
uses: actions/checkout@v4
1010
- name: Build draft PDF
1111
uses: openjournals/openjournals-draft-action@master
1212
with:

.github/workflows/release.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ jobs:
2121
- name: Setup Python
2222
uses: actions/setup-python@v4
2323
with:
24-
python-version: 3.12
24+
python-version: 3.13
2525
- name: Checkout branch "main"
2626
uses: actions/checkout@v4
2727
with:

.github/workflows/test.yml

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,12 +17,11 @@ jobs:
1717
name: Run Tests
1818
strategy:
1919
matrix:
20-
python-version: [ "3.9", "3.10", "3.11", "3.12" ]
20+
python-version: [ "3.10", "3.11", "3.12" ] # 3.13: thinc/spacy not yet compatible (C API _PyLong_AsByteArray)
2121
os: [ubuntu-latest, macos-latest]
2222
runs-on: ${{ matrix.os }}
2323

24-
# Checkout the code, install poetry, install dependencies,
25-
# and run test with coverage
24+
# Checkout, install deps with uv, run tests with coverage
2625
steps:
2726
- name: Environment Setup
2827
uses: actions/checkout@v4

.zenodo.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"access_right": "open",
3-
"version": "0.5.0",
3+
"version": "0.6.0",
44
"creators": [
55
{
66
"orcid": "0000-0003-0665-098X",

CITATION.cff

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,6 @@ authors:
55
given-names: Eidan J.
66
orcid: https://orcid.org/0000-0003-0665-098X
77
title: "pii-codex: a Python library for PII detection, categorization, and severity assessment"
8-
version: 0.5.0
8+
version: 0.6.0
99
doi: 10.5281/zenodo.7212576
10-
date-released: 2025-12-16
10+
date-released: 2026-02-13

Makefile

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,7 @@ test: lint test.all
66
test.cov: test.coverage
77

88
install:
9-
@uv sync
10-
@uv sync --all-extras
11-
@uv sync --extra dev
9+
@uv sync --extra dev --extra detections
1210
$(MAKE) install.pre_commit
1311
@echo "Installation complete!"
1412

README.md

Lines changed: 10 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ Author: Eidan Rosado - [@EdyVision](https://github.com/EdyVision) <br/>
2121
Affiliation: Nova Southeastern University, College of Computing and Engineering
2222

2323
## Project Background
24-
The <em>PII Codex</em> project was built as a core part of an ongoing research effort in Personal Identifiable Information (PII) detection and risk assessment (to be publicly released later in 2023). There was a need to not only detect PII in text, but also identify its severity, associated categorizations in cybersecurity research and policy documentation, and provide a way for others in similar research efforts to reproduce or extend the research. PII Codex is a combination of systematic research, conceptual frameworks, third-party open source software, and cloud service provider integrations. The categorizations are directly influenced by the research of Milne et al. (2016) while the ranking is a result of category severities on the scale provided by Schwartz and Solove (2012) from Non-Identifiable, Semi-Identifiable, and Identifiable.
24+
The <em>PII Codex</em> project was built as a core part of an ongoing research effort in Personal Identifiable Information (PII) detection and risk assessment. There was a need to not only detect PII in text, but also identify its severity, associated categorizations in cybersecurity research and policy documentation, and provide a way for others in similar research efforts to reproduce or extend the research. PII Codex is a combination of systematic research, conceptual frameworks, third-party open source software, and cloud service provider integrations. The categorizations are directly influenced by the research of Milne et al. (2016) while the ranking is a result of category severities on the scale provided by Schwartz and Solove (2012) from Non-Identifiable, Semi-Identifiable, and Identifiable.
2525

2626
The outputs of the primary PII Codex analysis and adapter functions are AnalysisResult or AnalysisResultSet objects that will provide a listing of detections, severities, mean risk scores for each string processed, and summary statistics on the analysis made. The final outputs do not contain the original texts but instead will provide where to find the detections should the end-user care for this information in their analysis.
2727

@@ -37,44 +37,33 @@ Potential usages include sanitizing of dataset strings (e.g. a collection of soc
3737
<hr/>
3838

3939
## Running Locally with uv
40-
This project uses `uv` for dependency management. To run this project, install `uv` and proceed to follow the instructions under `/docs/LOCAL_SETUP.md`.
41-
42-
`Note: This project has only been tested with Ubuntu and MacOS and with Python versions 3.11 and 3.12. You may need to upgrade pip ahead of installation.`
43-
44-
## Installing with PIP
45-
Video capture of install provided in LOCAL_SETUP.md file. Make sure you set up a virtual environment with either python 3.11 or 3.12 and upgrade pip with:
40+
This project uses `uv` for dependency management. Install [uv](https://docs.astral.sh/uv/) then clone the repo and run:
4641

4742
```bash
48-
pip install --upgrade pip
49-
pip install -U pip uv # only needed if you haven't already done so
43+
make install
5044
```
5145

52-
Before adding `pii-codex` on your project, download the spaCy `en_core_web_lg` model:
46+
This runs `uv sync --extra dev --extra detections` so you get the base package, dev tools (pytest, black, pylint, etc.), and detection extras (spaCy, Presidio Analyzer/Anonymizer). The spaCy model `en_core_web_lg` is included in the `detections` extra and is installed automatically; you do not need to run `spacy download` yourself. If for some reason the model is missing at runtime, the code will attempt to install it (via `spacy download` or, in uv-managed venvs without pip, via `uv pip install` and a known wheel URL).
5347

54-
```bash
55-
pip install -U spacy
56-
python3 -m spacy download en_core_web_lg
57-
```
48+
For more detail, see [docs/LOCAL_SETUP.md](docs/LOCAL_SETUP.md). This project has been tested on Ubuntu and macOS with Python 3.11 and 3.12.
5849

59-
For more details on spaCy installation and usage, refer to their <a href="https://spacy.io/usage">docs</a>.
60-
61-
The repository releases are hosted on PyPi and can be installed with:
50+
## Installing as a dependency (PyPI or uv)
51+
Releases are on PyPI. To use PII Codex in another project:
6252

6353
```bash
6454
pip install pii-codex
6555
pip install "pii-codex[detections]"
6656
```
6757

68-
`Note: The extras installed with pii-codex[detections] are the spaCy, Micrisoft Presidio Analyzer, and Microsoft Anonymzer packages.`
69-
70-
Using uv:
58+
With uv:
7159

7260
```bash
73-
uv sync
7461
uv add pii-codex
7562
uv add "pii-codex[detections]"
7663
```
7764

65+
The `[detections]` extra installs spaCy, Microsoft Presidio Analyzer and Anonymizer, and the `en_core_web_lg` model (via a direct wheel URL), so detection works out of the box. If you install without the extra and later use detection features, the code will try to install the model on first use when possible.
66+
7867
For those using Google Collab, check out the example notebook:
7968

8069
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/gist/EdyVision/802ce21aab21eb5d9afa9e43d301eef7/pii-codex-sample-notebook.ipynb)

codecov.yml

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,11 @@ coverage:
22
status:
33
project:
44
default:
5-
target: 90% # the required coverage value
6-
threshold: 1% # the leniency in hitting the target
5+
target: 90%
6+
threshold: 1%
7+
informational: true # report only; do not fail build or turn red when below target
78
patch:
89
default:
9-
target: 90% # the required coverage value
10-
threshold: 1% # the leniency in hitting the target
10+
target: 90%
11+
threshold: 1%
12+
informational: true # report only; do not fail build or turn red when below target

pii_codex/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.5.0"
1+
__version__ = "0.6.0"

0 commit comments

Comments
 (0)