You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Affiliation: Nova Southeastern University, College of Computing and Engineering
22
22
23
23
## Project Background
24
-
The <em>PII Codex</em> project was built as a core part of an ongoing research effort in Personal Identifiable Information (PII) detection and risk assessment (to be publicly released later in 2023). There was a need to not only detect PII in text, but also identify its severity, associated categorizations in cybersecurity research and policy documentation, and provide a way for others in similar research efforts to reproduce or extend the research. PII Codex is a combination of systematic research, conceptual frameworks, third-party open source software, and cloud service provider integrations. The categorizations are directly influenced by the research of Milne et al. (2016) while the ranking is a result of category severities on the scale provided by Schwartz and Solove (2012) from Non-Identifiable, Semi-Identifiable, and Identifiable.
24
+
The <em>PII Codex</em> project was built as a core part of an ongoing research effort in Personal Identifiable Information (PII) detection and risk assessment. There was a need to not only detect PII in text, but also identify its severity, associated categorizations in cybersecurity research and policy documentation, and provide a way for others in similar research efforts to reproduce or extend the research. PII Codex is a combination of systematic research, conceptual frameworks, third-party open source software, and cloud service provider integrations. The categorizations are directly influenced by the research of Milne et al. (2016) while the ranking is a result of category severities on the scale provided by Schwartz and Solove (2012) from Non-Identifiable, Semi-Identifiable, and Identifiable.
25
25
26
26
The outputs of the primary PII Codex analysis and adapter functions are AnalysisResult or AnalysisResultSet objects that will provide a listing of detections, severities, mean risk scores for each string processed, and summary statistics on the analysis made. The final outputs do not contain the original texts but instead will provide where to find the detections should the end-user care for this information in their analysis.
27
27
@@ -37,44 +37,33 @@ Potential usages include sanitizing of dataset strings (e.g. a collection of soc
37
37
<hr/>
38
38
39
39
## Running Locally with uv
40
-
This project uses `uv` for dependency management. To run this project, install `uv` and proceed to follow the instructions under `/docs/LOCAL_SETUP.md`.
41
-
42
-
`Note: This project has only been tested with Ubuntu and MacOS and with Python versions 3.11 and 3.12. You may need to upgrade pip ahead of installation.`
43
-
44
-
## Installing with PIP
45
-
Video capture of install provided in LOCAL_SETUP.md file. Make sure you set up a virtual environment with either python 3.11 or 3.12 and upgrade pip with:
40
+
This project uses `uv` for dependency management. Install [uv](https://docs.astral.sh/uv/) then clone the repo and run:
46
41
47
42
```bash
48
-
pip install --upgrade pip
49
-
pip install -U pip uv # only needed if you haven't already done so
43
+
make install
50
44
```
51
45
52
-
Before adding `pii-codex` on your project, download the spaCy`en_core_web_lg` model:
46
+
This runs `uv sync --extra dev --extra detections` so you get the base package, dev tools (pytest, black, pylint, etc.), and detection extras (spaCy, Presidio Analyzer/Anonymizer). The spaCy model `en_core_web_lg`is included in the `detections` extra and is installed automatically; you do not need to run `spacy download` yourself. If for some reason the model is missing at runtime, the code will attempt to install it (via `spacy download` or, in uv-managed venvs without pip, via `uv pip install` and a known wheel URL).
53
47
54
-
```bash
55
-
pip install -U spacy
56
-
python3 -m spacy download en_core_web_lg
57
-
```
48
+
For more detail, see [docs/LOCAL_SETUP.md](docs/LOCAL_SETUP.md). This project has been tested on Ubuntu and macOS with Python 3.11 and 3.12.
58
49
59
-
For more details on spaCy installation and usage, refer to their <ahref="https://spacy.io/usage">docs</a>.
60
-
61
-
The repository releases are hosted on PyPi and can be installed with:
50
+
## Installing as a dependency (PyPI or uv)
51
+
Releases are on PyPI. To use PII Codex in another project:
62
52
63
53
```bash
64
54
pip install pii-codex
65
55
pip install "pii-codex[detections]"
66
56
```
67
57
68
-
`Note: The extras installed with pii-codex[detections] are the spaCy, Micrisoft Presidio Analyzer, and Microsoft Anonymzer packages.`
69
-
70
-
Using uv:
58
+
With uv:
71
59
72
60
```bash
73
-
uv sync
74
61
uv add pii-codex
75
62
uv add "pii-codex[detections]"
76
63
```
77
64
65
+
The `[detections]` extra installs spaCy, Microsoft Presidio Analyzer and Anonymizer, and the `en_core_web_lg` model (via a direct wheel URL), so detection works out of the box. If you install without the extra and later use detection features, the code will try to install the model on first use when possible.
66
+
78
67
For those using Google Collab, check out the example notebook:
79
68
80
69
[](https://colab.research.google.com/gist/EdyVision/802ce21aab21eb5d9afa9e43d301eef7/pii-codex-sample-notebook.ipynb)
0 commit comments