Как устроен Anonymaze
Обновлено: April 28, 2026
Anonymaze is a thin native shell over a deterministic NLP pipeline. This page walks the same path your text takes when you anonymise a document, in the same order: from “user drops file” to “copy clean output.”
Stack overview
| Component | Tech | Role |
|---|---|---|
| macOS shell | SwiftUI | Native UI, file picker, OCR via Apple Vision |
| Windows shell | Electron + React | Day-1 Windows distribution (v0.1 internal beta, see desktop-windows/) |
| Marketing site | Next.js 16 + React 19 + Tailwind 4 | This website + the online demo (anonymaze.ai) |
| NLP engine | Proprietary multi-engine pipeline | Detection pipeline shared by desktop and online |
| Backend API | FastAPI + Uvicorn + slowapi | Hosts the online demo and Pro tier |
Detection pipeline
- Regex stage. Deterministic patterns for structured PII (emails, phones, SSNs, IBANs, credit cards, country-specific IDs). Fast, high precision, near-zero false positives on well-formed inputs.
- Surface heuristics. Locale-aware tokenisation, capitalisation cues, honorific handling. Bridges between regex and statistical NER.
- Named-entity recognition. Per-locale neural NER model tuned for each language. Detects PERSON, ORG, LOC, ADDRESS, DATE.
- False-positive filter. Per-locale rule set rejects common known false positives (capitalised sentence-initial words, brand names that overlap with surnames, etc.). Lives in `python_engine/languages/<locale>/false_positives.py`.
- Deduplication and span resolution. Overlapping candidates from the three earlier stages are reconciled into a single non-overlapping span list. Coreference (“John Smith” → later “Mr. Smith” → later just “John”) collapses to one placeholder.
- Output. Final JSON with anonymized_text, an entity list (placeholder + original + type + span), a mapping for round-trip restore, per-type stats, language detected, and processing-time milliseconds.
Language packs
We ship 15 locales today, organised in two tiers. Tier A (target precision and recall ≥ 0.95): English, Russian, French, German, Spanish (Spain). Tier B (target ≥ 0.85): Chinese, Hindi, Italian, Japanese, Korean, Polish, Portuguese (Brazil), Portuguese (Portugal), Spanish (LatAm), Turkish. Auto-detection routes inputs to the right pack via a character + word heuristic.
| Code | Language | NER engine |
|---|---|---|
| zh | Chinese | Proprietary NLP pipeline |
| en | English | Proprietary NLP pipeline |
| fr | French | Proprietary NLP pipeline |
| de | German | Proprietary NLP pipeline |
| hi | Hindi | Proprietary NLP pipeline |
| it | Italian | Proprietary NLP pipeline |
| ja | Japanese | Proprietary NLP pipeline |
| ko | Korean | Proprietary NLP pipeline |
| pl | Polish | Proprietary NLP pipeline |
| pt-BR | Portuguese (Brazil) | Proprietary NLP pipeline |
| pt-PT | Portuguese (Portugal) | Proprietary NLP pipeline |
| ru | Russian | Proprietary NLP pipeline |
| es | Spanish (Spain) | Proprietary NLP pipeline |
| es-419 | Spanish (LatAm) | Proprietary NLP pipeline |
| tr | Turkish | Proprietary NLP pipeline |
Country-specific IDs
Beyond the universal entity types, Anonymaze ships dedicated detectors for these country-specific identifiers. The list grows with each language pack we add.
| Country | National identifiers detected |
|---|---|
| United States | SSN |
| Russia | INN, SNILS, passport, OMS |
| Spain | DNI, NIE, CURP, RFC |
| Portugal / Brazil | CPF, NIF |
| France | NIR (sécu) |
| Germany | Steuer-ID |
| China | 身份证 (National ID) |
| Italy | Codice Fiscale, P.IVA |
| Poland | PESEL |
| Turkey | T.C. Kimlik No |
Reversibility (round-trip after AI)
When you turn on reversibility, Anonymaze emits a sidecar mapping file alongside the anonymized text so you can restore the originals after running the text through an AI. This is a Pro feature. Engine source-of-truth: PR #130 (`schema_version: "1.0"`).
- Hex-tag placeholders: instead of `[PERSON_1]`, the engine emits stable per-document tags like `[PERSON_a3f2]`. The 4-character hex suffix is derived from a per-document salt, so the same name in two different documents gets different tags — keeping the anonymized text safe to share.
- Sidecar `mapping.json`: a small JSON file with `schema_version`, `doc_salt`, and an `entities[]` array of `{tag, original, type, positions, confidence}`. You download it after anonymizing and keep it private.
- Local-only by default: the mapping never leaves your device unless you choose to upload it. The desktop app writes it next to the document; the web demo lets you save it as a file.
- Round-trip flow: you anonymize → paste the redacted text into ChatGPT/Claude/your AI → paste the AI's response back into Anonymaze with the mapping → Anonymaze restores the originals into the response.
- Tag collisions and unknown tags: if the AI invents new `[PERSON_xxxx]` tags that aren't in your mapping, we leave them as-is and surface a count so you can review.
Pro enhancement on the Q3 roadmap: AES-256-GCM encrypted mapping files with Argon2id KDF. Until that ships, the mapping file is plain JSON — treat it like you would treat the original document and store it accordingly.
Two flows: where your data goes
Desktop (offline)
Your file is read from disk by SwiftUI / Electron, handed to a local Python subprocess via Process + Pipe, and the JSON result is rendered. No network call is made along this path. See /security for the verification methodology.
Online demo
Your text is sent over TLS 1.3 to api.anonymaze.ai, processed by an in-memory FastAPI handler with a fresh Anonymizer instance per request, returned, and discarded. No persistent storage. Rate-limited to 5 requests per minute per IP.
Engineering posture
Anonymaze is built on rigorously audited, permissively-licensed components and a proprietary orchestration layer that handles false-positive filtering, span resolution, and the per-locale tuning that delivers the bench numbers above.
Performance characteristics
| Typical detection time (online demo, 1 KB English) | ~85 ms server-side |
|---|---|
| Typical detection time (desktop, 1 KB English) | ~120 ms (cold cache); ~40 ms (warm) |
| Free-tier text limit | 50,000 characters per request |
| Pro-tier text limit | 500,000 characters per request |
| File-size limit (free) | 10 MB per document |
| Supported file formats | TXT, DOCX, PDF, XLSX, PPTX, CSV, RTF + image OCR (PNG, JPG) |