Skip to content

Как устроен Anonymaze

Обновлено: April 28, 2026

EN body — translation pending. The structural copy on this page has been localised; the long-form sections remain in English while a native reviewer finalises the translation. Source claims are unaffected.

Anonymaze is a thin native shell over a deterministic NLP pipeline. This page walks the same path your text takes when you anonymise a document, in the same order: from “user drops file” to “copy clean output.”

Stack overview

ComponentTechRole
macOS shellSwiftUINative UI, file picker, OCR via Apple Vision
Windows shellElectron + ReactDay-1 Windows distribution (v0.1 internal beta, see desktop-windows/)
Marketing siteNext.js 16 + React 19 + Tailwind 4This website + the online demo (anonymaze.ai)
NLP engineProprietary multi-engine pipelineDetection pipeline shared by desktop and online
Backend APIFastAPI + Uvicorn + slowapiHosts the online demo and Pro tier

Detection pipeline

  1. Regex stage. Deterministic patterns for structured PII (emails, phones, SSNs, IBANs, credit cards, country-specific IDs). Fast, high precision, near-zero false positives on well-formed inputs.
  2. Surface heuristics. Locale-aware tokenisation, capitalisation cues, honorific handling. Bridges between regex and statistical NER.
  3. Named-entity recognition. Per-locale neural NER model tuned for each language. Detects PERSON, ORG, LOC, ADDRESS, DATE.
  4. False-positive filter. Per-locale rule set rejects common known false positives (capitalised sentence-initial words, brand names that overlap with surnames, etc.). Lives in `python_engine/languages/<locale>/false_positives.py`.
  5. Deduplication and span resolution. Overlapping candidates from the three earlier stages are reconciled into a single non-overlapping span list. Coreference (“John Smith” → later “Mr. Smith” → later just “John”) collapses to one placeholder.
  6. Output. Final JSON with anonymized_text, an entity list (placeholder + original + type + span), a mapping for round-trip restore, per-type stats, language detected, and processing-time milliseconds.

Language packs

We ship 15 locales today, organised in two tiers. Tier A (target precision and recall ≥ 0.95): English, Russian, French, German, Spanish (Spain). Tier B (target ≥ 0.85): Chinese, Hindi, Italian, Japanese, Korean, Polish, Portuguese (Brazil), Portuguese (Portugal), Spanish (LatAm), Turkish. Auto-detection routes inputs to the right pack via a character + word heuristic.

CodeLanguageNER engine
zhChineseProprietary NLP pipeline
enEnglishProprietary NLP pipeline
frFrenchProprietary NLP pipeline
deGermanProprietary NLP pipeline
hiHindiProprietary NLP pipeline
itItalianProprietary NLP pipeline
jaJapaneseProprietary NLP pipeline
koKoreanProprietary NLP pipeline
plPolishProprietary NLP pipeline
pt-BRPortuguese (Brazil)Proprietary NLP pipeline
pt-PTPortuguese (Portugal)Proprietary NLP pipeline
ruRussianProprietary NLP pipeline
esSpanish (Spain)Proprietary NLP pipeline
es-419Spanish (LatAm)Proprietary NLP pipeline
trTurkishProprietary NLP pipeline

Country-specific IDs

Beyond the universal entity types, Anonymaze ships dedicated detectors for these country-specific identifiers. The list grows with each language pack we add.

CountryNational identifiers detected
United StatesSSN
RussiaINN, SNILS, passport, OMS
SpainDNI, NIE, CURP, RFC
Portugal / BrazilCPF, NIF
FranceNIR (sécu)
GermanySteuer-ID
China身份证 (National ID)
ItalyCodice Fiscale, P.IVA
PolandPESEL
TurkeyT.C. Kimlik No

Reversibility (round-trip after AI)

When you turn on reversibility, Anonymaze emits a sidecar mapping file alongside the anonymized text so you can restore the originals after running the text through an AI. This is a Pro feature. Engine source-of-truth: PR #130 (`schema_version: "1.0"`).

  • Hex-tag placeholders: instead of `[PERSON_1]`, the engine emits stable per-document tags like `[PERSON_a3f2]`. The 4-character hex suffix is derived from a per-document salt, so the same name in two different documents gets different tags — keeping the anonymized text safe to share.
  • Sidecar `mapping.json`: a small JSON file with `schema_version`, `doc_salt`, and an `entities[]` array of `{tag, original, type, positions, confidence}`. You download it after anonymizing and keep it private.
  • Local-only by default: the mapping never leaves your device unless you choose to upload it. The desktop app writes it next to the document; the web demo lets you save it as a file.
  • Round-trip flow: you anonymize → paste the redacted text into ChatGPT/Claude/your AI → paste the AI's response back into Anonymaze with the mapping → Anonymaze restores the originals into the response.
  • Tag collisions and unknown tags: if the AI invents new `[PERSON_xxxx]` tags that aren't in your mapping, we leave them as-is and surface a count so you can review.

Pro enhancement on the Q3 roadmap: AES-256-GCM encrypted mapping files with Argon2id KDF. Until that ships, the mapping file is plain JSON — treat it like you would treat the original document and store it accordingly.

Two flows: where your data goes

Desktop (offline)

Your file is read from disk by SwiftUI / Electron, handed to a local Python subprocess via Process + Pipe, and the JSON result is rendered. No network call is made along this path. See /security for the verification methodology.

Online demo

Your text is sent over TLS 1.3 to api.anonymaze.ai, processed by an in-memory FastAPI handler with a fresh Anonymizer instance per request, returned, and discarded. No persistent storage. Rate-limited to 5 requests per minute per IP.

Engineering posture

Anonymaze is built on rigorously audited, permissively-licensed components and a proprietary orchestration layer that handles false-positive filtering, span resolution, and the per-locale tuning that delivers the bench numbers above.

Performance characteristics

Typical detection time (online demo, 1 KB English)~85 ms server-side
Typical detection time (desktop, 1 KB English)~120 ms (cold cache); ~40 ms (warm)
Free-tier text limit50,000 characters per request
Pro-tier text limit500,000 characters per request
File-size limit (free)10 MB per document
Supported file formatsTXT, DOCX, PDF, XLSX, PPTX, CSV, RTF + image OCR (PNG, JPG)