Show HN: Searchable compression for JSON – ~99% page skip and sub-ms lookups

SEE — Searchable JSON Compression (Semantic Entropy Encoding) combined ≈ 19.5% • lookup p50 ≈ 0.18 ms • skip ≈ 99% Why it matters SEE reduces both the data tax (storage/egress) and the CPU tax (decompress/parse) by keeping JSON searchable while compressed. It may not always be smaller than Zstd, but searchability + low I/O + random access leads to better TCO/ROI for many workloads. ① Download (Release) ・ ② OnePager (ROI) ・ ③ Try in 10 minutes Enterprise / NDA inquiry → Private contact form Under NDA: full VDR pack available. Please provide a company email (no confidential data required). What is SEE? Schema-aware JSON compression: combines structure × delta × Zstd (+ Bloom / Skip) to stay searchable while compressed , with page-level random access . combines structure × delta × Zstd (+ Bloom / Skip) to stay , with . Design trade-off: favors low I/O & low latency (ms) and ~99% skip rate over minimal size. Key metrics (Demo) Combined size: ≈19.5% of raw Lookup present (ms): p50 ≈ 0.18 / p95 ≈ 0.28 / p99 ≈ 0.34 Skip ratio: present ≈ 0.99 / absent ≈ 0.992, Bloom density ≈ 0.30 ROI quick math Savings/TB = (1 − 0.195) × Price_per_GB × 1000 Example: $0.05/GB → ≈$40/TB, $0.25/GB → ≈$200/TB 🔧 Try in 10 minutes python samples/quick_demo.py Prints compression ratio, skip rate, Bloom density, and lookup latency (p50/p95/p99). Demo package (Release v0.1.0): Includes Python wheel, .see files, demo scripts, metrics, and OnePager PDF. Reproducible on Windows / macOS / Linux. Verify integrity using: pwsh tools/verify_checksums.ps1 # or manually check SHA256SUMS.txt KPI (demo): combined ≈ 19.5%, lookup p50 ≈ 0.18 ms, skip ≈ 99%, bloom ≈ 0.30. Tradeoff: not always smaller than Zstd, but stays searchable while compressed, cutting I/O and CPU costs. Why SEE vs Zstd-only? Zstd-only can be smaller, but not searchable ; you still pay I/O + CPU to decompress and parse JSON. can be smaller, but not ; you still pay to decompress and parse JSON. SEE trades a small size increase for millisecond lookups and page-level random access, reducing I/O and CPU — resulting in better TCO. FAQ (short) Q. Will it ever be larger than Zstd? A. Sometimes yes; in return you get ms lookups and ~99% skipping . For I/O/CPU-bound workloads, TCO decreases . Q. Best-fit data? A. Repetitive JSON/NDJSON such as logs, events, telemetry, and metrics. Q. How long to reproduce? A. About 10 minutes using the included Demo ZIP. Q. Why not build a separate index? A. Separate indexes add extra I/O, space, and consistency risk. SEE keeps searchability inside the storage format , reducing random I/O and parsing overhead. Q. How to tune for different data? A. Adjust Bloom density (default ≈0.30, works best in 0.25–0.55). Demo prints all metrics for validation. What’s included in the Release ZIP Python Wheel (.whl) Demo scripts : samples/quick_demo.py , samples/quick_bench.py (prints KPIs) : , (prints KPIs) OnePager (PDF) and metrics/ summaries and summaries Integrity check script: tools/verify_checksums.ps1 README_FIRST.md — concise reproduction guide 📦 VDR (Virtual Data Room) — Evaluation Package What it is The SEE VDR is a private, NDA-only evaluation bundle that lets third parties reproduce our key KPIs on their own machine: Compression: combined size ≈ ~19.5% of raw combined size ≈ ~19.5% of raw Lookup latency: p50 ≈ ~0.18 ms p50 ≈ ~0.18 ms Skipping: ~99% page-level skip What it contains (high level) Sample .see artifacts with minimal metadata (for reproducible tests) with minimal metadata (for reproducible tests) A prebuilt evaluation wheel (binary-only) for quick local runs (binary-only) for quick local runs KPI summaries (CSV/JSON) and a frozen results snapshot (CSV/JSON) and a frozen Simple verification scripts (checksums / quality-gate) (checksums / quality-gate) A concise One-Pager and evaluator README ℹ️ Implementation details (core algorithms, dictionaries, low-level parameters) remain proprietary and are not disclosed in this repository. Access policy Distributed on request under NDA (no public download). (no public download). To request access, please contact us via LinkedIn (see Official Links & Profiles) with the subject: “SEE VDR Access” . (see Official Links & Profiles) with the subject: . Redistribution, reverse engineering, and public benchmarking of VDR binaries are prohibited . . An Evaluation EULA applies in addition to the NDA. How evaluators use it (under NDA) Verify package integrity (checksums script). Install the provided evaluation wheel into a clean virtual environment. Run the 10-minute demo to print ratio / skip / bloom / p50–p99. Compare local output with the included KPI snapshot (apples-to-apples). Why VDR? Ensures reproducible, verifiable numbers without exposing the core IP. numbers without exposing the core IP. Shortens technical diligence for FinOps / M&A / platform teams while keeping trade secrets protected. If you only need the public demo, see the repository’s samples and Release assets. The VDR is reserved for formal evaluations (NDA) that require deeper verification. Links Docs / Site: https://kodomonocc1.github.io/see_proto/ https://kodomonocc1.github.io/see_proto/ Latest Release (Demo ZIP + Wheel + OnePager + SHA256): https://github.com/kodomonocch1/see_proto/releases/tag/v0.1.0 https://github.com/kodomonocch1/see_proto/releases/tag/v0.1.0 Enterprise / NDA contact (private): https://docs.google.com/forms/d/e/1FAIpQLScV2Ti592K3Za2r_WLUd0E6xSvCEVnlEOxYd6OGgbpJm0ADlg/viewform?usp=header Note: The GitHub Discussions “Enterprise (NDA)” category is public. Do not post confidential information or emails there — use the private form above. 🔗 Official Links & Profiles 📬 If you're interested in schema-aware compression, reproducible benchmarks, or potential collaboration, feel free to connect via LinkedIn. From Bytes to Balance Sheets — SEE (Semantic Entropy Encoding) Optional: For reproducibility or citation If you reproduce benchmarks or use SEE in your research, please cite:

Show HN: Searchable compression for JSON – ~99% page skip and sub-ms lookups

Share this article

Related Articles