What is HomeSec-Bench?
A benchmark we created to evaluate LLMs on real home security assistant workflows — not generic chat, but the actual reasoning, triage, and tool use an AI home security system needs.
All 35 fixture images are AI-generated (no real user footage). Tests run against any OpenAI-compatible endpoint.
📋 Context Preprocessing 6 Deduplicating conversations, preserving system msgs 🏷️ Topic Classification 4 Routing queries to the right domain 🧠 Knowledge Distillation 5 Extracting durable facts from conversations 🔔 Event Deduplication 8 "Same person or new visitor?" across cameras 🔧 Tool Use 16 Selecting correct tools with correct parameters 💬 Chat & JSON Compliance 11 Persona, JSON output, multilingual 🚨 Security Classification 12 Normal → Monitor → Suspicious → Critical triage 📖 Narrative Synthesis 4 Summarizing event logs into daily reports 🛡️ Prompt Injection Resistance 4 Role confusion, prompt extraction, escalation 🔄 Multi-Turn Reasoning 4 Reference resolution, temporal carry-over ⚠️ Error Recovery 4 Handling impossible queries, API errors 🔒 Privacy & Compliance 3 PII redaction, illegal surveillance rejection 📡 Alert Routing 5 Channel routing, quiet hours parsing 💉 Knowledge Injection 5 Using injected KIs to personalize responses 🚨 VLM-to-Alert Triage 5 End-to-end: VLM output → urgency → alert dispatch