Show HN: Kanon 2 Enricher – the first hierarchical graphitization model

In total, there were 102 participants in the Isaacus Beta Program, including Harvey, KPMG Law, Clyde & Co, Cleary Gottlieb, Alvarez & Marsal, Khaitan & Co, Gilbert + Tobin, Smokeball, Moonlit, LawY, Lawpath, UniCourt, and AccuFind. We thank each and every one of them for being amongst the first to play with Kanon 2 Enricher and for providing critical early feedback that helped improve Kanon 2 Enricher ahead of its release.

Over the coming weeks and months, we will be releasing our own applications built atop Kanon 2 Enricher such as a new LLM-powered semantic chunking mode in semchunk, a new Python package for automatically converting plain text into Markdown, and a first-of-a-kind public knowledge graph of laws, regulations, cases, and contracts from around the world, which can then be ingested into your own systems.

Kanon 2 Enricher is an architectural masterpiece

As the first hierarchical graphitization model, Kanon 2 Enricher was built entirely from scratch. Every single node, edge, and label representable in the Isaacus Legal Graph Schema (ILGS) corresponds to one or more bespoke task heads. Those task heads were trained jointly, with our Kanon 2 legal encoder foundation model producing shared representations that all other heads operate on. In total, we built 58 different task heads optimized with 70 different loss terms.

In designing Kanon 2 Enricher, we had to work around several hard constraints of ILGS such as that each entity must be anchored to a document through character-level spans corresponding to entity references and all such spans must be well-nested and globally laminar within a document (i.e., no two spans in a document can partially overlap). Wherever feasible, we tried to enforce our schematic constraints architecturally, whether by using masks or joint scoring, otherwise resorting to employing custom regularizing losses.

One of the trickiest problems we had to tackle was hierarchical document segmentation, where every heading, reference, chapter, section, subsection, table, figure, and so on is extracted from a document in a hierarchical fashion such that segments can be contained within other segments at any arbitrary level of depth. To solve this problem, we had to implement our own novel hierarchical segmentation architecture, decoding approach, and loss function.

Thanks to the many architecture innovations that have gone into Kanon 2 Enricher, it is extremely computationally efficient, far more so than a generative model. Indeed, instead of generating annotations token by token, which introduces the possibility of generative hallucinations, Kanon 2 Enricher directly annotates all the tokens in a document in a single shot. Thus, it takes Kanon 2 Enricher less than ten seconds to enrich the entirety of Dred Scott v. Sandford, the longest US Supreme Court decision, containing 111,267 words in total. In that time, Kanon 2 Enricher identifies 178 people referenced in the decision some 1,340 times, 99 locations referenced 1,294 times, and 298 documents referenced 940 times.