Show HN: Chonky – a neural text semantic chunking goes multilingual

Chonky is a transformer model that intelligently segments text into meaningful semantic chunks. This model can be used in the RAG systems.

🆕 Now multilingual!

Model Description

The model processes text and divides it into semantically coherent segments. These chunks can then be fed into embedding-based retrieval systems or language models as part of a RAG pipeline.

⚠️This model was fine-tuned on sequence of length 1024 (by default mmBERT supports sequence length up to 8192).

How to use

I've made a small python library for this model: chonky

Here is the usage:

from src.chonky import ParagraphSplitter # on the first run it will download the transformer model splitter = ParagraphSplitter( model_id="mirth/chonky_mmbert_small_multilingual_1", device="cpu" ) text = ( "Before college the two main things I worked on, outside of school, were writing and programming. " "I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. " "My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. " "The first programs I tried writing were on the IBM 1401 that our school district used for what was then called 'data processing.' " "This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, " "and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — " "CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights." ) for chunk in splitter(text): print(chunk) print("--")

Sample Output:

... continue reading