Chonky is a transformer model that intelligently segments text into meaningful semantic chunks. This model can be used in the RAG systems.
π Now multilingual!
Model Description
The model processes text and divides it into semantically coherent segments. These chunks can then be fed into embedding-based retrieval systems or language models as part of a RAG pipeline.
β οΈThis model was fine-tuned on sequence of length 1024 (by default mmBERT supports sequence length up to 8192).
How to use
I've made a small python library for this model: chonky
Here is the usage:
from src.chonky import ParagraphSplitter # on the first run it will download the transformer model splitter = ParagraphSplitter( model_id="mirth/chonky_mmbert_small_multilingual_1", device="cpu" ) text = ( "Before college the two main things I worked on, outside of school, were writing and programming. " "I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. " "My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. " "The first programs I tried writing were on the IBM 1401 that our school district used for what was then called 'data processing.' " "This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, " "and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines β " "CPU, disk drives, printer, card reader β sitting up on a raised floor under bright fluorescent lights." ) for chunk in splitter(text): print(chunk) print("--")
Sample Output:
... continue reading