Show HN: Semlib – Semantic Data Processing

Semlib

Semlib is a Python library for building data processing and data analysis pipelines that leverage the power of large language models (LLMs). Semlib provides, as building blocks, familiar functional programming primitives like map , reduce , sort , and filter , but with a twist: Semlib's implementation of these operations are programmed with natural language descriptions rather than code. Under the hood, Semlib handles complexities such as prompting, parsing, concurrency control, caching, and cost tracking.

pip install semlib

📖 API Reference ⬀ 🤔 Rationale 💡 Examples ⬀

>>> presidents = await prompt( ... " Who were the 39th through 42nd presidents of the United States? " , ... return_type = Bare(list[ str ]) ... ) >>> await sort(presidents, by = " right-leaning " ) ['Jimmy Carter', 'Bill Clinton', 'George H. W. Bush', 'Ronald Reagan'] >>> await find(presidents, by = " former actor " ) 'Ronald Reagan' >>> await map ( ... presidents, ... " How old was {} when he took office? " , ... return_type = Bare( int ), ... ) [52, 69, 64, 46]

Rationale

Large language models are great at natural-language data processing and data analysis tasks, but when you have a large amount of data, you can't get high-quality results by just dumping all the data into a long-context LLM and asking it to complete a complex task in a single shot. Even with today's reasoning models and agents, this approach doesn't give great results.

This library provides an alternative. You can structure your computation using the building blocks that Semlib provides: functional programming primitives upgraded to handle semantic operations. This approach has a number of benefits.

Quality. By breaking down a sophisticated data processing task into simpler steps that are solved by today's LLMs, you can get higher-quality results, even in situations where today's LLMs might be capable of processing the data in a single shot and ending up with barely acceptable results. (example: analyzing support tickets in Airline Support Report)

Feasibility. Even long-context LLMs have limitations (e.g., 1M tokens in today's frontier models). Furthermore, performance often drops off with longer inputs. By breaking down the data processing task into smaller steps, you can handle arbitrary-sized data. (example: sorting an arbitrary number of arXiv papers in arXiv Paper Recommendations)

... continue reading