“The Bitter Lesson” is wrong. Well… sort of. Assaf Pinhasi 3 min read · 1 hour ago 1 hour ago -- Listen Share TL;DR There is no dichotomy between domain knowledge vs. “general purpose methods that leverage data+compute”. They are both powerful tools that compensate for each other and need to be balanced and traded off during the model building process. “The bitter lesson” in 30 seconds “The bitter lesson” is one of the most popular opinion pieces about AI research and it’s future. In his writing, Rich Sutton makes a dichotomy between two main schools of AI research: 1. Research based on “human knowledge” 2. Research based on scaling methods like “learning and search” that scale with more data and compute. Sutton claims that virtually all long term progress in AI was achieved using the latter, and that the former is actually detrimental and distracting. He continues to provide multiple very convincing examples. The false conclusion Many people who have read the post reach the conclusion that “you don’t need human knowledge, just use methods that rely on data + compute”. I don’t believe the dichotomy can exist in reality. Counter-argument No machine learning model was ever built using pure “human knowledge” — because then it wouldn’t be a learning model. It would be a hard coded algorithm. It would be a hard coded algorithm. Similarly, no machine learning model was ever created without any “human knowledge” because 1) models are still designed by humans which made design decisions and 2) models cannot learn useful things without human guidance. “human knowledge” because 1) models are still designed by humans which made design decisions and 2) models cannot learn useful things without human guidance. We have no reason to believe that models that “ search and learn” at huge scale as a black box will magically align to be useful to humans Evaluating models is an integral part of the model development lifecycle and as such needs to be accounted for in the discussion about “human knowledge” vs. “search and learn”. Do I sound like the dude on the left? I hope not! Alternative theory The entire model building process is guided by domain knowledge. Methods which apply this knowledge range from “direct” to “influential” On the “direct” end of the spectrum, we codify the knowledge explicitly. Knowledge can be seen directly in the code or data. On the “influential” end of the spectrum, we create some derivative or translation between the domain model and the model’s behavior, and apply them at well selected “pressure points” to guide the model’s behavior. We need to choose the “operating point” on the spectrum multiple times, for different parts of the model lifecycle. Often when we choose a more “influential” approach early on, we end up needing to use more “direct” methods later in the lifecycle, especially evaluation This is needed because giving the model higher degrees of freedom increases risk of it learning catastrophic behaviors, given that it’s learning process is not at all aligned to human-centric thinking process. It’s possible that the sum total of investment in domain knowledge related tasks has not changed so dramatically over time. What may have changed is how this investment is distributed across the model building lifecycle. Example An example can be seen in lifecycle of building LLMs. We start off with very broad “influential” approach, and slowly increase how “explicit” we are with our domain knowledge and judgement: Self-supervision on massive scale, highly varied datasets Curated datasets in sub-domains in which we want the model to be better (e.g. textbooks, coding competitions, etc.) Human feedback, labels and preferences Guardrails and various alignment techniques Evaluation with highly curated domain specific data and tools — from domain-rich judge prompts, to code execution environment, high quality datasets labeled by experts, red-teaming, etc. Summary