[AI] Anatomy of a Model
How a large-scale AI model is built, from raw data to the final products we use every day.
Audio version: An AI-generated two-person discussion exploring the ideas in this post.
An Honest Number
The year is 1935. The Great Depression clings to the American dream of homeownership. Foreclosure signs are a grim feature of the landscape. In response, President Roosevelt’s administration created the Federal Housing Administration, or FHA. Its mission is bold: to stabilize the housing market by insuring mortgages from private lenders.
But to insure a loan, the FHA first has to answer a hard question: what is a house actually worth?
Before this, a property's value was a matter of local wisdom. An appraiser might walk through a property, kick the baseboards, and declare a value. Two appraisers could arrive at wildly different numbers for the same house. For a new federal agency underwriting risk on a national scale, this subjectivity was untenable. The FHA needed a system, something repeatable, defensible, and rooted in objective fact.
This is where you came in.
The Search for Data
You were a young statistician, freshly hired by the FHA. Your mandate was simple and immense: create a method to produce a reliable estimate of a single-family home's market value. You began by framing the problem. The "market value," you decided, was simply what a house recently sold for in a fair transaction. This sale price was the number you wanted to predict. The things that influenced the price, the square footage, the number of rooms, the quality of the neighborhood, were your tools.
Your goal was to create a mathematical equation, a model, that could take the features of any given house and produce an honest estimate of its price. With the problem defined, your next task was to gather data. This was long before digital databases. The data lived in cavernous county courthouses, in thick, leather-bound ledgers filled with looping cursive. For weeks, you and your small team of clerks pored over deeds and tax records. For each property that had sold in the last three years, you transcribed the information onto a standardized data card. You had to be careful, excluding sales between family members or foreclosures that didn't reflect the true market.
Each card was a portrait of a house, painted with the features a 1930s homebuyer cared about:
Square Footage: The basic measure of size.
Bedrooms & Bathrooms: A full bathroom with indoor plumbing was a major selling point, not a given.
Lot Size: The land the house sat on.
Construction Material: Brick was highly desirable for its permanence. You coded these features numerically: Brick = 1, Wood Frame = 2.
Heating System: A modern coal-fired central furnace was a world away from wood stoves in each room. This was a simple "yes" or "no" (1 or 0).
Modern Kitchen: Did it have built-in cabinetry and a gas or electric stove? Another crucial "yes" or "no."
Proximity to Streetcar: With few two-car families, public transport was vital. You measured this in blocks.
After weeks of this dusty work, you had a formidable stack of several hundred data cards, a clean, structured dataset, ready for analysis. But how do you find the relationship hidden within it?
A Recipe for a Price
The tool for this job was a statistical workhorse: Multiple Linear Regression. The name sounds complicated, but the idea is simple. You figured a house's price could be estimated with a kind of recipe. The recipe starts with a base price, then adds or subtracts money based on the home's features.
The mathematical version of your recipe looked something like this:
Price ≈ (Base Price) + (Price per sq. foot × # of sq. feet) + (Value of central heat) + ...
The magic was in finding the right numbers for this recipe. What was the price for an extra square foot? How much was central heating really worth to a buyer? These were the values, the coefficients, you needed to calculate. If you could find them, you could create a formula to value almost any house.
Today, a computer would perform this task in seconds. For you, it meant days of intense, focused calculation using your most trusted tools: graph paper, a slide rule, and a noisy, hand-cranked mechanical calculator. You used the Method of Least Squares. The goal was to find the recipe, the set of coefficients, that came closest to predicting the actual sale prices for all the houses in your dataset. Think of it as drawing a line through a cloud of data points on a graph; you were looking for the one "best-fit" line that passed as close as possible to all the points at once.
Finding this line was a monumental undertaking. For a model with eight features, it required solving a web of interlocking equations by hand. But after days of cranking the calculator, the math finally yielded its prize: your set of coefficients. For example, you might have found:
Base Price: $600
Price per Square Foot: $1.25
Value of Central Heat: $400
Value of a Modern Kitchen: $250
Your raw statistical clay had been fired in the kiln of mathematics. You now had a formal model. But how could you be sure it worked?
The Moment of Truth
A formula is useless unless you can prove its accuracy. Before you began your calculations, you had wisely set aside about 20% of your data cards. Your model had never seen these houses. This would be its test.
You took the features from each card in this hold-out set and plugged them into your new formula. For each house, you calculated a predicted price. Then came the moment of truth: you compared your predictions to the actual sale prices. You calculated your typical error. An error of $300, for instance, meant your predictions were, on average, off by about $300. For underwriting mortgages worth several thousand dollars, this was a promising start. You also found that your model accounted for 82% of the variation in house prices, a remarkably strong result. It worked.
The results were good, but you suspected they could be better. You saw that your model consistently undervalued very large homes. This told you the value of an extra square foot wasn't constant; it was worth less in a mansion than in a bungalow. To fix this, you refined your formula. You added new terms, like one for square footage squared, which allowed the model to trace a curve instead of a rigid straight line. You also explored how features might interact. Does a garage add more value to a large house than a small one?
Each time you changed the formula, you had to repeat the grueling process of calculation and re-evaluation. It was a process of sharpening the pencil, always seeking a smaller error and a more honest number. How could this model continue to be useful over time?
A Living Formula
After weeks of refinement, you had a model you were proud of. The FHA adopted it as a core part of its underwriting process. An appraiser would now fill out a standardized form, the numbers would be run through the model, and a baseline valuation would be produced.
But your work wasn't done. You knew the housing market was a living entity. Tastes would change. A sunroom might become more valuable, a coal chute less so. The numbers that worked in 1935 would not be accurate in 1945. You established a protocol for monitoring. Every quarter, you would test the model against new sales data. You watched your error rate like a hawk. When it began to creep up, it was a signal that the market had shifted. It meant it was time to retrain the model, to take all the accumulated data, new and old, and re-run the Great Calculation to generate a fresh set of coefficients that reflected the new reality.
In the quiet hum of your office, amidst the clatter of your mechanical calculator, you did more than build a formula. You pioneered a discipline. That manual, painstaking process laid the intellectual groundwork for the automated systems that power our world today. You learned then what remains true now: in the face of chaos, the rigorous application of statistics can forge a path toward clarity and confidence.
But how has this process evolved into the world of Artificial Intelligence?
The Modern Blueprint
The disciplined process you followed in 1935, the quest for an honest number, the dusty search for data, the great calculation, and the constant, watchful eye, was the blueprint for a new discipline. Today, that manual process has evolved into the modern Artificial Intelligence (AI) and Machine Learning (ML) lifecycle. It is still a structured journey that transforms a goal into a tangible system. But the scale and speed are vastly different. Your hand-cranked calculator has been replaced by vast server farms, and your stack of data cards has become an ocean of digital information.
Yet, the core logic remains timeless. The lifecycle, now more formalized, provides the crucial framework for managing the complexity of developing sophisticated AI. What are the steps in this modern lifecycle?
1. Framing the Problem
It still begins with a clear question. For you in 1935, the FHA needed a reliable way to value a home to reduce its financial risk. Today, a business might want to predict which customers are likely to cancel a subscription or identify which transactions might be fraudulent. You must first understand the business goal completely before you can translate it into a mathematical task for a model.
2. Acquiring the Data
Your hunt for data in dusty courthouse ledgers has become an automated query. Instead of sending out clerks, you now pull data from massive databases, connect to web services, or access public datasets. The principle, however, is the same: you must understand what data is available, where it lives, and how you can acquire the raw material needed to solve your problem.
3. Preparing the Data
This is the modern, high-speed version of cleaning your data cards. The goal is to make the raw data consistent and useful. You can't have some home sizes listed in feet and others in meters. You must decide how to handle records with missing key information, like a sale price. This step also includes feature engineering, the same creative process you used when you added "square footage squared" to your model. You might combine several raw data points to create a single, more powerful feature, giving the model a better signal to learn from. This is often the most time-consuming part of the entire lifecycle.
4. Selecting the Model
Where you had one primary tool, linear regression, your modern counterpart has a vast workshop of algorithms. You might choose a decision tree, a neural network, or a gradient boosting machine, among many others. Your task is to engineer a solution by selecting the right model architecture for the specific problem you're trying to solve.
5. Training the Model
The "Great Calculation" is now automated. You feed your prepared data to the chosen model, and instead of you cranking a calculator, powerful computers "train" the model. They perform the complex optimization, finding the ideal internal parameters, the "coefficients" in your 1935 recipe, that best map the input data to the correct answers. During this process, you continuously validate the model's performance to ensure it's actually learning and not just memorizing the data.
6. Evaluating the Model
This is your "Moment of Truth," conducted with scientific rigor. Just as you set aside a stack of unseen data cards, you test the trained model on a hold-out "test set." You use a whole battery of statistical metrics to measure its performance from every angle. This final exam determines if the model is accurate, reliable, and fair enough to be used in the real world.
7. Tuning the Model
"Sharpening the Pencil" is now a highly systematic process. You methodically adjust the model's settings, its hyperparameters, to squeeze out every last bit of performance. This could involve changing how fast the model learns or adjusting its internal complexity. Often, this tuning process is itself automated, with computers running thousands of experiments to find the optimal combination of settings.
8. Deploying the Model
In 1935, deploying your model meant handing your formula over to the underwriting department. Today, deployment is a complex technical step. It involves integrating your final, tuned model into a live software application, ensuring it can handle thousands or even millions of requests from users reliably and with minimal delay. Your model is no longer a formula on paper; it's a working part of a larger technological system.
9. Monitoring the Model
Your "Watchful Eye" remains the final, crucial step. The world is not static. Market conditions change, customer behaviors shift, and new kinds of data emerge. This causes model drift, where a model's performance degrades over time because the new reality no longer matches the data it was trained on. You use automated dashboards and alerts to monitor the model's accuracy and health in real time. When you detect drift, it triggers the lifecycle to begin again: you acquire new data, retrain the model, and deploy an updated version to ensure it remains accurate and valuable. Your living formula has become a continuously learning system.
Now, how does this blueprint apply to one of today’s most ambitious technologies, the Large Language Model?
Building a Large Language Model
Your journey in 1935 to value a house gives us a blueprint for understanding one of the most ambitious technological endeavors of our time: building a foundational Large Language Model (LLM) like Google's Gemini or OpenAI's GPT series. The core logic is the same, but the scale is almost unimaginable. You're no longer trying to solve a single problem like pricing a home; you're trying to build a general-purpose brain for understanding and generating human language. The entire process is defined by this immense scale and a profound focus on making the model helpful, honest, and harmless.
What challenges does this enormous ambition present?
Understanding the Goal
First, you must define the problem. The goal isn't to build a tool that does one thing well, but to create a foundation model, a versatile AI that learns the underlying patterns, grammar, facts, and reasoning abilities from nearly everything ever written. Once built, this single model can be prompted to perform hundreds of different tasks: writing emails, summarizing reports, answering questions, and even generating computer code. The business objective is to create a foundational utility, like an electrical grid, that powers an entire ecosystem of new products and services. But this ambition comes with a formidable set of challenges:
Astronomical Cost: Training a top-tier LLM is one of the most computationally expensive projects on earth. It requires a city of specialized computers running nonstop for months, costing tens or even hundreds of millions of dollars.
Web-Scale Data: The model needs a training library larger than any that has ever existed, measured in trillions of words. Finding, cleaning, and organizing this data is a massive challenge in itself.
Ethical & Safety Risks: This is the most profound challenge. An unconstrained model can make things up (a phenomenon called "hallucination"), reflect biases from its training data, or generate harmful content. A central, non-negotiable goal must be to align the model with human values so it operates safely and responsibly.
With these challenges in mind, where does the data come from?
Acquiring the Data
To build this brain, you need to feed it a library containing almost everything humans have written. Your data acquisition begins by sourcing a mix of text and code from across the internet and digitized archives. Some of it comes from Common Crawl, a public repository of raw data copied from billions of web pages. You supplement this with higher-quality sources: millions of digitized books, scientific papers, encyclopedic knowledge from Wikipedia, and vast amounts of source code from sites like GitHub. Early models were trained on hundreds of billions of words; today's models learn from over a trillion words. This "data arms race" is driven by a simple finding: the more high-quality data a model reads, the smarter it gets.
How is this mountain of raw text refined?
Preparing the Data
Now you have a mountain of raw text, but it's a mess. Refining it into a high-quality training library is like preparing the ingredients for a giant recipe. The process is a massive data-cleaning pipeline.
Quality Filtering: First, you throw out the rotten vegetables. Automated filters remove low-quality documents, such as pages with very little text or those filled with spam and profanity.
Deduplication: The web is full of repetition. You aggressively remove duplicate sentences and documents to ensure the model doesn't become biased toward information that just happens to be repeated often.
PII Removal: In a crucial step for privacy, you use automated tools to find and strip out Personally Identifiable Information (PII) like names, phone numbers, and email addresses. This helps prevent the model from memorizing and later revealing someone's private data.
Tokenization: Finally, you prepare the text for the model. This involves chopping the cleaned text into a sequence of common words and word fragments called tokens, which are the basic vocabulary the model uses.
What is the architectural foundation for these models?
Selecting the Architecture
The architectural foundation for virtually all modern LLMs is the Transformer, introduced in a groundbreaking 2017 paper. It was a radical new way to process language. Instead of reading a sentence one word at a time, the Transformer looks at all the words at once. Its special trick is a mechanism called self-attention, which lets it weigh the importance of every word in relation to every other word, no matter how far apart they are. This is how it "gets" context and understands complex grammar. Because it processes everything at once, it's perfectly suited for the parallel computing power of modern AI hardware, making it possible to train on web-scale data.
Your job as an engineer is to decide on the model's blueprint: its size and shape. This includes how many layers deep it should be and how complex each layer is. These decisions are guided by "scaling laws," empirical rules that help predict how much smarter a model will get as you make it bigger.
How is this massive model trained and shaped?
Training and Alignment
Training a foundational LLM is a careful, multi-stage process designed not just to impart knowledge, but to shape its behavior.
Phase 1: Pre-training. This is the main event and the most expensive part. You give the model your massive library of text and one simple task: predict the next word. By doing this simple task trillions of times, the model is forced to learn an incredibly rich internal understanding of language, facts, and reasoning on its own.
Phase 2: Instruction Tuning. The pre-trained model is like a brilliant but unhelpful librarian who can finish any sentence but doesn't know how to answer a question. This phase is like sending it to finishing school. You fine-tune it on a smaller, high-quality dataset of questions and good answers, teaching it how to be a helpful, conversational assistant.
Phase 3: Alignment. This is where you teach the model ethics. The pioneering method is Reinforcement Learning from Human Feedback (RLHF). First, you have the model generate several answers to a prompt. Second, human reviewers rank those answers from best to worst. This feedback is used to train a separate "reward model" that learns to predict what humans prefer. Finally, this reward model is used to further train the LLM, teaching it to produce answers that align with human values. Newer methods like Direct Preference Optimization (DPO) achieve the same goal in a simpler, more direct way.
How do you test a model that can do almost anything?
Evaluating the Model
You give it a battery of standardized tests, just like a student. To gauge general knowledge, you use academic benchmarks like MMLU, a massive multiple-choice exam covering 57 subjects from STEM to the humanities. To test its programming skills, you use benchmarks that require it to write functional code. A major challenge here is data contamination, you have to ensure the model didn't accidentally see the test questions in its training data, which would invalidate the results.
Just as important are the safety and ethics exams. Benchmarks like TruthfulQA measure the model's tendency to state facts correctly, while ToxiGen evaluates its ability to avoid generating hateful or toxic language.
How can the model be improved without a full retraining?
Optimizing the Model
Once the foundation is built, you can improve its performance on specific tasks without the expense of a full retraining.
Prompt Engineering: This is simply the art of asking the right question to get the right answer. Carefully crafting your instructions is a highly effective way to control the model's output.
Retrieval-Augmented Generation (RAG): This is like giving the model an open-book test to prevent it from making things up. When you ask a question, the system first retrieves relevant, up-to-date documents from a trusted knowledge base. It then hands these documents to the LLM and instructs it to form an answer based only on the information provided.
Parameter-Efficient Fine-Tuning (PEFT): When a task requires more adaptation, you can perform minor surgery instead of a full brain transplant. These techniques freeze most of the giant model and train only a few small, new parts, allowing you to adapt it to new tasks with a fraction of the time and cost.
So, how does this incredibly powerful technology actually reach you?
From Model to Product
You never interact with "the model" itself. Modern AI models, whether they transcribe your voice, recommend a movie, or write an email, don't just show up on your screen raw and ready to go. Instead, you use a product that has wrapped the model in a carefully designed experience. This separation isn't accidental. It's the crucial step that makes AI usable, valuable, and trustworthy.
Think of it this way: an AI model is an engine. An engine is useless without the car built around it, a chassis, a steering wheel, brakes, and seats. The same is true for AI. The product is the car that makes the engine's power useful to a real person. What does this "car" do?
Bridging the Gap
At the heart of this design is a simple idea: AI models are general-purpose tools, but you have specific needs. A large language model can generate an answer to nearly any open-ended prompt, but that's only helpful if you know how to ask the right question or if the product guides the conversation toward a meaningful outcome. A speech recognition model can turn audio into a pile of text, but who needs a pile of text? You need that text inserted into the right place, punctuated correctly, and attributed to the right speaker. The product bridges this gap between the model's raw capability and your intention. It takes on the job of "productizing intelligence," hiding the immense complexity and surfacing just enough of the model's power to solve your problem cleanly and efficiently.
Consider a real-world example: a voice assistant like Siri or Alexa. When you say, "Set a timer for 10 minutes," a whole chain of AI models springs into action underneath. One model recognizes your speech, another understands your intent ("set a timer"), and a third generates the spoken confirmation. But you don't see any of that. You don't interact with the speech model or the language model. You see a simple, friendly confirmation and a countdown on your screen. The product wraps the models in logic, context, and an interface, the things that make it work for you.
Providing Guardrails
This wrapper is also a set of guardrails. AI models are not always predictable. An LLM might "hallucinate" a fact. A vision model might misclassify an object. Exposing these models directly to you would be like exposing you to a sputtering engine without a protective hood. The product is designed to catch, correct, or minimize these errors through better-designed prompts, content filtering, and graceful ways to handle mistakes.
In many modern applications, no single model is enough. The product often acts as a conductor, orchestrating a whole suite of models to get the job done. A transcription app might use one model for noise reduction, another for the core speech-to-text conversion, a third for adding punctuation, and a fourth for figuring out who is speaking. A customer service chatbot might combine a creative LLM with a rigid database to pull up your account information. The product is the conductor that ensures each model plays its part to create a single, coherent result.
This orchestration also enables learning. When you correct a transcript or rephrase a question, the product is what captures that feedback. While the core AI model might not retrain in that instant, the product can learn immediately. It can adapt its interface, refine its prompts, or adjust its recommendations. This tight feedback loop is what transforms a laboratory prototype into a polished consumer experience.
What does this all come down to?
Earning Your Trust
Ultimately, this all comes down to trust. People don't trust abstract models; they trust the products they use every day. Trust is earned through predictability, clarity, and reliability, all of which are delivered at the product level, not the model level. If the AI makes a mistake, the product needs to offer a helpful way out. If the output is confusing, the product needs to explain it. You don't want to use an AI model. You want to book a flight, write a better email, or find the right answer faster. The model is just one component in the system that makes that possible. Wrapping it in a product is what transforms a mathematical curiosity into a real-world tool, and it's the only way most of us will ever experience the power of artificial intelligence.
Conclusion
Your journey began in 1935, with a simple mandate to find an honest, repeatable number for the value of a house. Armed with ledger books and a hand-cranked calculator, you followed a disciplined path: defining the problem, gathering the data, choosing a method, and meticulously testing your work.
Nearly a century later, the tools have changed beyond recognition. The mechanical calculator has become a city of cloud-based supercomputers, and the stack of data cards has become a digital ocean of human knowledge. The challenge has expanded from valuing a home to building a model that can reason, write, and create. Yet, as you've seen through the modern lifecycle, the fundamental principles you first applied remain the unshakable foundation of the entire discipline. The quest is still one of forging order from chaos, of finding a signal in the noise.
Ultimately, this entire process, from the dusty courthouse to the polished final product, is about translating mathematical power into human value. The FHA's model was only useful because it was wrapped in a process that allowed appraisers to bring stability to a desperate market. Today's powerful language models are only useful because they are wrapped in products that help you find an answer, write a better email, or translate a foreign language. The technology will continue to evolve at a breathtaking pace, but its purpose remains the same. It is a quest to understand the patterns of our world and to build tools that help us navigate it more clearly, more efficiently, and more creatively. That human-centered mission, as vital today as it was in the depths of the Great Depression, is what will continue to drive this story forward.
Series:
[AI] Simplifying AI
Audio version: An AI-generated two-person discussion exploring the ideas in this post.


![[AI] Simplifying AI](https://substackcdn.com/image/fetch/$s_!HrF6!,w_1300,h_650,c_fill,f_auto,q_auto:good,fl_progressive:steep,g_auto/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb56e0e00-8faf-4e25-b9f7-8a22be98ccd5_2048x2048.png)
Fantastically written and researched. Enjoyed the real estate comparison. Great work here!
This was really cool! And what I liked best was not actually the AI part, but the description of how people figured out estimating housing prices, including crawling through dusty ledgers and filling out data. As important as I think technology is, I believe becoming more environmentally sustainable involves learning how to do many jobs with less energy so that we can save the energy for the ones where it makes the biggest difference. Thank you for highlighting the old school skills along with the new.