Designing Machine Learning Systems. An Iterative Process for Production Ready Applications

rw-book-cover

Metadata

Author: Chip Huyen
Full Title: Designing Machine Learning Systems. An Iterative Process for Production Ready Applications

Highlights

ML systems are both complex and unique. They are complex because they consist of many different components (ML algorithms, data, business logic, evaluation metrics, underlying infrastructure, etc.) and involve many different stakeholders (data scientists, ML engineers, business leaders, users, even society at large). ML systems are unique because they are data dependent, and data varies wildly from one use case to the next. (Location 66)
Many people, when they hear “machine learning system,” think of just the ML algorithms being used such as logistic regression or different types of neural networks. However, the algorithm is only a small part of an ML system in production. The system also includes the business requirements that gave birth to the ML project in the first place, the interface where users and developers interact with your system, the data stack, and the logic for developing, monitoring, and updating your models, as well as the infrastructure that enables the delivery of that logic. (Location 233)
ML systems design takes a system approach to MLOps, which means that it considers an ML system holistically to ensure that all the components and their stakeholders can work together to satisfy the specified objectives and requirements. (Location 243)
Even for problems that ML can solve, ML solutions might not be the optimal solutions. Before starting an ML project, you might want to ask whether ML is necessary or cost- effective. (Location 263)
Machine learning is an approach to (1) learn (2) complex patterns from (3) existing data and use these patterns to make (4) predictions on (5) unseen data. (Location 266)
Without data and without continual learning, many companies follow a “fake- it- til- you make it” approach: launching a product that serves predictions made by humans, instead of ML models, with the hope of using the generated data to train ML models later. (Location 314)
Compute- intensive problems are one class of problems that have been very successfully reframed as predictive. Instead of computing the exact outcome of a process, which might be even more computationally costly and time- consuming than ML, you can frame the problem as: “What would the outcome of this process look like?” and approximate it using an ML model. The output will be an approximation of the exact output, but often, it’s good enough. You can see a lot of it in graphic renderings, such as image denoising and screen- space shading. (Location 324)
Unless your ML model’s performance is 100% all the time, which is highly unlikely for any meaningful tasks, your model is going to make mistakes. ML is especially suitable when the cost of a wrong prediction is low. For example, one of the biggest use cases of ML today is in recommender systems because with recommender systems, a bad recommendation is usually forgiving— the user just won’t click on the recommendation. (Location 347)
ML solutions often require nontrivial up- front investment on data, compute, infrastructure, and talent, so it’d make sense if we can use these solutions a lot. “At scale” means different things for different tasks, but, in general, it means making a lot of predictions. Examples include sorting through millions of emails a year or predicting which departments thousands of support tickets should be routed to a day. (Location 354)
Cultures change. Tastes change. Technologies change. What’s trendy today might be old news tomorrow. Consider the task of email spam classification. Today an indication of a spam email is a Nigerian prince, but tomorrow it might be a distraught Vietnamese writer. If your problem involves one or more constantly changing patterns, hardcoded solutions such as handwritten rules can become outdated quickly. (Location 363)
Most of today’s ML algorithms shouldn’t be used under any of the following conditions: It’s unethical. We’ll go over one case study where the use of ML algorithms can be argued as unethical in the section “Case study I: Automated grader’s biases”. Simpler solutions do the trick. In Chapter 6, we’ll cover the four phases of ML model development where the first phase should be non- ML solutions. It’s not cost- effective. (Location 372)
However, even if ML can’t solve your problem, it might be possible to break your problem into smaller components, and use ML to solve some of them. For example, if you can’t build a chatbot to answer all your customers’ queries, it might be possible to build an ML model to predict whether a query matches one of the frequently asked questions. If yes, direct the customer to the answer. If not, direct them to customer service. (Location 377)
ML- based pricing optimization is most suitable for cases with a large number of transactions where demand fluctuates and consumers are willing to pay a dynamic price— for example, internet ads, flight tickets, accommodation bookings, ride- sharing, and events. (Location 423)
Customer acquisition cost is hailed by investors as a startup killer. 11 Reducing customer acquisition costs by a small amount can result in a large increase in profit. This can be done through better identifying potential customers, showing better- targeted ads, giving discounts at the right time, etc.— all of which are suitable tasks for ML. (Location 432)
Automated support ticket classification can help with that. Previously, when a customer opened a support ticket or sent an email, it needed to first be processed then passed around to different departments until it arrived at the inbox of someone who could address it. An ML system can analyze the ticket content and predict where it should go, which can shorten the response time and improve customer satisfaction. It can also be used to classify internal IT tickets. (Location 440)
Table 1- 1. Key differences between ML in research and ML in production Research Production Requirements State- of- the- art model performance on benchmark datasets Different stakeholders have different requirements Computational priority Fast training, high throughput Fast inference, low latency Data Statica Constantly shifting Fairness Often not a focus Must be considered Interpretability Often not a focus Must be considered (Location 468)
When developing an ML project, it’s important for ML engineers to understand requirements from all stakeholders involved and how strict these requirements are. (Location 507)
An obvious argument is that in these competitions many of the hard steps needed for building ML systems are already done for you. 16 A less obvious argument is that due to the multiple- hypothesis testing scenario that happens when you have multiple teams testing on the same hold- out test set, a model can do better than the rest just by chance. (Location 524)
During model development, training is the bottleneck. Once the model has been deployed, however, its job is to generate predictions, so inference is the bottleneck. Research usually prioritizes fast training, whereas production usually prioritizes fast inference. (Location 540)
According to Martin Kleppmann in his book Designing Data- Intensive Applications, “The response time is what the client sees: besides the actual time to process the request (the service time), it includes network delays and queueing delays. Latency is the duration that a request is waiting to be handled— during which it is latent, awaiting service.” (Location 547)
If your system always processes one query at a time, higher latency means lower throughput. If the average latency is 10 ms, which means it takes 10 ms to process a query, the throughput is 100 queries/ second. If the average latency is 100 ms, the throughput is 10 queries/ second. However, because most modern distributed systems batch queries to process them together, often concurrently, higher latency might also mean higher throughput. If you process 10 queries at a time and it takes 10 ms to run a batch, the average latency is still 10 ms but the throughput is now 10 times higher— 1,000 queries/ second. If you process 50 queries at a time and it takes 20 ms to run a batch, the average latency now is 20 ms and the throughput is 2,500 queries/ second. Both latency and throughput have increased! The difference in latency and throughput trade- off for processing queries one at a time and processing queries in batches is illustrated in Figure 1- 4. (Location 554)
In 2017, an Akamai study found that a 100 ms delay can hurt conversion rates by 7%. 20 In 2019, Booking.com found that an increase of about 30% in latency cost about 0.5% in conversion rates—“ a relevant cost for our business.” 21 In 2016, Google found that more than half of mobile users will leave a page if it takes more than three seconds to load. (Location 568)
To reduce latency in production, you might have to reduce the number of queries you can process on the same hardware at a time. If your hardware is capable of processing many more queries at a time, using it to process fewer queries means underutilizing your hardware, increasing the cost of processing each query. (Location 574)
Higher percentiles are important to look at because even though they account for a small percentage of your users, sometimes they can be the most important users. For example, on the Amazon website, the customers with the slowest requests are often those who have the most data on their accounts because they have made many purchases— that is, they’re the most valuable customers. 23 (Location 585)
It’s a common practice to use high percentiles to specify the performance requirements for your system; for example, a product manager might specify that the 90th percentile or 99.9th percentile latency of a system must be below a certain number. (Location 589)
In production, data, if available, is a lot more messy. It’s noisy, possibly unstructured, constantly shifting. It’s likely biased, and you likely don’t know how it’s biased. Labels, if there are any, might be sparse, imbalanced, or incorrect. Changing project or business requirements might require updating some or all of your existing labels. If you work with users’ data, you’ll also have to worry about privacy and regulatory concerns. (Location 597)
Figure 1- 5. Data in research versus data in production. Source: Adapted from an image by Andrej Karpathy24 (Location 606)
During the research phase, a model is not yet used on people, so it’s easy for researchers to put off fairness as an afterthought: “Let’s try to get state of the art first and worry about fairness when we get to production.” When it gets to production, it’s too late. (Location 610)
ML algorithms don’t predict the future, but encode the past, thus perpetuating the biases in the data and more. (Location 624)
interpretability is important for users, both business leaders and end users, to understand why a decision is made so that they can trust a model and detect potential biases mentioned previously. 29 Second, it’s important for developers to be able to debug and improve a model. (Location 644)
In SWE, there’s an underlying assumption that code and data are separated. In fact, in SWE, we want to keep things as modular and separate as possible (see the Wikipedia page on separation of concerns). On the contrary, ML systems are part code, part data, and part artifacts created from the two. (Location 666)
The trend in the last decade shows that applications developed with the most/ best data win. Instead of focusing on improving ML algorithms, most companies will focus on improving their data. Because data can change quickly, ML applications need to be adaptive to the changing environment, which might require faster development and deployment cycles. (Location 669)
In traditional SWE, you only need to focus on testing and versioning your code. With ML, we have to test and version our data too, and that’s the hard part. How to version large datasets? How to know if a data sample is good or bad for your system? Not all data samples are equal— some are more valuable to your model than others. (Location 672)
Back in 2018, when the Bidirectional Encoder Representations from Transformers (BERT) paper first came out, people were talking about how BERT was too big, too complex, and too slow to be practical. The pretrained large BERT model has 340 million parameters and is 1.35 GB. 33 Fast- forward two years later, BERT and its variants were already used in almost every English search on Google. 34 (Location 687)
ML systems design takes a system approach to MLOps, which means that we’ll consider an ML system holistically to ensure that all the components— the business requirements, the data stack, infrastructure, deployment, monitoring, etc.— and their stakeholders can work together to satisfy the specified objectives and requirements. (Location 794)
Before we develop an ML system, we must understand why this system is needed. If this system is built for a business, it must be driven by business objectives, which will need to be translated into ML objectives to guide the development of ML models. (Location 797)
pattern I see in many short- lived ML projects is that the data scientists become too focused on hacking ML metrics without paying attention to business metrics. Their managers, however, only care about business metrics and, after failing to see how an ML project can help push their business metrics, kill the projects prematurely (and possibly let go of the data science team involved). (Location 814)
So what metrics do companies care about? While most companies want to convince you otherwise, the sole purpose of businesses, according to the Nobel- winning economist Milton Friedman, is to maximize profits for shareholders. (Location 818)
The ultimate goal of any project within a business is, therefore, to increase profits, either directly or indirectly: directly such as increasing sales (conversion rates) and cutting costs; indirectly such as higher customer satisfaction and increasing time spent on a website. (Location 820)
One of the reasons why predicting ad click- through rates and fraud detection are among the most popular use cases for ML today is that it’s easy to map ML models’ performance to business metrics: every increase in click- through rate results in actual ad revenue, and every fraudulent transaction stopped results in actual money saved. (Location 830)
Many companies create their own metrics to map business metrics to ML metrics. For example, Netflix measures the performance of their recommender system using take- rate: the number of quality plays divided by the number of recommendations a user sees. 4 The higher the take- rate, the better the recommender system. Netflix also put a recommender system’s take- rate in the context of their other business metrics like total streaming hours and subscription cancellation rate. They found that a higher take- rate also results in higher total streaming hours and lower subscription cancellation rates. (Location 832)
Due to all the hype surrounding ML, generated both by the media and by practitioners with a vested interest in ML adoption, some companies might have the notion that ML can magically transform their businesses overnight. Magically: possible. Overnight: no. (Location 852)
Returns on investment in ML depend a lot on the maturity stage of adoption. The longer you’ve adopted ML, the more efficient your pipeline will run, the faster your development cycle will be, the less engineering time you’ll need, and the lower your cloud bills will be, which all lead to higher returns. (Location 856)
The specified requirements for an ML system vary from use case to use case. However, most systems should have these four characteristics: reliability, scalability, maintainability, and adaptability. (Location 870)
Reliability The system should continue to perform the correct function at the desired level of performance even in the face of adversity (hardware or software faults, and even human error). (Location 873)
“Correctness” might be difficult to determine for ML systems. For example, your system might call the predict function— e.g., model.predict()— correctly, but the predictions are wrong. How do we know if a prediction is wrong if we don’t have ground truth labels to compare it with? (Location 876)
ML systems can fail silently. End users don’t even know that the system has failed and might have kept on using it as if it were working. For example, if you use Google Translate to translate a sentence into a language you don’t know, it might be very hard for you to tell even if the translation is wrong. (Location 879)
An ML system might grow in ML model count. Initially, you might have only one model for one use case, such as detecting the trending hashtags on a social network site like Twitter. However, over time, you want to add more features to this use case, so you’ll add one more to filter out NSFW (not safe for work) content and another model to filter out tweets generated by bots. This growth pattern is especially common in ML systems that target enterprise use cases. (Location 890)
An indispensable feature in many cloud services is autoscaling: automatically scaling up and down the number of machines depending on usage. This feature can be tricky to implement. (Location 900)
handling growth isn’t just resource scaling, but also artifact management. Managing one hundred models is very different from managing one model. With one model, you can, perhaps, manually monitor this model’s performance and manually update the model with new data. Since there’s only one model, you can just have a file that helps you reproduce this model whenever needed. However, with one hundred models, both the monitoring and retraining aspect will need to be automated. You’ll need a way to manage the code generation so that you can adequately reproduce a model when you need to. (Location 906)
There are many people who will work on an ML system. They are ML engineers, DevOps engineers, and subject matter experts (SMEs). They might come from very different backgrounds, with very different programming languages and tools, and might own different parts of the process. (Location 919)
It’s important to structure your workloads and set up your infrastructure in such a way that different contributors can work using tools that they are comfortable with, instead of one group of contributors forcing their tools onto other groups. (Location 923)
Code should be documented. Code, data, and artifacts should be versioned. Models should be sufficiently reproducible so that even when the original authors are not around, other contributors can have sufficient contexts to build on their work. When a problem occurs, different contributors should be able to work together to identify the problem and implement a solution without finger- pointing. (Location 925)
Adaptability To adapt to shifting data distributions and business requirements, the system should have some capacity for both discovering aspects for performance improvement and allowing updates without service interruption. (Location 929)
Because ML systems are part code, part data, and data can change quickly, ML systems need to be able to evolve quickly. This is tightly linked to maintainability. (Location 933)
Iterative Process Developing an ML system is an iterative and, in most cases, never- ending process. 10 Once a system is put into production, it’ll need to be continually monitored and updated. (Location 938)
Before deploying my first ML system, I thought the process would be linear and straightforward. I thought all I had to do was to collect data, train a model, deploy that model, and be done. However, I soon realized that the process looks more like a cycle with a lot of back and forth between different steps. (Location 940)
Choose a metric to optimize. For example, you might want to optimize for impressions— the number of times an ad is shown. Collect data and obtain labels. Engineer features. Train models. During error analysis, you realize that errors are caused by the wrong labels, so you relabel the data. Train the model again. During error analysis, you realize that your model always predicts that an ad shouldn’t be shown, and the reason is because 99.99% of the data you have have NEGATIVE labels (ads that shouldn’t be shown). So you have to collect more data of ads that should be shown. Train the model again. The model performs well on your existing test data, which is by now two months old. However, it performs poorly on the data from yesterday. Your model is now stale, so you need to update it on more recent data. Train the model again. Deploy the model. The model seems to be performing well, but then the businesspeople come knocking on your door asking why the revenue is decreasing. It turns out the ads are being shown, but few people click on them. So you want to change your model to optimize for ad click- through rate instead. Go to step 1. Figure 2- 2 shows an oversimplified representation of what the iterative process for developing ML systems in production looks like from the perspective of a data scientist or an ML engineer. (Location 945)
Figure 2- 2. The process of developing an ML system looks more like a cycle with a lot of back (Location 965)
A project starts with scoping the project, laying out goals, objectives, and constraints. Stakeholders should be identified and involved. Resources should be estimated and allocated. (Location 968)
Model performance needs to be evaluated against business goals and analyzed to generate business insights. These insights can then be used to eliminate unproductive projects or scope out new projects. (Location 1000)
In general, when there are multiple objectives, it’s a good idea to decouple them first because it makes model development and maintenance easier. First, it’s easier to tweak your system without retraining models, as previously explained. Second, it’s easier for maintenance since different objectives might need different maintenance schedules. Spamming techniques evolve much faster than the way post quality is perceived, so spam filtering systems need updates at a much higher frequency than quality- ranking systems. (Location 1195)
Progress in the last decade shows that the success of an ML system depends largely on the data it was trained on. Instead of focusing on improving ML algorithms, most companies focus on managing and improving their data. 16 (Location 1202)
the mind- over- data camp, there’s Dr. Judea Pearl, a Turing Award winner best known for his work on causal inference and Bayesian networks. The introduction to his book The Book of Why is entitled “Mind over Data,” in which he emphasizes: “Data is profoundly dumb.” In one of his more controversial posts on Twitter in 2020, he expressed his strong opinion against ML approaches that rely heavily on data and warned that data- centric ML people might be out of a job in three to five years: “ML will not be the same in 3– 5 years, and ML folks who continue to follow the current data- centric paradigm will find themselves outdated, if not jobless. Take note.” 18 (Location 1212)
Professor Christopher Manning, director of the Stanford Artificial Intelligence Laboratory, who argued that huge computation and a massive amount of data with a simple learning algorithm create incredibly bad learners. The structure allows us to design systems that can learn more from less data. (Location 1218)
“The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin… Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation.” (Location 1226)
The debate isn’t about whether finite data is necessary, but whether it’s sufficient. The term finite here is important, because if we had infinite data, it might be possible for us to look up the answer. Having a lot of data is different from having infinite data. (Location 1237)
When working with data in production, you usually work with data across multiple processes and services. For example, you might have a feature engineering service that computes features from raw data, and a prediction service to generate predictions based on computed features. This means that you’ll have to pass computed features from the feature engineering service to the prediction service. (Location 1358)
One source is user input data, data explicitly input by users. User input can be text, images, videos, uploaded files, etc. If it’s even remotely possible for users to input wrong data, they are going to do it. As a result, user input data can be easily malformatted. Text might be too long or too short. Where numerical values are expected, users might accidentally enter text. If you let users upload files, they might upload files in the wrong formats. User input data requires more heavy- duty checking and processing. (Location 1374)
Another source is system- generated data. This is the data generated by different components of your systems, which include various types of logs and system outputs such as model predictions. (Location 1381)
logs don’t need to be processed as soon as they arrive, the way you would want to process user input data. For many use cases, it’s acceptable to process logs periodically, such as hourly or even daily. However, you might still want to process your logs fast to be able to detect and be notified whenever something interesting happens. (Location 1391)
Because debugging ML systems is hard, it’s a common practice to log everything you can. This means that your volume of logs can grow very, very quickly. This leads to two problems. The first is that it can be hard to know where to look because signals are lost in the noise. There have been many services that process and analyze logs, such as Logstash, Datadog, Logz.io, etc. Many of them use ML models to help you process and make sense of your massive number of logs. (Location 1394)
The second problem is how to store a rapidly growing number of logs. Luckily, in most cases, you only have to store logs for as long as they are useful and can discard them when they are no longer relevant for you to debug your current system. If you don’t have to access your logs frequently, they can also be stored in low- access storage that costs much less than higher- frequency- access storage. (Location 1398)
The system also generates data to record users’ behaviors, such as clicking, choosing a suggestion, scrolling, zooming, ignoring a pop- up, or spending an unusual amount of time on certain pages. Even though this is system- generated data, it’s still considered part of user data and might be subject to privacy regulations. (Location 1403)
Then there’s the wonderfully weird world of third- party data. First- party data is the data that your company already collects about your users or customers. Second- party data is the data collected by another company on their own customers that they make available to you, though you’ll probably have to pay for it. Third- party data companies collect data on the public who aren’t their direct customers. (Location 1413)
Data of all kinds can be bought, such as social media activities, purchase history, web browsing habits, car rentals, and political leaning for different demographic groups getting as granular as men, age 25– 34, working in tech, living in the Bay Area. From this data, you can infer information such as people who like brand A also like brand B. This data can be especially helpful for systems such as recommender systems to generate results relevant to users’ interests. Third- party data is usually sold after being cleaned and processed by vendors. (Location 1423)
It’s important to think about how the data will be used in the future so that the format you use will make sense. Here are some of the questions you might want to consider: How do I store multimodal data, e.g., a sample that might contain both images and texts? Where do I store my data so that it’s cheap and still fast to access? How do I store complex models so that they can be loaded and run correctly on different hardware? (Location 1438)
The process of converting a data structure or object state into a format that can be stored or transmitted and reconstructed later is data serialization. (Location 1444)
There are many, many data serialization formats. When considering a format to work with, you might want to consider different characteristics such as human readability, access patterns, and whether it’s based on text or binary, which influences the size of its files. (Location 1445)
Because JSON is ubiquitous, the pain it causes can also be felt everywhere. Once you’ve committed the data in your JSON files to a schema, it’s pretty painful to retrospectively go back to change the schema. (Location 1507)
The two formats that are common and represent two distinct paradigms are CSV and Parquet. CSV (comma- separated values) is row- major, which means consecutive elements in a row are stored next to each other in memory. Parquet is column- major, which means consecutive elements in a column are stored next to each other. (Location 1511)
Because modern computers process sequential data more efficiently than nonsequential data, if a table is row- major, accessing its rows will be faster than accessing its columns in expectation. This means that for row- major formats, accessing data by rows is expected to be faster than accessing data by columns. (Location 1521)
Column- major formats allow flexible column- based reads, especially if your data is large with thousands, if not millions, of features. Consider if you have data about ride- sharing transactions that has 1,000 features but you only want 4 features: time, location, distance, price. With column- major formats, you can read the four columns corresponding to these four features directly. (Location 1529)
Row- major formats allow faster data writes. Consider the situation when you have to keep adding new individual examples to your data. For each individual example, it’d be much faster to write it to a file where your data is already in a row- major format. (Location 1533)
row- major formats are better when you have to do a lot of writes, whereas column- major ones are better when you have to do a lot of column- based reads. (Location 1535)
One subtle point that a lot of people don’t pay attention to, which leads to misuses of pandas, is that this library is built around the columnar format. pandas is built around DataFrame, a concept inspired by R’s Data Frame, which is column- major. (Location 1538)
In NumPy, the major order can be specified. When an ndarray is created, it’s row- major by default if you don’t specify the order. People coming to pandas from NumPy tend to treat DataFrame the way they would ndarray, e.g., trying to access data by rows, and find DataFrame slow. (Location 1543)
Text files are files that are in plain text, which usually means they are human- readable. Binary files are the catchall that refers to all nontext files. (Location 1571)
The idea is simple but powerful. In this model, data is organized into relations; each relation is a set of tuples. A table is an accepted visual representation of a relation, and each row of a table makes up a tuple, (Location 1601)
One major downside of normalization is that your data is now spread across multiple relations. You can join the data from different relations back together, but joining can be expensive for large tables. (Location 1659)
Databases built around the relational data model are relational databases. Once you’ve put data in your databases, you’ll want a way to retrieve it. The language that you can use to specify the data that you want from a database is called a query language. The most popular query language for relational databases today is SQL. (Location 1661)
Even though inspired by the relational model, the data model behind SQL has deviated from the original relational model. For example, SQL tables can contain row duplicates, whereas true relations can’t contain duplicates. (Location 1666)
The most important thing to note about SQL is that it’s a declarative language, as opposed to Python, which is an imperative language. (Location 1668)
In the imperative paradigm, you specify the steps needed for an action and the computer executes these steps to return the outputs. In the declarative paradigm, you specify the outputs you want, and the computer figures out the steps needed to get you the queried outputs. (Location 1669)
With an SQL database, you specify the pattern of data you want— the tables you want the data from, the conditions the results must meet, the basic data transformations such as join, sort, group, aggregate, etc.— but not how to retrieve the data. It is up to the database system to decide how to break the query into different parts, what methods to use to execute each part of the query, and the order in which different parts of the query should be executed. (Location 1671)
With certain added features, SQL can be Turing- complete, which means that, in theory, SQL can be used to solve any computation problem (without making any guarantee about the time or memory required). (Location 1675)
Query optimization is one of the most challenging problems in database systems, and normalization means that data is spread out on multiple relations, which makes joining it together even harder. Even though developing a query optimizer is hard, the good news is that you generally only need one query optimizer and all your applications can leverage it. (Location 1684)
With a declarative ML system, users only need to declare the features’ schema and the task, and the system will figure out the best model to perform that task with the given features. Users won’t have to write code to construct, train, and tune models. (Location 1692)
Declarative ML systems today abstract away the model development part, and as we’ll cover in the next six chapters, with models being increasingly commoditized, model development is often the easier part. The hard part lies in feature engineering, data processing, model evaluation, data shift detection, continual learning, and so on. (Location 1737)
The relational data model has been able to generalize to a lot of use cases, from ecommerce to finance to social networks. However, for certain use cases, this model can be restrictive. For example, it demands that your data follows a strict schema, and schema management is painful. In a survey by Couchbase in 2014, frustration with schema management was the #1 reason for the adoption of their nonrelational database. 16 It can also be difficult to write and execute SQL queries for specialized applications. (Location 1743)
The latest movement against the relational data model is NoSQL. Originally started as a hashtag for a meetup to discuss nonrelational databases, NoSQL has been retroactively reinterpreted as Not Only SQL, 17 as many NoSQL data systems also support relational models. (Location 1750)
Two major types of nonrelational models are the document model and the graph model. The document model targets use cases where data comes in self- contained documents and relationships between one document and another are rare. The graph model goes in the opposite direction, targeting use cases where relationships between data items are common and important. We’ll examine each of these two models, starting with the document model. (Location 1753)
The document model is built around the concept of “document.” A document is often a single continuous string, encoded as JSON, XML, or a binary format like BSON (Binary JSON). (Location 1757)
A collection of documents could be considered analogous to a table in a relational database, and a document analogous to a row. In fact, you can convert a relation into a collection of documents that way. (Location 1762)
a collection of documents is much more flexible than a table. All rows in a table must follow the same schema (e.g., have the same sequence of columns), while documents in the same collection can have completely different schemas. (Location 1768)
Because the document model doesn’t enforce a schema, it’s often referred to as schemaless. This is misleading because, as discussed previously, data stored in documents will be read later. The application that reads the documents usually assumes some kind of structure of the documents. Document databases just shift the responsibility of assuming structures from the application that writes the data to the application that reads the data. (Location 1860)
To retrieve information about a book, you’ll have to query multiple tables. In the document model, all information about a book can be stored in a document, making it much easier to retrieve. (Location 1868)
However, compared to the relational model, it’s harder and less efficient to execute joins across documents compared to across tables. For example, if you want to find all books whose prices are below $25, yo u ’ ll ha v e t ore a d a ll d oc u m e n t s, e x t r a c tt h e p r i ces, co m p a re t h e m t o$ 25, and return all the documents containing the books with prices below $ 25. (Location 1870)
A graph consists of nodes and edges, where the edges represent the relationships between the nodes. A database that uses graph structures to store its data is called a graph database. If in document databases, the content of each document is the priority, then in graph databases, the relationships between data items are the priority. (Location 1879)
Structured data follows a predefined data model, also known as a data schema. For example, the data model might specify that each data item consists of two values: the first value, “name,” is a string of at most 50 characters, and the second value, “age,” is an 8- bit integer in the range between 0 and 200. The predefined structure makes your data easier to analyze. (Location 1895)
The disadvantage of structured data is that you have to commit your data to a predefined schema. If your schema changes, you’ll have to retrospectively update all your data, often causing mysterious bugs in the process. (Location 1902)
Because business requirements change over time, committing to a predefined data schema can become too restricting. Or you might have data from multiple data sources that are beyond your control, and it’s impossible to make them follow the same schema. This is where unstructured data becomes appealing. Unstructured data doesn’t adhere to a predefined data schema. It’s usually text but can also be numbers, dates, images, audio, etc. For example, a text file of logs generated by your ML model is unstructured data. (Location 1908)
A repository for storing structured data is called a data warehouse. A repository for storing unstructured data is called a data lake. Data lakes are usually used to store raw data before processing. Data warehouses are used to store data that has been processed into formats ready to be used. (Location 1919)
Data formats and data models specify the interface for how users can store and retrieve data. Storage engines, also known as databases, are the implementation of how data is stored and retrieved on machines. (Location 1937)
a transaction refers to the action of buying or selling something. In the digital world, a transaction refers to any kind of action: tweeting, ordering a ride through a ride- sharing service, uploading a new model, watching a YouTube video, and so on. Even though these different transactions involve different types of data, the way they’re processed is similar across applications. The transactions are inserted as they are generated, and occasionally updated when something changes, or deleted when they are no longer needed. 19 This type of processing is known as online transaction processing (OLTP). (Location 1945)
Because these transactions often involve users, they need to be processed fast (low latency) so that they don’t keep users waiting. The processing method needs to have high availability— that is, the processing system needs to be available any time a user wants to make a transaction. If your system can’t process a transaction, that transaction won’t go through. (Location 1954)
Transactional databases are designed to process online transactions and satisfy the low latency, high availability requirements. When people hear transactional databases, they usually think of ACID (atomicity, consistency, isolation, durability). (Location 1956)
Atomicity To guarantee that all the steps in a transaction are completed successfully as a group. If any step in the transaction fails, all other steps must fail also. For example, if a user’s payment fails, you don’t want to still assign a driver to that user. (Location 1959)
Consistency To guarantee that all the transactions coming through must follow predefined rules. For example, a transaction must be made by a valid user. (Location 1962)
Isolation To guarantee that two transactions happen at the same time as if they were isolated. Two users accessing the same data won’t change it at the same time. For example, you don’t want two users to book the same driver at the same time. (Location 1963)
Durability To guarantee that once a transaction has been committed, it will remain committed even in the case of a system failure. For example, after you’ve ordered a ride and your phone dies, you still want your ride to come. (Location 1965)
definition of ACID.” 20 Because each transaction is often processed as a unit separately from other transactions, transactional databases are often row- major. This also means that transactional databases might not be efficient for questions such as “What’s the average price for all the rides in September in San Francisco?” This kind of analytical question requires aggregating data in columns across multiple rows of data. Analytical databases are designed for this purpose. They are efficient with queries that allow you to look at data from different viewpoints. We call this type of processing online analytical processing (OLAP). (Location 1973)
the separation of transactional and analytical databases was due to limitations of technology— it was hard to have databases that could handle both transactional and analytical queries efficiently. However, this separation is being closed. Today, we have transactional databases that can handle analytical queries, such as CockroachDB. We also have analytical databases that can handle transactional queries, such as Apache Iceberg and DuckDB. (Location 1980)
in the traditional OLTP or OLAP paradigms, storage and processing are tightly coupled— how data is stored is also how data is processed. This may result in the same data being stored in multiple databases and using different processing engines to solve different types of queries. An interesting paradigm in the last decade has been to decouple storage from processing (also known as compute), as adopted by many data vendors including Google’s BigQuery, Snowflake, IBM, and Teradata. 21 In this paradigm, the data can be stored in the same place, with a processing layer on top that can be optimized for different types of queries. (Location 1990)
online processing means data is immediately available for input/ output. Nearline, which is short for near- online, means data is not immediately available but can be made online quickly without human intervention. Offline means data is not immediately available and requires some human intervention to become online. (Location 1999)
In the early days of the relational data model, data was mostly structured. When data is extracted from different sources, it’s first transformed into the desired format before being loaded into the target destination such as a database or a data warehouse. This process is called ETL, which stands for extract, transform, and load. (Location 2004)
ETL refers to the general purpose processing and aggregating of data into the shape and the format that you want. (Location 2009)
Extract is extracting the data you want from all your data sources. Some of them will be corrupted or malformatted. In the extracting phase, you need to validate your data and reject the data that doesn’t meet your requirements. For rejected data, you might have to notify the sources. Since this is the first step of the process, doing it correctly can save you a lot of time downstream. (Location 2010)
Transform is the meaty part of the process, where most of the data processing is done. You might want to join data from multiple sources and clean it. You might want to standardize the value ranges (e.g., one data source might use “Male” and “Female” for genders, but another uses “M” and “F” or “1” and “2”). You can apply operations such as transposing, deduplicating, sorting, aggregating, deriving new features, more data validating, etc. (Location 2013)
Load is deciding how and how often to load your transformed data into the target destination, which can be a file, a database, or a data warehouse. (Location 2017)
When the internet first became ubiquitous and hardware had just become so much more powerful, collecting data suddenly became so much easier. The amount of data grew rapidly. Not only that, but the nature of data also changed. The number of data sources expanded, and data schemas evolved. Finding it difficult to keep data structured, some companies had this idea: “Why not just store all data in a data lake so we don’t have to deal with schema changes? Whichever application needs data can just pull out raw data from there and process it.” This process of loading data into storage first then processing it later is sometimes called ELT (extract, load, transform). This paradigm allows for the fast arrival of data since there’s little processing needed before data is stored. However, as data keeps on growing, this idea becomes less attractive. It’s inefficient to search through a massive amount of raw data for the data that you want. 23 At the same time, as companies switch to running applications on the cloud and infrastructures become standardized, data structures also become standardized. Committing data to a predefined schema becomes more feasible. (Location 2021)
When data is passed from one process to another, we say that the data flows from one process to another, which gives us a dataflow. There are three main modes of dataflow: Data passing through databases Data passing through services using requests such as the requests provided by REST and RPC APIs (e.g., POST/ GET requests) Data passing through a real- time transport like Apache Kafka and Amazon Kinesis (Location 2040)
The easiest way to pass data between two processes is through databases, which we’ve discussed in the section “Data Storage Engines and Processing”. For example, to pass data from process A to process B, process A can write that data into a database, and process B simply reads from that database. This mode, however, doesn’t always work because of two reasons. First, it requires that both processes must be able to access the same database. This might be infeasible, especially if the two processes are run by two different companies. Second, it requires both processes to access data from databases, and read/ write from databases can be slow, making it unsuitable for applications with strict latency requirements— e.g., almost all consumer- facing applications. (Location 2046)
One way to pass data between two processes is to send data directly through a network that connects these two processes. To pass data from process B to process A, process A first sends a request to process B that specifies the data A needs, and B returns the requested data through the same network. Because processes communicate through requests, we say that this is request- driven. This mode of data passing is tightly coupled with the service- oriented architecture. A service is a process that can be accessed remotely, e.g., through a network. In this example, B is exposed to A as a service that A can send requests to. For B to be able to request data from A, A will also need to be exposed to B as a service. (Location 2055)
Two services in communication with each other can also be parts of the same application. Structuring different components of your application as separate services allows each component to be developed, tested, and maintained independently of one another. Structuring an application as separate services gives you a microservice architecture. (Location 2065)
To put the microservice architecture in the context of ML systems, imagine you’re an ML engineer working on the price optimization problem for a company that owns a ride- sharing application like Lyft. In reality, Lyft has hundreds of services in its microservice architecture, but for the sake of simplicity, let’s consider only three services: Driver management service Predicts how many drivers will be available in the next minute in a given area. Ride management service Predicts how many rides will be requested in the next minute in a given area. Price optimization service Predicts the optimal price for each ride. The price for a ride should be low enough for riders to be willing to pay, yet high enough for drivers to be willing to drive and for the company to make a profit. Because the price depends on supply (the available drivers) and demand (the requested rides), the price optimization service needs data from both the driver management and ride management services. Each time a user requests a ride, the price optimization service requests the predicted number of rides and predicted number of drivers to predict the optimal price for this ride. (Location 2068)
The most popular styles of requests used for passing data through networks are REST (representational state transfer) and RPC (remote procedure call). Their detailed analysis is beyond the scope of this book, but one major difference is that REST was designed for requests over networks, whereas RPC “tries to make a request to a remote network service look the same as calling a function or method in your programming language.” Because of this, “REST seems to be the predominant style for public APIs. The main focus of RPC frameworks is on requests between services owned by the same organization, typically within the same data center.” (Location 2084)
Implementations of a REST architecture are said to be RESTful. Even though many people think of REST as HTTP, REST doesn’t exactly mean HTTP because HTTP is just an implementation of REST. (Location 2091)
To understand the motivation for real- time transports, let’s go back to the preceding example of the ride- sharing app with three simple services: driver management, ride management, and price optimization. In the last section, we discussed how the price optimization service needs data from the ride and driver management services to predict the optimal price for each ride. Now, imagine that the driver management service also needs to know the number of rides from the ride management service to know how many drivers to mobilize. It also wants to know the predicted prices from the price optimization service to use them as incentives for potential drivers (e.g., if you get on the road now you can get a 2x surge charge). Similarly, the ride management service might also want data from the driver management and price optimization services. If we pass data through services as discussed in the previous section, each of these services needs to send requests to the other two services, as shown in Figure 3- 8. (Location 2097)
With only three services, data passing is already getting complicated. Imagine having hundreds, if not thousands of services like what major internet companies have. Interservice data passing can blow up and become a bottleneck, slowing down the entire system. Request- driven data passing is synchronous: the target service has to listen to the request for the request to go through. If the price optimization service requests data from the driver management service and the driver management service is down, the price optimization service will keep resending the request until it times out. And if the price optimization service is down before it receives a response, the response will be lost. A service that is down can cause all services that require data from it to be down. (Location 2109)
What if there’s a broker that coordinates data passing among services? Instead of having services request data directly from each other and creating a web of complex interservice data passing, each service only has to communicate with the broker, as shown in Figure 3- 9. For example, instead of having other services request the driver management services for the predicted number of drivers for the next minute, what if whenever the driver management service makes a prediction, this prediction is broadcast to a broker? Whichever service wants data from the driver management service can check that broker for the most recent predicted number of drivers. Similarly, whenever the price optimization service makes a prediction about the surge charge for the next minute, this prediction is broadcast to the broker. (Location 2117)
Technically, a database can be a broker— each service can write data to a database and other services that need the data can read from that database. However, as mentioned in the section “Data Passing Through Databases”, reading and writing from databases are too slow for applications with strict latency requirements. Instead of using databases to broker data, we use in- memory storage to broker data. Real- time transports can be thought of as in- memory storage for data passing among services. (Location 2125)
A piece of data broadcast to a real- time transport is called an event. This architecture is, therefore, also called event- driven. A real- time transport is sometimes called an event bus. Request- driven architecture works well for systems that rely more on logic than on data. Event- driven architecture works better for systems that are data- heavy. (Location 2129)
The two most common types of real- time transports are pubsub, which is short for publish- subscribe, and message queue. In the pubsub model, any service can publish to different topics in a real- time transport, and any service that subscribes to a topic can read all the events in that topic. The services that produce data don’t care about what services consume their data. Pubsub solutions often have a retention policy— data will be retained in the real- time transport for a certain period of time (e.g., seven days) before being deleted or moved to a permanent storage (like Amazon S3). (Location 2132)
In a message queue model, an event often has intended consumers (an event with intended consumers is called a message), and the message queue is responsible for getting the message to the right consumers. Examples of pubsub solutions are Apache Kafka and Amazon Kinesis. 27 Examples of message queues are Apache RocketMQ and RabbitMQ. Both paradigms have gained a lot of traction in the last few years. Figure 3- 11 shows some of the companies that use Apache Kafka and RabbitMQ. (Location 2139)
Once your data arrives in data storage engines like databases, data lakes, or data warehouses, it becomes historical data. This is opposed to streaming data (data that is still streaming in). Historical data is often processed in batch jobs— jobs that are kicked off periodically. (Location 2153)
When data is processed in batch jobs, we refer to it as batch processing. Batch processing has been a research subject for many decades, and companies have come up with distributed systems like MapReduce and Spark to process batch data efficiently. (Location 2159)
When you have data in real- time transports like Apache Kafka and Amazon Kinesis, we say that you have streaming data. Stream processing refers to doing computation on streaming data. Computation on streaming data can also be kicked off periodically, but the periods are usually much shorter than the periods for batch jobs (e.g., every five minutes instead of every day). Computation on streaming data can also be kicked off whenever the need arises. (Location 2161)
Stream processing, when done right, can give low latency because you can process data as soon as data is generated, without having to first write it into databases. Many people believe that stream processing is less efficient than batch processing because you can’t leverage tools like MapReduce or Spark. This is not always the case, for two reasons. First, streaming technologies like Apache Flink are proven to be highly scalable and fully distributed, which means they can do computation in parallel. Second, the strength of stream processing is in stateful computation. (Location 2167)
Because batch processing happens much less frequently than stream processing, in ML, batch processing is usually used to compute features that change less often, such as drivers’ ratings (if a driver has had hundreds of rides, their rating is less likely to change significantly from one day to the next). Batch features— features extracted through batch processing— are also known as static features. Stream processing is used to compute features that change quickly, such as how many drivers are available right now, how many rides have been requested in the last minute, how many rides will be finished in the next two minutes, the median price of the last 10 rides in this area, etc. Features about the current state of the system like these are important to make the optimal price predictions. Streaming features— features extracted through stream processing— are also known as dynamic features. (Location 2174)
For many problems, you need not only batch features or streaming features, but both. You need infrastructure that allows you to process streaming data as well as batch data and join them together to feed into your ML models. (Location 2181)
To do computation on data streams, you need a stream computation engine (the way Spark and MapReduce are batch computation engines). For simple streaming computation, you might be able to get away with the built- in stream computation capacity of real- time transports like Apache Kafka, but Kafka stream processing is limited in its ability to deal with various data sources. (Location 2184)
For ML systems that leverage streaming features, the streaming computation is rarely simple. The number of stream features used in an application such as fraud detection and credit scoring can be in the hundreds, if not thousands. The stream feature extraction logic can require complex queries with join and aggregation along different dimensions. To extract these features requires efficient stream processing engines. For this purpose, you might want to look into tools like Apache Flink, KSQL, and Spark Streaming. Of these three engines, Apache Flink and KSQL are more recognized in the industry and provide a nice SQL abstraction for data scientists. (Location 2187)
Stream processing is more difficult because the data amount is unbounded and the data comes in at variable rates and speeds. It’s easier to make a stream processor do batch processing than to make a batch processor do stream processing. Apache Flink’s core maintainers have been arguing for years that batch processing is a special case of stream processing. (Location 2191)
Building a state- of- the- art model is interesting. Spending days wrangling with a massive amount of malformatted data that doesn’t even fit into your machine’s memory is frustrating. (Location 2304)
Data is messy, complex, unpredictable, and potentially treacherous. If not handled properly, it can easily sink your entire ML operation. But this is precisely the reason why data scientists and ML engineers should learn how to handle data well, saving us time and headache down the road. (Location 2305)
We use the term “training data” instead of “training dataset” because “dataset” denotes a set that is finite and stationary. (Location 2312)
Like other steps in building ML systems, creating training data is an iterative process. As your model evolves through a project lifecycle, your training data will likely also evolve. (Location 2314)
Sampling is an integral part of the ML workflow that is, unfortunately, often overlooked in typical ML coursework. Sampling happens in many steps of an ML project lifecycle, such as sampling from all possible real- world data to create training data; sampling from a given dataset to create splits for training, validation, and testing; or sampling from all possible events that happen within your ML system for monitoring purposes. (Location 2319)
Nonprobability Sampling Nonprobability sampling is when the selection of data isn’t based on any probability criteria. Here are some of the criteria for nonprobability sampling: Convenience sampling Samples of data are selected based on their availability. This sampling method is popular because, well, it’s convenient. Snowball sampling Future samples are selected based on existing samples. For example, to scrape legitimate Twitter accounts without having access to Twitter databases, you start with a small number of accounts, then you scrape all the accounts they follow, and so on. Judgment sampling Experts decide what samples to include. Quota sampling You select samples based on quotas for certain slices of data without any randomization. For example, when doing a survey, you might want 100 responses from each of the age groups: under 30 years old, between 30 and 60 years old, and above 60 years old, regardless of the actual age distribution. (Location 2335)
The samples selected by nonprobability criteria are not representative of the real- world data and therefore are riddled with selection biases. 2 Because of these biases, you might think that it’s a bad idea to select data to train ML models using this family of sampling methods. You’re right. Unfortunately, in many cases, the selection of data for ML models is still driven by convenience. (Location 2350)
Language models are often trained not with data that is representative of all possible texts but with data that can be easily collected— Wikipedia, Common Crawl, Reddit. (Location 2356)
In the simplest form of random sampling, you give all samples in the population equal probabilities of being selected. (Location 2371)
To avoid the drawback of simple random sampling, you can first divide your population into the groups that you care about and sample from each group separately. (Location 2380)
For example, if you know that a certain subpopulation of data, such as more recent data, is more valuable to your model and want it to have a higher chance of being selected, you can give it a higher weight. (Location 2397)
This also helps with the case when the data you have comes from a different distribution compared to the true data. For example, if in your data, red samples account for 25% and blue samples account for 75%, but you know that in the real world, red and blue have equal probability to happen, you can give red samples weights three times higher than blue samples. (Location 2398)
common concept in ML that is closely related to weighted sampling is sample weights. Weighted sampling is used to select samples to train your model with, whereas sample weights are used to assign “weights” or “importance” to training samples. Samples with higher weights affect the loss function more. Changing sample weights can change your model’s decision boundaries significantly, (Location 2437)
Reservoir sampling is a fascinating algorithm that is especially useful when you have to deal with streaming data, which is usually what you have in production. Imagine you have an incoming stream of tweets and you want to sample a certain number, k, of tweets to do analysis or train a model on. You don’t know how many tweets there are, but you know you can’t fit them all in memory, which means you don’t know in advance the probability at which a tweet should be selected. You want to ensure that: Every tweet has an equal probability of being selected. You can stop the algorithm at any time and the tweets are sampled with the correct probability. (Location 2445)
The algorithm involves a reservoir, which can be an array, and consists of three steps: Put the first k elements into the reservoir. For each incoming nth element, generate a random number i such that 1 ≤ i ≤ n. If 1 ≤ i ≤ k: replace the ith element in the reservoir with the nth element. Else, do nothing. (Location 2454)
This means that each incoming nth element has probability of being in the reservoir. You can also prove that each element in the reservoir has probability of being there. This means that all samples have an equal chance of being selected. If we stop the algorithm at any time, all samples in the reservoir have been sampled with the correct probability. Figure 4- 2 shows an illustrative example of how reservoir sampling works. Figure 4- 2. A visualization of how reservoir sampling works (Location 2460)
Importance sampling is one of the most important sampling methods, not just in ML. It allows us to sample from a distribution when we only have access to another distribution. Imagine you have to sample x from a distribution P( x), but P( x) is really expensive, slow, or infeasible to sample from. However, you have a distribution Q( x) that is a lot easier to sample from. So you sample x from Q( x) instead and weigh this sample by . Q( x) is called the proposal distribution or the importance distribution. Q( x) can be any distribution as long as Q( x) > 0 whenever P( x) ≠ 0. (Location 2470)
One example where importance sampling is used in ML is policy- based reinforcement learning. Consider the case when you want to update your policy. You want to estimate the value functions of the new policy, but calculating the total rewards of taking an action can be costly because it requires considering all possible outcomes until the end of the time horizon after that action. However, if the new policy is relatively close to the old policy, you can calculate the total rewards based on the old policy instead and reweight them according to the new policy. The rewards from the old policy make up the proposal distribution. (Location 2483)
Despite the promise of unsupervised ML, most ML models in production today are supervised, which means that they need labeled data to learn from. The performance of an ML model still depends heavily on the quality and quantity of the labeled data it’s trained on. (Location 2489)
Data labeling has gone from being an auxiliary task to being a core function of many ML teams in production. (Location 2494)
Anyone who has ever had to work with data in production has probably felt this at a visceral level: acquiring hand labels for your data is difficult for many, many reasons. First, hand- labeling data can be expensive, especially if subject matter expertise is required. (Location 2499)
Hand labeling means that someone has to look at your data, which isn’t always possible if your data has strict privacy requirements. (Location 2506)
Third, hand labeling is slow. For example, accurate transcription of speech utterance at the phonetic level can take 400 times longer than the utterance duration. 7 So if you want to annotate 1 hour of speech, it’ll take 400 hours or almost 3 months for a person to do so. (Location 2510)
Slow labeling leads to slow iteration speed and makes your model less adaptive to changing environments and requirements. If the task changes or data changes, you’ll have to wait for your data to be relabeled before updating your model. (Location 2513)
Often, to obtain enough labeled data, companies have to use data from multiple sources and rely on multiple annotators who have different levels of expertise. These different data sources and annotators also have different levels of accuracy. This leads to the problem of label ambiguity or label multiplicity: what to do when there are multiple conflicting labels for a data instance. (Location 2521)
Disagreements among annotators are extremely common. The higher the level of domain expertise required, the higher the potential for annotating disagreement. 8 If one human expert thinks the label should be A while another believes it should be B, how do we resolve this conflict to obtain one single ground truth? If human experts can’t agree on a label, what does human- level performance even mean? (Location 2543)
To minimize the disagreement among annotators, it’s important to first have a clear problem definition. For example, in the preceding entity recognition task, some disagreements could have been eliminated if we clarify that in case of multiple possible entities, pick the entity that comprises the longest substring. (Location 2547)
Indiscriminately using data from multiple sources, generated with different annotators, without examining their quality can cause your model to fail mysteriously. (Location 2555)
It’s good practice to keep track of the origin of each of your data samples as well as its labels, a technique known as data lineage. Data lineage helps you both flag potential biases in your data and debug your models. (Location 2563)
Hand- labeling isn’t the only source for labels. You might be lucky enough to work on tasks with natural ground truth labels. Tasks with natural labels are tasks where the model’s predictions can be automatically evaluated or partially evaluated by the system. (Location 2569)
The canonical example of tasks with natural labels is recommender systems. The goal of a recommender system is to recommend to users items relevant to them. Whether a user clicks on the recommended item or not can be seen as the feedback for that recommendation. (Location 2577)
Even if your task doesn’t inherently have natural labels, it might be possible to set up your system in a way that allows you to collect some feedback on your model. For example, if you’re building a machine translation system like Google Translate, you can have the option for the community to submit alternative translations for bad translations— these alternative translations can be used to train the next iteration of your models (though you might want to review these suggested translations first). Newsfeed ranking is not a task with inherent labels, but by adding the Like button and other reactions to each newsfeed item, Facebook is able to collect feedback on their ranking algorithm. (Location 2586)
Tasks with natural labels are fairly common in the industry. In a survey of 86 companies in my network, I found that 63% of them work with tasks with natural labels, as shown in Figure 4- 3. This doesn’t mean that 63% of tasks that can benefit from ML solutions have natural labels. What is more likely is that companies find it easier and cheaper to first start on tasks that have natural labels. (Location 2591)
For tasks with natural ground truth labels, the time it takes from when a prediction is served until when the feedback on it is provided is the feedback loop length. Tasks with short feedback loops are tasks where labels are generally available within minutes. Many recommender systems have short feedback loops. If the recommended items are related products on Amazon or people to follow on Twitter, the time between when the item is recommended until it’s clicked on, if it’s clicked on at all, is short. (Location 2603)
If you build a system to recommend clothes for users like the one Stitch Fix has, you wouldn’t get feedback until users have received the items and tried them on, which could be weeks later. (Location 2610)
If you want to extract labels from user feedback, it’s important to note that there are different types of user feedback. They can occur at different stages during a user journey on your app and differ by volume, strength of signal, and feedback loop length. (Location 2613)
Types of feedback a user on this application can provide might include clicking on a product recommendation, adding a product to cart, buying a product, rating, leaving a review, and returning a previously bought product. Clicking on a product happens much faster and more frequently (and therefore incurs a higher volume) than purchasing a product. However, buying a product is a much stronger signal on whether a user likes that product compared to just clicking on it. When building a product recommender system, many companies focus on optimizing for clicks, which give them a higher volume of feedback to evaluate their models. However, some companies focus on purchases, which gives them a stronger signal that is also more correlated to their business metrics (e.g., revenue from product sales). Both approaches are valid. There’s no definite answer to what type of feedback you should optimize for your use case, and it merits serious discussions between all stakeholders involved. (Location 2618)
For tasks with long feedback loops, natural labels might not arrive for weeks or even months. Fraud detection is an example of a task with long feedback loops. For a certain period of time after a transaction, users can dispute whether that transaction is fraudulent or not. (Location 2633)
Labels with long feedback loops are helpful for reporting a model’s performance on quarterly or yearly business reports. However, they are not very helpful if you want to detect issues with your models as soon as possible. If there’s a problem with your fraud detection model and it takes you months to catch, by the time the problem is fixed, all the fraudulent transactions your faulty model let through might have caused a small business to go bankrupt. (Location 2638)
Because of the challenges in acquiring sufficient high- quality labels, many techniques have been developed to address the problems that result. In this section, we will cover four of them: weak supervision, semi- supervision, transfer learning, and active learning. (Location 2644)
Method How Ground truths required? Weak supervision Leverages (often noisy) heuristics to generate labels No, but a small number of labels are recommended to guide the development of heuristics Semi- supervision Leverages structural assumptions to generate labels Yes, a small number of initial labels as seeds to generate more labels Transfer learning Leverages models pretrained on another task for your new task No for zero- shot learning Yes for fine- tuning, though the number of ground truths required is often much smaller than what would be needed if you train the model from scratch Active learning Labels data samples that are most useful to your model (Location 2650)
If hand labeling is so problematic, what if we don’t use hand labels altogether? One approach that has gained popularity is weak supervision. One of the most popular open source tools for weak supervision is Snorkel, (Location 2662)
The insight behind weak supervision is that people rely on heuristics, which can be developed with subject matter expertise, to label data. (Location 2668)
Because LFs encode heuristics, and heuristics are noisy, labels produced by LFs are noisy. Multiple LFs might apply to the same data examples, and they might give conflicting labels. (Location 2687)
In theory, you don’t need any hand labels for weak supervision. However, to get a sense of how accurate your LFs are, a small number of hand labels is recommended. These hand labels can help you discover patterns in your data to write better LFs. (Location 2697)
Weak supervision can be especially useful when your data has strict privacy requirements. You only need to see a small, cleared subset of data to write LFs, which can be applied to the rest of your data without anyone looking at it. (Location 2699)
With LFs, subject matter expertise can be versioned, reused, and shared. Expertise owned by one team can be encoded and used by another team. If your data changes or your requirements change, you can just reapply LFs to your data samples. The approach of using LFs to generate labels for your data is also known as programmatic labeling. (Location 2700)
The advantages of programmatic labeling over hand labeling Hand labeling Programmatic labeling Expensive: Especially when subject matter expertise required Cost saving: Expertise can be versioned, shared, and reused across an organization Lack of privacy: Need to ship data to human annotators Privacy: Create LFs using a cleared data subsample and then apply LFs to other data without looking at individual samples Slow: Time required scales linearly with number of labels needed Fast: Easily scale from 1K to 1M samples Nonadaptive: Every change requires relabeling the data Adaptive: When changes happen, just reapply LFs! (Location 2704)
My students often ask that if heuristics work so well to label data, why do we need ML models? One reason is that LFs might not cover all data samples, so we can train ML models on data programmatically labeled with LFs and use this trained model to generate predictions for samples that aren’t covered by any LF. (Location 2726)
Weak supervision is a simple but powerful paradigm. However, it’s not perfect. In some cases, the labels obtained by weak supervision might be too noisy to be useful. But even in these cases, weak supervision can be a good way to get you started when you want to explore the effectiveness of ML without wanting to invest too much in hand labeling up front. (Location 2728)
If weak supervision leverages heuristics to obtain noisy labels, semi- supervision leverages structural assumptions to generate new labels based on a small set of initial labels. Unlike weak supervision, semi- supervision requires an initial set of labels. (Location 2734)
For a comprehensive review, I recommend “Semi- Supervised Learning Literature Survey” (Xiaojin Zhu, 2008) and “A Survey on Semi- Supervised Learning” (Engelen and Hoos, 2018). (Location 2741)
A classic semi- supervision method is self- training. You start by training a model on your existing set of labeled data and use this model to make predictions for unlabeled samples. Assuming that predictions with high raw probability scores are correct, you add the labels predicted with high probability to your training set and train a new model on this expanded training set. This goes on until you’re happy with your model performance. (Location 2743)
Another semi- supervision method assumes that data samples that share similar characteristics share the same labels. (Location 2747)
A semi- supervision method that has gained popularity in recent years is the perturbation- based method. It’s based on the assumption that small perturbations to a sample shouldn’t change its label. So you apply small perturbations to your training instances to obtain new training instances. The perturbations might be applied directly to the samples (e.g., adding white noise to images) or to their representations (e.g., adding small random values to embeddings of words). The perturbed samples have the same labels as the unperturbed samples. (Location 2755)
Semi- supervision is the most useful when the number of training labels is limited. One thing to consider when doing semi- supervision with limited data is how much of this limited data should be used to evaluate multiple candidate models and select the best one. If you use a small amount, the best performing model on this small evaluation set might be the one that overfits the most to this set. On the other hand, if you use a large amount of data for evaluation, the performance boost gained by selecting the best model based on this evaluation set might be less than the boost gained by adding the evaluation set to the limited training set. Many companies overcome this trade- off by using a reasonably large evaluation set to select the best model, then continuing training the champion model on the evaluation set. (Location 2764)
Transfer learning refers to the family of methods where a model developed for a task is reused as the starting point for a model on a second task. (Location 2773)
The trained model can then be used for the task that you’re interested in— a downstream task— such as sentiment analysis, intent detection, or question answering. In some cases, such as in zero- shot learning scenarios, you might be able to use the base model on a downstream task directly. In many cases, you might need to fine- tune the base model. Fine- tuning means making small changes to the base model, such as continuing to train the base model or a part of the base model on data from a given downstream task. (Location 2780)
Transfer learning has gained a lot of interest in recent years for the right reasons. It has enabled many applications that were previously impossible due to the lack of training samples. A nontrivial portion of ML models in production today are the results of transfer learning, including object detection models that leverage models pretrained on ImageNet and text classification models that leverage pretrained language models such as BERT or GPT- 3.21 Transfer learning also lowers the entry barriers into ML, as it helps reduce the up- front cost needed for labeling data to build ML applications. (Location 2797)
Active learning is a method for improving the efficiency of data labels. The hope here is that ML models can achieve greater accuracy with fewer training labels if they can choose which data samples to learn from. Active learning is sometimes called query learning— though this term is getting increasingly unpopular— because a model (active learner) sends back queries in the form of unlabeled samples to be labeled by annotators (usually humans). (Location 2809)
Instead of randomly labeling data samples, you label the samples that are most helpful to your models according to some metrics or heuristics. The most straightforward metric is uncertainty measurement— label the examples that your model is the least certain about, hoping that they will help your model learn the decision boundary better. (Location 2814)
Another common heuristic is based on disagreement among multiple candidate models. This method is called query- by- committee, an example of an ensemble method. 23 You need a committee of several candidate models, which are usually the same model trained with different sets of hyperparameters or the same model trained on different slices of data. Each model can make one vote for which samples to label next, and it might vote based on how uncertain it is about the prediction. You then label the samples that the committee disagrees on the most. (Location 2824)
There are other heuristics such as choosing samples that, if trained on them, will give the highest gradient updates or will reduce the loss the most. For a comprehensive review of active learning methods, check out “Active Learning Literature Survey” (Settles 2010). (Location 2828)
I’m most excited about active learning when a system works with real- time data. Data changes all the time, a phenomenon we briefly touched on in Chapter 1 and will further detail in Chapter 8. Active learning in this data regime will allow your model to learn more effectively in real time and adapt faster to changing environments. (Location 2835)
Class imbalance can also happen with regression tasks where the labels are continuous. Consider the task of estimating health- care bills. 25 Health- care bills are highly skewed— the median bill is low, but the 95th percentile bill is astronomical. When predicting hospital bills, it might be more important to predict accurately the bills at the 95th percentile than the median bills. A 100% difference in a $250 bi ll i s a cce pt ab l e (a c t u a l$ 500, predicted $250), b u t a 100$ 10k bill is not (actual $20 k, p re d i c t e d$ 10k). Therefore, we might have to train the model to be better at predicting 95th percentile bills, even if it reduces the overall metrics. (Location 2848)
The first reason is that class imbalance often means there’s insufficient signal for your model to learn to detect the minority classes. In the case where there is a small number of instances in the minority class, the problem becomes a few- shot learning problem where your model only gets to see the minority class a few times before having to make a decision on it. In the case where there is no instance of the rare classes in your training set, your model might assume these rare classes don’t exist. (Location 2865)
The second reason is that class imbalance makes it easier for your model to get stuck in a nonoptimal solution by exploiting a simple heuristic instead of learning anything useful about the underlying pattern of the data. (Location 2868)
The third reason is that class imbalance leads to asymmetric costs of error— the cost of a wrong prediction on a sample of the rare class might be much higher than a wrong prediction on a sample of the majority class. For example, misclassification on an X- ray with cancerous cells is much more dangerous than misclassification on an X- ray of a normal lung. If your loss function isn’t configured to address this asymmetry, your model will treat all samples the same way. As a result, you might obtain a model that performs equally well on both majority and minority classes, while you much prefer a model that performs less well on the majority class but much better on the minority one. (Location 2873)
When I was in school, most datasets I was given had more or less balanced classes. 28 It was a shock for me to start working and realize that class imbalance is the norm. In real- world settings, rare events are often more interesting (or more dangerous) than regular events, and many tasks focus on detecting those rare events. (Location 2878)
“Survey on Deep Learning with Class Imbalance” (Johnson and Khoshgoftaar 2019). (Location 2919)
Precision = True Positive / (True Positive + False Positive) Recall = True Positive / (True Positive + False Negative) F1 = 2 × Precision × Recall / (Precision + Recall) (Location 2960)
A popular method of undersampling low- dimensional data that was developed back in 1976 is Tomek links. 38 With this technique, you find pairs of samples from opposite classes that are close in proximity and remove the sample of the majority class in each pair. (Location 3009)
Both SMOTE and Tomek links have only been proven effective in low- dimensional data. Many of the sophisticated resampling techniques, such as Near- Miss and one- sided selection, 41 require calculating the distance between instances or between instances and the decision boundaries, which can be expensive or infeasible for high- dimensional data or in high- dimensional feature space, such as the case with large neural networks. (Location 3023)
two- phase learning. 42 You first train your model on the resampled data. This resampled data can be achieved by randomly undersampling large classes until each class has only N instances. You then fine- tune your model on the original data. (Location 3033)
Elkan proposed cost- sensitive learning in which the individual loss function is modified to take into account this varying cost. 44 The method started by using a cost matrix to specify Cij: the cost if class i is classified as class j. If i = j, it’s a correct classification, and the cost is usually 0. If not, it’s a misclassification. If classifying POSITIVE examples as NEGATIVE is twice as costly as the other way around, you can make C10 twice as high as C01. (Location 3065)
What if we adjust the loss so that if a sample has a lower probability of being right, it’ll have a higher weight? This is exactly what focal loss does. 46 (Location 3100)
Neural networks, in general, are sensitive to noise. In the case of computer vision, this means that adding a small amount of noise to an image can cause a neural network to misclassify it. Su et al. showed that 67.97% of the natural images in the Kaggle CIFAR- 10 test dataset and 16.04% of the ImageNet test images can be misclassified by changing just one pixel (Location 3149)
Using deceptive data to trick a neural network into making wrong predictions is called adversarial attacks. Adding noise to samples is a common technique to create adversarial samples. The success of adversarial attacks is especially exaggerated as the resolution of images increases. (Location 3158)
Adding noisy samples to training data can help models recognize the weak spots in their learned decision boundary and improve their performance. 51 Noisy samples can be created by either adding random noise or by a search strategy. Moosavi- Dezfooli et al. proposed an algorithm, called DeepFool, that finds the minimum possible noise injection needed to cause a misclassification with high confidence. 52 This type of augmentation is called adversarial augmentation. 53 (Location 3160)
Adversarial augmentation is less common in NLP (an image of a bear with randomly added pixels still looks like a bear, but adding random characters to a random sentence will likely render it gibberish), but perturbation has been used to make models more robust. (Location 3169)
Since collecting data is expensive and slow, with many potential privacy concerns, it’d be a dream if we could sidestep it altogether and train our models with synthesized data. Even though we’re still far from being able to synthesize all training data, it’s possible to synthesize some training data to boost a model’s performance. (Location 3182)
If you’re interested in learning more about data augmentation for computer vision, “A Survey on Image Data Augmentation for Deep Learning” (Shorten and Khoshgoftaar 2019) is a comprehensive review. (Location 3209)
many of the companies that I’ve worked with have discovered time and time again that once they have a workable model, having the right features tends to give them the biggest performance boost compared to clever algorithmic techniques such as hyperparameter tuning. (Location 3385)
The promise of deep learning is that we won’t have to handcraft features. For this reason, deep learning is sometimes called feature learning. 1 Many features can be automatically learned and extracted by algorithms. However, we’re still far from the point where all features can be automated. (Location 3399)
The process of choosing what information to use and how to extract this information into a format usable by your ML models is feature engineering. (Location 3446)
Missing not at random (MNAR) This is when the reason a value is missing is because of the true value itself. In this example, we might notice that some respondents didn’t disclose their income. Upon investigation it may turn out that the income of respondents who failed to report tends to be higher than that of those who did disclose. The income values are missing for reasons related to the values themselves. (Location 3497)
Missing at random (MAR) This is when the reason a value is missing is not due to the value itself, but due to another observed variable. In this example, we might notice that age values are often missing for respondents of the gender “A,” which might be because the people of gender A in this survey don’t like disclosing their age. (Location 3504)
Missing completely at random (MCAR) This is when there’s no pattern in when the value is missing. In this example, we might think that the missing values for the column “Job” might be completely random, not because of the job itself and not because of any other variable. People just forget to fill in that value sometimes for no particular reason. However, this type of missing is very rare. (Location 3509)
removing rows of data can also remove important information that your model needs to make predictions, especially if the missing values are not at random (MNAR). (Location 3531)
removing rows of data can create biases in your model, especially if (Location 3534)
there is no perfect way to handle missing values. With deletion, you risk losing important information or accentuating biases. With imputation, you risk injecting your own bias into and adding noise to your data, or worse, data leakage. (Location 3551)
Before inputting features into models, it’s important to scale them to be similar ranges. This process is called feature scaling. This is one of the simplest things you can do that often results in a performance boost for your model. Neglecting to do so can cause your model to make gibberish predictions, especially with classical algorithms like gradient- boosted trees and logistic regression. (Location 3563)
Scaling to an arbitrary range works well when you don’t want to make any assumptions about your variables. If you think that your variables might follow a normal distribution, it might be helpful to normalize them so that they have zero mean and unit variance. This process is called standardization: with being the mean of variable x, and (Location 3573)
There are two important things to note about scaling. One is that it’s a common source of data leakage (Location 3585)
Another is that it often requires global statistics— you have to look at the entire or a subset of training data to calculate its min, max, or mean. During inference, you reuse the statistics you had obtained during training to scale new data. If the new data has changed significantly compared to the training, these statistics won’t be very useful. Therefore, it’s important to retrain your model often to account for these changes. (Location 3587)
the hashing trick, popularized by the package Vowpal Wabbit developed at Microsoft. 7 The gist of this trick is that you use a hash function to generate a hashed value of each category. The hashed value will become the index of that category. Because you can specify the hash space, you can fix the number of encoded values for a feature in advance, without having to know how many categories there will be. For example, if you choose a hash space of 18 bits, which corresponds to 218 = 262,144 possible hashed values, all the categories, even the ones that your model has never seen before, will be encoded by an index between 0 and 262,143. (Location 3644)
One problem with hashed functions is collision: two categories being assigned the same index. However, with many hash functions, the collisions are random; new brands can share an index with any of the existing brands instead of always sharing an index with unpopular brands, which is what happens when we use the preceding UNKNOWN category. (Location 3650)
You can choose a hash space large enough to reduce the collision. You can also choose a hash function with properties that you want, such as a locality- sensitive hashing function where similar categories (such as websites with similar names) are hashed into values close to each other. Because it’s a trick, it’s often considered hacky by academics and excluded from ML curricula. But its wide adoption in the industry is a testimonial to how effective the trick is. It’s essential to Vowpal Wabbit and it’s part of the frameworks of scikit- learn, TensorFlow, and gensim. It can be especially useful in continual learning settings where your model learns from incoming examples in production. (Location 3657)
Feature crossing is the technique to combine two or more features to generate new features. This technique is useful to model the nonlinear relationships between features. (Location 3667)
Because feature crossing helps model nonlinear relationships between variables, it’s essential for models that can’t learn or are bad at learning nonlinear relationships, such as linear regression, logistic regression, and tree- based models. (Location 3681)
DeepFM and xDeepFM are the family of models that have successfully leveraged explicit feature interactions for recommender systems and click- through- rate prediction. (Location 3684)
caveat of feature crossing is that it can make your feature space blow up. Imagine feature A has 100 possible values and feature B has 100 possible features; crossing these two features will result in a feature with 100 × 100 = 10,000 possible values. You will need a lot more data for models to learn all these possible values. Another caveat is that because feature crossing increases the number of features models use, it can make models overfit to the training data. (Location 3686)
Embeddings An embedding is a vector that represents a piece of data. We call the set of all possible embeddings generated by the same algorithm for a type of data “an embedding space.” All embedding vectors in the same space are of the same size. (Location 3701)
One of the most common uses of embeddings is word embeddings, where you can represent each word with a vector. However, embeddings for other types of data are increasingly popular. For example, ecommerce solutions like Criteo and Coveo have embeddings for products. 10 Pinterest has embeddings for images, graphs, queries, and even users. 11 (Location 3705)
if we use a model like a transformer, words are processed in parallel, so words’ positions need to be explicitly inputted so that our model knows the order of these words (“ a dog bites a child” is very different from “a child bites a dog”). (Location 3712)
With position embedding, the number of columns is the number of positions. In our case, since we only work with the previous sequence size of 8, the positions go from 0 to (Location 3720)
The embedding size for positions is usually the same as the embedding size for words so that they can be summed. For example, the embedding for the word “food” at position 0 is the sum of the embedding vector for the word “food” and the embedding vector for position 0. This is the way position embeddings are implemented in Hugging Face’s BERT as of August 2021. Because the embeddings change as the model weights get updated, we say that the position embeddings are learned. (Location 3722)
Position embeddings can also be fixed. The embedding for each position is still a vector with S elements (S is the position embedding size), but each element is predefined using a function, usually sine and cosine. In the original Transformer paper, if the element is at an even index, use sine. Else, use cosine. (Location 3727)
Data leakage refers to the phenomenon when a form of the label “leaks” into the set of features used for making predictions, and this same information is not available during inference. (Location 3758)
Data leakage is challenging because often the leakage is nonobvious. It’s dangerous because it can cause your models to fail in an unexpected and spectacular way, even after extensive evaluation and testing. (Location 3760)
When I learned ML in college, I was taught to randomly split my data into train, validation, and test splits. This is also how data is often reportedly split in ML research papers. However, this is also one common cause for data leakage. (Location 3783)
In many cases, data is time- correlated, which means that the time the data is generated affects its label distribution. Sometimes, the correlation is obvious, as in the case of stock prices. To oversimplify it, the prices of similar stocks tend to move together. If 90% of the tech stocks go down today, it’s very likely the other 10% of the tech stocks go down too. When building models to predict the future stock prices, you want to split your training data by time, such as training your model on data from the first six days and evaluating it on data from the seventh day. (Location 3786)
Consider the task of predicting whether someone will click on a song recommendation. Whether someone will listen to a song depends not only on their music taste but also on the general music trend that day. If an artist passes away one day, people will be much more likely to listen to that artist. By including samples from a certain day in the train split, information about the music trend that day will be passed into your model, making it easier for it to make predictions on other samples on that same day. (Location 3793)
To prevent future information from leaking into the training process and allowing models to cheat during evaluation, split your data by time, instead of splitting randomly, whenever possible. For example, if you have data from five weeks, use the first four weeks for the train split, then randomly split week 5 into validation and test splits as shown in Figure 5- 7. (Location 3796)
One common mistake is to use the entire training data to generate global statistics before splitting it into different splits, leaking the mean and variance of the test samples into the training process, allowing a model to adjust its predictions for the test samples. This information isn’t available in production, so the model’s performance will likely degrade. (Location 3806)
To avoid this type of leakage, always split your data first before scaling, then use the statistics from the train split to scale all the splits. Some even suggest that we split our data before any exploratory data analysis and data processing, so that we don’t accidentally gain information about the test split. (Location 3809)
One common way to handle the missing values of a feature is to fill (input) them with the mean or median of all values present. Leakage might occur if the mean or median is calculated using entire data instead of just the train split. (Location 3813)
If you have duplicates or near- duplicates in your data, failing to remove them before splitting your data might cause the same samples to appear in both train and validation/ test splits. (Location 3820)
group of examples have strongly correlated labels but are divided into different splits. For example, a patient might have two lung CT scans that are a week apart, which likely have the same labels on whether they contain signs of lung cancer, but one of them is in the train split and the second is in the test split. This type of leakage is common for objective detection tasks that contain photos of the same object taken milliseconds apart— some of them landed in the train split while others landed in the test split. It’s hard avoiding this type of data leakage without understanding how your data was generated. (Location 3835)
There’s no foolproof way to avoid this type of leakage, but you can mitigate the risk by keeping track of the sources of your data and understanding how it is collected and processed. Normalize your data so that data from different sources can have the same means and variances. If different CT scan machines output images with different resolutions, normalizing all the images to have the same resolution would make it harder for models to know which image is from which scan machine. And don’t forget to incorporate subject matter experts, who might have more contexts on how data is collected and used, into the ML design process! (Location 3847)
Measure the predictive power of each feature or a set of features with respect to the target variable (label). If a feature has unusually high correlation, investigate how this feature is generated and whether the correlation makes sense. It’s possible that two features independently don’t contain leakage, but two features together can contain leakage. For example, when building a model to predict how long an employee will stay at a company, the starting date and the end date separately doesn’t tell us much about their tenure, but both together can give us that information. (Location 3856)
Do ablation studies to measure how important a feature or a set of features is to your model. If removing a feature causes the model’s performance to deteriorate significantly, investigate why that feature is so important. If you have a massive amount of features, say a thousand features, it might be infeasible to do ablation studies on every possible combination of them, but it can still be useful to occasionally do ablation studies with a subset of features that you suspect the most. (Location 3862)
Keep an eye out for new features added to your model. If adding a new feature significantly improves your model’s performance, either that feature is really good or that feature just contains leaked information about labels. (Location 3867)
adding more features leads to better model performance. In my experience, the list of features used for a model in production only grows over time. However, more features doesn’t always mean better model performance. (Location 3872)
Having too many features can be bad both during training and serving your model for the following reasons: The more features you have, the more opportunities there are for data leakage. Too many features can cause overfitting. Too many features can increase memory required to serve a model, which, in turn, might require you to use a more expensive machine/ instance to serve your model. Too many features can increase inference latency when doing online prediction, especially if you need to extract these features from raw data for predictions online. We’ll go deeper into online prediction in Chapter 7. Useless features become technical debts. Whenever your data pipeline changes, all the affected features need to be adjusted accordingly. For example, if one day your application decides to no longer take in information about users’ age, all features that use users’ age need to be updated. (Location 3874)
if a feature doesn’t help a model make good predictions, regularization techniques like L1 regularization should reduce that feature’s weight to 0. However, in practice, it might help models learn faster if the features that are no longer useful (and even possibly harmful) are removed, prioritizing good features. (Location 3883)
You can store removed features to add them back later. You can also just store general feature definitions to reuse and share across teams in an organization. When talking about feature definition management, some people might think of feature stores as the solution. However, not all feature stores manage feature definitions. (Location 3885)
If you use a classical ML algorithm like boosted gradient trees, the easiest way to measure the importance of your features is to use built- in feature importance functions implemented by XGBoost. 17 For more model- agnostic methods, you might want to look into SHAP (SHapley Additive exPlanations). 18 InterpretML is a great open source package that leverages feature importance to help you understand how your model makes predictions. (Location 3893)
The exact algorithm for feature importance measurement is complex, but intuitively, a feature’s importance to a model is measured by how much that model’s performance deteriorates if that feature or a set of features containing that feature is removed from the model. (Location 3900)
SHAP is great because it not only measures a feature’s importance to an entire model, it also measures each feature’s contribution to a model’s specific prediction. (Location 3902)
Since the goal of an ML model is to make correct predictions on unseen data, features used for the model should generalize to unseen data. Not all features generalize equally. (Location 3922)
Measuring feature generalization is a lot less scientific than measuring feature importance, and it requires both intuition and subject matter expertise on top of statistical knowledge. (Location 3927)
Coverage is the percentage of the samples that has values for this feature in the data— so the fewer values that are missing, the higher the coverage. A rough rule of thumb is that if this feature appears in a very small percentage of your data, it’s not going to be very generalizable. (Location 3929)
For example, if you want to build a model to predict whether someone will buy a house in the next 12 months and you think that the number of children someone has will be a good feature, but you can only get this information for 1% of your data, this feature might not be very useful. This rule of thumb is rough because some features can still be useful even if they are missing in most of your data. This is especially true when the missing values are not at random, which means having the feature or not might be a strong indication of its value. For example, if a feature appears only in 1% of your data, but 99% of the examples with this feature have POSITIVE labels, this feature is useful and you should use it. (Location 3931)
If the coverage of a feature differs a lot between the train and test split (such as it appears in 90% of the examples in the train split but only in 20% of the examples in the test split), this is an indication that your train and test splits don’t come from the same distribution. You might want to investigate whether the way you split your data makes sense and whether this feature is a cause for data leakage. (Location 3937)
imagine you want to build a model to estimate the time it will take for a given taxi ride. You retrain this model every week, and you want to use the data from the last six days to predict the ETAs (estimated time of arrival) for today. One of the features is DAY_OF_THE_WEEK, which you think is useful because the traffic on weekdays is usually worse than on the weekend. This feature coverage is 100%, because it’s present in every feature. However, in the train split, the values for this feature are Monday to Saturday, whereas in the test split, the value for this feature is Sunday. If you include this feature in your model without a clever scheme to encode the days, it won’t generalize to the test split, and might harm your model’s performance. (Location 3943)
When considering a feature’s generalization, there’s a trade- off between generalization and specificity. You might realize that the traffic during an hour only changes depending on whether that hour is the rush hour. So you generate the feature IS_RUSH_HOUR and set it to 1 if the hour is between 7 a.m. and 9 a.m. or between 4 p.m. and 6 p.m. IS_RUSH_HOUR is more generalizable but less specific than HOUR_OF_THE_DAY. Using IS_RUSH_HOUR without HOUR_OF_THE_DAY might cause models to lose important information about the hour. (Location 3950)
Here is a summary of best practices for feature engineering: Split data by time into train/ valid/ test splits instead of doing it randomly. If you oversample your data, do it after splitting. Scale and normalize your data after splitting to avoid data leakage. Use statistics from only the train split, instead of the entire data, to scale your features and handle missing values. Understand how your data is generated, collected, and processed. Involve domain experts if possible. Keep track of your data’s lineage. Understand feature importance to your model. Use features that generalize well. Remove no longer useful features from your models. (Location 3963)
Model development is an iterative process. After each iteration, you’ll want to compare your model’s performance against its performance in previous iterations and evaluate how suitable this iteration is for production. (Location 4041)
If you had unlimited time and compute power, the rational thing to do would be to try all possible solutions and see what is best for you. However, time and compute power are limited resources, and you have to be strategic about what models you select. (Location 4062)
When considering what model to use, it’s important to consider not only the model’s performance, measured by metrics such as accuracy, F1 score, and log loss, but also its other properties, such as how much data, compute, and time it needs to train, what’s its inference latency, and interpretability. (Location 4090)
If there’s a solution that can solve your problem that is much cheaper and simpler than state- of- the- art models, use the simpler solution. (Location 4116)
First, simpler models are easier to deploy, and deploying your model early allows you to validate that your prediction pipeline is consistent with your training pipeline. Second, starting with something simple and adding more complex components step- by- step makes it easier to understand your model and debug it. Third, the simplest model serves as a baseline to which you can compare your more complex models. (Location 4121)
Simplest models are not always the same as models with the least effort. For example, pretrained BERT models are complex, but they require little effort to get started with, especially if you use a ready- made implementation like the one in Hugging Face’s Transformer. In this case, it’s not a bad idea to use the complex solution, given that the community around this solution is well developed enough to help you get through any problems you might encounter. (Location 4124)
Pretrained BERT might be low effort to start with, but it can be quite high effort to improve upon. Whereas if you start with a simpler model, there’ll be a lot of room for you to improve upon your model. (Location 4128)
There are a lot of human biases in evaluating models. Part of the process of evaluating an ML architecture is to experiment with different features and different sets of hyperparameters to find the best model of that architecture. If an engineer is more excited about an architecture, they will likely spend a lot more time experimenting with it, which might result in better- performing models for that architecture. (Location 4136)
When comparing different architectures, it’s important to compare them under comparable setups. If you run 100 experiments for an architecture, it’s not fair to only run a couple of experiments for the architecture you’re evaluating it against. You might need to run 100 experiments for the other architecture too. (Location 4139)
simple way to estimate how your model’s performance might change with more data is to use learning curves. A learning curve of a model is a plot of its performance— e.g., training loss, training accuracy, validation accuracy— against the number of training samples it uses, as shown in Figure 6- 1. The learning curve won’t help you estimate exactly how much performance gain you can get from having more training data, but it can give you a sense of whether you can expect any performance gain at all from more training data. (Location 4151)
situation that I’ve encountered is when a team evaluates a simple neural network against a collaborative filtering model for making recommendations. When evaluating both models offline, the collaborative filtering model outperformed. However, the simple neural network can update itself with each incoming example, whereas the collaborative filtering has to look at all the data to update its underlying matrix. The team decided to deploy both the collaborative filtering model and the simple neural network. They used the collaborative filtering model to make predictions for users, and continually trained the simple neural network in production with new, incoming data. After two weeks, the simple neural network was able to outperform the collaborative filtering model. (Location 4158)
Smoothness Every supervised machine learning method assumes that there’s a set of functions that can transform inputs into outputs such that similar inputs are transformed into similar outputs. If an input X produces an output Y, then an input close to X would produce an output proportionally close to Y. (Location 4187)
One method that has consistently given a performance boost is to use an ensemble of multiple models instead of just an individual model to make predictions. Each model in the ensemble is called a base learner. (Location 4200)
Ensembling methods are less favored in production because ensembles are more complex to deploy and harder to maintain. However, they are still common for tasks where a small performance boost can lead to a huge financial gain, such as predicting click- through rate for ads. (Location 4210)
When creating an ensemble, the less correlation there is among base learners, the better the ensemble will be. (Location 4233)
Bagging, shortened from bootstrap aggregating, is designed to improve both the training stability and accuracy of ML algorithms. 4 It reduces variance and helps to avoid overfitting. (Location 4240)
Given a dataset, instead of training one classifier on the entire dataset, you sample with replacement to create different datasets, called bootstraps, and train a classification or regression model on each of these bootstraps. Sampling with replacement ensures that each bootstrap is created independently from its peers. Figure 6- 3 shows an illustration of bagging. Figure 6- 3. Bagging illustration. (Location 4245)
Bagging generally improves unstable methods, such as neural networks, classification and regression trees, and subset selection in linear regression. However, it can mildly degrade the performance of stable methods such as k- nearest neighbors. (Location 4253)
Boosting is a family of iterative ensemble algorithms that convert weak learners to strong ones. Each learner in this ensemble is trained on the same set of samples, but the samples are weighted differently among iterations. As a result, future weak learners focus more on the examples that previous weak learners misclassified. (Location 4262)
Figure 6- 4. Boosting illustration. (Location 4269)
You start by training the first weak classifier on the original dataset. Samples are reweighted based on how well the first classifier classifies them, e.g., misclassified samples are given higher weight. Train the second classifier on this reweighted dataset. Your ensemble now consists of the first and the second classifiers. Samples are weighted based on how well the ensemble classifies them. Train the third classifier on this reweighted dataset. Add the third classifier to the ensemble. Repeat for as many iterations as needed. Form the final strong classifier as a weighted combination of the existing classifiers— classifiers with smaller training errors have higher weights. (Location 4271)
Stacking means that you train base learners from the training data then create a meta- learner that combines the outputs of the base learners to output final predictions, as shown in Figure 6- 5. The meta- learner can be as simple as a heuristic: you take the majority vote (for classification tasks) or the average vote (for regression tasks) from all base learners. It can be another model, such as a logistic regression model or a linear regression model. (Location 4290)
For more great advice on how to create an ensemble, refer to the awesome ensemble guide by one of Kaggle’s legendary teams, MLWave. (Location 4298)
It’s important to keep track of all the definitions needed to re- create an experiment and its relevant artifacts. An artifact is a file generated during an experiment— examples of artifacts can be files that show the loss curve, evaluation loss graph, logs, or intermediate results of a model throughout a training process. This enables you to compare different experiments and choose the one best suited for your needs. Comparing different experiments can also help you understand how small changes affect your model’s performance, which, in turn, gives you more visibility into how your model works. (Location 4304)
The process of tracking the progress and results of an experiment is called experiment tracking. The process of logging all the details of an experiment for the purpose of possibly recreating it later or comparing it with other experiments is called versioning. These two go hand in hand with each other. Many tools originally set out to be experiment tracking tools, such as MLflow and Weights & Biases, have grown to incorporate versioning. Many tools originally set out to be versioning tools, such as DVC, have also incorporated experiment tracking. (Location 4309)
large part of training an ML model is babysitting the learning processes. Many problems can arise during the training process, including loss not decreasing, overfitting, underfitting, fluctuating weight values, dead neurons, and running out of memory. It’s important to track what’s going on during training not only to detect and address these issues but also to evaluate whether your model is learning anything useful. (Location 4314)
When I just started getting into ML, all I was told to track was loss and speed. Fast- forward several years, and people are tracking so many things that their experiment tracking boards look both beautiful and terrifying at the same time. (Location 4319)
short list of things you might want to consider tracking for each experiment during its training process: The loss curve corresponding to the train split and each of the eval splits. The model performance metrics that you care about on all nontest splits, such as accuracy, F1, perplexity. The log of corresponding sample, prediction, and ground truth label. This comes in handy for ad hoc analytics and sanity check. The speed of your model, evaluated by the number of steps per second or, if your data is text, the number of tokens processed per second. System performance metrics such as memory usage and CPU/ GPU utilization. They’re important to identify bottlenecks and avoid wasting system resources. The values over time of any parameter and hyperparameter whose changes can affect your model’s performance, such as the learning rate if you use a learning rate schedule; gradient norms (both globally and per layer), especially if you’re clipping your gradient norms; and weight norm, especially if you’re doing weight decay. (Location 4321)
Experiment tracking enables comparison across experiments. By observing how a certain change in a component affects the model’s performance, you gain some understanding into what that component does. (Location 4346)
ML systems are part code, part data, so you need to not only version your code but your data as well. (Location 4362)
Code versioning has more or less become a standard in the industry. However, at this point, data versioning is like flossing. Everyone agrees it’s a good thing to do, but few do it. There are a few reasons why data versioning is challenging. One reason is that because data is often much larger than code, we can’t use the same strategy that people usually use to version code to version data. (Location 4363)
dataset used might be so large that duplicating it multiple times might be unfeasible. Code versioning tools allow for multiple people to work on the same codebase at the same time by duplicating the codebase on each person’s local machine. However, a dataset might not fit into a local machine. (Location 4372)
Aggressive experiment tracking and versioning helps with reproducibility, but it doesn’t ensure reproducibility. The frameworks and hardware you use might introduce nondeterminism to your experiment results, 10 making it impossible to replicate the result of an experiment without knowing everything about the environment your experiment runs in. (Location 4384)
First, ML models fail silently, a topic we’ll cover in depth in Chapter 8. The code compiles. The loss decreases as it should. The correct functions are called. The predictions are made, but the predictions are wrong. The developers don’t notice the errors. And worse, users don’t either and use the predictions as if the application was functioning as it should. (Location 4397)
When debugging a traditional software program, you might be able to make changes to the buggy code and see the result immediately. However, when making changes to an ML model, you might have to retrain the model and wait until it converges to see whether the bug is fixed, which can take hours. In some cases, you can’t even be sure whether the bugs are fixed until the model is deployed to the users. (Location 4401)
debugging ML models is hard because of their cross- functional complexity. There are many components in an ML system: data, labels, features, ML algorithms, code, infrastructure, etc. These different components might be owned by different teams. For example, data is managed by data engineers, labels by subject matter experts, ML algorithms by data scientists, and infrastructure by ML engineers or the ML platform team. When an error occurs, it could be because of any of these components or a combination of them, making it hard to know where to look or who should be looking into it. (Location 4404)
Theoretical constraints As discussed previously, each model comes with its own assumptions about the data and the features it uses. A model might fail because the data it learns from doesn’t conform to its assumptions. For example, you use a linear model for the data whose decision boundaries aren’t linear. (Location 4409)
Poor implementation of model The model might be a good fit for the data, but the bugs are in the implementation of the model. For example, if you use PyTorch, you might have forgotten to stop gradient updates during evaluation when you should. The more components a model has, the more things that can go wrong, and the harder it is to figure out which goes wrong. However, with models being increasingly commoditized and more and more companies using off- the- shelf models, this is becoming less of a problem. (Location 4413)
Poor choice of hyperparameters With the same model, one set of hyperparameters can give you the state- of- the- art result but another set of hyperparameters might cause the model to never converge. The model is a great fit for your data, and its implementation is correct, but a poor set of hyperparameters might render your model useless. (Location 4419)
Data problems There are many things that could go wrong in data collection and preprocessing that might cause your models to perform poorly, such as data samples and labels being incorrectly paired, noisy labels, features normalized using outdated statistics, and more. (Location 4424)
Poor choice of features There might be many possible features for your models to learn from. Too many features might cause your models to overfit to the training data or cause data leakage. Too few features might lack predictive power to allow your models to make good predictions. (Location 4427)
Debugging should be both preventive and curative. You should have healthy practices to minimize the opportunities for bugs to proliferate as well as a procedure for detecting, locating, and fixing bugs. Having the discipline to follow both the best practices and the debugging procedure is crucial in developing, implementing, and deploying ML models. (Location 4432)
There is, unfortunately, still no scientific approach to debugging in ML. However, there have been a number of tried- and- true debugging techniques published by experienced ML engineers and researchers. (Location 4434)
Readers interested in learning more might want to check out Andrej Karpathy’s awesome post “A Recipe for Training Neural Networks”. (Location 4436)
Start simple and gradually add more components Start with the simplest model and then slowly add more components to see if it helps or hurts the performance. For example, if you want to build a recurrent neural network (RNN), start with just one level of RNN cell before stacking multiple together or adding more regularization. If you want to use a BERT- like model (Devlin et al. 2018), which uses both a masked language model (MLM) and next sentence prediction (NSP) loss, you might want to use only the MLM loss before adding NSP loss. Currently, many people start out by cloning an open source implementation of a state- of- the- art model and plugging in their own data. On the off- chance that it works, it’s great. But if it doesn’t, it’s very hard to debug the system because the problem could have been caused by any of the many components in the model. (Location 4437)
Overfit a single batch After you have a simple implementation of your model, try to overfit a small amount of training data and run evaluation on the same data to make sure that it gets to the smallest possible loss. If it’s for image recognition, overfit on 10 images and see if you can get the accuracy to be 100%, or if it’s for machine translation, overfit on 100 sentence pairs and see if you can get to a BLEU score of near 100. If it can’t overfit a small amount of data, there might be something wrong with your implementation. (Location 4445)
Set a random seed There are so many factors that contribute to the randomness of your model: weight initialization, dropout, data shuffling, etc. Randomness makes it hard to compare results across different experiments— you have no idea if the change in performance is due to a change in the model or a different random seed. Setting a random seed ensures consistency between different runs. It also allows you to reproduce errors and other people to reproduce your results. (Location 4451)
When a sample of your data is large, e.g., one machine can handle a few samples at a time, you might only be able to work with a small batch size, which leads to instability for gradient descent- based optimization. (Location 4466)
According to the authors of the open source package gradient- checkpointing, “For feed- forward models we were able to fit more than 10x larger models onto our GPU, at only a 20% increase in computation time.” 13 Even when a sample fits into memory, using checkpointing can allow you to fit more samples into a batch, which might allow you to train your model faster. (Location 4469)
It’s now the norm to train ML models on multiple machines. The most common parallelization method supported by modern ML frameworks is data parallelism: you split your data on multiple machines, train your model on all of them, and accumulate gradients. (Location 4474)
A challenging problem is how to accurately and effectively accumulate gradients from different machines. As each machine produces its own gradient, if your model waits for all of them to finish a run— synchronous stochastic gradient descent (SGD)— stragglers will cause the entire system to slow down, wasting time and resources. (Location 4479)
If your model updates the weight using the gradient from each machine separately— asynchronous SGD— gradient staleness might become a problem because the gradients from one machine have caused the weights to change before the gradients from another machine have come in. (Location 4487)
In theory, asynchronous SGD converges but requires more steps than synchronous SGD. However, in practice, when the number of weights is large, gradient updates tend to be sparse, meaning most gradient updates only modify small fractions of the parameters, and it’s less likely that two gradient updates from different machines will modify the same weights. When gradient updates are sparse, gradient staleness becomes less of a problem and the model converges similarly for both synchronous and asynchronous SGD. (Location 4494)
Another problem is that spreading your model on multiple machines can cause your batch size to be very big. If a machine processes a batch size of 1,000, then 1,000 machines process a batch size of 1M (OpenAI’s GPT- 3 175B uses a batch size of 3.2M in 2020). 19 To oversimplify the calculation, if training an epoch on a machine takes 1M steps, training on 1,000 machines might take only 1,000 steps. An intuitive approach is to scale up the learning rate to account for more learning at each step, but we also can’t make the learning rate too big as it will lead to unstable convergence. In practice, increasing the batch size past a certain point yields diminishing returns. (Location 4499)
With data parallelism, each worker has its own copy of the whole model and does all the computation necessary for its copy of the model. Model parallelism is when different components of your model are trained on different machines, (Location 4512)
Model parallelism can be misleading because in some cases parallelism doesn’t mean that different parts of the model in different machines (Location 4522)
Pipeline parallelism is a clever technique to make different components of a model on different machines run more in parallel. There are multiple variants to this, but the key idea is to break the computation of each machine into multiple parts. When machine 1 finishes the first part of its computation, it passes the result onto machine 2, then continues to the second part, and so on. Machine 2 now can execute its computation on the first part while machine 1 executes its computation on the second part. (Location 4526)
Model parallelism and data parallelism aren’t mutually exclusive. Many companies use both methods for better utilization of their hardware, even though the setup to use both methods can require significant engineering effort. (Location 4538)
The goal of hyperparameter tuning is to find the optimal set of hyperparameters for a given model within a search space— the performance of each set evaluated on a validation set. (Location 4567)
Popular ML frameworks either come with built- in utilities or have third- party utilities for hyperparameter tuning— for example, scikit- learn with auto- sklearn, 25 TensorFlow with Keras Tuner, and Ray with Tune. Popular methods for hyperparameter tuning include random search, 26 grid search, and Bayesian optimization. 27 The book AutoML: Methods, Systems, Challenges by the AutoML group at the University of Freiburg dedicates its first chapter (which you can read online for free) to hyperparameter optimization. (Location 4572)
Some teams take hyperparameter tuning to the next level: what if we treat other components of a model or the entire model as hyperparameters. The size of a convolution layer or whether or not to have a skip layer can be considered a hyperparameter. Instead of manually putting a pooling layer after a convolutional layer or ReLu (rectified linear unit) after linear, you give your algorithm these building blocks and let it figure out how to combine them. This area of research is known as architectural search, or neural architecture search (NAS) for neural networks, as it searches for the optimal model architecture. (Location 4590)
To explore the search space. A simple approach is random search— randomly choosing from all possible configurations— which is unpopular because it’s prohibitively expensive even for NAS. Common approaches include reinforcement learning (rewarding the choices that improve the performance estimation) and evolution (adding mutations to an architecture, choosing the best- performing ones, adding mutations to them, and so on). 28 For NAS, the search space is discrete— the final architecture uses only one of the available options for each layer/ operation, 29 and you have to provide the set of building blocks. The common building blocks are various convolutions of different sizes, linear, various activations, pooling, identity, zero, etc. The set of building blocks varies based on the base architecture, e.g., convolutional neural networks or transformers. In a typical ML training process, you have a model and then a learning procedure, an algorithm that helps your model find the set of parameters that minimize a given objective function for a given set of data. The most common learning procedure for neural networks today is gradient descent, which leverages an optimizer to specify how to update a model’s weights given gradient updates. 30 Popular optimizers are, as you probably already know, Adam, Momentum, SGD, (Location 4604)
it’s important for people interested in ML in production to be aware of the progress in AutoML for two reasons. First, the resulting architectures and learned optimizers can allow ML algorithms to work off- the- shelf on multiple real- world tasks, saving production time and cost, during both training and inferencing. (Location 4629)
If this is your first time trying to make this type of prediction from this type of data, start with non- ML solutions. Your first stab at the problem can be the simplest heuristics. (Location 4639)
Facebook newsfeed was introduced in 2006 without any intelligent algorithms— posts were shown in chronological order, as shown in Figure 6- 10.33 It wasn’t until 2011 that Facebook started displaying news updates you were most interested in at the top of the feed. (Location 4642)
According to Martin Zinkevich in his magnificent “Rules of Machine Learning: Best Practices for ML Engineering”: “If you think that machine learning will give you a 100% boost, then a heuristic will get you 50% of the way there.” 35 (Location 4648)
For your first ML model, you want to start with a simple algorithm, something that gives you visibility into its working to allow you to validate the usefulness of your problem framing and your data. Logistic regression, gradient- boosted trees, k- nearest neighbors can be great for that. They are also easier to implement and deploy, which allows you to quickly build out a framework from data engineering to development to deployment that you can test out and gain confidence on. (Location 4652)
Once you have your ML framework in place, you can focus on optimizing the simple ML models with different objective functions, hyperparameter search, feature engineering, more data, and ensembles. (Location 4655)
Once you’ve reached the limit of your simple models and your use case demands significant model improvement, experiment with more complex models. You’ll also want to experiment to figure out how quickly your model decays in production (e.g., how often it’ll need to be retrained) so that you can build out your infrastructure to support this retraining requirement. (Location 4657)
Lacking a clear understanding of how to evaluate your ML systems is not necessarily a reason for your ML project to fail, but it might make it impossible to find the best solution for your need, and make it harder to convince your managers to adopt ML. You might want to partner with the business team to develop metrics for model evaluation that are more relevant to your company’s business. (Location 4668)
the evaluation methods should be the same during both development and production. But in many cases, the ideal is impossible because during development, you have ground truth labels, but in production, you don’t. (Location 4672)
Evaluation metrics, by themselves, mean little. When evaluating your model, it’s essential to know the baseline you’re evaluating it against. (Location 4694)
Random baseline If our model just predicts at random, what’s the expected performance? The predictions are generated at random following a specific distribution, which can be the uniform distribution or the task’s label distribution. (Location 4696)
Simple heuristic Forget ML. If you just make predictions based on simple heuristics, what performance would you expect? For example, if you want to build a ranking system to rank items on a user’s newsfeed with the goal of getting that user to spend more time on the newsfeed, how much time would a user spend if you just rank all the items in reverse chronological order, showing the latest one first? (Location 4711)
Zero rule baseline The zero rule baseline is a special case of the simple heuristic baseline when your baseline model always predicts the most common class. For example, for the task of recommending the app a user is most likely to use next on their phone, the simplest model would be to recommend their most frequently used app. If this simple heuristic can predict the next app accurately 70% of the time, any model you build has to outperform it significantly to justify the added complexity. (Location 4718)
Human baseline In many cases, the goal of ML is to automate what would have been otherwise done by humans, so it’s useful to know how your model performs compared to human experts. (Location 4725)
In many cases, ML systems are designed to replace existing solutions, which might be business logic with a lot of if/ else statements or third- party solutions. It’s crucial to compare your new model to these existing solutions. Your ML model doesn’t always have to be better than existing solutions to be useful. A model whose performance is a little bit inferior can still be useful if it’s much easier or cheaper to use. (Location 4732)
When evaluating a model, it’s important to differentiate between “a good system” and “a useful system.” A good system isn’t necessarily useful, and a bad system isn’t necessarily useless. A self- driving vehicle might be good if it’s a significant improvement from previous self- driving systems, but it might not be useful if it doesn’t perform at least as well as human drivers. (Location 4738)
system that predicts what word a user will type next on their phone might be considered bad if it’s much worse than a native speaker. However, it might still be useful if its predictions can help users type faster some of the time. (Location 4742)
the inputs used to develop your model should be similar to the inputs your model will have to work with in production, but it’s not possible in many cases. This is especially true when data collection is expensive or difficult and the best available data you have access to for training is still very different from your real- world data. (Location 4757)
The inputs your models have to work with in production are often noisy compared to inputs in development. 41 The model that performs best on training data isn’t necessarily the model that performs best on noisy data. (Location 4760)
The more sensitive your model is to noise, the harder it will be to maintain it, since if your users’ behaviors change just slightly, such as they change their phones, your model’s performance might degrade. It also makes your model susceptible to adversarial attack. (Location 4766)
Certain changes to the inputs shouldn’t lead to changes in the output. In the preceding case, changes to race information shouldn’t affect the mortgage outcome. Similarly, changes to applicants’ names shouldn’t affect their resume screening results nor should someone’s gender affect how much they should be paid. If these happen, there are biases in your model, which might render it unusable no matter how good its performance is. (Location 4776)
Certain changes to the inputs should, however, cause predictable changes in outputs. For example, when developing a model to predict housing prices, keeping all the features the same but increasing the lot size shouldn’t decrease the predicted price, and decreasing the square footage shouldn’t increase it. If the outputs change in the opposite expected direction, your model might not be learning the right thing, and you need to investigate it further before deploying it. (Location 4784)
If your recommender system shows exactly the movies A will most likely watch, the recommendations will consist of only romance movies because A is much more likely to watch romance than any other type of movies. You might want a more calibrated system whose recommendations are representative of users’ actual watching habits. In this case, they should consist of 80% romance and 20% comedy. (Location 4802)
imagine that there are only two ads, ad A and ad B. Your model predicts that this user will click on ad A with a 10% probability and on ad B with an 8% probability. You don’t need your model to be calibrated to rank ad A above ad B. However, if you want to predict how many clicks your ads will get, you’ll need your model to be calibrated. If your model predicts that a user will click on ad A with a 10% probability but in reality the ad is only clicked on 5% of the time, your estimated number of clicks will be way off. If you have another model that gives the same ranking but is better calibrated, you might want to consider the better calibrated one. (Location 4806)
To calibrate your models, a common method is Platt scaling, which is implemented in scikit- learn with sklearn.calibration.CalibratedClassifierCV. Another good open source implementation by Geoff Pleiss can be found on GitHub. (Location 4819)
about the importance of model calibration and how to calibrate neural networks, Lee Richardson and Taylor Pospisil have an excellent blog post based on their work at Google. (Location 4825)
Confidence measurement can be considered a way to think about the usefulness threshold for each individual prediction. (Location 4828)
Indiscriminately showing all a model’s predictions to users, even the predictions that the model is unsure about, can, at best, cause annoyance and make users lose trust in the system, (Location 4831)
If you only want to show the predictions that your model is certain about, how do you measure that certainty? What is the certainty threshold at which the predictions should be shown? What do you want to do with predictions below that threshold— discard them, loop in humans, or ask for more information from users? (Location 4834)
While most other metrics measure the system’s performance on average, confidence measurement is a metric for each individual sample. System- level measurement is useful to get a sense of overall performance, but sample- level metrics are crucial when you care about your system’s performance on every sample. (Location 4837)
Slicing means to separate your data into subsets and look at your model’s performance on each subset separately. A common mistake that I’ve seen in many companies is that they are focused too much on coarse- grained metrics like overall F1 or accuracy on the entire data and not enough on sliced- based metrics. (Location 4840)
when you build a model for user churn prediction (predicting when a user will cancel a subscription or a service), paid users are more critical than nonpaid users. Focusing on a model’s overall performance might hurt its performance on these critical slices. (Location 4865)
fascinating and seemingly counterintuitive reason why slice- based evaluation is crucial is Simpson’s paradox, a phenomenon in which a trend appears in several groups of data but disappears or reverses when the groups are combined. This means that model B can perform better than model A on all data together, but model A performs better than model B on each subgroup separately. (Location 4867)
To make informed decisions regarding what model to choose, we need to take into account its performance not only on the entire data, but also on individual slices. Slice- based evaluation can give you insights to improve your model’s performance both overall and on critical data and help detect potential biases. It might also help reveal non- ML problems. Once, our team discovered that our model performed great overall but very poorly on traffic from mobile users. After investigating, we realized that it was because a button was half hidden on small screens (e.g., phone screens). (Location 4924)
To track your model’s performance on critical slices, you’d first need to know what your critical slices are. You might wonder how to discover critical slices in your data. Slicing is, unfortunately, still more of an art than a science, requiring intensive data exploration and analysis. (Location 4930)
Heuristics- based Slice your data using domain knowledge you have of the data and the task at hand. For example, when working with web traffic, you might want to slice your data along dimensions like mobile versus desktop, browser type, and locations. Mobile users might behave very differently from desktop users. Similarly, internet users in different geographic locations might have different expectations on what a website should look like. (Location 4933)
Error analysis Manually go through misclassified examples and find patterns among them. We discovered our model’s problem with mobile users when we saw that most of the misclassified examples were from mobile users. (Location 4939)
Slice finder There has been research to systemize the process of finding slices, including Chung et al.’ s “Slice Finder: Automated Data Slicing for Model Validation” in 2019 and covered in Sumyea Helal’s “Subgroup Discovery Algorithms: A Survey and Empirical Evaluation” (2016). The process generally starts with generating slice candidates with algorithms such as beam search, clustering, or decision, then pruning out clearly bad candidates for slices, and then ranking the candidates that are left. (Location 4941)
I once read somewhere on the internet: deploying is easy if you ignore all the hard parts. (Location 5132)
In many companies, the responsibility of deploying models falls into the hands of the same people who developed those models. In many other companies, once a model is ready to be deployed, it will be exported and handed off to another team to deploy it. However, this separation of responsibilities can cause high overhead communications across teams and make it slow to update your model. It also can make it hard to debug should something go wrong. (Location 5165)
Exporting a model means converting this model into a format that can be used by another application. Some people call this process “serialization.” 4 There are two parts of a model that you can export: the model definition and the model’s parameter values. The model definition defines the structure of your model, such as how many hidden layers it has and how many units in each layer. The parameter values provide the values for these units and layers. Usually, these two parts are exported together. (Location 5170)
People tend to ask me: “How often should I update my models?” It’s the wrong question to ask. The right question should be: “How often can I update my models?” Since a model’s performance decays over time, we want to update it as fast as possible. This is an area of ML where we should learn from existing DevOps best practices. (Location 5225)
While many companies still only update their models once a month, or even once a quarter, Weibo’s iteration cycle for updating some of their ML models is 10 minutes. 12 I’ve heard similar numbers at companies like Alibaba and ByteDance (the company behind TikTok). (Location 5232)
there are three main modes of prediction that I hope you’ll remember: Batch prediction, which uses only batch features. Online prediction that uses only batch features (e.g., precomputed embeddings). Online prediction that uses both batch features and streaming features. This is also known as streaming prediction. (Location 5263)
Online prediction is when predictions are generated and returned as soon as requests for these predictions arrive. For (Location 5266)
Batch prediction is when predictions are generated periodically or whenever triggered. The predictions are stored somewhere, such as in SQL tables or an in- memory database, and retrieved as needed. For example, Netflix might generate movie recommendations for all of its users every four hours, and the precomputed recommendations are fetched and shown to users when they log on to Netflix. Batch prediction is also known as asynchronous prediction: predictions are generated asynchronously with requests. (Location 5274)
The terms “online prediction” and “batch prediction” can be confusing. Both can make predictions for multiple samples (in batch) or one sample at a time. To avoid this confusion, people sometimes prefer the terms “synchronous prediction” and “asynchronous prediction.” However, this distinction isn’t perfect either, because when online prediction leverages a real- time transport to send prediction requests to your model, the requests and predictions technically are asynchronous. (Location 5278)
features computed from historical data, such as data in databases and data warehouses, are batch features. Features computed from streaming data— data in real- time transports— are streaming features. In batch prediction, only batch features are used. In online prediction, however, it’s possible to use both batch features and streaming features. (Location 5293)
“online features” used interchangeably. They are actually different. Online features are more general, as they refer to any feature used for online prediction, including batch features stored in memory. A very common type of batch feature used for online prediction, especially session- based recommendations, is item embeddings. Item embeddings are usually precomputed in batch and fetched whenever they are needed for online prediction. In this case, embeddings can be considered online features but not streaming features. Streaming features refer exclusively to features computed from streaming data. (Location 5300)
online prediction and batch prediction don’t have to be mutually exclusive. One hybrid solution is that you precompute predictions for popular queries, then generate predictions online for less popular queries. (Location 5313)
less efficient, both in terms of cost and performance, than batch prediction because you might not be able to batch inputs together and leverage vectorization or other optimization techniques. This is not necessarily true, (Location 5328)
with online prediction, you don’t have to generate predictions for users who aren’t visiting your site. Imagine you run an app where only 2% of your users log in daily— e.g., in 2020, Grubhub had 31 million users and 622,000 daily orders. 15 If you generate predictions for every user each day, the compute used to generate 98% of your (Location 5331)
To people coming to ML from an academic background, the more natural way to serve predictions is probably online. You give your model an input and it generates a prediction as soon as it receives that input. This is likely how most people interact with their models while prototyping. This is also likely easier to do for most companies when first deploying a model. You export your model, upload the exported model to Amazon SageMaker or Google App Engine, and get back an exposed endpoint. (Location 5340)
A problem with online prediction is that your model might take too long to generate predictions. Instead of generating predictions as soon as they arrive, what if you compute predictions in advance and store them in your database, and fetch them when requests arrive? This is exactly what batch prediction does. With this approach, you can generate predictions for multiple inputs at once, leveraging distributed techniques to process a high volume of samples efficiently. (Location 5349)
Because the predictions are precomputed, you don’t have to worry about how long it’ll take your models to generate predictions. For this reason, batch prediction can also be seen as a trick to reduce the inference latency of more complex models— the (Location 5353)
Batch prediction is good for when you want to generate a lot of predictions and don’t need the results immediately. You don’t have to use all the predictions generated. For example, you can make predictions for all customers on how likely they are to buy a new product, and reach out to the top 10%. (Location 5355)
However, the problem with batch prediction is that it makes your model less responsive to users’ change preferences. This limitation can be seen even in more technologically progressive companies like Netflix. Say you’ve been watching a lot of horror movies lately, so when you first log in to Netflix, horror movies dominate recommendations. But you’re feeling bright today, so you search “comedy” and start browsing the comedy category. Netflix should learn and show you more comedy in your list of their recommendations, right? As of writing this book, it can’t update the list until the next batch of recommendations is generated, but I have no doubt that this limitation will be addressed in the near future. (Location 5357)
Batch prediction is a workaround for when online prediction isn’t cheap enough or isn’t fast enough. Why generate one million predictions in advance and worry about storing and retrieving them if you can generate each prediction as needed at the exact same cost and same speed? As hardware becomes more customized and powerful and better techniques are being developed to allow faster, cheaper online predictions, online prediction might become the default. (Location 5372)
Batch prediction is largely a product of legacy systems. In the last decade, big data processing has been dominated by batch systems like MapReduce and Spark, which allow us to periodically process a large amount of data very efficiently. When companies started with ML, they leveraged their existing batch systems to make predictions. When these companies want to use streaming features for their online prediction, they need to build a separate streaming pipeline. (Location 5389)
Having two different pipelines to process your data is a common cause for bugs in ML production. One cause for bugs is when the changes in one pipeline aren’t correctly replicated in the other, leading to two pipelines extracting two different sets of features. This is especially common if the two pipelines are maintained by two different teams, such as the ML team maintains the batch pipeline for training while the deployment team maintains the stream pipeline for inference, (Location 5401)
Building infrastructure to unify stream processing and batch processing has become a popular topic in recent years for the ML community. Companies including Uber and Weibo have made major infrastructure overhauls to unify their batch and stream processing pipelines by using a stream processor like Apache Flink. 18 Some companies use feature stores to ensure the consistency between the batch features used during training and the streaming features used in prediction. (Location 5412)
If the model you want to deploy takes too long to generate predictions, there are three main approaches to reduce its inference latency: make it do inference faster, make the model smaller, or make the hardware it’s deployed on run faster. (Location 5425)
The process of making a model smaller is called model compression, and the process to make it do inference faster is called inference optimization. Originally, model compression was to make models fit on edge devices. However, making models smaller often makes them run faster. (Location 5427)
The number of research papers on model compression is growing. Off- the- shelf utilities are proliferating. As of April 2022, Awesome Open Source has a list of “The Top 168 Model Compression Open Source Projects”, and that list is growing. While there are many new techniques being developed, the four types of techniques that you might come across the most often are low- rank optimization, knowledge distillation, pruning, and quantization. Readers interested in a comprehensive review might want to check out Cheng et al.’ s “Survey of Model Compression and Acceleration for Deep Neural Networks,” which was updated in 2020.19 (Location 5433)
The key idea behind low- rank factorization is to replace high- dimensional tensors with lower- dimensional tensors. 20 One type of low- rank factorization is compact convolutional filters, where the over- parameterized (having too many parameters) convolution filters are replaced with compact blocks to both reduce the number of parameters and increase speed. (Location 5440)
Knowledge distillation is a method in which a small model (student) is trained to mimic a larger model or ensemble of models (teacher). The smaller model is what you’ll deploy. Even though the student is often trained after a pretrained teacher, both may also be trained at the same time. 23 One example of a distilled network used in production is DistilBERT, which reduces the size of a BERT model by 40% while retaining 97% of its language understanding capabilities and being 60% faster. (Location 5467)
The advantage of this approach is that it can work regardless of the architectural differences between the teacher and the student networks. For example, you can get a random forest as the student and a transformer as the teacher. The disadvantage of this approach is that it’s highly dependent on the availability of a teacher network. If you use a pretrained model as the teacher model, training the student network will require less data and will likely be faster. However, if you don’t have a teacher available, you’ll have to train a teacher network before training a student network, and training a teacher network will require a lot more data and take more time to train. This method is also sensitive to applications and model architectures, and therefore hasn’t found wide usage in production. (Location 5474)
Pruning was a method originally used for decision trees where you remove sections of a tree that are uncritical and redundant for classification. 25 As neural networks gained wider adoption, people started to realize that neural networks are over- parameterized and began to find ways to reduce the workload caused by the extra parameters. (Location 5481)
Pruning, in the context of neural networks, has two meanings. One is to remove entire nodes of a neural network, which means changing its architecture and reducing its number of parameters. The more common meaning is to find parameters least useful to predictions and set them to 0. In this case, pruning doesn’t reduce the total number of parameters, only the number of nonzero parameters. The architecture of the neural network remains the same. This helps with reducing the size of a model because pruning makes a neural network more sparse, and sparse architecture tends to require less storage space than dense structure. (Location 5487)
Quantization is the most general and commonly used model compression method. It’s straightforward to do and generalizes over tasks and architectures. Quantization reduces a model’s size by using fewer bits to represent its parameters. By default, most software packages use 32 bits to represent a float number (single precision floating point). If a model has 100M parameters and each requires 32 bits to store, it’ll take up 400 MB. If we use 16 bits to represent a number, we’ll reduce the memory footprint by half. Using 16 bits to represent a float is called half precision. (Location 5505)
Instead of using floats, you can have a model entirely in integers; each integer takes only 8 bits to represent. This method is also known as “fixed point.” In the extreme case, some have attempted the 1- bit representation of each weight (binary weight neural networks), e.g., BinaryConnect and XNOR- Net. 30 The authors of the XNOR- Net paper spun off Xnor.ai, a startup that focused on model compression. (Location 5511)
footprint but also improves the computation speed. First, it allows us to increase our batch size. Second, less precision speeds up computation, which further reduces training time and inference latency. Consider the addition of two numbers. If we perform the addition bit by bit, and each takes x nanoseconds, it’ll take 32x nanoseconds for 32- bit numbers but only 16x nanoseconds for 16- bit numbers. (Location 5517)
There are downsides to quantization. Reducing the number of bits to represent your numbers means that you can represent a smaller range of values. For values outside that range, you’ll have to round them up and/ or scale them to be in range. Rounding numbers leads to rounding errors, and small rounding errors can lead to big performance changes. You also run the risk of rounding/ scaling your numbers to under-/ overflow and rendering it to 0. Efficient rounding and scaling is nontrivial to implement at a low level, but luckily, major frameworks have this built in. (Location 5520)
Quantization can either happen during training (quantization aware training), 32 where models are trained in lower precision, or post- training, where models are trained in single- precision floating point and then quantized for inference. Using quantization during training means that you can use less memory for each parameter, which allows you to train larger models on the same hardware. (Location 5524)
Recently, low- precision training has become increasingly popular, with support from most modern training hardware. NVIDIA introduced Tensor Cores, processing units that support mixed- precision training. 33 Google TPUs (tensor processing units) also support training with Bfloat16 (16- bit Brain Floating Point Format), which the company dubbed “the secret to high performance on Cloud TPUs.” 34 Training in fixed- point is not yet as popular but has had a lot of promising results. (Location 5528)
Fixed- point inference has become a standard in the industry. Some edge devices only support fixed- point inference. Most popular frameworks for on- device ML inference— Google’s TensorFlow Lite, Facebook’s PyTorch Mobile, NVIDIA’s TensorRT— offer post- training quantization for free with a few lines of code. (Location 5534)
Another decision you’ll want to consider is where your model’s computation will happen: on the cloud or on the edge. On the cloud means a large chunk of computation is done on the cloud, either public clouds or private clouds. On the edge means a large chunk of computation is done on consumer devices— such as browsers, phones, laptops, smartwatches, cars, security cameras, robots, embedded devices, FPGAs (field programmable gate arrays), and ASICs (application- specific integrated circuits)— which are also known as edge devices. (Location 5552)
As their cloud bills climb, more and more companies are looking for ways to push their computations to edge devices. The more computation is done on the edge, the less is required on the cloud, and the less they’ll have to pay for servers. (Location 5567)
Other than help with controlling costs, there are many properties that make edge computing appealing. The first is that it allows your applications to run where cloud computing cannot. When your models are on public clouds, they rely on stable internet connections to send data to the cloud and back. Edge computing allows your models to work in situations where there are no internet connections or where the connections are unreliable, such as in rural areas or developing countries. I’ve worked with several companies and organizations that have strict no- internet policies, which means that whichever applications we wanted to sell them must not rely on internet connections. (Location 5569)
when your models are already on consumers’ devices, you can worry less about network latency. Requiring data transfer over the network (sending data to the model on the cloud to make predictions then sending predictions back to the users) might make some use cases impossible. In many cases, network latency is a bigger bottleneck than inference latency. (Location 5573)
Putting your models on the edge is also appealing when handling sensitive user data. ML on the cloud means that your systems might have to send user data over networks, making it susceptible to being intercepted. Cloud computing also often means storing data of many users in the same place, which means a breach can affect many people. (Location 5577)
Edge computing makes it easier to comply with regulations, like GDPR, about how user data can be transferred or stored. While edge computing might reduce privacy concerns, it doesn’t eliminate them altogether. In some cases, edge computing might make it easier for attackers to steal user data, such as they can just take the device with them. (Location 5582)
Providing support for a framework on a hardware backend is time- consuming and engineering- intensive. Mapping from ML workloads to a hardware backend requires understanding and taking advantage of that hardware’s design, and different hardware backends have different memory layouts and compute primitives, as shown in Figure 7- 11. (Location 5603)
the compute primitive of CPUs used to be a number (scalar) and the compute primitive of GPUs used to be a one- dimensional vector, whereas the compute primitive of TPUs is a two- dimensional vector (tensor). 44 Performing a convolution operator will be very different with one- dimensional vectors compared to two- dimensional vectors. Similarly, you’d need to take into account different L1, L2, and L3 layouts and buffer sizes to use them efficiently. (Location 5609)
Instead of targeting new compilers and libraries for every new hardware backend, what if we create a middleman to bridge frameworks and platforms? Framework developers will no longer have to support every type of hardware; they will only need to translate their framework code into this middleman. Hardware vendors can then support one middleman instead of multiple frameworks. This type of “middleman” is called an intermediate representation (IR). IRs lie at the core of how compilers work. From the original code for a model, compilers generate a series of high- and low- level IRs before generating the code native to a hardware backend so that it can run on that hardware backend, (Location 5615)
This process is also called lowering, as in you “lower” your high- level framework code into low- level hardware- native code. It’s not translating because there’s no one- to- one mapping between them. High- level IRs are usually computation graphs of your ML models. A computation graph is a graph that describes the order in which your computation is executed. Readers interested can read about computation graphs in PyTorch and TensorFlow. (Location 5624)
In many companies, what usually happens is that data scientists and ML engineers develop models that seem to be working fine in development. However, when these models are deployed, they turn out to be too slow, so their companies hire optimization engineers to optimize their models for the hardware their models run on. (Location 5640)
Optimization engineers are hard to come by and expensive to hire because they need to have expertise in both ML and hardware architectures. Optimizing compilers (compilers that also optimize your code) are an alternative solution, as they can automate the process of optimizing models. In the process of lowering ML model code into machine code, compilers can look at the computation graph of your ML model and the operators it consists of— convolution, loops, cross- entropy— and find a way to speed it up. (Location 5649)
There are standard local optimization techniques that are known to speed up your model, most of them making things run in parallel or reducing memory access on chips. Here are four of the common techniques: Vectorization Given a loop or a nested loop, instead of executing it one item at a time, execute multiple elements contiguous in memory at the same time to reduce latency caused by data I/ O. Parallelization Given an input array (or n- dimensional array), divide it into different, independent work chunks, and do the operation on each chunk individually. Loop tiling46 Change the data accessing order in a loop to leverage hardware’s memory layout and cache. This kind of optimization is hardware dependent. A good access pattern on CPUs is not a good access pattern on GPUs. Operator fusion Fuse multiple operators into one to avoid redundant memory access. For example, two operations on the same array require two loops over that array. In a fused case, it’s just one loop. Figure 7- 13 shows an example of operator fusion. (Location 5654)
you use PyTorch on GPUs, you might have seen torch.backends.cudnn.benchmark = True. When this is set to True, cuDNN autotune will be enabled. cuDNN autotune searches over a predetermined set of options to execute a convolution operator and then chooses the fastest way. cuDNN autotune, despite its effectiveness, only works for convolution operators. A much more general solution is autoTVM, which is part of the open source compiler stack TVM. autoTVM works with subgraphs instead of just an operator, so the search spaces it works with are much more complex. The way autoTVM works is quite complicated, but in simple terms: It first breaks your computation graph into subgraphs. It predicts how big each subgraph is. It allocates time to search for the best possible path for each subgraph. It stitches the best possible way to run each subgraph together to execute the entire graph. (Location 5699)
While the results of ML- powered compilers are impressive, they come with a catch: they can be slow. You go through all the possible paths and find the most optimized ones. This process can take hours, even days for complex ML models. However, it’s a one- time operation, and the results of your optimization search can be cached and used to both optimize existing models and provide a starting point for future tuning sessions. You optimize your model once for one hardware backend then run it on multiple devices of that same hardware type. This sort of optimization is ideal when you have a model ready for production and target hardware to run inference on. (Location 5712)
WebAssembly (WASM). WASM is an open standard that allows you to run executable programs in browsers. After you’ve built your models in scikit- learn, PyTorch, TensorFlow, or whatever frameworks you’ve used, instead of compiling your models to run on specific hardware, you can compile your model to WASM. You get back an executable file that you can just use with JavaScript. WASM is one of the most exciting technological trends I’ve seen in the last couple of years. It’s performant, easy to use, and has an ecosystem that is growing like wildfire. 51 As of September 2021, it’s supported by 93% of devices worldwide. 52 The main drawback of WASM is that because WASM runs in browsers, it’s slow. Even though WASM is already much faster than JavaScript, it’s still slow compared to running code natively on devices (such as iOS or Android apps). (Location 5736)
deploying a model isn’t the end of the process. A model’s performance degrades over time in production. Once a model has been deployed, we still have to continually monitor its performance to detect issues as well as deploy updates to fix these issues. (Location 5925)
Operational expectation violations are easier to detect, as they’re usually accompanied by an operational breakage such as a timeout, a 404 error on a webpage, an out- of- memory error, or a segmentation fault. However, ML performance expectation violations are harder to detect as doing so requires measuring and monitoring the performance of ML models in production. (Location 5944)
Software system failures are failures that would have happened to non- ML systems. Here are some examples of software system failures: Dependency failure A software package or a codebase that your system depends on breaks, which leads your system to break. This failure mode is common when the dependency is maintained by a third party, and especially common if the third party that maintains the dependency no longer exists. 2 Deployment failure Failures caused by deployment errors, such as when you accidentally deploy the binaries of an older version of your model instead of the current version, or when your systems don’t have the right permissions to read or write certain files. Hardware failures When the hardware that you use to deploy your model, such as CPUs or GPUs, doesn’t behave the way it should. For example, the CPUs you use might overheat and break down. 3 Downtime or crashing If a component of your system runs from a server somewhere, such as AWS or a hosted service, and that server is down, your system will also be down. (Location 5955)
Addressing software system failures requires not ML skills, but traditional software engineering skills, and addressing them is beyond the scope of this book. Because of the importance of traditional software engineering skills in deploying ML systems, ML engineering is mostly engineering, not ML. 5 For readers interested in learning how to make ML systems reliable from the software engineering perspective, I highly recommend the book Reliable Machine Learning, published by O’Reilly with Todd Underwood as one of the authors. (Location 5984)
A reason for the prevalence of software system failures is that because ML adoption in the industry is still nascent, tooling around ML production is limited and best practices are not yet well developed or standardized. However, as toolings and best practices for ML production mature, there are reasons to believe that the proportion of software system failures will decrease and the proportion of ML- specific failures will increase. (Location 5989)
ML- specific failures are failures specific to ML systems. Examples include data collection and processing problems, poor hyperparameters, changes in the training pipeline not correctly replicated in the inference pipeline and vice versa, data distribution shifts that cause a model’s performance to deteriorate over time, edge cases, and degenerate feedback loops. (Location 5994)
When we say that an ML model learns from the training data, it means that the model learns the underlying distribution of the training data with the goal of leveraging this learned distribution to generate accurate predictions for unseen data— data that it didn’t see during training. (Location 6004)
When the model is able to generate accurate predictions for unseen data, we say that this model “generalizes to unseen data.” 6 The test data that we use to evaluate a model during development is supposed to represent unseen data, and the model’s performance on the test data is supposed to give us an idea of how well the model will generalize. (Location 6008)
One of the first things I learned in ML courses is that it’s essential for the training data and the unseen data to come from a similar distribution. The assumption is that the unseen data comes from a stationary distribution that is the same as the training data distribution. If the unseen data comes from a different distribution, the model might not generalize well. (Location 6012)
the underlying distribution of the real- world data is unlikely to be the same as the underlying distribution of the training data. Curating a training dataset that can accurately represent the data that a model will encounter in production turns out to be very difficult. 8 Real- world data is multifaceted and, in many cases, virtually infinite, whereas training data is finite and constrained by the time, compute, and human resources available during the dataset creation and processing. (Location 6017)
the train- serving skew: a model that does great in development but performs poorly when deployed. (Location 6024)
Data shifts happen all the time, suddenly, gradually, or seasonally. They can happen suddenly because of a specific event, such as when your existing competitors change their pricing policies and you have to update your price predictions in response, or when you launch your product in a new region, or when a celebrity mentions your product, which causes a surge in new users, and so on. They can happen gradually because social norms, cultures, languages, trends, industries, etc. just change over time. They can also happen due to seasonal variations, such as people might be more likely to request rideshares in the winter when it’s cold and snowy than in the spring. (Location 6030)
a large percentage of what might look like data shifts on monitoring dashboards are caused by internal errors, 9 such as bugs in the data pipeline, missing values incorrectly inputted, inconsistencies between the features extracted during training and inference, features standardized using statistics from the wrong subset of data, wrong model version, or bugs in the app interface that force users to change their behaviors. (Location 6035)
Edge cases Imagine there existed a self- driving car that can drive you safely 99.99% of the time, but the other 0.01% of the time, it might get into a catastrophic accident that can leave you permanently injured or even dead. 10 Would you use that car? (Location 6043)
An ML model that performs well on most cases but fails on a small number of cases might not be usable if these failures cause catastrophic consequences. For this reason, major self- driving car companies are focusing on making their systems work on edge cases. 11 Edge cases are the data samples so extreme that they cause the model to make catastrophic mistakes. Even though edge cases generally refer to data samples drawn from the same distribution, if there is a sudden increase in the number of data samples in which your model doesn’t perform well, it could be an indication that the underlying data distribution has shifted. (Location 6049)
outliers refer to data: an example that differs significantly from other examples. Edge cases refer to performance: an example where a model performs significantly worse than other examples. An outlier can cause a model to perform unusually poorly, which makes it an edge case. However, not all outliers are edge cases. (Location 6066)
Labels”, we discussed a feedback loop as the time it takes from when a prediction is shown until the time feedback on the prediction is provided. The feedback can be used to extract natural labels to evaluate the model’s performance and train the next iteration of the model. (Location 6080)
degenerate feedback loop can happen when the predictions themselves influence the feedback, which, in turn, influences the next iteration of the model. More formally, a degenerate feedback loop is created when a system’s outputs are used to generate the system’s future inputs, which, in turn, influence the system’s future outputs. In ML, a system’s predictions can influence how users interact with the system, and because users’ interactions with the system are sometimes used as training data to the same system, degenerate feedback loops can occur and cause unintended consequences. Degenerate feedback loops are especially common in tasks with natural labels from users, such as recommender systems and ads click- through- rate prediction. (Location 6084)
imagine you build a system to recommend to users songs that they might like. The songs that are ranked high by the system are shown first to users. Because they are shown first, users click on them more, which makes the system more confident that these recommendations are good. In the beginning, the rankings of two songs, A and B, might be only marginally different, but because A was originally ranked a bit higher, it showed up higher in the recommendation list, making users click on A more, which made the system rank A even higher. After a while, A’s ranking became much higher than B’s. 13 Degenerate feedback loops are one reason why popular movies, books, or songs keep getting more popular, which makes it hard for new items to break into popular lists. This type of scenario is incredibly common in production, and it’s heavily researched. It goes by many different names, including “exposure bias,” “popularity bias,” “filter bubbles,” and sometimes “echo chambers.” (Location 6090)
degenerate feedback loops can cause your model to perform suboptimally at best. At worst, they can perpetuate and magnify biases embedded in data, such as biasing against candidates without feature X. (Location 6105)
For the task of recommender systems, it’s possible to detect degenerate feedback loops by measuring the popularity diversity of a system’s outputs even when the system is offline. An item’s popularity can be measured based on how many times it has been interacted with (e.g., seen, liked, bought, etc.) in the past. The popularity of all the items will likely follow a long- tail distribution: a small number of items are interacted with a lot, while most items are rarely interacted with at all. Various metrics such as aggregate diversity and average coverage of long- tail items proposed by Brynjolfsson et al. (2011), Fleder and Hosanagar (2009), and Abdollahpouri et al. (2019) can help you measure the diversity of the outputs of a recommender system. 15 Low scores mean that the outputs of your system are homogeneous, which might be caused by popularity bias. (Location 6110)
They first divided items into buckets based on their popularity— e.g., bucket 1 consists of items that have been interacted with less than 100 times, bucket 2 consists of items that have been interacted with more than 100 times but less than 1,000 times, etc. Then they measured the prediction accuracy of a recommender system for each of these buckets. If a recommender system is much better at recommending popular items than recommending less popular items, it likely suffers from popularity bias. 16 Once your system is in production and you notice that its predictions become more homogeneous over time, it likely suffers from degenerate feedback loops. (Location 6120)
We’ve discussed that degenerate feedback loops can cause a system’s outputs to be more homogeneous over time. Introducing randomization in the predictions can reduce their homogeneity. In the case of recommender systems, instead of showing the users only the items that the system ranks highly for them, we show users random items and use their feedback to determine the true quality of these items. This is the approach that TikTok follows. Each new video is randomly assigned an initial pool of traffic (which can be up to hundreds of impressions). This pool of traffic is used to evaluate each video’s unbiased quality to determine whether it should be moved to a bigger pool of traffic or be marked as irrelevant. (Location 6131)
Randomization has been shown to improve diversity, but at the cost of user experience. 18 Showing our users completely random items might cause users to lose interest in our product. An intelligent exploration strategy, such as those discussed in the section “Contextual bandits as an exploration strategy”, can help increase item diversity with acceptable prediction accuracy loss. Schnabel et al. use a small amount of randomization and causal inference techniques to estimate the unbiased value of each song. 19 They were able to show that this algorithm was able to correct a recommender system to make recommendations fair to creators. (Location 6137)
If the position in which a prediction is shown affects its feedback in any way, you might want to encode the position information using positional features. Positional features can be numerical (e.g., positions are 1, 2, 3,…) or Boolean (e.g., whether a prediction is shown in the first position or not). Note that “positional features” are different from “positional embeddings” mentioned in Chapter 5. (Location 6147)
Here is a naive example to show how to use positional features. During training, you add “whether a song is recommended first” as a feature to your training data, as shown in Table 8- 1. This feature allows your model to learn how much being a top recommendation influences how likely a song is clicked on. (Location 6151)
During inference, you want to predict whether a user will click on a song regardless of where the song is recommended, so you might want to set the 1st Position feature to be False. Then you look at the model’s predictions for various songs for each user and can choose the order in which to show each song. This is a naive example because doing this alone might not be enough to combat degenerate feedback loops. A more sophisticated approach would be to use two different models. The first model predicts the probability that the user will see and consider a recommendation taking into account the position at which that recommendation will be shown. The second model then predicts the probability that the user will click on the item given that they saw and considered it. The second model doesn’t concern positions at all. (Location 6176)
Data distribution shift refers to the phenomenon in supervised learning when the data a model works with changes over time, which causes this model’s predictions to become less accurate as time passes. The distribution of the data the model is trained on is called the source distribution. The distribution of the data the model runs inference on is called the target distribution. (Location 6190)
There’s also a book on dataset distribution shifts, Dataset Shift in Machine Learning by Quiñonero- Candela et al., published by MIT Press in 2008. (Location 6195)
While data distribution shift is often used interchangeably with concept drift and covariate shift and occasionally label shift, these are three distinct subtypes of data shift. Note that this discussion on different types of data shifts is math- heavy and mostly useful from a research perspective: to develop efficient algorithms to detect and address data shifts requires understanding the causes of those shifts. In production, when encountering a distribution shift, data scientists don’t usually stop to wonder what type of shift it is. They mostly care about what they can do to handle this shift. (Location 6198)
Let’s call the inputs to a model X and its outputs Y. We know that in supervised learning, the training data can be viewed as a set of samples from the joint distribution P( X, Y), and then ML usually models P( Y | X). This joint distribution P( X, Y) can be decomposed in two ways: P( X, Y) = P( Y | X) P( X) P( X, Y) = P( X | Y) P( Y) (Location 6204)
P( Y | X) denotes the conditional probability of an output given an input— for example, the probability of an email being spam given the content of the email. P( X) denotes the probability density of the input. P( Y) denotes the probability density of the output. (Location 6212)
Covariate shift When P( X) changes but P( Y | X) remains the same. This refers to the first decomposition of the joint distribution. (Location 6215)
Label shift When P( Y) changes but P( X | Y) remains the same. This refers to the second decomposition of the joint distribution. (Location 6217)
Concept drift When P( Y | X) changes but P( X) remains the same. This refers to the first decomposition of the joint distribution. 21 (Location 6222)
Mathematically, covariate shift is when P( X) changes, but P( Y | X) remains the same, which means that the distribution of the input changes, but the conditional probability of an output given an input remains the same. (Location 6240)
During model development, covariate shifts can happen due to biases during the data selection process, which could result from difficulty in collecting examples for certain classes. For example, suppose that to study breast cancer, you get data from a clinic where women go to test for breast cancer. Because people over 40 are encouraged by their doctors to get checkups, your data is dominated by women over 40. For this reason, covariate shift is closely related to the sample selection bias problem. 24 (Location 6247)
training data is artificially altered to make it easier for your model to learn. As discussed in Chapter 4, it’s hard for ML models to learn from imbalanced datasets, so you might want to collect more samples of the rare classes or oversample your data on the rare classes to make it easier for your model to learn the rare classes. (Location 6252)
Covariate shift can also be caused by the model’s learning process, especially through active learning. In Chapter 4, we defined active learning as follows: instead of randomly selecting samples to train a model on, we use the samples most helpful to that model according to some heuristics. This means that the training input distribution is altered by the learning process to differ from the real- world input distribution, and covariate shifts are a by- product. 25 (Location 6255)
covariate shift usually happens because of major changes in the environment or in the way your application is used. Imagine you have a model to predict how likely a free user will be to convert to a paid user. The income level of the user is a feature. Your company’s marketing department recently launched a campaign that attracts users from a demographic more affluent than your current demographic. The input distribution into your model has changed, but the probability that a user with a given income level will convert remains the same. (Location 6260)
you know in advance how the real- world input distribution will differ from your training input distribution, you can leverage techniques such as importance weighting to train your model to work for the real- world data. Importance weighting consists of two steps: estimate the density ratio between the real- world input distribution and the training input distribution, then weight the training data according to this ratio and train an ML model on this weighted data. 26 (Location 6264)
Label shift, also known as prior shift, prior probability shift, or target shift, is when P( Y) changes but P( X | Y) remains the same. You can think of this as the case when the output distribution changes but, for a given output, the input distribution stays the same. (Location 6277)
Remember that covariate shift is when the input distribution changes. When the input distribution changes, the output distribution also changes, resulting in both covariate shift and label shift happening at the same time. (Location 6283)
However, not all covariate shifts result in label shifts. It’s a subtle point, so we’ll consider another example. Imagine that there is now a preventive drug that every woman takes that helps reduce their chance of getting breast cancer. The probability P( Y | X) reduces for women of all ages, so it’s no longer a case of covariate shift. However, given a person with breast cancer, the age distribution remains the same, so this is still a case of label shift. (Location 6289)
Concept drift, also known as posterior shift, is when the input distribution remains the same but the conditional distribution of the output given an input changes. You can think of this as “same input, different output.” (Location 6295)
Consider you’re in charge of a model that predicts the price of a house based on its features. Before COVID- 19, a three- bedroom apartment in San Francisco could cost $2, 000, 000. Ho w e v er, a tt h e b e g innin g o f CO V I D - 19, man y p eo pl e l e f tS an F r an c i sco, so t h es am e a p a r t m e n tw o u l d cos t o n l y$ 1,500,000. So even though the distribution of house features remains the same, the conditional distribution of the price of a house given its features has changed. (Location 6300)
In many cases, concept drifts are cyclic or seasonal. For example, rideshare prices will fluctuate on weekdays versus weekends, and flight ticket prices rise during holiday seasons. Companies might have different models to deal with cyclic and seasonal drifts. For example, they might have one model to predict rideshare prices on weekdays and another model for weekends. (Location 6303)
One is feature change, such as when new features are added, older features are removed, or the set of all possible values of a feature changes. 28 For example, your model was using years for the “age” feature, but now it uses months, so the range of this feature’s values has drifted. (Location 6309)
Label schema change is when the set of possible values for Y change. With label shift, P( Y) changes but P( X | Y) remains the same. With label schema change, both P( Y) and P( X | Y) change. A schema describes the structure of the data, so the label schema of a task describes the structure of the labels of that task. For example, a dictionary that maps from a class to an integer value, such as {“ POSITIVE”: 0, “NEGATIVE”: 1}, is a schema. (Location 6316)
Data distribution shifts are only a problem if they cause your model’s performance to degrade. So the first idea might be to monitor your model’s accuracy- related metrics— accuracy, F1 score, recall, AUC- ROC, etc.— in production to see whether they have changed. “Change” here usually means “decrease,” but if my model’s accuracy suddenly goes up or fluctuates significantly for no reason that I’m aware of, I’d want to investigate. (Location 6338)
Accuracy- related metrics work by comparing the model’s predictions to ground truth labels. 30 During model development, you have access to labels, but in production, you don’t always have access to labels, and even if you do, labels will be delayed, as discussed in the section “Natural Labels”. Having access to labels within a reasonable time window will vastly help with giving you visibility into your model’s performance. (Location 6343)
When ground truth labels are unavailable or too delayed to be useful, we can monitor other distributions of interest instead. The distributions of interest are the input distribution P( X), the label distribution P( Y), and the conditional distributions P( X | Y) and P( Y | X). (Location 6347)
In industry, a simple method many companies use to detect whether the two distributions are the same is to compare their statistics like min, max, mean, median, variance, various quantiles (such as 5th, 25th, 75th, or 95th quantile), skewness, kurtosis, etc. For example, you can compute the median and variance of the values of a feature during inference and compare them to the metrics computed during training. (Location 6357)
Mean, median, and variance are only useful with the distributions for which the mean/ median/ variance are useful summaries. If those metrics differ significantly, the inference distribution might have shifted from the training distribution. However, if those metrics are similar, there’s no guarantee that there’s no shift. (Location 6364)
two- sample hypothesis test, shortened as two- sample test. It’s a test to determine whether the difference between two populations (two sets of data) is statistically significant. If the difference is statistically significant, then the probability that the difference is a random fluctuation due to sampling variability is very low, and, therefore, the difference is caused by the fact that these two populations come from two distinct distributions. If you consider the data from yesterday to be the source population and the data from today to be the target population and they are statistically different, it’s likely that the underlying data distribution has shifted between yesterday and today. (Location 6366)
A caveat is that just because the difference is statistically significant doesn’t mean that it is practically important. However, a good heuristic is that if you are able to detect the difference from a relatively small sample, then it is probably a serious difference. If it takes a huge number of samples to detect, then the difference is probably not worth worrying about. (Location 6371)
A basic two- sample test is the Kolmogorov– Smirnov test, also known as the K- S or KS test. 32 It’s a nonparametric statistical test, which means it doesn’t require any parameters of the underlying distribution to work. It doesn’t make any assumption about the underlying distribution, which means it can work for any distribution. However, one major drawback of the KS test is that it can only be used for one- dimensional data. If your model’s predictions and labels are one- dimensional (scalar numbers), then the KS test is useful to detect label or prediction shifts. However, it won’t work for high- dimensional data, and features are usually high- dimensional. 33 KS tests can also be expensive and produce too many false positive alerts. (Location 6374)
Another test is Least- Squares Density Difference, an algorithm that is based on the least squares density- difference estimation method. 35 There is also MMD, Maximum Mean Discrepancy (Gretton et al. 2012), a kernel- based technique for multivariate two- sample testing and its variant Learned Kernel MMD (Liu et al. 2020). MMD is popular in research, but as of writing this book, I’m not aware of any company that is using it in the industry. (Location 6382)
Alibi Detect is a great open source package with the implementations of many drift detection algorithms, as shown in Figure 8- 2. Because two- sample tests often work better on low- dimensional data than on high- dimensional data, it’s highly recommended that you reduce the dimensionality of your data before performing a two- sample test on it. 36 (Location 6387)
Not all types of shifts are equal— some are harder to detect than others. For example, shifts happen at different rates, and abrupt changes are easier to detect than slow, gradual changes. 37 Shifts can also happen across two dimensions: spatial or temporal. (Location 6397)
scale window of the data we look at affects the shifts we can detect. If your data has a weekly cycle, then a time scale of less than a week won’t detect the cycle. Consider the data in Figure 8- 3. If we use data from day 9 to day 14 as the source distribution, then day 15 looks like a shift. However, if we use data from day 1 to day 14 as the source distribution, then all data points from day 15 are likely being generated by that same distribution. As illustrated by this example, detecting temporal shifts is hard when shifts are confounded by seasonal variation. Figure 8- 3. Whether a distribution has drifted over time depends on the time scale window specified (Location 6404)
When computing running statistics over time, it’s important to differentiate between cumulative and sliding statistics. Sliding statistics are computed within a single time scale window, e.g., an hour. Cumulative statistics are continually updated with more data. This means, for the beginning of each time scale window, the sliding accuracy is reset, whereas the cumulative sliding accuracy is not. Because cumulative statistics contain information from previous time windows, they might obscure what happens in a specific time window. (Location 6411)
Working with data in the temporal space makes things so much more complicated, requiring knowledge of time- series analysis techniques such as time- series decompositions (Location 6419)
For readers interested in time- series decomposition, Lyft engineering has a great case study on how they decompose their time- series data to deal with the seasonality of the market. (Location 6421)
As of today, many companies use the distribution of the training data as the base distribution and monitor the production data distribution at a certain granularity level, such as hourly and daily. 39 The shorter your time scale window, the faster you’ll be able to detect changes in your data distribution. However, too short a time scale window can lead to false alarms of shifts, (Location 6422)
Some platforms, especially those dealing with real- time data analytics such as monitoring, provide a merge operation that allows merging statistics from shorter time scale windows to create statistics for larger time scale windows. For example, you can compute the data statistics you care about hourly, then merge these hourly statistics chunks into daily views. (Location 6427)
More advanced monitoring platforms even attempt a root cause analysis (RCA) feature that automatically analyzes statistics across various time window sizes to detect exactly the time window where a change in data happened. (Location 6430)
many companies assume that data shifts are inevitable, so they periodically retrain their models— once a month, once a week, or once a day— regardless of the extent of the shift. How to determine the optimal frequency to retrain your models is an important decision that many companies still determine based on gut feelings instead of experimental data. (Location 6443)
To make a model work with a new distribution in production, there are three main approaches. The first is the approach that currently dominates research: train models using massive datasets. The hope here is that if the training dataset is large enough, the model will be able to learn such a comprehensive distribution that whatever data points the model will encounter in production will likely come from this distribution. (Location 6448)
The second approach, less popular in research, is to adapt a trained model to a target distribution without requiring new labels. Zhang et al. (2013) used causal interpretations together with kernel embedding of conditional and marginal distributions to correct models’ predictions for both covariate shifts and label shifts without using labels from the target distribution. 42 Similarly, Zhao et al. (2020) proposed domain- invariant representation learning: an unsupervised domain adaptation technique that can learn data representations invariant to changing distributions. 43 However, this area of research is heavily underexplored and hasn’t found wide adoption in industry. (Location 6451)
The third approach is what is usually done in the industry today: retrain your model using the labeled data from the target distribution. However, retraining your model is not so straightforward. Retraining can mean retraining your model from scratch on both the old and new data or continuing training the existing model on new data. The latter approach is also called fine- tuning. (Location 6458)
If you want to retrain your model, there are two questions. First, whether to train your model from scratch (stateless retraining) or continue training it from the last checkpoint (stateful training). Second, what data to use: data from the last 24 hours, last week, last 6 months, or from the point when data has started to drift. You might need to run experiments to figure out which retraining strategy works best for you. (Location 6461)
Readers familiar with data shift literature might often see data shifts mentioned along with domain adaptation and transfer learning. If you consider a distribution to be a domain, then the question of how to adapt your model to new distributions is similar to the question of how to adapt your model to different domains. Similarly, if you consider learning a joint distribution P( X, Y) as a task, then adapting a model trained on one joint distribution for another joint distribution can be framed as a form of transfer learning. (Location 6466)
transfer learning refers to the family of methods where a model developed for a task is reused as the starting point for a model on a second task. The difference is that with transfer learning, you don’t retrain the base model from scratch for the second task. However, to adapt your model to a new distribution, you might need to retrain your model from scratch. (Location 6471)
Addressing data distribution shifts doesn’t have to start after the shifts have happened. It’s possible to design your system to make it more robust to shifts. A system uses multiple features, and different features shift at different rates. Consider that you’re building a model to predict whether a user will download an app. You might be tempted to use that app’s ranking in the app store as a feature since higher- ranking apps tend to be downloaded more. However, app ranking changes very quickly. You might want to instead bucket each app’s ranking into general categories such as top 10, between 11 and 100, between 101 and 1,000, between 1,001 and 10,000, and so on. At the same time, an app’s categories might change a lot less frequently, but they might have less power to predict whether a user will download that app. When choosing features for your models, you might want to consider the trade- off between the performance and the stability of a feature: a feature might be really good for accuracy but deteriorate quickly, forcing you to train your model more often. (Location 6474)
You might also want to design your system to make it easier for it to adapt to shifts. For example, housing prices might change a lot faster in major cities like San Francisco than in rural Arizona, so a housing price prediction model serving rural Arizona might need to be updated less frequently than a model serving San Francisco. If you use the same model to serve both markets, you’ll have to use data from both markets to update your model at the rate demanded by San Francisco. However, if you use a separate model for each market, you can update each of them only when necessary. (Location 6481)
Monitoring and observability are sometimes used exchangeably, but they are different. Monitoring refers to the act of tracking, measuring, and logging different metrics that can help us determine when something goes wrong. Observability means setting up our system in a way that gives us visibility into our system to help us investigate what went wrong. The process of setting up our system in this way is also called “instrumentation.” Examples of instrumentation are adding timers to your functions, counting NaNs in your features, tracking how inputs are transformed through your systems, logging unusual events such as unusually long inputs, etc. Observability is part of monitoring. Without some level of observability, monitoring is impossible. (Location 6494)
Monitoring is all about metrics. Because ML systems are software systems, the first class of metrics you’d need to monitor are the operational metrics. These metrics are designed to convey the health of your systems. They are generally divided into three levels: the network the system is run on, the machine the system is run on, and the application that the system runs. Examples of these metrics are latency; throughput; the number of prediction requests your model receives in the last minute, hour, day; the percentage of requests that return with a 2xx code; CPU/ GPU utilization; memory utilization; etc. No matter how good your ML model is, if the system is down, you’re not going to benefit from it. (Location 6501)
One of the most important characteristics of a software system in production is availability— how often the system is available to offer reasonable performance to users. This characteristic is measured by uptime, the percentage of time a system is up. The conditions to determine whether a system is up are defined in the service level objectives (SLOs) or service level agreements (SLAs). For example, an SLA may specify that the service is considered to be up if it has a median latency of less than 200 ms and a 99th percentile under 2 s. (Location 6508)
for ML systems, the system health extends beyond the system uptime. If your ML system is up but its predictions are garbage, your users aren’t going to be happy. Another class of metrics you’d want to monitor are ML- specific metrics that tell you the health of your ML models. (Location 6517)
Within ML- specific metrics, there are generally four artifacts to monitor: a model’s accuracy- related metrics, predictions, features, and raw inputs. These are artifacts generated at four different stages of an ML system pipeline, (Location 6520)
The deeper into the pipeline an artifact is, the more transformations it has gone through, which makes a change in that artifact more likely to be caused by errors in one of those transformations. However, the more transformations an artifact has gone through, the more structured it’s become and the closer it is to the metrics you actually care about, which makes it easier to monitor. (Location 6522)
Figure 8- 5. The more transformations an artifact has gone through, the more likely its changes are to be caused by errors in one of those transformations (Location 6526)
If your system receives any type of user feedback for the predictions it makes— click, hide, purchase, upvote, downvote, favorite, bookmark, share, etc.— you should definitely log and track it. Some feedback can be used to infer natural labels, which can then be used to calculate your model’s accuracy- related metrics. Accuracy- related metrics are the most direct metrics to help you decide whether a model’s performance has degraded. (Location 6529)
Even if the feedback can’t be used to infer natural labels directly, it can be used to detect changes in your ML model’s performance. For example, when you’re building a system to recommend to users what videos to watch next on YouTube, you want to track not only whether the users click on a recommended video (click- through rate), but also the duration of time users spend on that video and whether they complete watching it (completion rate). If, over time, the click- through rate remains the same but the completion rate drops, it might mean that your recommender system is getting worse. (Location 6536)
Google Translate has the option for users to upvote or downvote a translation, as shown in Figure 8- 6. If the number of downvotes the system receives suddenly goes up, there might be issues. These downvotes can also be used to guide the labeling process, such as getting human experts to generate new translations for the samples with downvotes, to train the next iteration of their models. (Location 6541)
You can monitor predictions for distribution shifts. Because predictions are low dimensional, it’s also easier to compute two- sample tests to detect whether the prediction distribution has shifted. Prediction distribution shifts are also a proxy for input distribution shifts. Assuming that the function that maps from input to output doesn’t change— the weights and biases of your model haven’t changed— then a change in the prediction distribution generally indicates a change in the underlying input distribution. (Location 6555)
Changes in accuracy- related metrics might not become obvious for days or weeks, whereas a model predicting all False for 10 minutes can be detected immediately. (Location 6561)
on tracking changes in features, both the features that a model uses as inputs and the intermediate transformations from raw inputs into final features. Feature monitoring is appealing because compared to raw input data, features are well structured following a predefined schema. The first step of feature monitoring is feature validation: ensuring that your features follow an expected schema. The expected schemas are usually generated from training data or from common sense. If these expectations are violated in production, there might be a shift in the underlying distribution. (Location 6570)
Because features are often organized into tables— each column representing a feature and each row representing a data sample— feature validation is also known as table testing or table validation. Some call them unit tests for data. There are many open source libraries that help you do basic feature validation, and the two most common are Great Expectations and Deequ, which is by AWS. (Location 6578)
Beyond basic feature validation, you can also use two- sample tests to detect whether the underlying distribution of a feature or a set of features has shifted. Since a feature or a set of features can be high- dimensional, you might need to reduce their dimension before performing the test on them, which can make the test less effective. (Location 6586)
production, and each model uses hundreds, if not thousands, of features. Even something as simple as computing summary statistics for all these features every hour can be expensive, not only in terms of compute required but also memory used. Tracking, i.e., constantly computing, too many metrics can also slow down your system and increase both the latency that your users experience and the time it takes for you to detect anomalies in your system. (Location 6589)
While tracking features is useful for debugging purposes, it’s not very useful for detecting model performance degradation. In theory, a small distribution shift can cause catastrophic failure, but in practice, an individual feature’s minor changes might not harm the model’s performance at all. Feature distributions shift all the time, and most of these changes are benign. 48 If you want to be alerted whenever a feature seems to have drifted, you might soon be overwhelmed by alerts and realize that most of these alerts are false positives. This can cause a phenomenon called “alert fatigue” where the monitoring team stops paying attention to the alerts because they are so frequent. The problem of feature monitoring becomes the problem of trying to decide which feature shifts are critical and which are not. (Location 6593)
Feature extraction is often done in multiple steps (such as filling missing values and standardization), using multiple libraries (such as pandas, Spark), on multiple services (such as BigQuery or Snowflake). You might have a relational database as an input to the feature extraction process and a NumPy array as the output. Even if you detect a harmful change in a feature, it might be impossible to detect whether this change is caused by a change in the underlying input distribution or whether it’s caused by an error in one of the multiple processing steps. (Location 6600)
The schema that your features follow can change over time. If you don’t have a way to version your schemas and map each of your features to its expected schema, the cause of the reported alert might be due to the mismatched schema rather than a change in the data. (Location 6605)
What if we monitor the raw inputs before they are processed? The raw input data might not be easier to monitor, as it can come from multiple sources in different formats, following multiple structures. The way many ML workflows are set up today also makes it impossible for ML engineers to get direct access to raw input data, as the raw input data is often managed by a data platform team who processes and moves the data to a location like a data warehouse, and the ML engineers can only query for data from that data warehouse where the data is already partially processed. Therefore, monitoring raw inputs is often a responsibility of the data platform team, not the data science or ML team. Therefore, it’s out of scope for this book. (Location 6619)
Measuring, tracking, and interpreting metrics for complex systems is a nontrivial task, and engineers rely on a set of tools to help them do so. It’s common for the industry to herald metrics, logs, and traces as the three pillars of monitoring. However, I find their differentiations murky. They seem to be generated from the perspective of people who develop monitoring systems: traces are a form of logs and metrics can be computed from logs. (Location 6627)
Traditional software systems rely on logs to record events produced at runtime. An event is anything that can be of interest to the system developers, either at the time the event happens or later for debugging and analysis purposes. Examples of events are when a container starts, the amount of memory it takes, when a function is called, when that function finishes running, the other functions that this function calls, the input and output of that function, etc. Also, don’t forget to log crashes, stack traces, error codes, and more. In the words of Ian Malpass at Etsy, “If it moves, we track it.” 49 They also track things that haven’t changed yet, in case they’ll move later. (Location 6632)
The number of logs can grow very large very quickly. For example, back in 2019, the dating app Badoo was handling 20 billion events a day. 50 When something goes wrong, you’ll need to query your logs for the sequence of events that caused it, a process that can feel like searching for a needle in a haystack. (Location 6640)
system might consist of many different components: containers, schedulers, microservices, polyglot persistence, mesh routing, ephemeral auto- scaling instances, serverless Lambda functions. A request may do 20– 30 hops from when it’s sent until when a response is received. The hard part might not be in detecting when something happened, but where the problem was. (Location 6644)
When we log an event, we want to make it as easy as possible for us to find it later. This practice with microservice architecture is called distributed tracing. We want to give each process a unique ID so that, when something goes wrong, the error message will (hopefully) contain that ID. This allows us to search for the log messages associated with it. We also want to record with each event all the metadata necessary: the time when it happens, the service where it happens, the function that is called, the user associated with the process, if any, etc. (Location 6648)
Analyzing billions of logged events manually is futile, so many companies use ML to analyze logs. An example use case of ML in log analysis is anomaly detection: to detect abnormal events in your system. A more sophisticated model might even classify each event in terms of its priorities such as usual, abnormal, exception, error, and fatal. (Location 6655)
Many companies process logs in batch processes. In this scenario, you collect a large number of logs, then periodically query over them looking for specific events using SQL or process them using a batch process like in a Spark or Hadoop or Hive cluster. This makes the processing of logs efficient because you can leverage distributed and MapReduce processes to increase your processing throughput. However, because you process your logs periodically, you can only discover problems periodically. To discover anomalies in your logs as soon as they happen, you want to process your events as soon as they are logged. This makes log processing a stream processing problem. 53 You can use real- time transport such as Kafka or Amazon Kinesis to transport events as they are logged. To search for events with specific characteristics in real time, you can leverage a streaming SQL engine like KSQL or Flink SQL. (Location 6659)
Dashboards to visualize metrics are critical for monitoring. Another use of dashboards is to make monitoring accessible to nonengineers. Monitoring isn’t just for the developers of a system, but also for nonengineering stakeholders including product managers and business developers. (Location 6672)
Excessive metrics on a dashboard can also be counterproductive, a phenomenon known as dashboard rot. It’s important to pick the right metrics or abstract out lower- level metrics to compute higher- level signals that make better sense for your specific tasks. (Location 6681)
This describes the condition for an alert. You might want to create an alert when a metric breaches a threshold, optionally over a certain duration. For example, you might want to be notified when a model’s accuracy is under 90%, or that the HTTP response latency is higher than a second for at least 10 minutes. (Location 6686)
These describe who is to be notified when the condition is met. The alerts will be shown in the monitoring service you employ, such as Amazon CloudWatch or GCP Cloud Monitoring, but you also want to reach responsible people when they’re not on these monitoring services. For example, you might configure your alerts to be sent to an email address such as mlops- monitoring@[ your company email domain], or to post to a Slack channel such as mlops- monitoring or to PagerDuty. (Location 6690)
description of the alert This helps the alerted person understand what’s going on. The description should be as detailed as possible, such as: Recommender model accuracy below 90% $t im es t am p : T hi s a l er t or i g ina t e df ro m t h eser v i ce$ { service- name} Depending on the audience of the alert, it’s often necessary to make the alert actionable by providing mitigation instructions or a runbook, a compilation of routine procedures and operations that might help with handling the alert. (Location 6693)
Alert fatigue is a real phenomenon, as discussed previously in this chapter. Alert fatigue can be demoralizing— nobody likes to be awakened in the middle of the night for something outside of their responsibilities. It’s also dangerous— being exposed to trivial alerts can desensitize people to critical alerts. It’s important to set meaningful conditions so that only critical alerts are sent out. (Location 6699)
Monitoring makes no assumption about the relationship between the internal state of a system and its outputs. You monitor the external outputs of the system to figure out when something goes wrong inside the system— there’s no guarantee that the external outputs will help you figure out what goes wrong. (Location 6705)
The team has to rely on external outputs of their system to figure out what’s going on internally. Observability is a term used to address this challenge. It’s a concept drawn from control theory, and it refers to bringing “better visibility into understanding the complex behavior of software using [outputs] collected from the system at run time.” (Location 6714)
observability makes an assumption stronger than traditional monitoring: that the internal states of a system can be inferred from knowledge of its external outputs. Internal states can be current states, such as “the GPU utilization right now,” and historical states, such as “the average GPU utilization over the last day.” (Location 6724)
When something goes wrong with an observable system, we should be able to figure out what went wrong by looking at the system’s logs and metrics without having to ship new code to the system. Observability is about instrumenting your system in a way to ensure that sufficient information about a system’s runtime is collected and analyzed. Monitoring centers around metrics, and metrics are usually aggregated. Observability allows more fine- grain metrics, so that you can know not only when a model’s performance degrades but also for what types of inputs or what subgroups of users or over what period of time the model degrades. (Location 6726)
In ML, observability encompasses interpretability. Interpretability helps us understand how an ML model works, and observability helps us understand how the entire ML system, which includes the ML model, works. For example, when a model’s performance degrades over the last hour, being able to interpret which feature contributes the most to all the wrong predictions made over the last hour will help with figuring out what went wrong with the system and how to fix it. (Location 6734)
Even though monitoring is a powerful concept, it’s inherently passive. You wait for a shift to happen to detect it. Monitoring helps unearth the problem without correcting it. In the next section, we’ll introduce continual learning, a paradigm that can actively help you update your models to address shifts. (Location 6740)
spoiler: continual learning is largely an infrastructural problem. Then we’ll lay out a four- stage plan to make continual learning a reality. (Location 6930)
If the model is retrained to adapt to the changing environment, evaluating it on a stationary test set isn’t enough. We’ll cover a seemingly terrifying but necessary concept: test in production. This process is a way to test your systems with live data in production to ensure that your updated model indeed works without catastrophic consequences. (Location 6933)
Test in production is complementary to monitoring. If monitoring means passively keeping track of the outputs of whatever model is being used, test in production means proactively choosing which model to produce outputs so that we can evaluate it. The goal of both monitoring and test in production is to understand a model’s performance and figure out when to update it. The goal of continual learning is to safely and efficiently automate the update. All of these concepts allow us to design an ML system that is maintainable and adaptable to changing environments. (Location 6936)
When hearing “continual learning,” many people think of the training paradigm where a model updates itself with every incoming sample in production. Very few companies actually do that. (Location 6942)
if your model is a neural network, learning with every incoming sample makes it susceptible to catastrophic forgetting. Catastrophic forgetting refers to the tendency of a neural network to completely and abruptly forget previously learned information upon learning new information. (Location 6944)
The updated model shouldn’t be deployed until it’s been evaluated. This means that you shouldn’t make changes to the existing model directly. Instead, you create a replica of the existing model and update this replica on new data, and only replace the existing model with the updated replica if the updated replica proves to be better. The existing model is called the champion model, and the updated replica, the challenger. (Location 6952)
the term “continual learning” makes people imagine updating models very frequently, such as every 5 or 10 minutes. Many people argue that most companies don’t need to update their models that frequently because of two reasons. First, they don’t have enough traffic (i.e., enough new data) for that retraining schedule to make sense. Second, their models don’t decay that fast. I agree with them. If changing the retraining schedule from a week to a day gives no return and causes more overhead, there’s no need to do it. (Location 6961)
continual learning isn’t about the retraining frequency, but the manner in which the model is retrained. Most companies do stateless retraining— the model is trained from scratch each time. Continual learning means also allowing stateful training— the model continues training on new data. 2 Stateful training is also known as fine- tuning or incremental learning. (Location 6966)
The difference between stateless retraining and stateful training is visualized in Figure 9- 2. (Location 6975)
Stateful training allows you to update your model with less data. Training a model from scratch tends to require a lot more data than fine- tuning the same model. For example, if you retrain your model from scratch, you might need to use all data from the last three months. However, if you fine- tune your model from yesterday’s checkpoint, you only need to use data from the last day. (Location 6979)
One beautiful property that is often overlooked is that with stateful training, it might be possible to avoid storing data altogether. In the traditional stateless retraining, a data sample might be reused during multiple training iterations of a model, which means that data needs to be stored. This isn’t always possible, especially for data with strict privacy requirements. In the stateful training paradigm, each model update is trained using only the fresh data, so a data sample is used only once for training, as shown in Figure 9- 2. This means that it’s possible to train your model without having to store data in permanent storage, which helps eliminate many concerns about data privacy. However, this is overlooked because today’s let’s- keep- track- of- everything practice still makes many companies reluctant to throw away data. (Location 6984)
Stateful training doesn’t mean no training from scratch. The companies that have most successfully used stateful training also occasionally train their model from scratch on a large amount of data to calibrate it. Alternatively, they might also train their model from scratch in parallel with stateful training and then combine both updated models using techniques such as parameter server. (Location 6991)
Continual learning is about setting up infrastructure in a way that allows you, a data scientist or ML engineer, to update your models whenever it is needed, whether from scratch or fine- tuning, and to deploy this update quickly. You might wonder: stateful training sounds cool, but how does this work if I want to add a new feature or another layer to my model? (Location 6997)
we must differentiate two types of model updates: Model iteration A new feature is added to an existing model architecture or the model architecture is changed. Data iteration The model architecture and features remain the same, but you refresh this model with new data. As of today, stateful training is mostly applied for data iteration, as changing your model architecture or adding a new feature still requires training the resulting model from scratch. There has been research showing that it might be possible to bypass training from scratch for model iteration by using techniques such as knowledge transfer (Google, 2015) and model surgery (OpenAI, 2019). According to OpenAI, “Surgery transfers trained weights from one network to another after a selection process to determine which sections of the model are unchanged and which must be re- initialized.” (Location 7000)
Some people use “online learning” to refer to the specific setting where a model learns from each incoming new sample. In this setting, continual learning is a generalization of online learning. (Location 7019)
also use the term “continual learning” instead of “continuous learning.” Continuous learning refers to the regime in which your model continuously learns with each incoming sample, whereas with continual learning, the learning is done in a series of batches or micro- batches. (Location 7020)
Continuous learning is sometimes used to refer to continuous delivery of ML, which is closely related to continual learning as both help companies to speed up the iteration cycle of their ML models. However, the difference is that “continuous learning,” when used in this sense, is from the DevOps perspective about setting up the pipeline for continuous delivery, whereas “continual learning” is from the ML perspective. Due to the ambiguity of the term “continuous learning,” I hope that the community can stay away from this term altogether. (Location 7022)
We discussed that continual learning is about setting up infrastructure so that you can update your models and deploy these changes as fast as you want. But why would you need the ability to update your models as fast as you want? The first use case of continual learning is to combat data distribution shifts, especially when the shifts happen suddenly. (Location 7034)
In 2019, Alibaba acquired Data Artisans, the team leading the development of the stream processing framework Apache Flink, for $ 103 million so that the team could help them adapt Flink for ML use cases. 7 Their flagship use case was making better recommendations on Singles Day, a shopping occasion in China similar to Black Friday in the US. (Location 7047)
A huge challenge for ML production today that continual learning can help overcome is the continuous cold start problem. (Location 7050)
Continuous cold start is a generalization of the cold start problem, 9 as it can happen not just with new users but also with existing users. For example, it can happen because an existing user switches from a laptop to a mobile phone, and their behavior on a phone is different from their behavior on a laptop. It can happen because users are not logged in— most news sites don’t require readers to log in to read. (Location 7055)
we could make our models adapt to each user within their visiting session, the models would be able to make accurate, relevant predictions to users even on their first visit. TikTok, for example, has successfully applied continual learning to adapt their recommender system to each user within minutes. You download the app and, after a few videos, TikTok’s algorithms are able to predict with high accuracy what you want to watch next. 11 I don’t think everyone should try to build something as addictive as TikTok, but it’s proof that continual learning can unlock powerful predictive potential. (Location 7065)
“Why continual learning?” should be rephrased as “why not continual learning?” Continual learning is a superset of batch learning, as it allows you to do everything the traditional batch learning can do. But continual learning also allows you to unlock use cases that batch learning can’t. (Location 7070)
Even though continual learning has many use cases and many companies have applied it with great success, continual learning still has many challenges. In this section, we’ll discuss three major challenges: fresh data access, evaluation, and algorithms. (Location 7078)
The first challenge is the challenge to get fresh data. If you want to update your model every hour, you need new data every hour. Currently, many companies pull new training data from their data warehouses. The speed at which you can pull data from your data warehouses depends on the speed at which this data is deposited into your data warehouses. The speed can be slow, especially if data comes from multiple sources. An alternative is to allow pull data before it’s deposited into data warehouses, e.g., directly from real- time transports such as Kafka and Kinesis that transport data from applications to data warehouses, (Location 7081)
Being able to pull fresh data isn’t enough. If your model needs labeled data to update, as most models today do, this data will need to be labeled as well. In many applications, the speed at which a model can be updated is bottlenecked by the speed at which data is labeled. (Location 7091)
The best candidates for continual learning are tasks where you can get natural labels with short feedback loops. Examples of these tasks are dynamic pricing (based on estimated demand and availability), estimating time of arrival, stock price prediction, ads click- through prediction, and recommender systems for online content like tweets, songs, short videos, articles, etc. (Location 7093)
If you run an ecommerce website, your application might register that at 10: 33 p.m., user A clicks on the product with the ID of 32345. Your system needs to look back into the logs to see if this product ID was ever recommended to this user, and if yes, then what query prompted this recommendation, so that your system can match this query to this recommendation and label this recommendation as a good recommendation, as shown in Figure 9- 4. Figure 9- 4. A simplification of the process of extracting labels from user feedback (Location 7097)
The process of looking back into the logs to extract labels is called label computation. It can be quite costly if the number of logs is large. Label computation can be done with batch processing: e.g., waiting for logs to be deposited into data warehouses first before running a batch job to extract all labels from logs at once. (Location 7103)
If your model’s speed iteration is bottlenecked by labeling speed, it’s also possible to speed up the labeling process by leveraging programmatic labeling tools like Snorkel to generate fast labels with minimal human intervention. It might also be possible to leverage crowdsourced labels to quickly annotate fresh data. (Location 7108)
Given that tooling around streaming is still nascent, architecting an efficient streaming- first infrastructure for accessing fresh data and extracting fast labels from real- time transports can be engineering- intensive and costly. The good news is that tooling around streaming is growing fast. Confluent, the platform built on top of Kafka, is a $16 bi ll i o n co m p an y a so f O c t o b er 2021. I n l a t e 2020, S n o w f l ak es t a r t e d a t e am f oc u s in g o n s t re amin g .14 A so f S e pt e mb er 2021, M a t er ia l i ze ha sr ai se d$ 100 million to develop a streaming SQL database. 15 As tooling around streaming matures, it’ll be much easier and cheaper for companies to develop a streaming- first infrastructure for ML. (Location 7111)
The biggest challenge of continual learning isn’t in writing a function to continually update your model— you can do that by writing a script! The biggest challenge is in making sure that this update is good enough to be deployed. (Location 7120)
The risks for catastrophic failures amplify with continual learning. First, the more frequently you update your models, the more opportunities there are for updates to fail. Second, continual learning makes your models more susceptible to coordinated manipulation and adversarial attack. (Location 7125)
When designing the evaluation pipeline for continual learning, keep in mind that evaluation takes time, which can be another bottleneck for model update frequency. For example, a major online payment company I worked with has an ML system to detect fraudulent transactions. 18 The fraud patterns change quickly, so they’d like to update their system quickly to adapt to the changing patterns. They can’t deploy the new model before it’s been A/ B tested against the current model. However, due to the imbalanced nature of the task— most transactions aren’t fraud— it takes them approximately two weeks to see enough fraud transactions to be able to accurately assess which model is better. 19 Therefore, they can only update their system every two weeks. (Location 7135)
It’s much easier to adapt models like neural networks than matrix- based and tree- based models to the continual learning paradigm. However, there have been algorithms to create tree- based models that can learn from incremental amounts of data, most notably Hoeffding Tree and its variants Hoeffding Window Tree and Hoeffding Adaptive Tree, 21 but their uses aren’t yet widespread. (Location 7156)
Stage 1: Manual, stateless retraining In the beginning, the ML team often focuses on developing ML models to solve as many business problems as possible. (Location 7179)
Because your team is focusing on developing new models, updating existing models takes a backseat. You update an existing model only when the following two conditions are met: the model’s performance has degraded to the point that it’s doing more harm than good, and your team has time to update it. (Location 7188)
The process of updating a model is manual and ad hoc. Someone, usually a data engineer, has to query the data warehouse for new data. Someone else cleans this new data, extracts features from it, retrains that model from scratch on both the old and new data, and then exports the updated model into a binary format. Then someone else takes that binary format and deploys the updated model. Oftentimes, the code encapsulating data, features, and model logic was changed during the retraining process but these changes failed to be replicated to production, causing bugs that are hard to track down. (Location 7191)
After a few years, your team has managed to deploy models to solve most of the obvious problems. You have anywhere between 5 and 10 models in production. Your priority is no longer to develop new models, but to maintain and improve existing models. The ad hoc, manual process of updating models mentioned from the previous stage has grown into a pain point too big to be ignored. Your team decides to write a script to automatically execute all the retraining steps. This script is then run periodically using a batch process such as Spark. (Location 7200)
Most companies with somewhat mature ML infrastructure are in this stage. Some sophisticated companies run experiments to determine the optimal retraining frequency. However, for most companies in this stage, the retraining frequency is set based on gut feeling— e.g., “once a day seems about right” or “let’s kick off the retraining process each night when we have idle compute.” (Location 7206)
When creating scripts to automate the retraining process for your system, you need to take into account that different models in your system might require different retraining schedules. (Location 7209)
consider a recommender system that consists of two models: one model to generate embeddings for all products, and another model to rank the relevance of each product given a query. The embedding model might need to be retrained a lot less frequently than the ranking model. Because products’ characteristics don’t change that often, you might be able to get away with retraining your embeddings once a week, 24 whereas your ranking models might need to be retrained once a day. The automating script might get even more complicated if there are dependencies among your models. For example, because the ranking model depends on the embeddings, when the embeddings change, the ranking model should be updated too. (Location 7211)
your company has ML models in production, it’s likely that your company already has most of the infrastructure pieces needed for automated retraining. The feasibility of this stage revolves around the feasibility of writing a script to automate your workflow and configure your infrastructure to automatically: Pull data. Downsample or upsample this data if necessary. Extract features. Process and/ or annotate labels to create training data. Kick off the training process. Evaluate the newly trained model. Deploy it. How long it will take to write this script depends on many factors, including the script writer’s competency. However, in general, the three major factors that will affect the feasibility of this script are: scheduler, data, and model store. (Location 7218)
If you don’t already have a scheduler, you’ll need time to set up one. However, if you already have a scheduler such as Airflow or Argo, wiring the scripts together shouldn’t be that hard. (Location 7227)
The second factor is the availability and accessibility of your data. Do you need to gather data yourself into your data warehouse? Will you have to join data from multiple organizations? Do you need to extract a lot of features from scratch? Will you also need to label your data? The more questions you answer yes to, the more time it will take to set up this script. Stefan Krawczyk, ML/ data platform manager at Stitch Fix, commented that he suspects most people’s time might be spent here. (Location 7229)
The third factor you’ll need is a model store to automatically version and store all the artifacts needed to reproduce a model. The simplest model store is probably just an S3 bucket that stores serialized blobs of models in some structured manner. However, blob storage like S3 is neither very good at versioning artifacts nor human- readable. You might need a more mature model store like Amazon SageMaker (managed service) and Databricks’ MLflow (open source). (Location 7232)
When creating training data from new data to update your model, remember that the new data has already gone through the prediction service. This prediction service has already extracted features from this new data to input into models for predictions. Some companies reuse these extracted features for model retraining, which both saves computation and allows for consistency between prediction and training. This approach is known as “log and wait.” It’s a classic approach to reduce the train- serving skew discussed in Chapter 8 (see the section “Production data differing from training data”). Log and wait isn’t yet a popular approach, but it’s getting more popular. Faire has a great blog post discussing the pros and cons of their “log and wait” approach. (Location 7241)
stage 2, each time you retrain your model, you train it from scratch (stateless retraining). It makes your retraining costly, especially for retraining with a higher frequency. You read the section “Stateless Retraining Versus Stateful Training” and decide that you want to do stateful training— why train on data from the last three months every day when you can continue training using only data from the last day? So in this stage, you reconfigure your automatic updating script so that, when the model update is kicked off, it first locates the previous checkpoint and loads it into memory before continuing training on this checkpoint. (Location 7252)
The main thing you need in this stage is a change in the mindset: retraining from scratch is such a norm— many companies are so used to data scientists handing off a model to engineers to deploy from scratch each time— that many companies don’t think about setting up their infrastructure (Location 7261)
Once you’re committed to stateful training, reconfiguring the updating script is straightforward. The main thing you need at this stage is a way to track your data and model lineage. Imagine you first upload model version 1.0. This model is updated with new data to create model version 1.1, and so on to create model 1.2. Then another model is uploaded and called model version 2.0. This model is updated with new data to create model version 2.1. After a while, you might have model version 3.32, model version 2.11, model version 1.64. You might want to know how these models evolve over time, which model was used as its base model, and which data was used to update it so that you can reproduce and debug it. As far as I know, no existing model store has this model lineage capacity, so you’ll likely have to build the solution in- house. If you want to pull fresh data from the real- time transports instead of from data warehouses, as discussed in the section “Fresh data access challenge”, and your streaming infrastructure isn’t mature enough, you might need to revamp your streaming pipeline. (Location 7264)
At stage 3, your models are still updated based on a fixed schedule set out by developers. Finding the optimal schedule isn’t straightforward and can be situation- dependent. For example, last week, nothing much happened in the market, so your models didn’t decay that fast. However, this week, a lot of events happen, so your models decay much faster and require a much faster retraining schedule. Instead of relying on a fixed schedule, you might want your models to be automatically updated whenever data distributions shift and the model’s performance plummets. (Location 7277)
Value of data freshness The question of how often to update a model becomes a lot easier if we know how much the model performance will improve with updating. (Location 7304)
One way to figure out the gain is by training your model on the data from different time windows in the past and evaluating it on the data from today to see how the performance changes. For example, consider that you have data from the year 2020. To measure the value of data freshness, you can experiment with training model version A on the data from January to June 2020, model version B on the data from April to September, and model version C on the data from June to November, then test each of these model versions on the data from December, as shown in Figure 9- 5. The difference in the performance of these versions will give you a sense of the performance gain your model can get from fresher data. (Location 7309)
If the model trained on data from a quarter ago is much worse than the model trained on data from a month ago, you know that you shouldn’t wait a quarter to retrain your model. (Location 7314)
In 2014, Facebook did a similar experiment for ad click- through- rate prediction and found out that they could reduce the model’s loss by 1% by going from retraining weekly to retraining daily, and this performance gain was significant enough for them to switch their retraining pipeline from weekly to daily. 25 Given that online contents today are so much more diverse and users’ attention online changes much faster, we can imagine that the value of data freshness for ad click- through rate is even higher. (Location 7320)
We discussed earlier in this chapter that not all model updates are the same. We differentiated between model iteration (adding a new feature to an existing model architecture or changing the model architecture) and data iteration (same model architecture and features but you refresh this model with new data). You might wonder not only how often to update your model, but also what kind of model updates to perform. In theory, you can do both types of updates, and in practice, you should do both from time to time. (Location 7330)
if you find that iterating on your data doesn’t give you much performance gain, then you should spend your resources on finding a better model. On the other hand, if finding a better model architecture requires 100X compute for training and gives you 1% performance whereas updating the same model on data from the last three hours requires only 1X compute and also gives 1% performance gain, you’ll be better off iterating on data. (Location 7339)
as your infrastructure matures and the process of updating a model is partially automated and can be done in a matter of hours, if not minutes, the answer to this question is contingent on the answer to the following question: “How much performance gain would I get from fresher data?” It’s important to run experiments to quantify the value of data freshness to your models. (Location 7347)
To understand why offline evaluation isn’t enough, let’s go over two major test types for offline evaluation: test splits and backtests. (Location 7356)
first type of model evaluation you might think about is the good old test splits that you can use to evaluate your models offline, as discussed in Chapter 6. These test splits are usually static and have to be static so that you have a trusted benchmark to compare multiple models. It’ll be hard to compare the test results of two models if they are tested on different test sets. (Location 7357)
if you update the model to adapt to a new data distribution, it’s not sufficient to evaluate this new model on test splits from the old distribution. Assuming that the fresher the data, the more likely it is to come from the current distribution, one idea is to test your model on the most recent data that you have access to. So, after you’ve updated your model on the data from the last day, you might want to test this model on the data from the last hour (assuming that data from the last hour wasn’t included in the data used to update your model). The method of testing a predictive model on data from a specific period of time in the past is known as a backtest. (Location 7360)
The question is whether backtests are sufficient to replace static test splits. Not quite. If something went wrong with your data pipeline and some data from the last hour is corrupted, evaluating your model solely on this recent data isn’t sufficient. With backtests, you should still evaluate your model on a static test set that you have extensively studied and (mostly) trust as a form of sanity check. (Location 7365)
Because data distributions shift, the fact that a model does well on the data from the last hour doesn’t mean that it will continue doing well on the data in the future. The only way to know whether a model will do well in production is to deploy it. This insight led to one seemingly terrifying but necessary concept: test in production. (Location 7368)
Shadow deployment might be the safest way to deploy your model or any software update. Shadow deployment works as follows: Deploy the candidate model in parallel with the existing model. For each incoming request, route it to both models to make predictions, but only serve the existing model’s prediction to the user. Log the predictions from the new model for analysis purposes. (Location 7374)
Because you don’t serve the new model’s predictions to users until you’ve made sure that the model’s predictions are satisfactory, the risk of this new model doing something funky is low, at least not higher than the existing model. However, this technique isn’t always favorable because it’s expensive. It doubles the number of predictions your system has to generate, which generally means doubling your inference compute cost. (Location 7382)
A/ B testing is a way to compare two variants of an object, typically by testing responses to these two variants, and determining which of the two variants is more effective. In our case, we have the existing model as one variant, and the candidate model (the recently updated model) as another variant. We’ll use A/ B testing to determine which model is better according to some predefined metrics. (Location 7386)
A/ B testing works as follows: Deploy the candidate model alongside the existing model. A percentage of traffic is routed to the new model for predictions; the rest is routed to the existing model for predictions. It’s common for both variants to serve prediction traffic at the same time. However, there are cases where one model’s predictions might affect another model’s predictions— e.g., in ride- sharing’s dynamic pricing, a model’s predicted prices might influence the number of available drivers and riders, which, in turn, influence the other model’s predictions. In those cases, you might have to run your variants alternatively, e.g., serve model A one day and then serve model B the next day. Monitor and analyze the predictions and user feedback, if any, from both models to determine whether the difference in the two models’ performance is statistically significant. (Location 7393)
First, A/ B testing consists of a randomized experiment: the traffic routed to each model has to be truly random. If not, the test result will be invalid. For example, if there’s a selection bias in the way traffic is routed to the two models, such as users who are exposed to model A are usually on their phones whereas users exposed to model B are usually on their desktops, then if model A has better accuracy than model B, we can’t tell whether it’s because A is better than B or whether “being on a phone” influences the prediction quality. (Location 7402)
The gist here is that if your A/ B test result shows that a model is better than another with statistical significance, you can determine which model is indeed better. To measure statistical significance, A/ B testing uses statistical hypothesis testing such as two- sample tests. (Location 7408)
For readers interested in learning more about A/ B testing and other statistical concepts important in ML, I recommend Ron Kohav’s book Trustworthy Online Controlled Experiments (A Practical Guide to A/ B Testing) (Cambridge University Press) and Michael Barber’s great introduction to statistics for data science (much shorter). (Location 7421)
Canary release is a technique to reduce the risk of introducing a new software version in production by slowly rolling out the change to a small subset of users before rolling it out to the entire infrastructure and making it available to everybody. (Location 7429)
In the context of ML deployment, canary release works as follows: Deploy the candidate model alongside the existing model. The candidate model is called the canary. A portion of the traffic is routed to the candidate model. If its performance is satisfactory, increase the traffic to the candidate model. If not, abort the canary and route all the traffic back to the existing model. Stop when either the canary serves all the traffic (the candidate model has replaced the existing model) or when the canary is aborted. (Location 7433)
Canary releases can be used to implement A/ B testing due to the similarities in their setups. However, you can do canary analysis without A/ B testing. For example, you don’t have to randomize the traffic to route to each model. A plausible scenario is that you first roll out the candidate model to a less critical market before rolling out to everybody. (Location 7441)
Netflix and Google have a great shared blog post on how automated canary analysis is used at their companies. (Location 7444)
What if instead of exposing a user to recommendations from a model, we expose that user to recommendations from both models and see which model’s recommendations they will click on? That’s the idea behind interleaving experiments, originally proposed by Thorsten Joachims in 2002 for the problems of search rankings. 29 In experiments, Netflix found that interleaving “reliably identifies the best algorithms with considerably smaller sample size compared to traditional A/ B testing.” 30 (Location 7451)
In A/ B testing, core metrics like retention and streaming are measured and compared between the two groups. In interleaving, the two algorithms can be compared by measuring user preferences. Because interleaving can be decided by user preferences, there’s no guarantee that user preference will lead to better core metrics. (Location 7457)
Figure 9- 6. An illustration of interleaving versus A/ B testing. Source: Adapted from an image by Parks et al. (Location 7460)
When we show recommendations from multiple models to users, it’s important to note that the position of a recommendation influences how likely a user will click on it. For example, users are much more likely to click on the top recommendation than the bottom recommendation. For interleaving to yield valid results, we must ensure that at any given position, a recommendation is equally likely to be generated by A or B. To ensure this, one method we can use is team- draft interleaving, which mimics the drafting process in sports. For each recommendation position, we randomly select A or B with equal probability, and the chosen model picks the top recommendation that hasn’t already been picked. 31 (Location 7462)
Bandits For those unfamiliar, bandit algorithms originated in gambling. A casino has multiple slot machines with different payouts. A slot machine is also known as a one- armed bandit, hence the name. You don’t know which slot machine gives the highest payout. You can experiment over time to find out which slot machine is the best while maximizing your payout. Multi- armed bandits are algorithms that allow you to balance between exploitation (choosing the slot machine that has paid the most in the past) and exploration (choosing other slot machines that may pay off even more). (Location 7475)
As of today, the standard method for testing models in production is A/ B testing. With A/ B testing, you randomly route traffic to each model for predictions and measure at the end of your trial which model works better. A/ B testing is stateless: you can route traffic to each model without having to know about their current performance. You can do A/ B testing even with batch prediction. (Location 7482)
Bandits allow you to determine how to route traffic to each model for prediction to determine the best model while maximizing prediction accuracy for your users. Bandit is stateful: before routing a request to a model, you need to calculate all models’ current performance. This requires three things: Your model must be able to make online predictions. Preferably short feedback loops: you need to get feedback on whether a prediction is good or not. This is usually true for tasks where labels can be determined from users’ feedback, like in recommendations— if users click on a recommendation, it’s inferred to be good. If the feedback loops are short, you can update the payoff of each model quickly. A mechanism to collect feedback, calculate and keep track of each model’s performance, and route prediction requests to different models based on their current performance. (Location 7486)
Bandits are well- studied in academia and have been shown to be a lot more data- efficient than A/ B testing (in many cases, bandits are even optimal). Bandits require less data to determine which model is the best and, at the same time, reduce opportunity cost as they route traffic to the better model more quickly. See discussions on bandits at LinkedIn, Netflix, Facebook, and Dropbox, Zillow, and Stitch Fix. For a more theoretical view, see Chapter 2 of Reinforcement Learning (Sutton and Barto 2020). (Location 7495)
Bandit Algorithms Many of the solutions for the multi- armed bandit problem can be used here. The simplest algorithm for exploration is ε- greedy. For a percentage of time, say 90% of the time or ε = 0.9, you route traffic to the model that is currently the best- performing one, and for the other 10% of the time, you route traffic to a random model. This means that for each of the predictions your system generates, 90% of them come from the best- at- that- point- in- time model. Two of the most popular exploration algorithms are Thompson Sampling and Upper Confidence Bound (UCB). Thompson Sampling selects a model with a probability that this model is optimal given the current knowledge. 34 In our case, it means that the algorithm selects the model based on its probability of having a higher value (better performance) than all other models. On the other hand, UCB selects the item with the highest upper confidence bound. 35 We say that UCB implements optimism in the face of uncertainty, it gives an “uncertainty bonus,” also called “exploration bonus,” to the items it’s uncertain about. (Location 7506)
If bandits for model evaluation are to determine the payout (i.e., prediction accuracy) of each model, contextual bandits are to determine the payout of each action. In the case of recommendations/ ads, an action is an item/ ad to show to users, and the payout is how likely it is a user will click on it. Contextual bandits, like other bandits, are an amazing technique to improve the data efficiency of your model. (Location 7518)
Imagine that you’re building a recommender system with 1,000 items to recommend, which makes it a 1,000- arm bandit problem. Each time, you can only recommend the top 10 most relevant items to a user. In bandit terms, you’ll have to choose the best 10 arms. The shown items get user feedback, inferred via whether the user clicks on them. But you won’t get feedback on the other 990 items. This is known as the partial feedback problem, also known as bandit feedback. You can also think of contextual bandits as a classification problem with bandit feedback. (Location 7524)
Let’s say that each time a user clicks on an item, this item gets 1 value point. When an item has 0 value points, it could either be because the item has never been shown to a user, or because it’s been shown but not clicked on. You want to show users the items with the highest value to them, but if you keep showing users only the items with the most value points, you’ll keep on recommending the same popular items, and the never- before- shown items will keep having 0 value points. (Location 7529)
Contextual bandits are algorithms that help you balance between showing users the items they will like and showing the items that you want feedback on. 36 It’s the same exploration– exploitation trade- off that many readers might have encountered in reinforcement learning. Contextual bandits are also called “one- shot” reinforcement learning problems. 37 In reinforcement learning, you might need to take a series of actions before seeing the rewards. In contextual bandits, you can get bandit feedback right away after an action— e.g., after recommending an ad, you get feedback on whether a user has clicked on that recommendation. (Location 7532)
Contextual bandits are well researched and have been shown to improve models’ performance significantly (see reports by Twitter and Google). However, contextual bandits are even harder to implement than model bandits, since the exploration strategy depends on the ML model’s architecture (e.g., whether it’s a decision tree or a neural network), which makes it less generalizable across use cases. Readers interested in combining contextual bandits with deep learning should check out a great paper written by a team at Twitter: “Deep Bayesian Bandits: Exploring in Online Personalized Recommendations” (Guo et al. 2020). (Location 7538)
In ML, the evaluation process is often owned by data scientists— the same people who developed the model are responsible for evaluating it. Data scientists tend to evaluate their new model ad hoc using the sets of tests that they like. First, this process is imbued with biases— data scientists have contexts about their models that most users don’t, which means they probably won’t use this model in a way most of their users will. Second, the ad hoc nature of the process means that the results might be variable. One data scientist might perform a set of tests and find that model A is better than model B, while another data scientist might report differently. (Location 7546)
The lack of a way to ensure models’ quality in production has led to many models failing after being deployed, which, in turn, fuels data scientists’ anxiety when deploying models. To mitigate this issue, it’s important for each team to outline clear pipelines on how models should be evaluated: e.g., the tests to run, the order in which they should run, the thresholds they must pass in order to be promoted to the next stage. Better, these pipelines should be automated and kicked off whenever there’s a new model update. The results should be reported and reviewed, similar to the continuous integration/ continuous deployment (CI/ CD) process for traditional software engineering. It’s crucial to understand that a good evaluation process involves not only what tests to run but also who should run those tests. (Location 7551)
Many data scientists have told me that they know the right things to do for their ML systems, but they can’t do them because their infrastructure isn’t set up in a way that enables them to do so. (Location 7686)
ML systems are complex. The more complex a system, the more it can benefit from good infrastructure. Infrastructure, when set up right, can help automate processes, reducing the need for specialized knowledge and engineering time. This, in turn, can speed up the development and delivery of ML applications, reduce the surface area for bugs, and enable new use cases. When set up wrong, however, infrastructure is painful to use and expensive to replace. (Location 7687)
At the other end of the spectrum, there are companies that work on applications with unique requirements. For example, self- driving cars have unique accuracy and latency requirements— the algorithm must be able to respond within milliseconds and its accuracy must be near- perfect since a wrong prediction can lead to serious accidents. Similarly, Google Search has a unique scale requirement since most companies don’t process 63,000 search queries a second, which translates to 234 million search queries an hour, like Google does. 1 These companies will likely need to develop their own highly specialized infrastructure. Google developed a large part of their internal infrastructure for search; so did self- driving car companies like Tesla and Waymo. 2 It’s common that part of specialized infrastructure is later made public and adopted by other companies. For example, Google extended their internal cloud infrastructure to the public, resulting in Google Cloud Platform. (Location 7697)
In the middle of the spectrum are the majority of companies, those who use ML for multiple common applications— a fraud detection model, a price optimization model, a churn prediction model, a recommender system, etc.— at reasonable scale. “Reasonable scale” refers to companies that work with data in the order of gigabytes and terabytes, instead of petabytes, a day. Their data science team might range from 10 to hundreds of engineers. 3 This category might include any company from a 20- person startup to a company at Zillow’s scale, but not at FAAAM scale. 4 For example, back in 2018, Uber was adding tens of terabytes of data a day to their data lake, and Zillow’s biggest dataset was bringing in 2 terabytes of uncompressed data a day. 5 In contrast, even back in 2014, Facebook was generating 4 petabytes of data a day. 6 (Location 7706)
Companies in the middle of the spectrum will likely benefit from generalized ML infrastructure that is being increasingly standardized (see Figure 10- 1). In this book, we’ll focus on the infrastructure for the vast majority of ML applications at a reasonable scale. (Location 7716)
Storage and compute The storage layer is where data is collected and stored. The compute layer provides the compute needed to run your ML workloads such as training a model, computing features, generating features, etc. Resource management Resource management comprises tools to schedule and orchestrate your workloads to make the most out of your available compute resources. Examples of tools in this category include Airflow, Kubeflow, and Metaflow. ML platform This provides tools to aid the development of ML applications such as model stores, feature stores, and monitoring tools. Examples of tools in this category include SageMaker and MLflow. Development environment This is usually referred to as the dev environment; it is where code is written and experiments are run. Code needs to be versioned and tested. Experiments need to be tracked. (Location 7729)
Data and compute are the essential resources needed for any ML project, and thus the storage and compute layer forms the infrastructural foundation for any company that wants to apply ML. This layer is also the most abstract to a data scientist. We’ll discuss this layer first because these resources are the easiest to explain. (Location 7745)
systems work with a lot of data, and this data needs to be stored somewhere. The storage layer is where data is collected and stored. At its simplest form, the storage layer can be a hard drive disk (HDD) or a solid state disk (SSD). The storage layer can be in one place, e.g., you might have all your data in Amazon S3 or in Snowflake, or spread out over multiple locations. 8 Your storage layer can be on- prem in a private data center or on the cloud. (Location 7761)
The compute layer refers to all the compute resources a company has access to and the mechanism to determine how these resources can be used. The amount of compute resources available determines the scalability of your workloads. You can think of the compute layer as the engine to execute your jobs. At its simplest form, the compute layer can just be a single CPU core or a GPU core that does all your computation. Its most common form is cloud compute managed by a cloud provider such as AWS Elastic Compute Cloud (EC2) or GCP. (Location 7772)
The compute layer can usually be sliced into smaller compute units to be used concurrently. For example, a CPU core might support two concurrent threads; each thread is used as a compute unit to execute its own job. Or multiple CPU cores might be joined together to form a larger compute unit to execute a larger job. A compute unit can be created for a specific short- lived job such as an AWS Step Function or a GCP Cloud Run— the unit will be eliminated after the job finishes. A compute unit can also be created to be more “permanent,” aka without being tied to a job, like a virtual machine. A more permanent compute unit is sometimes called an “instance.” (Location 7778)
the compute layer doesn’t always use threads or cores as compute units. There are compute layers that abstract away the notions of cores and use other units of computation. For example, computation engines like Spark and Ray use “job” as their unit, and Kubernetes uses “pod,” a wrapper around containers, as its smallest deployable unit. While you can have multiple containers in a pod, you can’t independently start or stop different containers in the same pod. (Location 7785)
To execute a job, you first need to load the required data into your compute unit’s memory, then execute the required operations— addition, multiplication, division, convolution, etc.— on that data. For example, to add two arrays, you will first need to load these two arrays into memory, and then perform addition on the two arrays. If the compute unit doesn’t have enough memory to load these two arrays, the operation will be impossible without an algorithm to handle out- of- memory computation. Therefore, a compute unit is mainly characterized by two metrics: how much memory it has and how fast it runs an operation. (Location 7789)
The memory metric can be specified using units like GB, and it’s generally straightforward to evaluate: a compute unit with 8 GB of memory can handle more data in memory than a compute unit with only 2 GB, and it is generally more expensive. (Location 7793)
Some companies care not only how much memory a compute unit has but also how fast it is to load data in and out of memory, so some cloud providers advertise their instances as having “high bandwidth memory” or specify their instances’ I/ O bandwidth. (Location 7796)
The operation speed is more contentious. The most common metric is FLOPS— floating point operations per second. As the name suggests, this metric denotes the number of float point operations a compute unit can run per second. (Location 7798)
The ratio of the number of FLOPS a job can run to the number of FLOPs a compute unit is capable of handling is called utilization. 12 If an instance is capable of doing a million FLOPs and your job runs with 0.3 million FLOPS, that’s a 30% utilization rate. Of course, you’d want to have your utilization rate as high as possible. However, it’s near impossible to achieve 100% utilization rate. Depending on the hardware backend and the application, the utilization rate of 50% might be considered good or bad. Utilization also depends on how fast you can load data into memory to perform the next operations— hence the importance of I/ O bandwidth. (Location 7808)
Because thinking about FLOPS is not very useful, to make things easier, when evaluating compute performance, many people just look into the number of cores a compute unit has. So you might use an instance with 4 CPU cores and 8 GB of memory. Keep in mind that AWS uses the concept of vCPU, which stands for virtual CPU and which, for practical purposes, can be thought of as half a physical core. 14 (Location 7818)
Like data storage, the compute layer is largely commoditized. This means that instead of setting up their own data centers for storage and compute, companies can pay cloud providers like AWS and Azure for the exact amount of compute they use. Cloud compute makes it extremely easy for companies to start building without having to worry about the compute layer. It’s especially appealing to companies that have variable- sized workloads. Imagine if your workloads need 1,000 CPU cores one day of the year and only 10 CPU cores the rest of the year. If you build your own data centers, you’ll need to pay for 1,000 CPU cores up front. With cloud compute, you only need to pay for 1,000 CPU cores one day of the year and 10 CPU cores the rest of the year. (Location 7826)
This is especially useful in ML as data science workloads are bursty. Data scientists tend to run experiments a lot for a few weeks during development, which requires a surge of compute power. Later on, during production, the workload is more consistent. (Location 7840)
Keep in mind that cloud compute is elastic but not magical. It doesn’t actually have infinite compute. Most cloud providers offer limits on the compute resources you can use at a time. Some, but not all, of these limits can be raised through petitions. (Location 7842)
While leveraging the cloud tends to give companies higher returns than building their own storage and compute layers early on, this becomes less defensible as a company grows. Based on disclosed cloud infrastructure spending by public software companies, the venture capital firm a16z shows that cloud spending accounts for approximately 50% cost of revenue of these companies. (Location 7857)
The high cost of the cloud has prompted companies to start moving their workloads back to their own data centers, a process called “cloud repatriation.” (Location 7860)
While getting started with the cloud is easy, moving away from the cloud is hard. Cloud repatriation requires nontrivial up- front investment in both commodities and engineering effort. More and more companies are following a hybrid approach: keeping most of their workloads on the cloud but slowly increasing their investment in data centers. (Location 7867)
Another way for companies to reduce their dependence on any single cloud provider is to follow a multicloud strategy: spreading their workloads on multiple cloud providers. 20 This allows companies to architect their systems so that they can be compatible with multiple clouds, enabling them to leverage the best and most cost- effective technologies available instead of being stuck with the services provided by a single cloud provider, a situation known as vendor lock- in. (Location 7871)
A common pattern that I’ve seen for ML workloads is to do training on GCP or Azure, and deployment on AWS. (Location 7878)
The multicloud strategy doesn’t usually happen by choice. As Josh Wills, one of our early reviewers, put it: “Nobody in their right mind intends to use multicloud.” It’s incredibly hard to move data and orchestrate workloads across clouds. (Location 7879)
Often, multicloud just happens because different parts of the organization operate independently, and each part makes their own cloud decision. It can also happen following an acquisition— the acquired team is already on a cloud different from the host organization, and migrating hasn’t happened yet. (Location 7881)
The dev environment is where ML engineers write code, run experiments, and interact with the production environment where champion models are deployed and challenger models evaluated. The dev environment consists of the following components: IDE (integrated development environment), versioning, and CI/ CD. (Location 7895)
According to Ville Tuulos in his book Effective Data Science Infrastructure, “you would be surprised to know how many companies have well- tuned, scalable production infrastructure but the question of how the code is developed, debugged, and tested in the first place is solved in an ad- hoc manner.” (Location 7902)
He suggested that “if you have time to set up only one piece of infrastructure well, make it the development environment for data scientists.” Because the dev environment is where engineers work, improvements in the dev environment translate directly into improvements in engineering productivity. (Location 7905)
The dev environment should be set up to contain all the tools that can make it easier for engineers to do their job. It should also consist of tools for versioning. As of this writing, companies use an ad hoc set of tools to version their ML workflows, such as Git to version control code, DVC to version data, Weights & Biases or Comet.ml to track experiments during development, and MLflow to track artifacts of models when deploying them. (Location 7910)
Claypot AI is working on a platform that can help you version and track all your ML workflows in one place. Versioning is important for any software engineering projects, but even more so for ML projects because of both the sheer number of things you can change (code, parameters, the data itself, etc.) and the need to keep track of prior runs to reproduce later on. (Location 7916)
The dev environment should also be set up with a CI/ CD test suite to test your code before pushing it to the staging or production environment. Examples of tools to orchestrate your CI/ CD test suite are GitHub Actions and CircleCI. (Location 7919)
The IDE is the editor where you write your code. IDEs tend to support multiple programming languages. IDEs can be native apps like VS Code or Vim. IDEs can be browser- based, which means they run in browsers, such as AWS Cloud9. Many data scientists write code not just in IDEs but also in notebooks like Jupyter Notebooks and Google Colab. 23 Notebooks are more than just places to write code. You can include arbitrary artifacts such as images, plots, data in nice tabular formats, etc., which makes notebooks very useful for exploratory data analysis and analyzing model training results. (Location 7923)
Notebooks have a nice property: they are stateful— they can retain states after runs. If your program fails halfway through, you can rerun from the failed step instead of having to run the program from the beginning. This is especially helpful when you have to deal with large datasets that might take a long time to load. With notebooks, you only need to load your data once— notebooks can retain this data in memory— instead of having to load it each time you want to run your code. (Location 7933)
Note that this statefulness can be a double- edged sword, as it allows you to execute your cells out of order. For example, in a normal script, cell 4 must run after cell 3 and cell 3 must run after cell 2. However, in notebooks, you can run cell 2, 3, then 4 or cell 4, 3, then 2. This makes notebook reproducibility harder unless your notebook comes with an instruction on the order in which to run your cells. This difficulty is captured in a joke by Chris Albon (see Figure 10- 6). (Location 7941)
for data scientists and ML. Some companies have made notebooks the center of their (Location 7948)
data science infrastructure. In their seminal (Location 7949)
“Beyond Interactive: Notebook Innovation at Netflix,” Netflix included a list of infrastructure tools that can be used to make notebooks even more powerful. (Location 7949)
Papermill For spawning multiple notebooks with different parameter sets— such as when you want to run different experiments with different sets of parameters and execute them concurrently. It can also help summarize metrics from a collection of notebooks. (Location 7952)
Commuter A notebook hub for viewing, finding, and sharing notebooks within an organization. Another interesting project aimed at improving the notebook experience is nbdev, a library on top of Jupyter Notebooks that encourages you to write documentation and tests in the same place. (Location 7955)
The first thing about the dev environment is that it should be standardized, if not company- wide, then at least team- wide. We’ll go over a story to understand what it means to have the dev environment standardized and why that is needed. (Location 7963)
you can use VS Code installed on your computer and connect the local IDE to the cloud environment using a secure protocol like Secure Shell (SSH). (Location 7990)
While it’s generally agreed upon that tools and packages should be standardized, some companies are hesitant to standardize IDEs. Engineers can get emotionally attached to IDEs, and some have gone to great length to defend their IDE of choice, 26 so it’ll be hard forcing everyone to use the same IDE. However, over the years, some IDEs have emerged to be the most popular. Among them, VS Code is a good choice since it allows easy integration with cloud dev instances. (Location 7991)
Moving from local dev environments to cloud dev environments has many other benefits. First, it makes IT support so much easier— imagine having to support 1,000 different local machines instead of having to support only one type of cloud instance. Second, it’s convenient for remote work— you can just SSH into your dev environment wherever you go from any computer. Third, cloud dev environments can help with security. For example, if an employee’s laptop is stolen, you can just revoke access to cloud instances from that laptop to prevent the thief from accessing your codebase and proprietary information. Of course, some companies might not be able to move to cloud dev environments also because of security concerns. (Location 8002)
The fourth benefit, which I would argue is the biggest benefit for companies that do production on the cloud, is that having your dev environment on the cloud reduces the gap between the dev environment and the production environment. If your production environment is in the cloud, bringing your dev environment to the cloud is only natural. (Location 8007)
each of your containers might run on its own host, and this is where Docker Compose is at its limits. Kubernetes (K8s) is a tool for exactly that. K8s creates a network for containers to communicate and share resources. It can help you spin up containers on more instances when you need more compute/ memory as well as shutting down containers when you no longer need them, and it helps maintain high availability for your system. (Location 8075)
Jeremy Jordan has a great introduction to K8s for readers interested in learning more. However, K8s is not the most data- scientist- friendly tool, and there have been many discussions on how to move data science workloads away from it. (Location 8080)
in the cloud world where storage and compute resources are much more elastic, the concern has shifted from how to maximize resource utilization to how to use resources cost- effectively. Adding more resources to an application doesn’t mean decreasing resources for other applications, which significantly simplifies the allocation challenge. Many companies are OK with adding more resources to an application as long as the added cost is justified by the return, e.g., extra revenue or saved engineering time. (Location 8091)
In the vast majority of the world, where engineers’ time is more valuable than compute time, companies are OK using more resources if this means it can help their engineers become more productive. This means that it might make sense for companies to invest in automating their workloads, which might make using (Location 8095)
times is exactly what cron does. This is also all that cron does: run a script at a predetermined time and tell you whether the job succeeds or fails. It doesn’t care about the dependencies between the jobs it runs— you can run job A after job B with cron but you can’t schedule anything complicated like run B if A succeeds and run C if A fails. (Location 8107)
Steps in an ML workflow might have complex dependency relationships with each other. For example, an ML workflow might consist of the following steps: Pull last week’s data from data warehouses. Extract features from this pulled data. Train two models, A and B, on the extracted features. Compare A and B on the test set. Deploy A if A is better; otherwise deploy B. (Location 8111)
Schedulers tend to leverage queues to keep track of jobs. Jobs can be queued, prioritized, and allocated resources needed to execute. This means that schedulers need to be aware of the resources available and the resources needed to run each job— the resources needed are either specified as options when you schedule a job or estimated by the scheduler. (Location 8132)
If schedulers are concerned with when to run jobs and what resources are needed to run those jobs, orchestrators are concerned with where to get those resources. Schedulers deal with job- type abstractions such as DAGs, priority queues, user- level quotas (i.e., the maximum number of instances a user can use at a given time), etc. Orchestrators deal with lower- level abstractions like machines, instances, clusters, service- level grouping, replication, etc. If the orchestrator notices that there are more jobs than the pool of available instances, it can increase the number of instances in the available instance pool. We say that it “provisions” more computers to handle the workload. Schedulers are often used for periodical jobs, whereas orchestrators are often used for services where you have a long- running server that responds to requests. (Location 8150)
The most well- known orchestrator today is undoubtedly Kubernetes, the container orchestrator we discussed in the section “From Dev to Prod: Containers”. K8s can be used on- prem (even on your laptop via minikube). However, I’ve never met anyone who enjoys setting up their own K8s clusters, so most companies use K8s as a hosted service managed by their cloud providers, such as AWS’s Elastic Kubernetes Service (EKS) or Google Kubernetes Engine (GKE). (Location 8158)
Many people use schedulers and orchestrators interchangeably because schedulers usually run on top of orchestrators. Schedulers like Slurm and Google’s Borg have some orchestrating capacity, and orchestrators like HashiCorp Nomad and K8s come with some scheduling capacity. (Location 8167)
Orchestrators such as HashiCorp Nomad and data science– specific orchestrators including Airflow, Argo, Prefect, and Dagster have their own schedulers. (Location 8174)
Readers familiar with workflow management tools aimed especially at data science like Airflow, Argo, Prefect, Kubeflow, Metaflow, etc. might wonder where they fit in this scheduler versus orchestrator discussion. (Location 8178)
workflow management tools manage workflows. They generally allow you to specify your workflows as DAGs, similar to the one in Figure 10- 7. A workflow might consist of a featurizing step, a model training step, and an evaluation step. Workflows can be defined using either code (Python) or configuration files (YAML). Each step in a workflow is called a task. (Location 8181)
Almost all workflow management tools come with some schedulers, and therefore, you can think of them as schedulers that, instead of focusing on individual jobs, focus on the workflow as a whole. Once a workflow is defined, the underlying scheduler usually works with an orchestrator to allocate resources to run the workflow, as shown in Figure 10- 8. Figure 10- 8. After a workflow is defined, the tasks in this workflow are scheduled and orchestrated (Location 8184)
Airflow is one of the earliest workflow orchestrators. It’s an amazing task scheduler that comes with a huge library of operators that makes it easy to use Airflow with different cloud providers, databases, storage options, and so on. Airflow is a champion of the “configuration as code” principle. Its creators believed that data workflows are complex and should be defined using code (Python) instead of YAML or other declarative language. Here’s an example of an Airflow workflow, drawn from the platform’s GitHub repository: (Location 8192)
Airflow is monolithic, which means it packages the entire workflow into one container. If two different steps in your workflow have different requirements, you can, in theory, create different containers for them using Airflow’s DockerOperator, but it’s not that easy to do so. (Location 8269)
Airflow’s DAGs are not parameterized, which means you can’t pass parameters into your workflows. So if you want to run the same model with different learning rates, you’ll have to create different workflows. (Location 8272)
Airflow’s DAGs are static, which means it can’t automatically create new steps at runtime as needed. Imagine you’re reading from a database and you want to create a step to process each record in the database (e.g., to make a prediction), but you don’t know in advance how many records there are in the database. Airflow won’t be able to handle that. (Location 8273)
Prefect’s workflows are parameterized and dynamic, a vast improvement compared to Airflow. It also follows the “configuration as code” principle so workflows are defined in Python. However, like Airflow, containerized steps aren’t the first priority of Prefect. You can run each step in a container, but you’ll still have to deal with Dockerfiles and register your docker with your workflows in Prefect. (Location 8279)
Argo addresses the container problem. Every step in an Argo workflow is run in its own container. However, Argo’s workflows are defined in YAML, which allows you to define each step and its requirements in the same file. (Location 8284)
The main drawback of Argo, other than its messy YAML files, is that it can only run on K8s clusters, which are only available in production. If you want to test the same workflow locally, you’ll have to use minikube to simulate a K8s on your laptop, which can get messy. (Location 8406)
Enter Kubeflow and Metaflow, the two tools that aim to help you run the workflow in both dev and prod environments by abstracting away infrastructure boilerplate code usually needed to run Airflow or Argo. They promise to give data scientists access to the full compute power of the prod environment from local notebooks, which effectively allows data scientists to use the same code in both dev and prod environments. (Location 8410)
Even though both tools have some scheduling capacity, they are meant to be used with a bona fide scheduler and orchestrator. One component of Kubeflow is Kubeflow Pipelines, which is built on top of Argo, and it’s meant to be used on top of K8s. Metaflow can be used with AWS Batch or K8s. (Location 8417)
In Metaflow, you can use a Python decorator @conda to specify the requirements for each step— required libraries, memory and compute requirements— and Metaflow will automatically create a container with all these requirements to execute the step. You save on Dockerfiles or YAML files. (Location 8423)
Metaflow allows you to work seamlessly with both dev and prod environments from the same notebook/ script. You can run experiments with small datasets on local machines, and when you’re ready to run with the large dataset on the cloud, simply add @batch decorator to execute it on AWS Batch. You can even run different steps in the same workflow in different environments. For example, if a step requires a small memory footprint, it can run on your local machine. But if the next step requires a large memory footprint, you can just add @batch to execute it on the cloud. (Location 8425)
his company realized that these same tools could be used by other ML applications, not just recommender systems. They created a new team, the ML platform team, with the goal of providing shared infrastructure across ML applications. (Location 8522)
As each company finds uses for ML in more and more applications, there’s more to be gained by leveraging the same set of tools for multiple applications instead of supporting a separate set of tools for each application. This shared set of tools for ML deployment makes up the ML platform. (Location 8526)
Because ML platforms are relatively new, what exactly constitutes an ML platform varies from company to company. Even within the same company, it’s an ongoing discussion. Here, I’ll focus on the components that I most often see in ML platforms, which include model development, model store, and feature store. (Location 8528)
here are two general aspects you might want to keep in mind: Whether the tool works with your cloud provider or allows you to use it on your own data center You’ll need to run and serve your models from a compute layer, and usually tools only support integration with a handful of cloud providers. Nobody likes having to adopt a new cloud provider for another tool. Whether it’s open source or a managed service If it’s open source, you can host it yourself and have to worry less about data security and privacy. However, self- hosting means extra engineering time required to maintain it. If it’s managed service, your models and likely some of your data will be on its service, which might not work for certain regulations. Some managed services work with virtual private clouds, which allows you to deploy your machines in your own cloud clusters, helping with compliance. (Location 8531)
A deployment service can help with both pushing your models and their dependencies to production and exposing your models as endpoints. Since deploying is the name of the game, deployment is the most mature among all ML platform components, and many tools exist for this. All major cloud providers offer tools for deployment: AWS with SageMaker, GCP with Vertex AI, Azure with Azure ML, Alibaba with Machine Learning Studio, and so on. There are also a myriad of startups that offer model deployment tools such as MLflow Models, Seldon, Cortex, Ray Serve, and so on. (Location 8548)
When looking into a deployment tool, it’s important to consider how easy it is to do both online prediction and batch prediction with the tool. While it’s usually straightforward to do online prediction at a smaller scale with most deployment services, doing batch prediction is usually trickier. 29 Some tools allow you to batch requests together for online prediction, which is different from batch prediction. Many companies have separate deployment pipelines for online prediction and batch prediction. For example, they might use Seldon for online prediction but leverage Databricks for batch prediction. (Location 8555)
to deploy a model, you have to package your model and upload it to a location accessible in production. Model store suggests that it stores models— you can do so by uploading your models to storage like S3. However, it’s not quite that simple. (Location 8567)
the outputs the model produces locally are different from the outputs produced in production. Many things could have caused this discrepancy; here are just a few examples: The model being used in production right now is not the same model that she has locally. Perhaps she uploaded the wrong model binary to production? The model being used in production is correct, but the list of features used is wrong. Perhaps she forgot to rebuild the code locally before pushing it to production? The model is correct, the feature list is correct, but the featurization code is outdated. The model is correct, the feature list is correct, the featurization code is correct, but something is wrong with the data processing pipeline. (Location 8573)
Many companies have realized that storing the model alone in blob storage isn’t enough. To help with debugging and maintenance, it’s important to track as much information associated with a model as possible. Here are eight types of artifacts that you might want to store. (Location 8582)
Model definition This is the information needed to create the shape of the model, e.g., what loss function it uses. If it’s a neural network, this includes how many hidden layers it has and how many parameters are in each layer. (Location 8585)
Model parameters These are the actual values of the parameters of your model. These values are then combined with the model’s shape to re- create a model that can be used to make predictions. Some frameworks allow you to export both the parameters and the model definition together. (Location 8587)
Featurize and predict functions Given a prediction request, how do you extract features and input these features into the model to get back a prediction? The featurize and predict functions provide the instruction to do so. These functions are usually wrapped in endpoints. (Location 8591)
Dependencies The dependencies— e.g., Python version, Python packages— needed to run your model are usually packaged together into a container. (Location 8593)
Data The data used to train this model might be pointers to the location where the data is stored or the name/ version of your data. If you use tools like DVC to version your data, this can be the DVC commit that generated the data. (Location 8595)
Model generation code This is the code that specifies how your model was created, such as: What frameworks it used How it was trained The details on how the train/ valid/ test splits were created The number of experiments run The range of hyperparameters considered The actual set of hyperparameters that final model used (Location 8597)
Experiment artifacts These are the artifacts generated during the model development process, (Location 8605)
Because of the lack of a good model store solution, companies like Stitch Fix resolve to build their own model store. Figure 10- 10 shows the artifacts that Stitch Fix’s model store tracks. When a model is uploaded to their model store, this model comes with the link to the serialized model, the dependencies needed to run the model (Python environment), the Git commit where the model code generation is created (Git information), tags (to at least specify the team that owns the model), etc. (Location 8623)
Figure 10- 10. Artifacts that Stitch Fix’s model store tracks. Source: Adapted from a slide by Stefan Krawczyk for CS 329S (Stanford). (Location 8629)
“Feature store” is an increasingly loaded term that can be used by different people to refer to very different things. (Location 8632)
At its core, there are three main problems that a feature store can help address: feature management, feature transformation, and feature consistency. A feature store solution might address one or a combination of these problems: (Location 8635)
Feature management A company might have multiple ML models, each model using a lot of features. Back in 2017, Uber had about 10,000 features across teams! 31 It’s often the case that features used for one model can be useful for another model. For example, team A might have a model to predict how likely a user will churn, and team B has a model to predict how likely a free user will convert into a paid user. There are many features that these two models can share. If team A discovers that feature X is super useful, team B might be able to leverage that too. (Location 8637)
A feature store can help teams share and discover features, as well as manage roles and sharing settings for each feature. For example, you might not want everyone in the company to have access to sensitive financial information of either the company or its users. In this capacity, a feature store can be thought of as a feature catalog. Examples of tools for feature management are Amundsen (developed at Lyft) and DataHub (developed at LinkedIn). (Location 8643)
Feature computation32 Feature engineering logic, after being defined, needs to be computed. For example, the feature logic might be: use the average meal preparation time from yesterday. The computation part involves actually looking into your data and computing this average. (Location 8647)
A feature store can help with both performing feature computation and storing the results of this computation. In this capacity, a feature store acts like a data warehouse. (Location 8654)
This means that feature definitions written in Python during development might need to be converted into the languages used in production. So you have to write the same features twice, once for training and once for inference. First, it’s annoying and time- consuming. Second, it creates extra surface for bugs since one or more features in production might differ from their counterparts in training, causing weird model behaviors. A key selling point of modern feature stores is that they unify the logic for both batch features and streaming features, ensuring the consistency between features during training and features during inference. (Location 8660)
Some feature stores only manage feature definitions without computing features from data; some feature stores do both. Some feature stores also do feature validation, i.e., detecting when a feature doesn’t conform to a predefined schema, and some feature stores leave that aspect to a monitoring tool. (Location 8666)
the most popular open source feature store is Feast. However, Feast’s strength is in batch features, not streaming features. Tecton is a fully managed feature store that promises to be able to handle both batch features and online features, but their actual traction is slow because they require deep integration. Platforms like SageMaker and Databricks also offer their own interpretations of feature stores. Out of 95 companies I surveyed in January 2022, only around 40% of them use a feature store. Out of those who use a feature store, half of them build their own feature store. (Location 8669)
The stage your company is at In the beginning, you might want to leverage vendor solutions to get started as quickly as possible so that you can focus your limited resources on the core offerings of your product. As your use cases grow, however, vendor costs might become exorbitant and it might be cheaper for you to invest in your own solution. (Location 8690)
Stefan Krawczyk, manager of the ML platform team at Stitch Fix, explained to me his build versus buy decision: “If it’s something we want to be really good at, we’ll manage that in- house. If not, we’ll use a vendor.” For the vast majority of companies outside the technology sector— e.g., companies in retail, banking, manufacturing— ML infrastructure isn’t their focus, so they tend to bias toward buying. When I talk to these companies, they prefer managed services, even point solutions (e.g., solutions that solve a business problem for them, like a demand forecasting service). For many tech companies where technology is their competitive advantage, and whose strong engineering teams prefer to have control over their stacks, they tend to bias toward building. If they use a managed service, they might prefer that service to be modular and customizable, so that they can plug and play with any component. (Location 8693)
The maturity of the available tools For example, your team might decide that you need a model store, and you’d have preferred to use a vendor, but there’s no vendor mature enough for your needs, so you have to build your own feature store, perhaps on top of an open source solution. (Location 8700)
Some people think that building is cheaper than buying, which is not necessarily the case. Building means that you’ll have to bring on more engineers to build and maintain your own infrastructure. It can also come with future cost: the cost of innovation. In- house, custom infrastructure makes it hard to adopt new technologies available because of the integration issues. (Location 8709)
The build versus buy decisions are complex, highly context- dependent, and likely what heads of infrastructure spend much time mulling over. Erik Bernhardsson, ex- CTO of Better.com, said in a tweet that “one of the most important jobs of a CTO is vendor/ product selection and the importance of this keeps going up rapidly every year since the infrastructure space grows so fast.” (Location 8712)
ML systems are probabilistic instead of deterministic. Usually, if you run the same software on the same input twice at different times, you can expect the same result. However, if you run the same ML system twice at different times on the exact same input, you might get different results. 1 Second, due to this probabilistic nature, ML systems’ predictions are mostly correct, and the hard part is we usually don’t know for what inputs the system will be correct! Third, ML systems can also be large and might take an unexpectedly long time to produce a prediction. (Location 8840)
These differences mean that ML systems can affect user experience differently, especially for users that have so far been used to traditional software. Due to the relatively new usage of ML in the real world, how ML systems affect user experience is still not well studied. (Location 8845)
When using an app or a website, users expect a certain level of consistency. For example, (Location 8849)
ML predictions are probabilistic and inconsistent, which means that predictions generated for one user today might be different from what will be generated for the same user the next day, depending on the context of the predictions. For tasks that want to leverage ML to improve users’ experience, the inconsistency in ML predictions can be a hindrance. (Location 8852)
consider a case study published by Booking.com in 2020. When you book accommodations on Booking.com, there are about 200 filters you can use to specify your preferences, such as “breakfast included,” “pet friendly,” and “non- smoking rooms.” There are so many filters that it takes time for users to find the filters that they want. The applied ML team at Booking.com wanted to use ML to automatically suggest filters that a user might want, based on the filters they’ve used in a given browsing session. The challenge they encountered is that if their ML model kept suggesting different filters each time, users could get confused, especially if they couldn’t find a filter that they had already applied before. The team resolved this challenge by creating a rule to specify the conditions in which the system must return the same filter recommendations (e.g., when the user has applied a filter) and the conditions in which the system can return new recommendations (e.g., when the user changes their destination). This is known as the consistency– accuracy trade- off, since the recommendations deemed most accurate by the system might not be the recommendations that can provide user consistency. (Location 8855)
these mostly correct predictions won’t be very useful if users don’t know how to or can’t correct the responses. Consider the same task of using a language model to generate React code for a web page. The generated code might not work, or if it does, it might not render to a web page that meets the specified requirements. A React engineer might be able to fix this code quickly, but many users of this application might not know React. And this application might attract a lot of users who don’t know React— that’s why they needed this app in the first place! (Location 8881)
an approach is to show users multiple resulting predictions for the same input to increase the chance of at least one of them being correct. These predictions should be rendered in a way that even nonexpert users can evaluate them. In this case, given a set of requirements input by users, you can have the model produce multiple snippets of React code. The code snippets are rendered into visual web pages so that nonengineering users can evaluate which one is the best for them. (Location 8886)
“human- in- the- loop” AI, as it involves humans to pick the best predictions or to improve on the machine- generated predictions. For readers interested in human- in- the- loop AI, I’d highly recommend Jessy Lin’s “Rethinking Human- AI Interaction”. (Location 8892)
Some companies that I’ve worked with use a backup system that is less optimal than the main system but is guaranteed to generate predictions quickly. These systems can be heuristics or simple models. They can even be cached precomputed predictions. This means that you might have a rule that specifies: if the main model takes longer than X milliseconds to generate predictions, use the backup model instead. (Location 8903)
SMEs (doctors, lawyers, bankers, farmers, stylists, etc.) are often overlooked in the design of ML systems, but many ML systems wouldn’t work without subject matter expertise. They’re not only users but also developers of ML systems. Most people only think of subject matter expertise during the data labeling phase— e.g., you’d need trained professionals to label whether a CT scan of a lung shows signs of cancer. However, as training ML models becomes an ongoing process in production, labeling and relabeling might also become an ongoing process spanning the entire project lifecycle. An ML system would benefit a lot to have SMEs involved in the rest of the lifecycle, such as problem formulation, feature engineering, error analysis, model evaluation, reranking predictions, and user interface: how to best present results to users and/ or to other parts of the system. (Location 8917)
to help SMEs get more involved in the development of ML systems, many companies are building no- code/ low- code platforms that allow people to make changes without writing code. Most of the no- code ML solutions for SMEs are currently at the labeling, quality assurance, and feedback stages, but more platforms are being developed to aid in other critical junctions such as dataset creation and views for investigating issues that require SME input. (Location 8930)
Approach 1: Have a separate team to manage production In this approach, the data science/ ML team develops models in the dev environment. Then a separate team, usually the Ops/ platform/ ML engineering team, productioni⁠zes the models in prod. This approach makes hiring easier as it’s easier to hire people with one set of skills instead of people with multiple sets of skills. It might also make life easier for each person involved, as they only have to focus on one concern (e.g., developing models or deploying models). (Location 8940)
Communication and coordination overhead A team can become blockers for other teams. According to Frederick P. Brooks, “What one programmer can do in one month, two programmers can do in two months.” Debugging challenges When something fails, you don’t know whether your team’s code or some other team’s code might have caused it. It might not have been because of your company’s code at all. You need cooperation from multiple teams to figure out what’s wrong. Finger- pointing Even when you’ve figured out what went wrong, each team might think it’s another team’s responsibility to fix it. Narrow context No one has visibility into the entire process to optimize/ improve it. For example, the platform team has ideas on how to improve the infrastructure but they can only act on requests from data scientists, but data scientists don’t have to deal with infrastructure so they have less incentives to proactively make changes to it. (Location 8945)
the data science team also has to worry about productionizing models. Data scientists become grumpy unicorns, expected to know everything about the process, and they might end up writing more boilerplate code than data science. (Location 8955)
Figure 11- 2. I used to think that a data scientist would need to know all these things (Location 8963)
Eugene Yan also wrote about how “data scientists should be more end- to- end.” 2 Eric Colson, Stitch Fix’s chief algorithms officer (who previously was also VP data science and engineering at Netflix), wrote a post on “the power of the full- stack data science generalist and the perils of division of labor through function.” (Location 8965)
I love Erik Bernhardsson’s analogy that expecting data scientists to know about infrastructure is like expecting app developers to know about how Linux kernels work. 4 I joined an ML company because I wanted to spend more time with data, not with spinning up AWS instances, writing Dockerfiles, scheduling/ scaling clusters, or debugging YAML configuration files. (Location 8973)
For data scientists to own the entire process, we need good tools. In other words, we need good infrastructure. What if we have an abstraction to allow data scientists to own the process end- to- end without having to worry about infrastructure? (Location 8976)
According to both Stitch Fix and Netflix, the success of a full- stack data scientist relies on the tools they have. They need tools that “abstract the data scientists from the complexities of containerization, distributed processing, automatic failover, and other advanced computer science concepts.” 5 In Netflix’s model, the specialists— people who originally owned a part of the project— first create tools that automate their parts, as shown in Figure 11- 3. Data scientists can leverage these tools to own their projects end- to- end. (Location 8981)
from Abhishek Gupta, founder and principal researcher at the Montreal AI Ethics Institute. His work focuses on applied technical and policy measures to build ethical, safe, and inclusive AI systems. (Location 8996)
Responsible AI is the practice of designing, developing, and deploying AI systems with good intention and sufficient awareness to empower users, to engender trust, and to ensure fair and positive impact to society. It consists of areas like fairness, privacy, transparency, and accountability. (Location 9003)
NIST Special Publication 1270: Towards a Standard for Identifying and Managing Bias in Artificial Intelligence ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT) publications Trustworthy ML’s list of recommended resources and fundamental papers for researchers and practitioners who want to learn more about trustworthy ML Sara Hooker’s awesome slide deck on fairness, security, and governance in machine learning (2022) Timnit Gebru and Emily Denton’s tutorials on fairness, accountability, transparency, and ethics (2020) (Location 9017)
There are other interesting examples of “AI incidents” logged at the AI Incident Database. Keep in mind that while the following two examples and the ones logged at AI Incident Database are the ones that caught attention, there are many more instances of irresponsible AI that happen silently. (Location 9030)
Transparency is the first step in building trust in systems, (Location 9085)
Any system that operates on the trust of the public should be reviewable by independent experts trusted by the public. (Location 9093)
Developers of applications that gather user data must understand that their users might not have the technical know- how and privacy awareness to choose the right privacy settings for themselves, and so developers must proactively work to make the right settings the default, even at the cost of gathering less data. (Location 9143)
you know that biases can creep in your system through the entire workflow. Your first step is (Location 9156)
One of the reasons why biases are so hard to combat is that biases can come from any step during a project lifecycle. (Location 9158)
Training data Is the data used for developing your model representative of the data your model will handle in the real world? If not, your model might be biased against the groups of users with less data represented in the training data. (Location 9159)
Labeling If you use human annotators to label your data, how do you measure the quality of these labels? How do you ensure that annotators follow standard guidelines instead of relying on subjective experience to label your data? The more annotators have to rely on their subjective experience, the more room for human biases. (Location 9161)
Feature engineering Does your model use any feature that contains sensitive information? Does your model cause a disparate impact on a subgroup of people? Disparate impact occurs “when a selection process has widely different outcomes for different groups, even as it appears to be neutral.” 23 This can happen when a model’s decision relies on information correlated with legally protected classes (e.g., ethnicity, gender, religious practice) even when this information isn’t used in training the model directly. (Location 9163)
To mitigate this potential disparate impact, you might want to use disparate impact remover techniques proposed by Feldman et al. in “Certifying and Removing Disparate Impact” or to use the function DisparateImpactRemover implemented by AI Fairness 360 (AIF360). You can also identify hidden bias in variables (which can then be removed from the training set) using the Infogram method, implemented in H2O. (Location 9169)
Model’s objective Are you optimizing your model using an objective that enables fairness to all users? For example, are you prioritizing your model’s performance on all users, which skews your model toward the majority group of users? (Location 9174)
ML literature makes the unrealistic assumption that optimizing for one property, like model accuracy, holds all others static. People might discuss techniques to improve a model’s fairness with the assumption that this model’s accuracy or latency will remain the same. However, in reality, improving one property can cause other properties to degrade. (Location 9197)
differential privacy is “a system for publicly sharing information about a dataset by describing the patterns of groups within the dataset while withholding information about individuals in the dataset. The idea behind differential privacy is that if the effect of making an arbitrary single substitution in the database is small enough, the query result cannot be used to infer much about any single individual, and therefore provides privacy.” (Location 9201)
Differential privacy is a popular technique used on training data for ML models. The trade- off here is that the higher the level of privacy that differential privacy can provide, the lower the model’s accuracy. However, this accuracy reduction isn’t equal for all samples. As pointed out by Bagdasaryan and Shmatikov (2019), “the accuracy of differential privacy models drops much more for the underrepresented classes and subgroups.” (Location 9205)
In their 2019 paper, “What Do Compressed Deep Neural Networks Forget?,” Hooker et al. found that “models with radically different numbers of weights have comparable top- line performance metrics but diverge considerably in behavior on a narrow subset of the dataset.” 26 For example, they found that compression techniques amplify algorithmic harm when the protected feature (e.g., sex, race, disability) is in the long tail of the distribution. This means that compression disproportionately impacts underrepresented features. 27 Another important finding from their work is that while all compression techniques they evaluated have a nonuniform impact, not all techniques have the same level of disparate impact. Pruning incurs a far higher disparate impact than is observed for the quantization techniques that they evaluated. (Location 9213)
Companies might decide to bypass ethical issues in ML models to save cost and time, only to discover risks in the future when they end up costing a lot more, (Location 9231)
The earlier in the development cycle of an ML system that you can start thinking about how this system will affect the life of users and what biases your system might have, the cheaper it will be to address these biases. (Location 9233)
Create model cards Model cards are short documents accompanying trained ML models that provide information on how these models were trained and evaluated. Model cards also disclose the context in which models are intended to be used, as well as their limitations. 30 According to the authors of the model card paper, “The goal of model cards is to standardize ethical practice and reporting by allowing stakeholders to compare candidate models for deployment across not only traditional evaluation metrics but also along the axes of ethical, inclusive, and fair considerations.” (Location 9237)
The following list has been adapted from content in the paper “Model Cards for Model Reporting” to show the information you might want to report for your models: 31 Model details: Basic information about the model. Person or organization developing model Model date Model version Model type Information about training algorithms, parameters, fairness constraints or other applied approaches, and features Paper or other resource for more information Citation details License Where to send questions or comments about the model Intended use: Use cases that were envisioned during development. Primary intended uses Primary intended users Out- of- scope use cases Factors: Factors could include demographic or phenotypic groups, environmental conditions, technical attributes, or others. Relevant factors Evaluation factors Metrics: Metrics should be chosen to reflect potential real- world impacts of the model. Model performance measures Decision thresholds Variation approaches Evaluation data: Details on the dataset( s) used for the quantitative analyses in the card. Datasets Motivation Preprocessing Training data: May not be possible to provide in practice. When possible, this section should mirror Evaluation Data. If such detail is not possible, minimal allowable information should be provided here, such as details of the distribution over various factors in the training datasets. Quantitative analyses Unitary results Intersectional results Ethical considerations Caveats and recommendations (Location 9244)
Note that model cards will need to be updated whenever a model is updated. For models that update frequently, this can create quite an overhead for data scientists if model cards are created manually. Therefore, it’s important to have tools to automatically generate model cards, either by leveraging the model card generation feature of tools like TensorFlow, Metaflow, and scikit- learn or by building this feature in- house. (Location 9270)
Novel techniques to combat these biases and challenges are actively being developed. It’s important to stay up- to- date with the latest research in responsible AI. You might want to follow the ACM FAccT Conference, the Partnership on AI, the Alan Turing Institute’s Fairness, Transparency, Privacy group, and the AI Now Institute. (Location 9289)
Chip Huyen (https:// huyenchip.com) is co- founder and CEO of Claypot AI, developing infrastructure for real- time machine learning. Previously, she was at NVIDIA, Snorkel AI, and Netflix, where she helped some of the world’s largest organizations develop and deploy machine learning systems. When a student at Stanford, she created and taught the course TensorFlow for Deep Learning Research. She is currently teaching CS 329S: Machine Learning Systems Design at Stanford. (Location 10655)

Pelayo Arbués

Explorer

Recent Notes

AI Learning Paths for Software Engineers Without Becoming a Data Scientist

Power and Prediction

Why Software Engineers Should Learn a Bit of Data Science

Designing Machine Learning Systems. An Iterative Process for Production Ready Applications

Metadata

Highlights

Graph View

Table of Contents

Now Reading

John Snow Probably Didn’t Use That Broad Street Map to Reach His Conclusions About Cholera