rw-book-cover

Metadata

Highlights

  • The application to search ranking is one of the biggest machine learning success stories at Airbnb. Much of the initial gains were driven by a gradient boosted decision tree model. The gains, how- ever, plateaued over time. This paper discusses the work done in applying neural networks in an attempt to break out of that plateau. (View Highlight)
  • Our current discussion is focused on one particular model in this ecosystem. Considered the most complex piece of the ranking puzzle, this model is responsible for ordering the listings available according to the guest’s likelihood of booking. (View Highlight)
  • It is com- mon for guests to do multiple searches, clicking through some of the listings to view their details page. Successful sessions end with the guest booking a listing. The searches performed by a guest and their interactions are logged. While training, a new model has access to the logs of the current and previous models used in production. The new model is trained to learn a scoring function that assigns impressions of the booked listings in the logs at as high a rank as possible, similar to [19]. The new model is then tested online in an A/B testing framework to see if it can achieve a statistically significant increase in conversions compared to the current model. (View Highlight)
  • The home sharing platform at Airbnb is a two sided marketplace for hosts to rent out their spaces, referred to as listings, to be booked by prospective guests from all around the world. A typical booking starts with the guest issuing a search at airbnb.com for homes available in a particular geographic location. The task of search ranking is to respond to the guest with an ordered list of a handful of listings from the thousands available in the inventory. (View Highlight)
  • The very first implementation of search ranking was a manually craſted scoring function. Replacing the manual scoring function with a gradient boosted decision tree (GBDT) model gave one ofthe largest step improvements in homes bookings in Airbnb’s history, with many successful iterations to follow. The gains in online book- ings eventually saturated with long spells of neutral experiments. This made the moment ripe for trying sweeping changes to the system. (View Highlight)
  • Our transition to deep learning wasn’t the result ofan atomic move; it was the culmination of a series of iterative refinements. Figure 2a shows the comparative improvements in our principal offline metric NDCG (normalized discounted cumulative gain), where the impression of the booked listing is assigned a relevance of 1, and all other impressions 0 relevance. The x-axis depicts the models along with the time they were launched to production. (View Highlight)
  • Starting with this background, the current paper discusses our experiences in transitioning one of the at-scale search engines on the internet to deep learning. The paper is targeted towards teams that have a machine learning system in place and are starting to think about neural networks (NNs). (View Highlight)
  • The search ranking model under discussion is part of an ecosys- tem of models, all of which contribute towards deciding which listings to present to the guest. (View Highlight)
  • These include models that predict the likelihood of a host accepting the guest’s request for booking, models that predict the probability the guest will rate the on trip (View Highlight)
  • Andrej Karpathy has advice regarding model architecture: don’t be a hero (View Highlight)
  • (a) NDCG gains offline (View Highlight)
  • s (View Highlight)
  • The first architecture that we finally managed to get online was a simple single hidden layer NN with 32 fully connected ReLU activations that proved booking neutral against the GBDT model. The NN was fed by the same features as the GBDT model. The training objective for the NN was also kept invariant w.r.t the GBDT model: minimizing the L2 regression loss where booked listings are assigned a utility of 1.0 and listings that are not booked a utility of 0.0. (View Highlight)
  • The value of the whole exercise was that it validated that the entire NN pipeline was production ready and capable of serving live traffic. A (View Highlight)
  • Not being a hero got us off to a start, but not very far. In time we would adapt Karpathy’s advice to: don’t be a hero, in the beginning. Our first breakthrough came when we combined a NN with the idea behind Lamdarank [2]. (View Highlight)
  • Offline we were using NDCG as our principal metric. Lambdarank gave us a way to directly optimize 2 the NN for NDCG. This involved two crucial improvements over the regression based formulation of the simple NN: • Moving to a pairwise preference formulation where the listings seen by a booker were used to construct pairs of {booked listing, not-booked listing} as training examples. During training we minimized cross entropy loss of the score difference between the booked listing over the not- booked listing. • Weighing each pairwise loss by the difference in NDCG resulting from swapping the positions of the two listings making up the pair. This prioritized the rank optimization of the booked listing towards the top of the search result list, instead of the bottom. F (View Highlight)
  • The features feeding the DNN were mostly simple properties of Hidden Relu layer the listings such as price, amenities, historical booking count, etc, fed directly with minimal feature engineering. Exceptions include features output from another model: • Price oflistings that have the Smart Pricing feature enabled, supplied by a specialized model [24]. Embedding Embedding Other Features FM Prediction L0 q0 q1 Gradient Boosted Decision Trees Figure 3: NN with GBDT tree nodes and FM prediction as features Factorization Machine L1 • Similarity of the listing to the past views of the user, com- puted based on co-view embeddings [9]. These models tap into data that isn’t directly part of the search ranking training examples, providing the DNN with additional information. (View Highlight)
  • The narrative of one successful launch followed by another pre- sented in the previous section doesn’t tell the whole story. Reality is studded with unsuccessful attempts that outnumber the successful ones. Retelling every attempt made would be time consuming, so we pick two particularly interesting ones. These models are interest- ing because they illustrate howsome technique that is very effective and popular in the wild may not work as well when brought home. (View Highlight)
  • • Iterations on the GBDT model with alternative ways to sample searches for constructing the training data. • A factorization machine (FM) [14] model that predicted the booking probability of a listing given a query, by mapping listings and queries to a 32-dimensional space. These new ways of looking at the search ranking problem revealed something interesting: although performances of the alternative models on test data were comparable to the NN, the listings up- ranked by them were quite different. Inspired by NN architectures like [23], the new model was an attempt to combine the strengths of all three models. For the FM model we took the final prediction as a feature into the NN. From the GBDT model, we took the index of the leaf node activated per tree as a categorical feature. (View Highlight)
  • Each listing at Airbnb has a corresponding unique id. One of the ex- citing new opportunities unlocked by NNs was to use these listing ids as features. The idea was to use the listing ids as index into an embedding, which allowed us to learn a vector representation per listing, encoding their unique properties. A reason for our excite- ment was the success other applications had achieved in mapping such high cardinality categorical features to embeddings, such as learning embeddings for words in NLP applications [6], learning embeddings for video and user id in recommender systems [4], etc. (View Highlight)
  • The complexity of the model at this point was staggering, and some of the issues mentioned in [17] begun to rear their heads. In our final leap, we were able to deprecate all that complexity by simply scaling the training data 10x and moving to a DNN with 2 hidden layers. Typical configuration of the network: an input layer with a total of 195 features aſter expanding categorical features to embeddings, feeding the first hidden layer with 127 fully connected ReLUs, and then the second hidden layer with 83 fully connected ReLUs. (View Highlight)
  • , in the different variations we tried, listing ids mostly led to overfitting. (View Highlight)
  • while DNNs have achieved human level perfor- Figure 4: DNN learning curve • Iterations on the GBDT model with alternative ways to sample searches for constructing the training data. • A factorization machine (FM) [14] model that predicted the booking probability of a listing given a query, by mapping listings and queries to a 32-dimensional space. These new ways of looking at the search ranking problem revealed something interesting: although performances of the alternative models on test data were comparable to the NN, the listings up- ranked by them were quite different. Inspired by NN architectures like [23], the new model was an attempt to combine the strengths of all three models. For the FM model we took the final prediction as a feature into the NN. From the GBDT model, we took the index of the leaf node activated per tree as a categorical feature. Figure 3 gives an overview. 2.4 Deep NN The complexity of the model at this point was staggering, and some of the issues mentioned in [17] begun to rear their heads. In our final leap, we were able to deprecate all that complexity by simply scaling the training data 10x and moving to a DNN with 2 hidden layers. Typical configuration of the network: an input layer with a total of 195 features aſter expanding categorical features to embeddings, feeding the first hidden layer with 127 fully connected ReLUs, and then the second hidden layer with 83 fully connected ReLUs. 3 mance on certain image applications [21], it is very hard for us to judge where we stand for a similar comparison. Part of the problem is that it’s unclear how to define human level performance. Going through the logs, it’s quite challenging for us to identify which listing was booked. We find no objective truth in the logs, only tradeoffs highly conditional upon the budget and tastes of the guest which remain mostly hidden. Other researchers [10] note the diffi- culty in using human evaluation even for familiar shopping items. For our application these difficulties are further exacerbated due to the novelty of the inventory. (View Highlight)
  • The idea was that the model would be able to transfer its learnings from long views to predict bookings and avoid overfitting. Since the number of long view labels outnumbered the booking labels by orders of magnitude, a compensating higher weight was applied to the booking loss to preserve focus on the booking objective. The loss for each long view label was further scaled by loд(view duration) as proposed in [25]. For scoring listings online, we used the booking prediction only. (View Highlight)
  • The reason why such an established technique fails at Airbnb is because of some unique properties of the underlying marketplace. The embeddings need substantial amounts of data per item to con- verge to reasonable values. When items can be repeated without constraints, such as online videos or words in a language, there is no limit to the amount of user interaction an item can have. Ob- taining large amounts of data for the items of interest is relatively easy. Listings, on the other hand, are subjected to constraints from the physical world. Even the most popular listing can be booked at most 365 times in an entire year. Typical bookings per listing are much fewer. This fundamental limitation generates data that is very sparse at the listing level. The overfitting is a direct fallout of this limitation. (View Highlight)
  • When tested online, the model increased long views by a large margin. But bookings remained neutral. Manually inspecting list- ings which had a high ratio of long views compared to bookings, we found several possible reasons which could have resulted in this gap. Such long views could be driven by high-end but high priced listings, listings with long descriptions that are difficult to parse, or extremely unique and sometimes humorous listings, among sev- eral other reasons. Once again, it is the uniqueness of the Airbnb marketplace where long views are correlated with bookings, but have a large orthogonal component as well that makes predicting bookings based on them challenging. A better understanding of listing views continues to be a topic of research for us. (View Highlight)
  • While bookings have a physical limitation, user views of the listing details pages are not constrained in the same way, and those we have in spades. F (View Highlight)
  • igure 6 shows the distribution ofviews to bookings ratio for listings, with bookings typically orders of magnitude more sparse than views. Taking a step further, we found long views of listing details pages, unsurprisingly, correlated with bookings. (View Highlight)
  • To tackle the overfitting listing ids, we built a model taking a page out of the multi-task learning playbook [15]. The model simultaneously predicted the probability of booking and long view (View Highlight)
  • The baseline GBDT pipeline we started with had extensive feature engineering. Typical transformations included computing ratios, averaging over windows, and other flavors of composition. The feature engineering tricks had accumulated over years of exper- imentation. Yet it was unclear if the features were the best they could be, or even up to date with the changing dynamics ofthe mar- ketplace. (View Highlight)
  • Yet this section is dedicated to feature engineering, because we found that making NNs work efficiently involved a little more than feeding raw data. This flavor of feature engineering is different from the traditional one: instead of deriving the math to perform on the features before feeding them into the model, the focus shiſts to ensuring the features comply with certain properties so that the NN can do the math effectively by itself. (View Highlight)
  • In our very first attempt at training a NN, we simply took all the features used to train the GBDT model and fed it to the NN. This went down very badly. The loss would saturate in the middle of training and additional steps would have no effect. We traced the issue to the fact that the features were not normalized properly. (View Highlight)
  • For decision trees, the exact numeric values ofthe features hardly matter, as long as their relative ordering is meaningful. Neural networks on the other hand are quite sensitive to the numeric values the features tak (View Highlight)
  • Feeding values that are outside the usual range of features can cause large gradients to back propagate. This can permanently shut off activation functions like ReLU due to vanishing gradients [3]. To avoid it we ensure all features are restricted to a small range ofvalues, with the bulk ofthe distribution in the {-1, 1} interval and the median mapped to 0. (View Highlight)
  • This by and large involves inspecting the features and applying either of the two transforms: • In case the feature distribution resembles a normal distri- bution, we transform it by (feature val − µ)/σ, where µ is the feature mean and σ the standard deviation. • If the feature distribution looks closer to a power law dis- tribution, we transform it by loд( 1+feature val (View Highlight)
  • These plots drive our intuition for why DNNs may be generalizing well for our application. When building a model feeding on hun- dreds of features, the combinatorial space of all the feature values is impossibly large, and during training oſten a fraction of the feature combinations are covered. The smooth distributions coming from the lower layers ensure that the upper layers can correctly fiinter- polatefi the behavior for unseen values. Extending this intuition all the way to the input layer, we put our best effort to ensure the input features have a smooth distribution. (View Highlight)
  • In addition to mapping features to a restricted numerical range, we ensured most of them had a smooth distribution. Why obsess over the smoothness of distributions? Below are some of our reasons. (View Highlight)
  • How can we test if the model is generalizing well beyond logged examples? The real test is of course online performance of the model, but we found the following technique useful as a sanity check: scaling all the values of a given feature in the test set, such as price to 2x, 3x, 4x etc. and observing changes in NDCG. We found that the model’s performance was remarkably stable over these values it had never seen before. (View Highlight)
  • 4.2.1 Spotting bugs. When dealing with hundreds of millions of feature samples, how can we verify a small fraction of them are not buggy? Range checks are useful but limited. We found smoothness of distribution an invaluable tool to spot bugs as the distribution of errors oſten stand in contrast to the typical distribution. To give an example, we had bugs related to currencies in the prices logged for certain geographies. And for periods greater than 28 days, the logged price was the monthly price instead of the daily price. These errors showed up as spikes on the initial distribution plots (View Highlight)
  • Most features attained a smooth distribution once debugged and applied the appropriate normalization. However, for a few we had to do specialized feature engineering. An example is the geo location of a listing, represented by its latitude and longitude. Figure 11a and 11b show the distribution of raw lat/lng. To make the distribution smooth, instead of raw lat/lng we compute the offset from the center of the map displayed to the user. Shown in Figure 11c, the mass seems concentrated at the center due to the tail end of maps which are zoomed out a lot. So we take loд() of the lat/lng offset, which yields the distribution in Figure 11d. This allows us to construct two features with smooth distributions, (View Highlight)
  • .2.2 Facilitating generalization. Answering exactly why DNNs are good at generalizing is a complicated topic at the forefront of research [26]. Meanwhile our working knowledge is based on the observation that in the DNNs built for our application, the outputs of the layers get progressively smoother in terms of their distribu- tions. Figure 8 shows the distribution from the final output layer, while Figure 9 and Figure 10 show some samples from the hidden layers. For plotting the values from the hidden layers, the zeros have been omitted and the transform loд(1 + relu output) applied. (View Highlight)
  • To be clear, the raw lat/lng to offsets frommap center transform is a lossy many-to-one function as it can convert multiple lat/lng to the same offset values. This allows the model to learn global properties based on distance rather than properties of specific geographies. (View Highlight)
  • To learn properties localized to specific geographies we use high cardinality categorical features that we describe later. (View Highlight)
  • The overfitting listing id was not the only high cardinality categori- cal feature we tried. There were other attempts, where true to the promise of NNs, we got high returns with little feature engineering. (View Highlight)
  • Checking feature completeness. In some cases, investigat- ing the lack of smoothness of certain features lead to the discovery of features the model was missing. As an example, we had the frac- tion of available days a listing was occupied in the future as a signal of quality, the intuition being that high quality listings get sold out ahead of time. But the distribution of occupancy turned out to be perplexing in its lack of smoothness (View Highlight)
  • The preference of guests for various neighborhoods of a city is an important location signal. For the GBDT model, this information was fed by a heavily engineered pipeline, that tracked the hierarchical distribution of bookings over neighborhoods and cities. The effort involved in building and maintaining this pipeline was substantial. Yet it didn’t factor in key elements like prices of the booked listings. (View Highlight)
  • Aſter investigation we found an additional factor that influenced occu- pancy: listings had varying minimum stay requirements, sometimes extending to months, due to which they got occupied at different rates. However, we had not added the minimum required stay as a feature in the model as it was calendar dependent and consid- ered too complex. But aſter looking at the occupancy distribution, we added average length of stay at the listing as a feature. Once occupancy is normalized by average length ofstay, we see the distri- bution in Figure 12b. Some features that lack a smooth distribution in one dimension may become smooth in a higher dimension. It was helpful for us to think through if those dimensions were already available to the model and if not, then adding them (View Highlight)
  • We created a new categorical feature by taking the city specified in the query and the level 12 S2 cell [16] corresponding to a listing, then mapping the two together to an integer using a hashing func- tion. For example, given the query ”San Francisco” and a listing near the Embarcadero, we take the S2 cell the listing is situated in (539058204), and hash {”San Francisco”, 539058204} → 71829521 to build our categorical feature. These categorical features are then mapped to an embedding, which feed the NN. During training, the model infers the embedding with back propagation which encodes the location preference for the neighborhood represented by the S2 cell, given the city query. (View Highlight)
  • This section is concerned with speeding up training and scoring. A quick summary of our pipeline: search queries from a guest hits a JavaTM server that does retrieval and scoring. The server also produces logs which are stored as serialized ThriſtTM instances. The logs are processed using a SparkTM pipeline to create training data. Model training is done using TensorFlowTM. Various tools written in Scala and JavaTM are used for evaluating the models and computing offline metrics. The models are then uploaded to the JavaTM server that does retrieval and scoring. (View Highlight)
  • Dropout. Our initial impression was that dropout is the coun- terpart of regularization for neural networks [22], and thereby essential. However for our application, the different flavors of dropout we tried, all lead to a slight degradation in offline metrics. To reconcile our lack of success with dropout, our current interpre- tation of it is closer to a data augmentation technique [1], effective when the randomness introduced mimic valid scenarios that may be missing in the training data. For our case, the randomness was simply producing invalid scenarios that was distracting the model. (View Highlight)
  • As an alternative, we added hand craſted noise shapes taking into account the distribution of particular features, resulting in an improvement of ∼1% in offline NDCG. But we failed to get any statistically significant improvement in online performance. (View Highlight)
  • Initialization. Out of sheer habit, we started our first model by initializing all weights and embeddings to zero, only to discover that is the worst way to start training a neural network. Aſter surveying different techniques, our current choice is to use Xavier initialization [5] for the network weights and random uniform in the {-1, 1} range for embeddings. (View Highlight)
  • Protobufs and Datasets. The GDBT model was fed training data in CSV format, and we reused much of this pipeline to feed the TensorFlowTM models using feed dict. At first glance this seems like a finon-machine learningfi issue, and was quite low in our list of priorities. Our wake up call came when we found our GPU utilizations near ∼25%. Most of the training time was spent in pars- ing CSV and copying data through feed dict. We were effectively towing a Ferrari with a mule. Retooling the pipeline to produce training data as Protobufs and using Dataset [8] gave a 17x speedup to training and drove GPU utilization to ∼90%. This ultimately allowed us to attack the problem by scaling the training data from weeks to months. (View Highlight)
  • Learning rate. An overwhelming range of strategies confronted us here, but for our application we found it hard to improve upon the performance ofAdam [12] with its default settings. Currently we use a variant LazyAdamOptimizer [7], which we found faster when training with large embeddings. (View Highlight)
  • Batch size. Varying batch size has dramatic effect on training speed, but its exact effect on the model itself is hard to grasp. The most useful pointer we found was [18]. We however, didn’t quite follow the advice in the paper. Having swept the learning rate issue under the LazyAdamOptimizer carpet, we just opted for a fixed batch size of 200 which seemed to work for the current models. (View Highlight)
  • Refactoring static features. A large number of our features were properties of listings that rarely changed. For instance, location, number of bedrooms, a long list of amenities, guest rules etc. Read- ing all these features as part of each training example created an input bottleneck. To eliminate this disk traffic, we used only the listing id as a categorical feature. All quasi-static listing features were packed as a non-trainable embedding indexed by the listing id. For the features that had a mutation over the training period, this traded off a small level of noise against training speed. Resident in the GPU memory, the embedding eliminated multiple kilobytes of data per training example that used to get loaded from disk via the CPU. This efficiency made it possible to explore a whole new class ofmodels which took into account fine details about tens of listings the user had interacted with in the past. (View Highlight)
  • Estimating feature importance and model interpretability in gen- eral is an area where we took a step back with the move to NNs. Estimating feature importance is crucial in prioritizing engineer- ing effort and guiding model iterations. The strength of NNs is in figuring out nonlinear interactions between the features. This is also the weakness when it comes to understanding what role a particular feature is playing as nonlinear interactions make it very difficult to study any feature in isolation. (View Highlight)
  • JavaTM NN library. Towards the beginning of 2017 when we started shipping the TensorFlowTM models to production, we found no efficient solution to score the models within a JavaTM stack. Typically a back and forth conversion of the data between JavaTM and another language was required and the latency introduced in the process was a blocker for us. To adhere to the strict latency requirement of search, we created a custom neural network scoring library in JavaTM. While this has served us well till this point, we expect to revisit the issue to see the latest alternatives available. (View Highlight)
  • Score Decomposition. A homegrown partial dependence tool sim- ilar to [13] was the backbone of feature analysis in the GBDT world. In the NN world trying to understand individual feature impor- tance only lead to confusion. Our first naive attempt was to take the final score produced by the network, and try to decompose it into contributions coming from each input node. A (View Highlight)
  • no clean way to separate the influence of a particular incoming node across a non-linear activation like ReLU (View Highlight)
  • Figure 15 summarizes our deep learning journey so far. Feeding on the ubiquitous deep learning success stories, we started at the peak ofoptimism, thinking deep learning would be a drop in replacement for the GBDT model and give us stupendous gains out of the box. (View Highlight)
  • Ablation Test. This was another simplistic attack on the problem. The idea here was to ablate the features one at a time, retrain the model and observe the difference in performance. We could then assign the features importance proportional to the drop in perfor- mance their absence produced. However, the difficulty here was that any performance difference obtained by dropping a single fea- ture resembled the typical noise in offline metrics observed anyway while retraining models. Possibly due to non-trivial redundancy in our feature set, the model seemed capable of making up for a single absent feature from the remaining ones. This leads to a Ship-of- Theseus paradox: can you keep ablating one feature at a time from the model, claiming it has no significant drop in performance? (View Highlight)
  • A lot of initial discussions centered around keeping everything else invariant and replacing the current model with a neural network to see what gains we could get. This set us up for a plunge into the valley of despair, when initially none of those gains materialized. (View Highlight)
  • Permutation Test. We raised the sophistication in our next at- tempt, taking inspiration from permutation feature importance proposed for random forests [20]. We observed the performance of the model on a test set aſter randomly permuting the values of a feature across the examples in the test. Our expectation was that more important a feature, the higher the resulting degradation from perturbing it. (View Highlight)
  • In fact, all we saw in the beginning was regression in offline metrics. Over time we realized that moving to deep learning is not a drop-in model replacement at all; rather it’s about scaling the system. As a result, it required rethinking the entire system surrounding the model. Confined to smaller scales, models like GBDT are arguably at par in performance and easier to handle, and we continue to use them for focused medium sized problems. (View Highlight)
  • So would we recommend deep learning to others? That would be a wholehearted Yes. And it’s not only because of the strong gains in the online performance of the model. Part of it has to do with how deep learning has transformed our roadmap ahead. Earlier the focus was largely on feature engineering, but aſter the move to deep learning, trying to do better math on the features manually has lost its luster. This has freed us up to investigate problems at a higher level, like how can we improve our optimization objective, and are we accurately representing all our users? (View Highlight)
  • This exercise however lead to somewhat nonsensical results, like one of the most important features for predicting book- ing probability came out to be the number of rooms in a listing. The reason was that in permuting the features one at a time, we had baked in the assumption that the features were independent of each other, which was simply false. Number of rooms for instance is closely tied to price, number of guests staying, typical amenities etc. Permuting the feature independently created examples that never occurred in real life, and the importance of features in that invalid space sent us in the wrong direction. The test however was somewhat useful in determining features that were not pulling their weight. If randomly permuting a feature didn’t affect the model performance at all, it was a good indication that the model was probably not dependent on it. (View Highlight)
  • TopBot Analysis. A homegrown tool designed to interpret the features without perturbing them in any way provided some in- teresting insights. Named TopBot, short for top-bottom analyzer, it took a test set as input and used the model to rank the listings per test query. (View Highlight)
  • It then generated distribution plots of feature values from the listings ranked at the top for each query, and compared them to the distribution of feature values from the listings at the bottom. (View Highlight)