Dataview
LIST
FROM #Collection
WHERE file.name = this.Entry-ForAd Click Prediction - A Look at Industrial-Scale Ads Recommendation Models
🎤 Vocab
❗ Information
Title:
Ad Click Prediction - A Look at Industrial-Scale Ads Recommendation Models
Abstract:
This talk offers a practical overview of lessons learned from training and deploying click-through rate (CTR) prediction models at industrial scale. We will explore the key trade-offs between model accuracy, cost, and system complexity. The presentation will outline proven techniques for optimizing performance throughout the machine learning lifecycle, including architectural choices, effective training strategies, and approaches for addressing challenges such as reproducibility and generalization across diverse UI treatments.
Bio:
Xiaoyue Li (Software Engineer, Google)
Xiaoyue is a software engineer at Google, working on the development of search ads prediction models. She is an alumna of the Statistics Department at UC Davis (class of 2020). This talk explores the challenges of ad click prediction at industrial scale, highlighting the unique challenges of the problem and showcasing a selection of practical solutions being developed by her team.
- xiaoyuei[at]google.com
✒️ -> Scratch Notes
preface:
based on a paper, look up talk title
- everything presented is published work
problem
-
click prediction og goole
-
based on search result clicks
-
p(click | query)
- also include ad, context, even user potentially
given a fully formed hypothetical serp scenario, what is he probability that the user would click on this add if the scenario showed?
expected revenue = click thru rate (CTR) * cost per click (CPC)
- also include ad, context, even user potentially
-
important for charging correctly
-
charged based on actual clicks (no click no charge)
simple seeming problem? no…
balance multiple sonsors for example, estimate values of clicked ‘winners’ and ‘runner-ups’. Auction system?
- pctr?
- PCTR inference?
everything is impacting the price
Non stationarity
CTR changes over time
- weekends, holdays, special events, weather, …
much more of an optimization/bandit problem vs an AI challenge
many traditional statistical/ML methods fall apprt
- (invalid) “assume we are drawing events from a fixed distribution…“
scalable online ML
perform sequential pass of dataset (progressive validation)
- makes test set very closely match training set
- eval metric are computed on examples before they are trained on
serving models continue to train as new data arrives - allows for faster adpattion to shifting CTR distribution
accuracy vs cost
gains must be weighed against cost (to long term research velocity)
- training cost: nadwidth, throughput, latency
- deployment cost: compute, latency
some techniques aren’t worth the cost (e.g. ensemblingmany models). .often a technique can be “tuned” toward efficiency or accuracy
themes of effective techniques:
- finding right objectives
- amortizing cost (shampoo optiimizer, distillation)
- looking at overall system bottlenecks and adding costt where it hurts least
serving time (latency) is not necesarily equal to training time
- lots of techniques to speed up training (massive batching, …)
- limitations at serving (queries per second, etc.)
reducing cost:
- efficient model structures
- data sampling (reduction?)
larger networks and advanicng ml techniques tend to increase BOTH cost and quality
efficiency
efficiency: bottleneck layers tuned with AutoML
model consists of fully connected resnet layers
approximat full matmul with bottleneck from mn to mk+kn (k<<n and k<<m)
neural architecture search trains all candidate simultaneously in a single iteration, selecting a specific architecture by activating part of the super-network with masking
RL reward function incorporates model of computation cost allowing for trading offf accuracy and efficiency
weight sharing based NAS with cost constraints?
efficiency: down sampling training data
generally see diminishing returns as we increase our dataset. given that we do a signle seuentiial pass we can either:
- limit time range
- down sample examples within time range
successful methods- reducing non-click examples
- reducing examples with low logistic loss
- reducing examples unlikely to have been seen by user
efficiency nd accuracy: Shampoo
second order optimization with shampo improved accuracy significantly
able to avoid prohibitive cost by amortizing the most expensive computation (computing large matrix inverses) every n-steps
other tricks:
- partitioning large layers and running shampoo on smaller blocks
- learning rate grafting was used to match per layer step sizes when replacing adagrad with shampoo
paper: distributed shampoo
acc
acc: loss engineering
generally optimizes for logistic loss, calibrated prediction
also important the the set of shown ads be correctly allocated (wrt clicks)
- empirically, PerQueryAUC is a metric well correlated with business metrics
- unknown why? tried and true
- One effective approximatrion: ranknet loss (s_i and s_j: logit scores)
acc: distrillation
student model uses the teacher prediction in an auxilliary loss function (in addition to click label)
two pass online distillation
- more expensive teacher model that records its predictions
- teaacher is also a one-pass sequential training model
teacher is larger, mor eexpensive, and trains on more data
teacher is trained once, cost amortized when repeatedly training students
acc: dcn
smart?
image from dcn v2: imrpvoed eep and cross network and practical lessons for web scale learning to rank systems
generalization - UI treatments
which ui to show?
where to place?
what about other adds?
\what is the runners up ads pCTR?
- important in pricing and eligigbility decisions in auction
- even if that runner up ad is very bad
- or that runner up scenario is to not show ads
gen - factorization model
gen - biasconstraint
irreproducibility
repeating procedure leads to multiple outcomes/models. various sources of stochasticity
they block rollout if it is too irreproducible. development cycle hindered, especially things like teacher models
reproducibility
ensembles
smoothed activation (smooth relu, smelu)
summary
non stationairy -> scalable online ml frameworks
cost vs accuracy rtadeoff -> efficient matmul, autoML tuning, loss engineer, DCN, distillation,etc.
generalization to uncommon data -> factorization model, bias constraints
reproducibility -> smooth activation
references
pic taking
🧪 -> Refresh the Info
Did you generally find the overall content understandable or compelling or relevant or not, and why, or which aspects of the reading were most novel or challenging for you and which aspects were most familiar or straightforward?)
Did a specific aspect of the reading raise questions for you or relate to other ideas and findings you’ve encountered, or are there other related issues you wish had been covered?)
🔗 -> Links
Resources
- Put useful links here
Connections
- Link all related words