Learning Supplementary NLP features for CTR Prediction in Sponsored Search

  • Dong Wang ,
  • Shaoguang Yan ,
  • Yunqing Xia ,
  • Kavé Salamatian ,
  • Weiwei Deng ,
  • Qi Zhang

The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD' 22) |

Organized by ACM

PDF

In sponsored search engines, pre-trained language models have shown promising performance improvements on Click-Through-Rate (CTR) prediction. A widely used approach for utilizing pre-trained language model in CTR prediction consists of fine-tuning the language models with click labels and early stopping on peak value of the obtained Area Under the ROC Curve (AUC). Thereafter the output of these fine-tuned models, i.e., the final score or intermediate embedding generated by language model, is used as a new Natural Language Processing (NLP) feature into CTR prediction baseline. This cascade approach avoids complicating the CTR prediction baseline, while keeping flexibility and agility. However, we show in this work that calibrating separately the language model based on the peak single model AUC does not always yield NLP features that give the best performance in CTR prediction model ultimately. Our analysis reveals that the misalignment is due to overlap and redundancy between the new NLP features and the existing features in CTR prediction baseline. In other words, the NLP features can improve CTR prediction better if such overlap can be reduced.

For this purpose, we introduce a simple and general joint-training framework in fine-tuning of language models, combined with the already existing features in CTR prediction baseline, to extract supplementary knowledge for NLP feature. Moreover, we develop an efficient Supplementary Knowledge Distillation (SuKD) that transfers the supplementary knowledge learned by a heavy language model to a light and serviceable model. Comprehensive experiments on both public data and commercial data presented in this work demonstrate that the new NLP features resulting from the joint-training framework can outperform significantly the ones from the independent fine-tuning based on click labels. It is also showed that the light model distilled with SuKD can provide obvious AUC improvement in CTR prediction over the traditional feature-based knowledge distillation.