We released two large scale datasets for research on learning to rank: MSLR-WEB30k with more than 30,000 queries and a random sampling of it MSLR-WEB10K with 10,000 queries.
Dataset Descriptions
The datasets are machine learning data, in which queries and urls are represented by IDs. The datasets consist of feature vectors extracted from query-url pairs along with relevance judgment labels:
(1) The relevance judgments are obtained from a retired labeling set of a commercial web search engine (Microsoft Bing), which take 5 values from 0 (irrelevant) to 4 (perfectly relevant).
(2) The features are basically extracted by us, and are those widely used in the research community.
In the data files, each row corresponds to a query-url pair. The first column is relevance label of the pair, the second column is query id, and the following columns are features. The larger value the relevance label has, the more relevant the query-url pair is. A query-url pair is represented by a 136-dimensional feature vector.
Below are two rows from MSLR-WEB10K dataset:
==============================================
0 qid:1 1:3 2:0 3:2 4:2 … 135:0 136:0
2 qid:1 1:3 2:3 3:0 4:0 … 135:0 136:0
==============================================
Dataset Partition
We have partitioned each dataset into five parts with about the same number of queries, denoted as S1, S2, S3, S4, and S5, for five-fold cross validation. In each fold, we propose using three parts for training, one part for validation, and the remaining part for test (see the following table). The training set is used to learn ranking models. The validation set is used to tune the hyper parameters of the learning algorithms, such as the number of iterations in RankBoost and the combination coefficient in the objective function of Ranking SVM. The test set is used to evaluate the performance of the learned ranking models.
Folds | Training Set | Validation Set | Test Set |
Fold1 | {S1,S2,S3} | S4 | S5 |
Fold2 | {S2,S3,S4} | S5 | S1 |
Fold3 | {S3,S4,S5} | S1 | S2 |
Fold4 | {S4,S5,S1} | S2 | S3 |
Fold5 | {S5,S1,S2} | S3 | S4 |
Datasets
The datasets were released on June 16, 2010.
To use the datasets, you must read and accept the online agreement. By using the datasets, you agree to be bound by the terms of its license.
Datasets | Size | MD5 |
MSLR-WEB10K | ~ 1.2G | 97c5d4e7c171e475c91d7031e4fd8e79 |
MSLR-WEB30K | ~ 3.7G | 4beae4bee0cd244fc9b2aff355a61555 |
Evaluation tools
The evaluation script was updated on Jan. 13, 2011. Thank you to Yasser Ganjisaffar for pointing out the bug.
- Evaluation script for NDCG(meanNDCG) and Precision(MAP)
- Significance test script for algorithm comparison
Feature List
Each query-url pair is represented by a 136-dimensional vector.
Feature List of Microsoft Learning to Rank Datasets | |||
feature id | feature description | stream | comments |
1 | covered query term number | body | |
2 | anchor | ||
3 | title | ||
4 | url | ||
5 | whole document | ||
6 | covered query term ratio | body | |
7 | anchor | ||
8 | title | ||
9 | url | ||
10 | whole document | ||
11 | stream length | body | |
12 | anchor | ||
13 | title | ||
14 | url | ||
15 | whole document | ||
16 | IDF(Inverse document frequency) | body | |
17 | anchor | ||
18 | title | ||
19 | url | ||
20 | whole document | ||
21 | sum of term frequency | body | |
22 | anchor | ||
23 | title | ||
24 | url | ||
25 | whole document | ||
26 | min of term frequency | body | |
27 | anchor | ||
28 | title | ||
29 | url | ||
30 | whole document | ||
31 | max of term frequency | body | |
32 | anchor | ||
33 | title | ||
34 | url | ||
35 | whole document | ||
36 | mean of term frequency | body | |
37 | anchor | ||
38 | title | ||
39 | url | ||
40 | whole document | ||
41 | variance of term frequency | body | |
42 | anchor | ||
43 | title | ||
44 | url | ||
45 | whole document | ||
46 | sum of stream length normalized term frequency | body | |
47 | anchor | ||
48 | title | ||
49 | url | ||
50 | whole document | ||
51 | min of stream length normalized term frequency | body | |
52 | anchor | ||
53 | title | ||
54 | url | ||
55 | whole document | ||
56 | max of stream length normalized term frequency | body | |
57 | anchor | ||
58 | title | ||
59 | url | ||
60 | whole document | ||
61 | mean of stream length normalized term frequency | body | |
62 | anchor | ||
63 | title | ||
64 | url | ||
65 | whole document | ||
66 | variance of stream length normalized term frequency | body | |
67 | anchor | ||
68 | title | ||
69 | url | ||
70 | whole document | ||
71 | sum of tf*idf | body | |
72 | anchor | ||
73 | title | ||
74 | url | ||
75 | whole document | ||
76 | min of tf*idf | body | |
77 | anchor | ||
78 | title | ||
79 | url | ||
80 | whole document | ||
81 | max of tf*idf | body | |
82 | anchor | ||
83 | title | ||
84 | url | ||
85 | whole document | ||
86 | mean of tf*idf | body | |
87 | anchor | ||
88 | title | ||
89 | url | ||
90 | whole document | ||
91 | variance of tf*idf | body | |
92 | anchor | ||
93 | title | ||
94 | url | ||
95 | whole document | ||
96 | boolean model | body | |
97 | anchor | ||
98 | title | ||
99 | url | ||
100 | whole document | ||
101 | vector space model | body | |
102 | anchor | ||
103 | title | ||
104 | url | ||
105 | whole document | ||
106 | BM25 | body | |
107 | anchor | ||
108 | title | ||
109 | url | ||
110 | whole document | ||
111 | LMIR.ABS | body | Language model approach for information retrieval (IR) with absolute discounting smoothing |
112 | anchor | ||
113 | title | ||
114 | url | ||
115 | whole document | ||
116 | LMIR.DIR | body | Language model approach for IR with Bayesian smoothing using Dirichlet priors |
117 | anchor | ||
118 | title | ||
119 | url | ||
120 | whole document | ||
121 | LMIR.JM | body | Language model approach for IR with Jelinek-Mercer smoothing |
122 | anchor | ||
123 | title | ||
124 | url | ||
125 | whole document | ||
126 | Number of slash in URL | ||
127 | Length of URL | ||
128 | Inlink number | ||
129 | Outlink number | ||
130 | PageRank | ||
131 | SiteRank | Site level PageRank | |
132 | QualityScore | The quality score of a web page. The score is outputted by a web page quality classifier. | |
133 | QualityScore2 | The quality score of a web page. The score is outputted by a web page quality classifier, which measures the badness of a web page. | |
134 | Query-url click count | The click count of a query-url pair at a search engine in a period | |
135 | url click count | The click count of a url aggregated from user browsing data in a period | |
136 | url dwell time | The average dwell time of a url aggregated from user browsing data in a period |
Reference
You can cite this dataset as below.
@article{DBLP:journals/corr/QinL13, author = {Tao Qin and Tie{-}Yan Liu}, title = {Introducing {LETOR} 4.0 Datasets}, journal = {CoRR}, volume = {abs/1306.2597}, year = {2013}, url = {http://arxiv.org/abs/1306.2597}, timestamp = {Mon, 01 Jul 2013 20:31:25 +0200}, biburl = {http://dblp.uni-trier.de/rec/bib/journals/corr/QinL13}, bibsource = {dblp computer science bibliography, http://dblp.org} }
Release Notes
- The following people have contributed to the construction of the data: Tao Qin, Tie-Yan Liu, Wenkui Ding, Jun Xu, Hang Li.
- We would like to thank Bing team for the support in dataset creation. We would also like to thank Nick Craswell for the help in dataset release.
- If you have any questions or suggestions, please kindly let us know.
- Related links: LETOR3.0 and LETOR4.0 datasets.
People
Tao Qin
Partner Research Manager
Tie-Yan Liu
Distinguished Scientist, Microsoft Research AI for Science