Juan, Yuchin, et al. 2016 “Field-aware factorization machines for CTR prediction.”

Click-through rate(CTR) prediction plays an important role in computational advertising. This paper establishes Field-aware Factorization Machines(FFM) as an effective method for classifying large sparse data including CTR prediction. FFM is a variant of FM that utilizes information that features can be grouped into field for most CTR dataset.
It compared FFM with Logistic regression(LM), Polynomial regression(Poly2) and FM using logloss as evaluation metric. FFM outperform other models in terms of logloss but also requires longer training time that LM and FM.

1. CTR Dataset

For most CTR datasets, “features” can be grouped into “field”. In Table1: example, three features ESPN, Vogue, and NBC, belong to the field Publisher, and the other three features Nike, Gucci, and Adidas, belong to the eld Advertiser. FFM is a variant of FM that utilizes this information.

ctr example

2. FM

See FM post here.

\[\phi_{FM}(\mathbf{w}, \mathbf{x}) = \underset{j_1=1}{\sum^{n}}\underset{j_2 = j_1+1}{\sum^{n}}(\mathbf{w}_{j_1}\cdot \mathbf{w}_{j_2}) x_{j_1}x_{j_2}\] \[\mathbf{w}_{j_1}\cdot \mathbf{w}_{j_2} = \underset{f=1}{\sum^k} w_{j_1, f}w_{j_2, f}\]

where $k \in \mathrm{N}_0^+$ is the hyperparameter that defines the dimensionality of the factorization.

Example

ctr example

\[\phi_{FM}(\mathbf{w}, \mathbf{x}) = \mathbf{w}_{ESPN} \cdot \mathbf{w}_{Nike} + \mathbf{w}_{ESPN} \cdot \mathbf{w}_{Male} + \mathbf{w}_{Nike} \cdot \mathbf{w}_{Male}\]

3. FFM

\[\phi_{FFM}(\mathbf{w}, \mathbf{x}) = \underset{j_1=1}{\sum^n} \underset{j_2=j_1+1}{\sum^n} (\mathbf{w}_{j_1, f_2} \cdot \mathbf{w}_{j_2, f_1}) x_{j_1}x_{j_2}\]

where $f_1$ and $f_2$ are respectively the fields of $j_1$ and $j_2$

In FFMs, each feature has several latent vectors. Depending on the field of other features, one of them is used to do the inner product

Example

ctr example

\[\phi_{FFM}(\mathbf{w}, \mathbf{x}) = \mathbf{w}_{ESPN, A} \cdot \mathbf{w}_{Nike, P} + \mathbf{w}_{ESPN, G} \cdot \mathbf{w}_{Male, P} + \mathbf{w}_{Nike, G} \cdot \mathbf{w}_{Male, A}\]

3.1 Solving the Optimization Problem

Use stochastic gradient methods and AdaGrad(adaptive learning rate).
The initial values of $\mathbf{w}$ are randomly sampled from a uniform distribution between $[0, \frac{1}{\sqrt{k}}]$.
The initial values of $G$ (accumulated sum of squared gradient for each coordinate $d=1, \cdots, k$) are set to one in order to prevent a large value of $(G_{j_1, f_2})^{-\frac{1}{2}}_d$.
At each step of SG a data point $(y, \mathbf{x})$ is sampled for updating $\mathbf{w}_{j_1, f_2}$ and $\mathbf{w}_{j_2, f_1}$.
Beacause $\mathbf{x}$ is highly sparse in this application, only update dimensions with non-zero values.

Optimization Problem

\[\underset{\mathbf{w}}{min} \frac{\lambda}{2} \lVert \mathbf{w} \rVert^2_2 + \underset{i=1}{\sum^m}\log (1+\exp (-y_i \phi_{FFM}(\mathbf{w}, \mathbf{x})))\]

Sub-gradient

\[\mathbf{g}_{j_1, f_2} \equiv \nabla_{\mathbf{w}_{j_1, f_2}}f(\mathbf{w}) = \lambda \cdot \mathbf{w}_{j_1, f_2} + \kappa \cdot \mathbf{w}_{j_2, f_1} x_{j_1} x_{j_2}\\ \mathbf{g}_{j_2, f_1} \equiv \nabla_{\mathbf{w}_{j_2, f_1}}f(\mathbf{w}) = \lambda \cdot \mathbf{w}_{j_2, f_1} + \kappa \cdot \mathbf{w}_{j_1, f_2} x_{j_1} x_{j_2}\]

where $\kappa = \frac{\partial \log (1+\exp (-y \phi_{FFM}(\mathbf{w}, \mathbf{x})))}{\partial \phi_{FFM}(\mathbf{w, x})} = \frac{-y}{1+\exp (y \phi_{FFM}(\mathbf{w}, \mathbf{x}))}$

Accumulate the sum of squared gradient

for each coordinate $d=1, \cdots, k$, the sum of squared gradient is accumulated:

\[(G_{j_1, f_2})_d \leftarrow (G_{j_1, f_2})_d + (g_{j_1, f_2})^2_d\\ (G_{j_2, f_1})_d \leftarrow (G_{j_2, f_1})_d + (g_{j_2, f_1})^2_d\]

Update Rule

\[(w_{j_1, f_2})_d \leftarrow (w_{j_1, f_2})_d - \frac{\eta}{\sqrt{(G_{j_1, f_2})_d}} (g_{j_1, f_2})_d\\ (w_{j_2, f_1})_d \leftarrow (w_{j_2, f_1})_d - \frac{\eta}{\sqrt{(G_{j_2, f_1})_d}} (g_{j_2, f_1})_d\]

where $\eta$ is a user-specified learning rate.

Algorithm

algorithm

3.2 Adding Field Information

LIBSVM data format: Each (feat, val) pair indicates feature index and value.

dataformat

FFM data format: Extend LIBSVM format, must assign the corresponding field to each feature.

dataformat

Categorical Features

Apply same setting as LIBSVM that features with zero values are not stored. Every categorical feature is transformed to several binary ones. Consider each category as a field to add the field information.

example

The above example instance becomes :

example

Numerical features

There are 2 possible ways to assign field. A naive way is to treat each feature as a dummy filed, another way is to discretize each numerical feature to a categorical one.

Example data

example

treat each feature as a dummy field.

example

discretize each numerical feature to a categorical one and then use the same setting for categorical features to add field information

example

4. Experiment

It compared FFM with Logistic regression(LM), Polynomial regression(Poly2) and FM on 2 CTR datasets using logloss as evaluation metric.
It used “early stopping” which terminates the training process before reaching the best result on training data to avoid over-fitting, because find that unlike LM or Poly2, FFM is sensitive to the number of epochs.

Dataset

2 CTR dataset “Criteo” and “Avazu” from kaggle competitions. Because the labels in the test sets are not available, split the available data to training and validation.

Evaluation metric

\[\text{logloss} = \frac{1}{m}\underset{i=1}{\sum^m}\log (1 + \exp (-y_i \phi(\mathbf{w}, \mathbf{x}_i)))\]

where $m$ is the number of test instances.

Results

See result table in paper Table3.

FFM outperform other models in terms of logloss
But FFM also requires longer training time that LM and FM.
In contrary, though the logloss of LM is worse than other models, it is significantly faster.
There is a clear trade-off between logloss and speed.
FM is a good balance between logloss and speed.

Reference

[1] Juan, Yuchin, et al. “Field-aware factorization machines for CTR prediction.” Proceedings of the 10th ACM Conference on Recommender Systems. 2016.
[2] Rendle, Steffen. “Factorization machines.” 2010 IEEE International Conference on Data Mining. IEEE, 2010.

도은 블로그

PAPER REVIEW-Field-aware Factorization Machines for CTR prediction

1. CTR Dataset

2. FM

Example

3. FFM

Example

3.1 Solving the Optimization Problem

Optimization Problem

Sub-gradient

Accumulate the sum of squared gradient

Update Rule

Algorithm

3.2 Adding Field Information

Categorical Features

Numerical features

4. Experiment

Dataset

Evaluation metric

Results

Reference