Locally finite operator

In machine learning, the polynomial kernel is a kernel function commonly used with support vector machines (SVMs) and other kernelized models, that represents the similarity of vectors (training samples) in a feature space over polynomials of the original variables, allowing learning of non-linear models.

Intuitively, the polynomial kernel looks not only at the given features of input samples to determine their similarity, but also combinations of these.

Definition

For degree-d polynomials, the polynomial kernel is defined as^[1]

K (x, y) = (x^{⊤} y + c)^{d}

where $x$ and $y$ are vectors in the input space, i.e. vectors of features computed from training or test samples, $c \geq 0$ is a constant trading off the influence of higher-order versus lower-order terms in the polynomial. When $c = 0$ , the kernel is called homogeneous.^[2] (A further generalized polykernel divides $x^{⊤} y$ by a user-specified scalar parameter $a$ .^[3])

As a kernel, $K$ corresponds an inner product in a feature space based on some mapping $φ$ :

K (x, y) = ⟨ φ (x), φ (y) ⟩

The nature of $φ$ can be glanced from an example. Let $d = 2$ , so we get the special case of the quadratic kernel. Then

K (x, y) = {(\sum_{i = 1}^{n} x_{i} y_{i} + c)}^{2} = \sum_{i = 1}^{n} x_{i}^{2} y_{i}^{2} + \sum_{i = 2}^{n} \sum_{j = 1}^{i - 1} \sqrt{2} x_{i} y_{i} \sqrt{2} x_{j} y_{j} + \sum_{i = 1}^{n} \sqrt{2 c} x_{i} \sqrt{2 c} y_{i} + c^{2}

From this it follows that the feature map is given by:

φ (x) = ⟨ x_{n}^{2}, \dots, x_{1}^{2}, \sqrt{2} x_{n} x_{n - 1}, \dots, \sqrt{2} x_{n} x_{1}, \sqrt{2} x_{n - 1} x_{n - 2}, \dots, \sqrt{2} x_{n - 1} x_{1}, \dots, \sqrt{2} x_{2} x_{1}, \sqrt{2 c} x_{n}, \dots, \sqrt{2 c} x_{1}, c ⟩

When the input features are binary-valued (booleans), $c = 1$ and the $\sqrt{2}$ terms are ignored, the mapped features correspond to conjunctions of input features.^[4]

Practical use

Although the RBF kernel is more popular in SVM classification than the polynomial kernel, the latter is quite popular in natural language processing (NLP).^[4]^[5] The most common degree is d=2, since larger degrees tend to overfit on NLP problems.

Various ways of computing the polynomial kernel (both exact and approximate) have been devised as alternatives to the usual non-linear SVM training algorithms, including:

full expansion of the kernel prior to training/testing with a linear SVM,^[5] i.e. full computation of the mapping $φ$
basket mining (using a variant of the apriori algorithm) for the most commonly occurring feature conjunctions in a training set to produce an approximate expansion^[6]
inverted indexing of support vectors^[6]^[4]

One problem with the polynomial kernel is that may it suffer from numerical instability: when $x^{⊤} y + c < 1$ , $K (x, y) = (x^{⊤} y + c)^{d}$ tends to zero as $d$ is increased, whereas when $x^{⊤} y + c > 1$ , $K (x, y)$ tends to infinity.^[3]

References

↑ http://www.cs.tufts.edu/~roni/Teaching/CLT/LN/lecture18.pdf
↑ Template:Cite arXiv
↑ ^3.0 ^3.1 Chih-Jen Lin (2012). Machine learning software: design and practical use. Talk at Machine Learning Summer School, Kyoto.
↑ ^4.0 ^4.1 ^4.2 Yoav Goldberg and Michael Elhadad (2008). splitSVM: Fast, Space-Efficient, non-Heuristic, Polynomial Kernel Computation for NLP Applications. Proc. ACL-08: HLT.
↑ ^5.0 ^5.1 Yin-Wen Chang, Cho-Jui Hsieh, Kai-Wei Chang, Michael Ringgaard and Chih-Jen Lin (2010). Training and testing low-degree polynomial data mappings via linear SVM. J. Machine Learning Research 11:1471–1490.
↑ ^6.0 ^6.1 T. Kudo and Y. Matsumoto (2003). Fast methods for kernel-based text analysis. Proc. ACL 2003.

[1] ttp://www.cs.tufts.edu/~roni/Teaching/CLT/LN/lecture18.pdf

[2] Template:Cite arXiv

[lin2012-3] 3.0 ^3.1 Chih-Jen Lin (2012). Machine learning software: design and practical use. Talk at Machine Learning Summer School, Kyoto.

[Goldberg2008-4] 4.0 ^4.1 ^4.2 Yoav Goldberg and Michael Elhadad (2008). splitSVM: Fast, Space-Efficient, non-Heuristic, Polynomial Kernel Computation for NLP Applications. Proc. ACL-08: HLT.

[Chang2010-5] 5.0 ^5.1 Yin-Wen Chang, Cho-Jui Hsieh, Kai-Wei Chang, Michael Ringgaard and Chih-Jen Lin (2010). Training and testing low-degree polynomial data mappings via linear SVM. J. Machine Learning Research 11:1471–1490.

[Kudo2003-6] 6.0 ^6.1 T. Kudo and Y. Matsumoto (2003). Fast methods for kernel-based text analysis. Proc. ACL 2003.

[1]

[2]

[3]

[4]

[5]

[6]

Locally finite operator

Definition

Practical use

References

Navigation menu

Search