Answer by Bharath Hariharan:

Great answers here already, but there are some additional things that I would want to say. So here goes.

What are kernels?

A kernel is a similarity function. It is a function that you, as the domain expert, provide to a machine learning algorithm. It takes two inputs and spits out how similar they are.Suppose your task is to learn to classify images. You have (image, label) pairs as training data. Consider the typical machine learning pipeline: you take your images, you compute features, you string the features for each image into a vector, and you feed these "feature vectors" and labels into a learning algorithm.

Data –> Features –> Learning algorithm

Kernels offer an alternative. Instead of defining a slew of features, you define a single kernel function to compute similarity between images. You provide this kernel, together with the images and labels to the learning algorithm, and out comes a classifier.

Of course, the standard SVM/ logistic regression/ perceptron formulation doesn't work with kernels : it works with feature vectors. How on earth do we use kernels then? Two beautiful mathematical facts come to our rescue:

- Under some conditions, every kernel function can be expressed as a dot product in a (possibly infinite dimensional) feature space ( Mercer's theorem ).
- Many machine learning algorithms can be expressed entirely in terms of dot products.
These two facts mean that I can take my favorite machine learning algorithm, express it in terms of dot products, and then since my kernel is also a dot product in some space, replace the dot product by my favorite kernel. Voila!

Why kernels?

Why kernels, as opposed to feature vectors? One big reason is that in many cases, computing the kernel is easy, but computing the feature vector corresponding to the kernel is really really hard. The feature vector for even simple kernels can blow up in size, and for kernels like the RBF kernel ( k(x,y) = exp( -||x-y||^2), see Radial basis function kernel) the corresponding feature vector is infinite dimensional. Yet, computing the kernel is almost trivial.Many machine learning algorithms can be written to

onlyuse dot products, and then we can replace the dot products with kernels. By doing so, we don't have to use the feature vector at all. This means that we can work with highly complex, efficient-to-compute, and yet high performing kernels without ever having to write down the huge and potentially infinite dimensional feature vector. Thus if not for the ability to use kernel functions directly, we would be stuck with relatively low dimensional, low-performance feature vectors. This "trick" is called thekernel trick( Kernel trick ).

Endnote

I want to clear up two confusions which seem prevalant on this page:

- A function that transforms one feature vector into a higher dimensional feature vector is
nota kernel function. Thus f(x) = [x, x^2 ] isnota kernel. It is simply a new feature vector. Youdo notneed kernels to do this. You need kernels if you want to do this, or more complicated feature transformationswithoutblowing up dimensionality.- A kernel is not restricted to SVMs.
Anylearning algorithm that only works with dot products can be written down using kernels. The idea of SVMs is beautiful, the kernel trick is beautiful, and convex optimization is beautiful, and they stand quite independent.

What are Kernels in Machine Learning and SVM?