Features are usually normalized prior to classification.
L1 and L2 normalization are usually used in the literature.
Could anybody comment on the advantages of L2 norm (or L1 norm) compared to L1 norm (or L2 norm)?
Advantages of L2 over L1 norm
- As already stated by aleju in the comments, derivations of the L2 norm are easily computed. Therefore it is also easy to use gradient based learning methods.
- L2 regularization
optimizes the mean cost (whereas L1 reduces the median
explanation) which is often used as a performance measurement. This is especially good if you know you don't have any outliers and you want to keep the overall error small.
- The solution is more likely to be unique. This ties in with the previous point: While the mean is a single value, the median might be located in an interval between two points and is therefore not unique.
- While L1 regularization can give you a sparse coefficient vector, the non-sparseness of L2 can improve your prediction performance (since you leverage more features instead of simply ignoring them).
- L2 is invariant under rotation. If you have a dataset consisting of points in a space and you apply a rotation, you still get the same results (i.e. the distances between points remain the same).
Advantages of L1 over L2 norm
- The L1 norm prefers sparse coefficient vectors. (explanation on Quora) This means the L1 norm performs feature selection and you can delete all features where the coefficient is 0. A reduction of the dimensions is useful in almost all cases.
- The L1 norm optimizes the median. Therefore the L1 norm is not sensitive to outliers.
More sources:
The same question on Quora
Another one
If you are working with inverse problems, L1 will return a more sparse matrix and L2 will return a more correlated matrix.