SVM with ScikitLearn


Clustering with SVM


Support Vector Machine (SVM) is a classification technique. It tries to geometrically divide the data available. If the input data has N features, the data is plotted as points in an N dimensional space. Then, it identifies an N-1 dimensional structure that could separate the groups. The good separation is one which has maximum distance from the two groups. The distance from the group could be identified in various different forms - the distance from the closest points or the distance from the center, or mean of all distances, etc. That would depend upon the kind of data.
In effect, this complements the Nearest Neighbor algorithm. Problems that are easier with Nearest Neighbor are difficult with the SVM and vice-versa.

Implementation

Python implementation for SVM is quite simple with the SciKitLearn. We start with importing the required libraries
from sklearn.datasets import load_iris
from sklearn import svm
from sklearn.model_selection import train_test_split
Then we get the Cataract / Iris data from the builtin datasets of ScikitLearn
iris = load_iris()
X_train, X_test, Y_train, Y_test = train_test_split(iris.data, iris.target, stratify=iris.target, random_state=50)
Next step is to instantiate a model and train it.
model = svm.SVC()
model.fit(X_train, Y_train)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
We can then check up the efficiency of the model trainied
model.score(X_train, Y_train)
model.score(X_test, Y_test)
0.9821428571428571
0.97368421052631582
This is a good model.
But things may not always be so simple. If the features are not well defined, we may not have have a good plane passing through them. In that case, the scores would be very bad in spite of any tuning. In such a case, we have to rework the features to make sure the data gets segregated properly. Or if that is impossible, we might have to look for another algorithm. Hence it is important to have a good idea about the data we have.