Scikit-learn simple training example

1 minute read

Main point of training data is to generate matrix(X) having features(columns) and samples(rows) from provided data which can be text or any form of data.

After generating matrix(X), we can define the output(y) for each sample(row) then we can predict output from input data. Input data should have same number of features in training data.

Training data

1
2
3
4
5
6
7
8
9
10
import pandas as pd
simple_train = ['hello', 'hi', 'good morning', 'good afternoon', 'it is raining', 'I am playing baseball', 'watching tv']
greeting_train = ['yes', 'yes', 'yes', 'yes', 'no', 'no', 'no']
greeting_df = pd.DataFrame({
    'text': simple_train,
    'greeting': greeting_train,
});

greeting_df['label_greeting'] = greeting_df.greeting.map({'no':0, 'yes':1})
greeting_df.head()

Splitting data

1
2
3
4
5
6
7
8
9
10
X = greeting_df.text
y = greeting_df.label_greeting

# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

Training model

1
2
3
4
5
6
7
8
9
10
11
12
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

X_train_dtm = vect.fit_transform(X_train)
X_train_dtm

# import and instantiate a Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

# train the model using X_train_dtm (timing it with an IPython "magic command")
nb.fit(X_train_dtm, y_train)

Model testing

1
2
3
4
5
# make class predictions for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)
# calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

Predict

1
2
3
4
test = ['sing']
test_xtm = vect.transform(test)
test_pred = nb.predict(test_xtm)
test_pred

Github jupyter notebook

SK learn - Youtube link

Categories:

Updated: