All the pieces you must know to make use of CountVectorizer effectively in Sklearn
Essentially the most primary information processing that any Pure Language Processing (NLP) challenge requires is to transform the textual content information to the numeric information. So long as the info is in textual content kind we can’t do any sort of computation motion on it.
There are a number of strategies obtainable for this text-to-numeric information conversion. This tutorial will clarify one of the primary vectorizers, the CountVectorizer methodology within the scikit-learn library.
This methodology could be very easy. It takes the frequency of prevalence of every phrase because the numeric worth. An instance will make it clear.
Within the following code block:
- We’ll import the CountVectorizer methodology.
- Name the tactic.
- Match the textual content information to the CountVectorizer methodology and, convert that to an array.
import pandas as pd
from sklearn.feature_extraction.textual content import CountVectorizer #That is the textual content to be vectorized
textual content = ["Hello Everyone! This is Lilly. My aunt's name is also Lilly. I love my aunt.
I am trying to learn how to use count vectorizer."]
cv= CountVectorizer()
count_matrix = cv.fit_transform(textual content)
cnt_arr = count_matrix.toarray()
cnt_arr
Output:
array([[1, 1, 2, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 1, 1, 1]],
dtype=int64)
Right here I’ve the numeric values representing the textual content information above.
How do we all know which values signify which phrases within the textual content?
To make that clear, it is going to be useful to transform the array right into a DataFrame the place column names would be the phrases themselves.
cnt_df = pd.DataFrame(information = cnt_arr, columns = cv.get_feature_names())
cnt_df
Now, it reveals clearly. The worth of the phrase ‘additionally’ is 1 which suggests ‘additionally’ appeared solely as soon as within the take a look at. The phrase ‘aunt’ got here twice within the textual content. So, the worth of the phrase ‘aunt’ is 2.
Within the final instance, all of the sentences had been in a single string. So, we bought just one row of knowledge for 4 sentences. Let’s rearrange the textual content and…