Skip to content

Commit dcc11f1

Browse files
committed
count vectrozier
1 parent 057170c commit dcc11f1

File tree

1 file changed

+203
-0
lines changed

1 file changed

+203
-0
lines changed
Lines changed: 203 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,203 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"### Author : Sanjoy Biswas\n",
8+
"### Topic : Count Vectorizer\n",
9+
"### Email : [email protected]"
10+
]
11+
},
12+
{
13+
"cell_type": "markdown",
14+
"metadata": {},
15+
"source": [
16+
"Scikit-learn’s CountVectorizer is used to transform a corpora of text to a vector of term / token counts. It also provides the capability to preprocess your text data prior to generating the vector representation making it a highly flexible feature representation module for text."
17+
]
18+
},
19+
{
20+
"cell_type": "markdown",
21+
"metadata": {},
22+
"source": [
23+
"#### Methods\n",
24+
"build_analyzer():Return a callable that handles preprocessing, tokenization and n-grams generation.\n",
25+
"\n",
26+
"build_preprocessor():Return a function to preprocess the text before tokenization.\n",
27+
"\n",
28+
"build_tokenizer():Return a function that splits a string into a sequence of tokens.\n",
29+
"\n",
30+
"decode(doc):Decode the input into a string of unicode symbols.\n",
31+
"\n",
32+
"fit(raw_documents[, y]):Learn a vocabulary dictionary of all tokens in the raw documents.\n",
33+
"\n",
34+
"fit_transform(raw_documents[, y]):Learn the vocabulary dictionary and return document-term matrix.\n",
35+
"\n",
36+
"get_feature_names():Array mapping from feature integer indices to feature name.\n",
37+
"\n",
38+
"get_params([deep]):Get parameters for this estimator.\n",
39+
"\n",
40+
"get_stop_words():Build or fetch the effective stop words list.\n",
41+
"\n",
42+
"inverse_transform(X):Return terms per document with nonzero entries in X.\n",
43+
"\n",
44+
"set_params(**params):Set the parameters of this estimator.\n",
45+
"\n",
46+
"transform(raw_documents):Transform documents to document-term matrix."
47+
]
48+
},
49+
{
50+
"cell_type": "markdown",
51+
"metadata": {},
52+
"source": [
53+
"#### Word Counts with CountVectorizer\n",
54+
"The CountVectorizer provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary.\n",
55+
"\n",
56+
"You can use it as follows:\n",
57+
"\n",
58+
"1. Create an instance of the CountVectorizer class.\n",
59+
"2. Call the fit() function in order to learn a vocabulary from one or more documents.\n",
60+
"3. Call the transform() function on one or more documents as needed to encode each as a vector.\n",
61+
"\n",
62+
"An encoded vector is returned with a length of the entire vocabulary and an integer count for the number of times each word appeared in the document.\n",
63+
"\n",
64+
"Because these vectors will contain a lot of zeros, we call them sparse. Python provides an efficient way of handling sparse vectors in the scipy.sparse package.\n",
65+
"\n",
66+
"The vectors returned from a call to transform() will be sparse vectors, and you can transform them back to numpy arrays to look and better understand what is going on by calling the toarray() function."
67+
]
68+
},
69+
{
70+
"cell_type": "code",
71+
"execution_count": 2,
72+
"metadata": {},
73+
"outputs": [],
74+
"source": [
75+
"from sklearn.feature_extraction.text import CountVectorizer"
76+
]
77+
},
78+
{
79+
"cell_type": "code",
80+
"execution_count": 10,
81+
"metadata": {},
82+
"outputs": [],
83+
"source": [
84+
"dataset = ['Hey welcome to datascience',\n",
85+
" 'This is Data Science Course',\n",
86+
" 'Working as data scientist']"
87+
]
88+
},
89+
{
90+
"cell_type": "code",
91+
"execution_count": 11,
92+
"metadata": {},
93+
"outputs": [
94+
{
95+
"data": {
96+
"text/plain": [
97+
"['Hey welcome to datascience',\n",
98+
" 'This is Data Science Course',\n",
99+
" 'Working as data scientist']"
100+
]
101+
},
102+
"execution_count": 11,
103+
"metadata": {},
104+
"output_type": "execute_result"
105+
}
106+
],
107+
"source": [
108+
"dataset"
109+
]
110+
},
111+
{
112+
"cell_type": "code",
113+
"execution_count": 12,
114+
"metadata": {},
115+
"outputs": [],
116+
"source": [
117+
"cv = CountVectorizer()\n",
118+
"x = cv.fit_transform(dataset)"
119+
]
120+
},
121+
{
122+
"cell_type": "code",
123+
"execution_count": 13,
124+
"metadata": {},
125+
"outputs": [
126+
{
127+
"data": {
128+
"text/plain": [
129+
"['as',\n",
130+
" 'course',\n",
131+
" 'data',\n",
132+
" 'datascience',\n",
133+
" 'hey',\n",
134+
" 'is',\n",
135+
" 'science',\n",
136+
" 'scientist',\n",
137+
" 'this',\n",
138+
" 'to',\n",
139+
" 'welcome',\n",
140+
" 'working']"
141+
]
142+
},
143+
"execution_count": 13,
144+
"metadata": {},
145+
"output_type": "execute_result"
146+
}
147+
],
148+
"source": [
149+
"cv.get_feature_names()"
150+
]
151+
},
152+
{
153+
"cell_type": "code",
154+
"execution_count": 14,
155+
"metadata": {},
156+
"outputs": [
157+
{
158+
"data": {
159+
"text/plain": [
160+
"array([[0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0],\n",
161+
" [0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0],\n",
162+
" [1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1]], dtype=int64)"
163+
]
164+
},
165+
"execution_count": 14,
166+
"metadata": {},
167+
"output_type": "execute_result"
168+
}
169+
],
170+
"source": [
171+
"x.toarray()"
172+
]
173+
},
174+
{
175+
"cell_type": "code",
176+
"execution_count": null,
177+
"metadata": {},
178+
"outputs": [],
179+
"source": []
180+
}
181+
],
182+
"metadata": {
183+
"kernelspec": {
184+
"display_name": "Python 3",
185+
"language": "python",
186+
"name": "python3"
187+
},
188+
"language_info": {
189+
"codemirror_mode": {
190+
"name": "ipython",
191+
"version": 3
192+
},
193+
"file_extension": ".py",
194+
"mimetype": "text/x-python",
195+
"name": "python",
196+
"nbconvert_exporter": "python",
197+
"pygments_lexer": "ipython3",
198+
"version": "3.7.4"
199+
}
200+
},
201+
"nbformat": 4,
202+
"nbformat_minor": 2
203+
}

0 commit comments

Comments
 (0)