count vectrozier

imsanjoykb · imsanjoykb · commit dcc11f10034a · 2020-10-03T17:37:03.000Z
diff --git a/Feature Engineering/Count Vectrozier/Count Vectorizer.ipynb b/Feature Engineering/Count Vectrozier/Count Vectorizer.ipynb
@@ -0,0 +1,203 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Author : Sanjoy Biswas\n",
+    "### Topic : Count Vectorizer\n",
+    "### Email : sanjoy.eee32@gmail.com"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Scikit-learn’s CountVectorizer is used to transform a corpora of text to a vector of term / token counts. It also provides the capability to preprocess your text data prior to generating the vector representation making it a highly flexible feature representation module for text."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Methods\n",
+    "build_analyzer():Return a callable that handles preprocessing, tokenization and n-grams generation.\n",
+    "\n",
+    "build_preprocessor():Return a function to preprocess the text before tokenization.\n",
+    "\n",
+    "build_tokenizer():Return a function that splits a string into a sequence of tokens.\n",
+    "\n",
+    "decode(doc):Decode the input into a string of unicode symbols.\n",
+    "\n",
+    "fit(raw_documents[, y]):Learn a vocabulary dictionary of all tokens in the raw documents.\n",
+    "\n",
+    "fit_transform(raw_documents[, y]):Learn the vocabulary dictionary and return document-term matrix.\n",
+    "\n",
+    "get_feature_names():Array mapping from feature integer indices to feature name.\n",
+    "\n",
+    "get_params([deep]):Get parameters for this estimator.\n",
+    "\n",
+    "get_stop_words():Build or fetch the effective stop words list.\n",
+    "\n",
+    "inverse_transform(X):Return terms per document with nonzero entries in X.\n",
+    "\n",
+    "set_params(**params):Set the parameters of this estimator.\n",
+    "\n",
+    "transform(raw_documents):Transform documents to document-term matrix."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Word Counts with CountVectorizer\n",
+    "The CountVectorizer provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary.\n",
+    "\n",
+    "You can use it as follows:\n",
+    "\n",
+    "1. Create an instance of the CountVectorizer class.\n",
+    "2. Call the fit() function in order to learn a vocabulary from one or more documents.\n",
+    "3. Call the transform() function on one or more documents as needed to encode each as a vector.\n",
+    "\n",
+    "An encoded vector is returned with a length of the entire vocabulary and an integer count for the number of times each word appeared in the document.\n",
+    "\n",
+    "Because these vectors will contain a lot of zeros, we call them sparse. Python provides an efficient way of handling sparse vectors in the scipy.sparse package.\n",
+    "\n",
+    "The vectors returned from a call to transform() will be sparse vectors, and you can transform them back to numpy arrays to look and better understand what is going on by calling the toarray() function."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.feature_extraction.text import CountVectorizer"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dataset = ['Hey welcome to datascience',\n",
+    "          'This is Data Science Course',\n",
+    "          'Working as data scientist']"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['Hey welcome to datascience',\n",
+       " 'This is Data Science Course',\n",
+       " 'Working as data scientist']"
+      ]
+     },
+     "execution_count": 11,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "dataset"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cv = CountVectorizer()\n",
+    "x = cv.fit_transform(dataset)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['as',\n",
+       " 'course',\n",
+       " 'data',\n",
+       " 'datascience',\n",
+       " 'hey',\n",
+       " 'is',\n",
+       " 'science',\n",
+       " 'scientist',\n",
+       " 'this',\n",
+       " 'to',\n",
+       " 'welcome',\n",
+       " 'working']"
+      ]
+     },
+     "execution_count": 13,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "cv.get_feature_names()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([[0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0],\n",
+       "       [0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0],\n",
+       "       [1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1]], dtype=int64)"
+      ]
+     },
+     "execution_count": 14,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "x.toarray()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}