madhurimarawat
diff --git a/‎NLP_Lab_10_Chunker.ipynb
+233 b/‎NLP_Lab_10_Chunker.ipynb
+233
diff --git a/‎NLP_Lab_5_Regular_Expressions.ipynb
+154 b/‎NLP_Lab_5_Regular_Expressions.ipynb
+154
@@ -0,0 +1,233 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Experiment 10\n",
+        "## Build a Chunker."
+      ],
+      "metadata": {
+        "id": "n3qDPZthQ2He"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "A chunker, in the context of natural language processing (NLP), is a program or algorithm that identifies and segments syntactic units (chunks) in a sentence. These chunks typically consist of words that form a meaningful unit, such as noun phrases, verb phrases, or prepositional phrases.\n",
+        "\n",
+        "The process of chunking involves dividing a sentence into chunks based on the grammatical structure and relationships between words. It's an intermediate step between part-of-speech tagging and full parsing. While part-of-speech tagging assigns a part-of-speech label to each word in a sentence, chunking goes a step further by grouping words into meaningful units.\n",
+        "\n",
+        "Here's an example to illustrate chunking:\n",
+        "\n",
+        "**Sentence:** \"The black cat sat on the windowsill.\"\n",
+        "\n",
+        "**Part-of-Speech Tagging:**\n",
+        "```\n",
+        "The/DT black/JJ cat/NN sat/VBD on/IN the/DT windowsill/NN ./.\n",
+        "```\n",
+        "\n",
+        "**Chunking Result:**\n",
+        "```\n",
+        "[The black cat] [sat] [on the windowsill].\n",
+        "```\n",
+        "\n",
+        "In this example, the chunker identifies and groups the words into three chunks: a noun phrase (\"The black cat\"), a verb (\"sat\"), and a prepositional phrase (\"on the windowsill\").\n",
+        "\n",
+        "Chunking is often used in various NLP applications, including information extraction, named entity recognition, and shallow parsing. Different techniques and tools can be employed for chunking, such as regular expressions, rule-based systems, or machine learning approaches."
+      ],
+      "metadata": {
+        "id": "4wcATIQT_vGF"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "**Chunking in Natural Language Processing (NLP) - Explanation with Output:**\n",
+        "\n",
+        "```python\n",
+        "import nltk\n",
+        "from nltk.chunk import RegexpParser\n",
+        "from nltk.tokenize import word_tokenize\n",
+        "\n",
+        "# Sample sentence\n",
+        "sentence = \"The cat sat on the mat.\"\n",
+        "\n",
+        "# Tokenize the sentence\n",
+        "words = word_tokenize(sentence)\n",
+        "\n",
+        "# Perform POS tagging\n",
+        "pos_tags = nltk.pos_tag(words)\n",
+        "\n",
+        "# Define chunking patterns using regular expressions\n",
+        "chunking_patterns = r\"\"\"\n",
+        "    NP: {<DT>?<JJ>*<NN>}   # chunk determiner/possessive, adjectives, and noun\n",
+        "    PP: {<IN><NP>}         # chunk prepositions followed by NP\n",
+        "    VP: {<VB.*><NP|PP>*}   # chunk verbs and their arguments\n",
+        "\"\"\"\n",
+        "\n",
+        "# Create a chunk parser using the defined patterns\n",
+        "chunk_parser = RegexpParser(chunking_patterns)\n",
+        "\n",
+        "# Apply chunking to the POS-tagged words\n",
+        "chunks = chunk_parser.parse(pos_tags)\n",
+        "\n",
+        "# Display the chunks\n",
+        "print(chunks)\n",
+        "```\n",
+        "\n",
+        "**Output:**\n",
+        "```\n",
+        "(S\n",
+        "  (NP The/DT cat/NN)\n",
+        "  (VP sat/VBD (PP on/IN (NP the/DT mat/NN)))\n",
+        "  ./.)\n",
+        "```\n",
+        "\n",
+        "In the output, the sentence is parsed into a tree structure, where:\n",
+        "- **NP (Noun Phrase):** \"The cat\"\n",
+        "- **VP (Verb Phrase):** \"sat on the mat\"\n",
+        "- **PP (Prepositional Phrase):** \"on the mat\"\n",
+        "\n",
+        "Each phrase is structured hierarchically, providing a syntactic representation of the sentence's components based on the defined chunking patterns."
+      ],
+      "metadata": {
+        "id": "4GzKEEIdQ67m"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Explanation of Code\n",
+        "\n",
+        "1. **Import Libraries and Download Data:**\n",
+        "   - Import necessary libraries including NLTK modules and download required data.\n",
+        "\n",
+        "2. **Load and Train POS Tagger:**\n",
+        "   - Load the Penn Treebank corpus for training.\n",
+        "   - Train a POS tagger using a combination of a default tagger for unknown words and an Unigram tagger.\n",
+        "\n",
+        "3. **Tokenize and Tag Sample Sentence:**\n",
+        "   - Tokenize the sample sentence into words.\n",
+        "   - Use the trained POS tagger to tag the words in the sample sentence.\n",
+        "\n",
+        "4. **Define Chunk Grammar:**\n",
+        "   - Define a regular expression-based chunk grammar.\n",
+        "   - NP (Noun Phrase): Sequences of determiners (DT), adjectives (JJ), and nouns (NN).\n",
+        "   - VP (Verb Phrase): Verbs (VB) and their associated noun phrases (NP) or prepositional phrases (PP).\n",
+        "\n",
+        "5. **Create Chunk Parser:**\n",
+        "   - Create a chunk parser using the defined regular expression-based grammar.\n",
+        "\n",
+        "6. **Apply Chunk Parser:**\n",
+        "   - Apply the chunk parser to the POS-tagged sentence, creating a tree structure representing the chunks.\n",
+        "\n",
+        "7. **Visualize and Print Tree Structure:**\n",
+        "   - Optionally, visualize the resulting tree structure.\n",
+        "   - Print the tree structure using the `pretty_print()` method.\n",
+        "\n",
+        "This code demonstrates chunking based on a regular expression grammar, identifying and grouping noun phrases (NP) and verb phrases (VP) in the sample sentence. The resulting tree structure illustrates the identified chunks in the sentence."
+      ],
+      "metadata": {
+        "id": "j6lZxNiZRqXC"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "ib87wC-1oeNR",
+        "outputId": "041d199a-ef8b-4a73-8f32-eac06ab0ccd1"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "[nltk_data] Downloading package punkt to /root/nltk_data...\n",
+            "[nltk_data]   Package punkt is already up-to-date!\n",
+            "[nltk_data] Downloading package treebank to /root/nltk_data...\n",
+            "[nltk_data]   Package treebank is already up-to-date!\n"
+          ]
+        },
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "                               S                                              \n",
+            "    ___________________________|__________________________________________     \n",
+            "   |     |            NP               NP      NP            NP           NP  \n",
+            "   |     |     _______|________        |       |        _____|_____       |    \n",
+            "over/IN ./. The/DT quick/JJ brown/NN fox/NN jumps/NN the/DT     lazy/NN dog/NN\n",
+            "\n"
+          ]
+        }
+      ],
+      "source": [
+        "import nltk\n",
+        "from nltk import RegexpParser\n",
+        "from nltk.tokenize import word_tokenize\n",
+        "from nltk.corpus import treebank\n",
+        "\n",
+        "# Download the NLTK data (you only need to do this once)\n",
+        "nltk.download('punkt')\n",
+        "nltk.download('treebank')\n",
+        "\n",
+        "# Load a pre-trained POS tagger\n",
+        "tagged_sentences = treebank.tagged_sents()\n",
+        "train_data = tagged_sentences[:3000]\n",
+        "pos_tagger = nltk.DefaultTagger('NN')  # Default tagger for unknown words\n",
+        "pos_tagger = nltk.UnigramTagger(train_data, backoff=pos_tagger)\n",
+        "\n",
+        "# Define a sample sentence\n",
+        "sample_sentence = \"The quick brown fox jumps over the lazy dog.\"\n",
+        "\n",
+        "# Tokenize and tag the sample sentence\n",
+        "tokens = word_tokenize(sample_sentence)\n",
+        "pos_tags = pos_tagger.tag(tokens)\n",
+        "\n",
+        "# Define a regular expression-based chunk grammar\n",
+        "chunk_grammar = r\"\"\"\n",
+        "    NP: {<DT>?<JJ>*<NN>}   # Chunk sequences of DT, JJ, and NN\n",
+        "    VP: {<VB.*><NP|PP>*}   # Chunk verbs and their arguments\n",
+        "\"\"\"\n",
+        "\n",
+        "# Create a chunk parser using the regular expression-based grammar\n",
+        "chunk_parser = RegexpParser(chunk_grammar)\n",
+        "\n",
+        "# Apply the chunk parser to the POS-tagged sentence\n",
+        "tree = chunk_parser.parse(pos_tags)\n",
+        "\n",
+        "# Visualize the resulting tree (optional)\n",
+        "\n",
+        "# Print the tree structure\n",
+        "tree.pretty_print()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [],
+      "metadata": {
+        "id": "ATXFSAR2orEp"
+      },
+      "execution_count": null,
+      "outputs": []
+    }
+  ]
+}
@@ -0,0 +1,154 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Experiment 5\n",
+        "## Generate Regular Expressions for a given text."
+      ],
+      "metadata": {
+        "id": "mH3wtwcqd1W7"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Regular Expression\n",
+        "A **regular expression** (regex or regexp) is a sequence of characters that define a search pattern.<br>\n",
+        "It is a powerful tool used in Natural Language Processing (NLP) to search for specific patterns or structures in text data. <br>\n",
+        "Regular expressions are highly expressive and can match many patterns, including numbers, dates, email addresses, and phone numbers. <br>\n",
+        "They are useful for numerous practical day-to-day tasks that a data scientist encounters, such as data pre-processing, rule-based information mining systems, pattern matching, text feature engineering, web scraping, data extraction, etc.<br>\n",
+        "\n",
+        "Here are some key concepts related to regular expressions in NLP:\n",
+        "\n",
+        "- A regular expression is a sequence of characters that is used to find or replace patterns embedded in the text.\n",
+        "- Regular expressions are used to recognize different strings of characters ¹.\n",
+        "- Raw strings are used in regular expressions to treat backslashes as literal characters.\n",
+        "- The `re` module in Python provides functions for working with regular expressions.\n",
+        "- The `re.findall()` function is used to search for all occurrences that match a given pattern.\n",
+        "- The `re.sub()` function is used to substitute the matched RE pattern with given text.\n",
+        "- The `re.match()` function is used to match the RE pattern to string with some optional flags."
+      ],
+      "metadata": {
+        "id": "Ui1nFXxbZZF1"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Explanation Of The Code\n",
+        "\n",
+        "This code defines a function called `generate_regex` that takes a text input, escapes special characters, and generates a regular expression pattern based on the input text. Here's a breakdown of the code:\n",
+        "\n",
+        "### Importing the Required Library\n",
+        "```python\n",
+        "import re\n",
+        "```\n",
+        "This line imports the `re` module, which stands for regular expressions, and will be used for working with regular expressions in the code.\n",
+        "\n",
+        "### Function for Generating Regular Expression\n",
+        "```python\n",
+        "def generate_regex(text):\n",
+        "    regex = re.escape(text)\n",
+        "    regex = regex.replace(r'\\ ', r'\\s+')\n",
+        "    return regex\n",
+        "```\n",
+        "This function, `generate_regex`, takes a text input and performs the following steps:\n",
+        "\n",
+        "1. `re.escape(text)`: This function escapes special characters in the input text, ensuring that they are treated as literal characters in the regular expression.\n",
+        "\n",
+        "2. `regex.replace(r'\\ ', r'\\s+')`: This line replaces escaped space characters (`\\ `) with `\\s+`, where `\\s` represents any whitespace character, and `+` means one or more occurrences. This modification allows for flexibility in matching multiple spaces in the input text.\n",
+        "\n",
+        "3. The final regular expression is returned.\n",
+        "\n",
+        "### Main Section\n",
+        "```python\n",
+        "if __name__ == '__main__':\n",
+        "    text = 'This is a sample text'\n",
+        "    regex = generate_regex(text)\n",
+        "    print(f'Text is: {text}')\n",
+        "    print(f'Generated Regular Expression is: {regex}')\n",
+        "```\n",
+        "The main section of the code initializes a sample text, calls the `generate_regex` function to create a regular expression based on the text, and then prints both the original text and the generated regular expression.\n",
+        "\n",
+        "### Explanation\n",
+        "The purpose of this code is to create a regular expression pattern that can be used to match the input text, considering the input text may contain special characters and multiple spaces. The function aims to make the text suitable for pattern matching in a way that accounts for potential variations in spacing. This code then demonstrates the usage of the function with a sample text."
+      ],
+      "metadata": {
+        "id": "wwWInMFa8UYs"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Importing the required library\n",
+        "import re\n",
+        "\n",
+        "# Function for Generating Regular Expression\n",
+        "def generate_regex(text):\n",
+        "\n",
+        "    regex = re.escape(text)\n",
+        "\n",
+        "    regex = regex.replace(r'\\ ', r'\\s+')\n",
+        "\n",
+        "    # Returning Regular Expression\n",
+        "    return regex\n",
+        "\n",
+        "if __name__ == '__main__':\n",
+        "\n",
+        "    # Initializing Text\n",
+        "    text = 'This is a sample text'\n",
+        "\n",
+        "    # Generating Regular Expression by calling the function\n",
+        "    regex = generate_regex(text)\n",
+        "\n",
+        "    # Printing Text\n",
+        "    print(f'Text is: {text}')\n",
+        "\n",
+        "    # Printing Regular Expression\n",
+        "    print(f'Generated Regular Expression is: {regex}')"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "hwDIxtRZZxtk",
+        "outputId": "acc991a2-c8b9-4e62-cc7f-5ececb2402f2"
+      },
+      "execution_count": null,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Text is: This is a sample text\n",
+            "Generated Regular Expression is: This\\s+is\\s+a\\s+sample\\s+text\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [],
+      "metadata": {
+        "id": "U4n8yKGtaGeM"
+      },
+      "execution_count": null,
+      "outputs": []
+    }
+  ]
+}