From 2a7b7541e5c067454d0d1d2f7dd381d087f79f8e Mon Sep 17 00:00:00 2001 From: Pala63 <132717108+Pala63@users.noreply.github.com> Date: Mon, 28 Jul 2025 16:17:20 -0500 Subject: [PATCH 1/3] Created using Colab --- lessons/01_preprocessing.ipynb | 4712 +++++++++++++++++--------------- 1 file changed, 2548 insertions(+), 2164 deletions(-) diff --git a/lessons/01_preprocessing.ipynb b/lessons/01_preprocessing.ipynb index de33786..e6a9dd0 100644 --- a/lessons/01_preprocessing.ipynb +++ b/lessons/01_preprocessing.ipynb @@ -1,2173 +1,2557 @@ { - "cells": [ - { - "cell_type": "markdown", - "id": "d3e7ea21-6437-48e8-a9e4-3bdc05f709c9", - "metadata": {}, - "source": [ - "# Python Text Analysis: Preprocessing\n", - "\n", - "* * * \n", - "\n", - "
\n", - " \n", - "### Learning Objectives \n", - " \n", - "* Learn common steps for preprocessing text data, as well as specific operations for preprocessing Twitter data.\n", - "* Know commonly used NLP packages and what they are capable of.\n", - "* Understand tokenizers, and how they have changed since the advent of Large Language Models.\n", - "
\n", - "\n", - "### Icons Used in This Notebook\n", - "🔔 **Question**: A quick question to help you understand what's going on.
\n", - "🥊 **Challenge**: Interactive excersise. We'll work through these in the workshop!
\n", - "⚠️ **Warning:** Heads-up about tricky stuff or common mistakes.
\n", - "🎬 **Demo**: Showing off something more advanced – so you know what Python can be used for!
\n", - "\n", - "### Sections\n", - "1. [Preprocessing](#section1)\n", - "2. [Tokenization](#section2)\n", - "\n", - "In this three-part workshop series, we'll learn the building blocks for performing text analysis in Python. These techniques lie in the domain of Natural Language Processing (NLP). NLP is a field that deals with identifying and extracting patterns of language, primarily in written texts. Throughout the workshop series, we'll interact with various packages for performing text analysis: starting from simple string methods to specific NLP packages, such as `nltk`, `spaCy`, and more recent ones on Large Language Models (`BERT`).\n", - "\n", - "Now, let's have these packages properly installed before diving into the materials." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d442e4c7-e926-493d-a64e-516616ad915a", - "metadata": {}, - "outputs": [], - "source": [ - "# Uncomment the following lines to install packages/model\n", - "# %pip install NLTK\n", - "# %pip install transformers\n", - "# %pip install spaCy\n", - "# !python -m spacy download en_core_web_sm" - ] - }, - { - "cell_type": "markdown", - "id": "df5b8f8e-4e69-426e-a202-ec48b325e89a", - "metadata": {}, - "source": [ - "\n", - "\n", - "# Preprocessing\n", - "\n", - "In Part 1 of this workshop, we'll address the first step of text analysis. Our goal is to convert the raw, messy text data into a consistent format. This process is often called **preprocessing**, **text cleaning**, or **text normalization**.\n", - "\n", - "You'll notice that at the end of preprocessing, our data is still in a format that we can read and understand. In Parts 2 and 3, we will begin our foray into converting the text data into a numerical representation—a format that can be more readily handled by computers. \n", - "\n", - "🔔 **Question**: Let's pause for a minute to reflect on **your** previous experiences working on text data. \n", - "- What is the format of the text data you have interacted with (plain text, CSV, or XML)?\n", - "- Where does it come from (structured corpus, scraped from the web, survey data)?\n", - "- Is it messy (i.e., is the data formatted consistently)?" - ] - }, - { - "cell_type": "markdown", - "id": "4b35911a-3b3f-4a48-a7d1-9882aab04851", - "metadata": {}, - "source": [ - "## Common Processes\n", - "\n", - "Preprocessing is not something we can accomplish with a single line of code. We often start by familiarizing ourselves with the data, and along the way, we gain a clearer understanding of the granularity of preprocessing we want to apply.\n", - "\n", - "Typically, we begin by applying a set of commonly used processes to clean the data. These operations don't substantially alter the form or meaning of the data; they serve as a standardized procedure to reshape the data into a consistent format.\n", - "\n", - "The following processes, for examples, are commonly applied to preprocess English texts of various genres. These operations can be done using built-in Python functions, such as `string` methods, and Regular Expressions. \n", - "- Lowercase the text\n", - "- Remove punctuation marks\n", - "- Remove extra whitespace characters\n", - "- Remove stop words\n", - "\n", - "After the initial processing, we may choose to perform task-specific processes, the specifics of which often depend on the downstream task we want to perform and the nature of the text data (i.e., its stylistic and linguistic features). \n", - "\n", - "Before we jump into these operations, let's take a look at our data!" - ] - }, - { - "cell_type": "markdown", - "id": "ec5d7350-9a1e-4db9-b828-a87fe1676d8d", - "metadata": {}, - "source": [ - "### Import the Text Data\n", - "\n", - "The text data we'll be working with is a CSV file. It contains tweets about U.S. airlines, scrapped from Feb 2015. \n", - "\n", - "Let's read the file `airline_tweets.csv` into dataframe with `pandas`." - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "id": "3d1ff64b-53ad-4eca-b846-3fda20085c43", - "metadata": {}, - "outputs": [], - "source": [ - "# Import pandas\n", - "import pandas as pd\n", - "\n", - "# File path to data\n", - "csv_path = '../data/airline_tweets.csv'\n", - "\n", - "# Specify the separator\n", - "tweets = pd.read_csv(csv_path, sep=',')" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "id": "e397ac6a-c2ba-4cce-8700-b36b38026c9d", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
tweet_idairline_sentimentairline_sentiment_confidencenegativereasonnegativereason_confidenceairlineairline_sentiment_goldnamenegativereason_goldretweet_counttexttweet_coordtweet_createdtweet_locationuser_timezone
0570306133677760513neutral1.0000NaNNaNVirgin AmericaNaNcairdinNaN0@VirginAmerica What @dhepburn said.NaN2015-02-24 11:35:52 -0800NaNEastern Time (US & Canada)
1570301130888122368positive0.3486NaN0.0000Virgin AmericaNaNjnardinoNaN0@VirginAmerica plus you've added commercials t...NaN2015-02-24 11:15:59 -0800NaNPacific Time (US & Canada)
2570301083672813571neutral0.6837NaNNaNVirgin AmericaNaNyvonnalynnNaN0@VirginAmerica I didn't today... Must mean I n...NaN2015-02-24 11:15:48 -0800Lets PlayCentral Time (US & Canada)
3570301031407624196negative1.0000Bad Flight0.7033Virgin AmericaNaNjnardinoNaN0@VirginAmerica it's really aggressive to blast...NaN2015-02-24 11:15:36 -0800NaNPacific Time (US & Canada)
4570300817074462722negative1.0000Can't Tell1.0000Virgin AmericaNaNjnardinoNaN0@VirginAmerica and it's a really big bad thing...NaN2015-02-24 11:14:45 -0800NaNPacific Time (US & Canada)
\n", - "
" + "cells": [ + { + "cell_type": "markdown", + "id": "d3e7ea21-6437-48e8-a9e4-3bdc05f709c9", + "metadata": { + "id": "d3e7ea21-6437-48e8-a9e4-3bdc05f709c9" + }, + "source": [ + "# Python Text Analysis: Preprocessing\n", + "\n", + "* * *\n", + "\n", + "
\n", + " \n", + "### Learning Objectives\n", + " \n", + "* Learn common steps for preprocessing text data, as well as specific operations for preprocessing Twitter data.\n", + "* Know commonly used NLP packages and what they are capable of.\n", + "* Understand tokenizers, and how they have changed since the advent of Large Language Models.\n", + "
\n", + "\n", + "### Icons Used in This Notebook\n", + "🔔 **Question**: A quick question to help you understand what's going on.
\n", + "🥊 **Challenge**: Interactive excersise. We'll work through these in the workshop!
\n", + "⚠️ **Warning:** Heads-up about tricky stuff or common mistakes.
\n", + "🎬 **Demo**: Showing off something more advanced – so you know what Python can be used for!
\n", + "\n", + "### Sections\n", + "1. [Preprocessing](#section1)\n", + "2. [Tokenization](#section2)\n", + "\n", + "In this three-part workshop series, we'll learn the building blocks for performing text analysis in Python. These techniques lie in the domain of Natural Language Processing (NLP). NLP is a field that deals with identifying and extracting patterns of language, primarily in written texts. Throughout the workshop series, we'll interact with various packages for performing text analysis: starting from simple string methods to specific NLP packages, such as `nltk`, `spaCy`, and more recent ones on Large Language Models (`BERT`).\n", + "\n", + "Now, let's have these packages properly installed before diving into the materials." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "d442e4c7-e926-493d-a64e-516616ad915a", + "metadata": { + "id": "d442e4c7-e926-493d-a64e-516616ad915a", + "outputId": "696bdeff-0793-4aff-c8a9-d23d8446464a", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Requirement already satisfied: NLTK in /usr/local/lib/python3.11/dist-packages (3.9.1)\n", + "Requirement already satisfied: click in /usr/local/lib/python3.11/dist-packages (from NLTK) (8.2.1)\n", + "Requirement already satisfied: joblib in /usr/local/lib/python3.11/dist-packages (from NLTK) (1.5.1)\n", + "Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.11/dist-packages (from NLTK) (2024.11.6)\n", + "Requirement already satisfied: tqdm in /usr/local/lib/python3.11/dist-packages (from NLTK) (4.67.1)\n", + "Requirement already satisfied: transformers in /usr/local/lib/python3.11/dist-packages (4.53.3)\n", + "Requirement already satisfied: filelock in /usr/local/lib/python3.11/dist-packages (from transformers) (3.18.0)\n", + "Requirement already satisfied: huggingface-hub<1.0,>=0.30.0 in /usr/local/lib/python3.11/dist-packages (from transformers) (0.33.5)\n", + "Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.11/dist-packages (from transformers) (2.0.2)\n", + "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.11/dist-packages (from transformers) (25.0)\n", + "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.11/dist-packages (from transformers) (6.0.2)\n", + "Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.11/dist-packages (from transformers) (2024.11.6)\n", + "Requirement already satisfied: requests in /usr/local/lib/python3.11/dist-packages (from transformers) (2.32.3)\n", + "Requirement already satisfied: tokenizers<0.22,>=0.21 in /usr/local/lib/python3.11/dist-packages (from transformers) (0.21.2)\n", + "Requirement already satisfied: safetensors>=0.4.3 in /usr/local/lib/python3.11/dist-packages (from transformers) (0.5.3)\n", + "Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.11/dist-packages (from transformers) (4.67.1)\n", + "Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.11/dist-packages (from huggingface-hub<1.0,>=0.30.0->transformers) (2025.3.0)\n", + "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.11/dist-packages (from huggingface-hub<1.0,>=0.30.0->transformers) (4.14.1)\n", + "Requirement already satisfied: hf-xet<2.0.0,>=1.1.2 in /usr/local/lib/python3.11/dist-packages (from huggingface-hub<1.0,>=0.30.0->transformers) (1.1.5)\n", + "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests->transformers) (3.4.2)\n", + "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.11/dist-packages (from requests->transformers) (3.10)\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests->transformers) (2.5.0)\n", + "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.11/dist-packages (from requests->transformers) (2025.7.14)\n", + "Requirement already satisfied: spaCy in /usr/local/lib/python3.11/dist-packages (3.8.7)\n", + "Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in /usr/local/lib/python3.11/dist-packages (from spaCy) (3.0.12)\n", + "Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /usr/local/lib/python3.11/dist-packages (from spaCy) (1.0.5)\n", + "Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.11/dist-packages (from spaCy) (1.0.13)\n", + "Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.11/dist-packages (from spaCy) (2.0.11)\n", + "Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.11/dist-packages (from spaCy) (3.0.10)\n", + "Requirement already satisfied: thinc<8.4.0,>=8.3.4 in /usr/local/lib/python3.11/dist-packages (from spaCy) (8.3.6)\n", + "Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in /usr/local/lib/python3.11/dist-packages (from spaCy) (1.1.3)\n", + "Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /usr/local/lib/python3.11/dist-packages (from spaCy) (2.5.1)\n", + "Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /usr/local/lib/python3.11/dist-packages (from spaCy) (2.0.10)\n", + "Requirement already satisfied: weasel<0.5.0,>=0.1.0 in /usr/local/lib/python3.11/dist-packages (from spaCy) (0.4.1)\n", + "Requirement already satisfied: typer<1.0.0,>=0.3.0 in /usr/local/lib/python3.11/dist-packages (from spaCy) (0.16.0)\n", + "Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.11/dist-packages (from spaCy) (4.67.1)\n", + "Requirement already satisfied: numpy>=1.19.0 in /usr/local/lib/python3.11/dist-packages (from spaCy) (2.0.2)\n", + "Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.11/dist-packages (from spaCy) (2.32.3)\n", + "Requirement already satisfied: pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4 in /usr/local/lib/python3.11/dist-packages (from spaCy) (2.11.7)\n", + "Requirement already satisfied: jinja2 in /usr/local/lib/python3.11/dist-packages (from spaCy) (3.1.6)\n", + "Requirement already satisfied: setuptools in /usr/local/lib/python3.11/dist-packages (from spaCy) (75.2.0)\n", + "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.11/dist-packages (from spaCy) (25.0)\n", + "Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /usr/local/lib/python3.11/dist-packages (from spaCy) (3.5.0)\n", + "Requirement already satisfied: language-data>=1.2 in /usr/local/lib/python3.11/dist-packages (from langcodes<4.0.0,>=3.2.0->spaCy) (1.3.0)\n", + "Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.11/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spaCy) (0.7.0)\n", + "Requirement already satisfied: pydantic-core==2.33.2 in /usr/local/lib/python3.11/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spaCy) (2.33.2)\n", + "Requirement already satisfied: typing-extensions>=4.12.2 in /usr/local/lib/python3.11/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spaCy) (4.14.1)\n", + "Requirement already satisfied: typing-inspection>=0.4.0 in /usr/local/lib/python3.11/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spaCy) (0.4.1)\n", + "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests<3.0.0,>=2.13.0->spaCy) (3.4.2)\n", + "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.11/dist-packages (from requests<3.0.0,>=2.13.0->spaCy) (3.10)\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests<3.0.0,>=2.13.0->spaCy) (2.5.0)\n", + "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.11/dist-packages (from requests<3.0.0,>=2.13.0->spaCy) (2025.7.14)\n", + "Requirement already satisfied: blis<1.4.0,>=1.3.0 in /usr/local/lib/python3.11/dist-packages (from thinc<8.4.0,>=8.3.4->spaCy) (1.3.0)\n", + "Requirement already satisfied: confection<1.0.0,>=0.0.1 in /usr/local/lib/python3.11/dist-packages (from thinc<8.4.0,>=8.3.4->spaCy) (0.1.5)\n", + "Requirement already satisfied: click>=8.0.0 in /usr/local/lib/python3.11/dist-packages (from typer<1.0.0,>=0.3.0->spaCy) (8.2.1)\n", + "Requirement already satisfied: shellingham>=1.3.0 in /usr/local/lib/python3.11/dist-packages (from typer<1.0.0,>=0.3.0->spaCy) (1.5.4)\n", + "Requirement already satisfied: rich>=10.11.0 in /usr/local/lib/python3.11/dist-packages (from typer<1.0.0,>=0.3.0->spaCy) (13.9.4)\n", + "Requirement already satisfied: cloudpathlib<1.0.0,>=0.7.0 in /usr/local/lib/python3.11/dist-packages (from weasel<0.5.0,>=0.1.0->spaCy) (0.21.1)\n", + "Requirement already satisfied: smart-open<8.0.0,>=5.2.1 in /usr/local/lib/python3.11/dist-packages (from weasel<0.5.0,>=0.1.0->spaCy) (7.3.0.post1)\n", + "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.11/dist-packages (from jinja2->spaCy) (3.0.2)\n", + "Requirement already satisfied: marisa-trie>=1.1.0 in /usr/local/lib/python3.11/dist-packages (from language-data>=1.2->langcodes<4.0.0,>=3.2.0->spaCy) (1.2.1)\n", + "Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.11/dist-packages (from rich>=10.11.0->typer<1.0.0,>=0.3.0->spaCy) (3.0.0)\n", + "Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.11/dist-packages (from rich>=10.11.0->typer<1.0.0,>=0.3.0->spaCy) (2.19.2)\n", + "Requirement already satisfied: wrapt in /usr/local/lib/python3.11/dist-packages (from smart-open<8.0.0,>=5.2.1->weasel<0.5.0,>=0.1.0->spaCy) (1.17.2)\n", + "Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.11/dist-packages (from markdown-it-py>=2.2.0->rich>=10.11.0->typer<1.0.0,>=0.3.0->spaCy) (0.1.2)\n", + "Collecting en-core-web-sm==3.8.0\n", + " Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m12.8/12.8 MB\u001b[0m \u001b[31m92.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25h\u001b[38;5;2m✔ Download and installation successful\u001b[0m\n", + "You can now load the package via spacy.load('en_core_web_sm')\n", + "\u001b[38;5;3m⚠ Restart to reload dependencies\u001b[0m\n", + "If you are in a Jupyter or Colab notebook, you may need to restart Python in\n", + "order to load all the package's dependencies. You can do this by selecting the\n", + "'Restart kernel' or 'Restart runtime' option.\n" + ] + } + ], + "source": [ + " #Uncomment the following lines to install packages/model\n", + " %pip install NLTK\n", + " %pip install transformers\n", + " %pip install spaCy\n", + " !python -m spacy download en_core_web_sm" + ] + }, + { + "cell_type": "markdown", + "id": "df5b8f8e-4e69-426e-a202-ec48b325e89a", + "metadata": { + "id": "df5b8f8e-4e69-426e-a202-ec48b325e89a" + }, + "source": [ + "\n", + "\n", + "# Preprocessing\n", + "\n", + "In Part 1 of this workshop, we'll address the first step of text analysis. Our goal is to convert the raw, messy text data into a consistent format. This process is often called **preprocessing**, **text cleaning**, or **text normalization**.\n", + "\n", + "You'll notice that at the end of preprocessing, our data is still in a format that we can read and understand. In Parts 2 and 3, we will begin our foray into converting the text data into a numerical representation—a format that can be more readily handled by computers.\n", + "\n", + "🔔 **Question**: Let's pause for a minute to reflect on **your** previous experiences working on text data.\n", + "- What is the format of the text data you have interacted with (plain text, CSV, or XML)?\n", + "- Where does it come from (structured corpus, scraped from the web, survey data)?\n", + "- Is it messy (i.e., is the data formatted consistently)?" + ] + }, + { + "cell_type": "code", + "source": [ + "# Respondiendo a las preguntas:\n", + "# 1) he trabajo con texto simple y CSV\n", + "# 2) extraido de la web y corpus estructurado." + ], + "metadata": { + "id": "4P2j2IAz1KQn" + }, + "id": "4P2j2IAz1KQn", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "id": "4b35911a-3b3f-4a48-a7d1-9882aab04851", + "metadata": { + "id": "4b35911a-3b3f-4a48-a7d1-9882aab04851" + }, + "source": [ + "## Common Processes\n", + "\n", + "Preprocessing is not something we can accomplish with a single line of code. We often start by familiarizing ourselves with the data, and along the way, we gain a clearer understanding of the granularity of preprocessing we want to apply.\n", + "\n", + "Typically, we begin by applying a set of commonly used processes to clean the data. These operations don't substantially alter the form or meaning of the data; they serve as a standardized procedure to reshape the data into a consistent format.\n", + "\n", + "The following processes, for examples, are commonly applied to preprocess English texts of various genres. These operations can be done using built-in Python functions, such as `string` methods, and Regular Expressions.\n", + "- Lowercase the text\n", + "- Remove punctuation marks\n", + "- Remove extra whitespace characters\n", + "- Remove stop words\n", + "\n", + "After the initial processing, we may choose to perform task-specific processes, the specifics of which often depend on the downstream task we want to perform and the nature of the text data (i.e., its stylistic and linguistic features). \n", + "\n", + "Before we jump into these operations, let's take a look at our data!" + ] + }, + { + "cell_type": "markdown", + "id": "ec5d7350-9a1e-4db9-b828-a87fe1676d8d", + "metadata": { + "id": "ec5d7350-9a1e-4db9-b828-a87fe1676d8d" + }, + "source": [ + "### Import the Text Data\n", + "\n", + "The text data we'll be working with is a CSV file. It contains tweets about U.S. airlines, scrapped from Feb 2015.\n", + "\n", + "Let's read the file `airline_tweets.csv` into dataframe with `pandas`." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "3d1ff64b-53ad-4eca-b846-3fda20085c43", + "metadata": { + "id": "3d1ff64b-53ad-4eca-b846-3fda20085c43", + "outputId": "2df7adfd-01e6-452c-ae50-a9437cc486f0", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 335 + } + }, + "outputs": [ + { + "output_type": "error", + "ename": "FileNotFoundError", + "evalue": "[Errno 2] No such file or directory: '../data/airline_tweets.csv'", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mFileNotFoundError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m/tmp/ipython-input-2-1378166650.py\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 6\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 7\u001b[0m \u001b[0;31m# Specify the separator\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 8\u001b[0;31m \u001b[0mtweets\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread_csv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcsv_path\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msep\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m','\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;32m/usr/local/lib/python3.11/dist-packages/pandas/io/parsers/readers.py\u001b[0m in \u001b[0;36mread_csv\u001b[0;34m(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)\u001b[0m\n\u001b[1;32m 1024\u001b[0m \u001b[0mkwds\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mupdate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkwds_defaults\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1025\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1026\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0m_read\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfilepath_or_buffer\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkwds\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1027\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1028\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/usr/local/lib/python3.11/dist-packages/pandas/io/parsers/readers.py\u001b[0m in \u001b[0;36m_read\u001b[0;34m(filepath_or_buffer, kwds)\u001b[0m\n\u001b[1;32m 618\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 619\u001b[0m \u001b[0;31m# Create the parser.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 620\u001b[0;31m \u001b[0mparser\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mTextFileReader\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfilepath_or_buffer\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwds\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 621\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 622\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mchunksize\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0miterator\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/usr/local/lib/python3.11/dist-packages/pandas/io/parsers/readers.py\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, f, engine, **kwds)\u001b[0m\n\u001b[1;32m 1618\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1619\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mhandles\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mIOHandles\u001b[0m \u001b[0;34m|\u001b[0m \u001b[0;32mNone\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1620\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_make_engine\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mf\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mengine\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1621\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1622\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mclose\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m->\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/usr/local/lib/python3.11/dist-packages/pandas/io/parsers/readers.py\u001b[0m in \u001b[0;36m_make_engine\u001b[0;34m(self, f, engine)\u001b[0m\n\u001b[1;32m 1878\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;34m\"b\"\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mmode\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1879\u001b[0m \u001b[0mmode\u001b[0m \u001b[0;34m+=\u001b[0m \u001b[0;34m\"b\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1880\u001b[0;31m self.handles = get_handle(\n\u001b[0m\u001b[1;32m 1881\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1882\u001b[0m \u001b[0mmode\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/usr/local/lib/python3.11/dist-packages/pandas/io/common.py\u001b[0m in \u001b[0;36mget_handle\u001b[0;34m(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)\u001b[0m\n\u001b[1;32m 871\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mioargs\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mencoding\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;34m\"b\"\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mioargs\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmode\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 872\u001b[0m \u001b[0;31m# Encoding\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 873\u001b[0;31m handle = open(\n\u001b[0m\u001b[1;32m 874\u001b[0m \u001b[0mhandle\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 875\u001b[0m \u001b[0mioargs\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmode\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: '../data/airline_tweets.csv'" + ] + } + ], + "source": [ + "# Import pandas\n", + "import pandas as pd\n", + "\n", + "# File path to data\n", + "csv_path = '../data/airline_tweets.csv'\n", + "\n", + "# Specify the separator\n", + "tweets = pd.read_csv(csv_path, sep=',')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e397ac6a-c2ba-4cce-8700-b36b38026c9d", + "metadata": { + "id": "e397ac6a-c2ba-4cce-8700-b36b38026c9d", + "outputId": "ca6b5529-f2b3-4d87-ccc2-2e24e405ee43" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
tweet_idairline_sentimentairline_sentiment_confidencenegativereasonnegativereason_confidenceairlineairline_sentiment_goldnamenegativereason_goldretweet_counttexttweet_coordtweet_createdtweet_locationuser_timezone
0570306133677760513neutral1.0000NaNNaNVirgin AmericaNaNcairdinNaN0@VirginAmerica What @dhepburn said.NaN2015-02-24 11:35:52 -0800NaNEastern Time (US & Canada)
1570301130888122368positive0.3486NaN0.0000Virgin AmericaNaNjnardinoNaN0@VirginAmerica plus you've added commercials t...NaN2015-02-24 11:15:59 -0800NaNPacific Time (US & Canada)
2570301083672813571neutral0.6837NaNNaNVirgin AmericaNaNyvonnalynnNaN0@VirginAmerica I didn't today... Must mean I n...NaN2015-02-24 11:15:48 -0800Lets PlayCentral Time (US & Canada)
3570301031407624196negative1.0000Bad Flight0.7033Virgin AmericaNaNjnardinoNaN0@VirginAmerica it's really aggressive to blast...NaN2015-02-24 11:15:36 -0800NaNPacific Time (US & Canada)
4570300817074462722negative1.0000Can't Tell1.0000Virgin AmericaNaNjnardinoNaN0@VirginAmerica and it's a really big bad thing...NaN2015-02-24 11:14:45 -0800NaNPacific Time (US & Canada)
\n", + "
" + ], + "text/plain": [ + " tweet_id airline_sentiment airline_sentiment_confidence \\\n", + "0 570306133677760513 neutral 1.0000 \n", + "1 570301130888122368 positive 0.3486 \n", + "2 570301083672813571 neutral 0.6837 \n", + "3 570301031407624196 negative 1.0000 \n", + "4 570300817074462722 negative 1.0000 \n", + "\n", + " negativereason negativereason_confidence airline \\\n", + "0 NaN NaN Virgin America \n", + "1 NaN 0.0000 Virgin America \n", + "2 NaN NaN Virgin America \n", + "3 Bad Flight 0.7033 Virgin America \n", + "4 Can't Tell 1.0000 Virgin America \n", + "\n", + " airline_sentiment_gold name negativereason_gold retweet_count \\\n", + "0 NaN cairdin NaN 0 \n", + "1 NaN jnardino NaN 0 \n", + "2 NaN yvonnalynn NaN 0 \n", + "3 NaN jnardino NaN 0 \n", + "4 NaN jnardino NaN 0 \n", + "\n", + " text tweet_coord \\\n", + "0 @VirginAmerica What @dhepburn said. NaN \n", + "1 @VirginAmerica plus you've added commercials t... NaN \n", + "2 @VirginAmerica I didn't today... Must mean I n... NaN \n", + "3 @VirginAmerica it's really aggressive to blast... NaN \n", + "4 @VirginAmerica and it's a really big bad thing... NaN \n", + "\n", + " tweet_created tweet_location user_timezone \n", + "0 2015-02-24 11:35:52 -0800 NaN Eastern Time (US & Canada) \n", + "1 2015-02-24 11:15:59 -0800 NaN Pacific Time (US & Canada) \n", + "2 2015-02-24 11:15:48 -0800 Lets Play Central Time (US & Canada) \n", + "3 2015-02-24 11:15:36 -0800 NaN Pacific Time (US & Canada) \n", + "4 2015-02-24 11:14:45 -0800 NaN Pacific Time (US & Canada) " + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Show the first five rows\n", + "tweets.head()" + ] + }, + { + "cell_type": "markdown", + "id": "ae3b339f-45cf-465d-931c-05f9096fd510", + "metadata": { + "id": "ae3b339f-45cf-465d-931c-05f9096fd510" + }, + "source": [ + "The dataframe has one row per tweet. The text of tweet is shown in the `text` column.\n", + "- `text` (`str`): the text of the tweet.\n", + "\n", + "Other metadata we are interested in include:\n", + "- `airline_sentiment` (`str`): the sentiment of the tweet, labeled as \"neutral,\" \"positive,\" or \"negative.\"\n", + "- `airline` (`str`): the airline that is tweeted about.\n", + "- `retweet count` (`int`): how many times the tweet was retweeted." + ] + }, + { + "cell_type": "markdown", + "id": "302c695b-4bd1-4151-9cb9-ef5253eb16df", + "metadata": { + "id": "302c695b-4bd1-4151-9cb9-ef5253eb16df" + }, + "source": [ + "Let's take a look at some of the tweets:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b690daab-7be5-4b8f-8af0-a91fdec4ec4f", + "metadata": { + "id": "b690daab-7be5-4b8f-8af0-a91fdec4ec4f", + "outputId": "6c927a6f-0b55-40bc-d97f-b9d34360bae1" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "@VirginAmerica What @dhepburn said.\n", + "@VirginAmerica plus you've added commercials to the experience... tacky.\n", + "@VirginAmerica I didn't today... Must mean I need to take another trip!\n" + ] + } + ], + "source": [ + "print(tweets['text'].iloc[0])\n", + "print(tweets['text'].iloc[1])\n", + "print(tweets['text'].iloc[2])" + ] + }, + { + "cell_type": "markdown", + "id": "8adc05fa-ad30-4402-ab56-086bcb09a166", + "metadata": { + "id": "8adc05fa-ad30-4402-ab56-086bcb09a166" + }, + "source": [ + "🔔 **Question**: What have you noticed? What are the stylistic features of tweets?" + ] + }, + { + "cell_type": "markdown", + "id": "c3460393-00a6-461c-b02a-9e98f9b5d1af", + "metadata": { + "id": "c3460393-00a6-461c-b02a-9e98f9b5d1af" + }, + "source": [ + "### Lowercasing\n", + "\n", + "While we acknowledge that a word's casing is informative, we often don't work in contexts where we can properly utilize this information.\n", + "\n", + "More often, the subsequent analysis we perform is **case-insensitive**. For instance, in frequency analysis, we want to account for various forms of the same word. Lowercasing the text data aids in this process and simplifies our analysis.\n", + "\n", + "We can easily achieve lowercasing with the string method [`.lower()`](https://docs.python.org/3/library/stdtypes.html#str.lower); see [documentation](https://docs.python.org/3/library/stdtypes.html#string-methods) for more useful functions.\n", + "\n", + "Let's apply it to the following example:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "58a95d90-3ef1-4bff-9cfe-d447ed99f252", + "metadata": { + "id": "58a95d90-3ef1-4bff-9cfe-d447ed99f252", + "outputId": "33094c0c-036c-42f0-9196-94ebeb135165" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "@VirginAmerica I was scheduled for SFO 2 DAL flight 714 today. Changed to 24th due weather. Looks like flight still on?\n" + ] + } + ], + "source": [ + "# Print the first example\n", + "first_example = tweets['text'][108]\n", + "print(first_example)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c66d91c0-6eed-4591-95fc-cd2eae2e0d41", + "metadata": { + "id": "c66d91c0-6eed-4591-95fc-cd2eae2e0d41", + "outputId": "452329e5-c12d-4698-c20b-a1162b29d3fb" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "False\n", + "==================================================\n", + "@virginamerica i was scheduled for sfo 2 dal flight 714 today. changed to 24th due weather. looks like flight still on?\n", + "==================================================\n", + "@VIRGINAMERICA I WAS SCHEDULED FOR SFO 2 DAL FLIGHT 714 TODAY. CHANGED TO 24TH DUE WEATHER. LOOKS LIKE FLIGHT STILL ON?\n" + ] + } + ], + "source": [ + "# Check if all characters are in lowercase\n", + "print(first_example.islower())\n", + "print(f\"{'=' * 50}\")\n", + "\n", + "# Convert it to lowercase\n", + "print(first_example.lower())\n", + "print(f\"{'=' * 50}\")\n", + "\n", + "# Convert it to uppercase\n", + "print(first_example.upper())" + ] + }, + { + "cell_type": "markdown", + "id": "7bf0d8c8-bd6c-47ef-b305-09ac61d07d4d", + "metadata": { + "id": "7bf0d8c8-bd6c-47ef-b305-09ac61d07d4d" + }, + "source": [ + "### Remove Extra Whitespace Characters\n", + "\n", + "Sometimes we might come across texts with extraneous whitespace, such as spaces, tabs, and newline characters, which is particularly common when the text is scrapped from web pages. Before we dive into the details, let's briefly introduce Regular Expressions (regex) and the `re` package.\n", + "\n", + "Regular expressions are a powerful way of searching for specific string patterns in large corpora. They have an infamously steep learning curve, but they can be very efficient when we get a handle on them. Many NLP packages heavily rely on regex under the hood. Regex testers, such as [regex101](https://regex101.com), are useful tools in both understanding and creating regex expressions.\n", + "\n", + "Our goal in this workshop is not to provide a deep (or even shallow) dive into regex; instead, we want to expose you to them so that you are better prepared to do deep dives in the future!\n", + "\n", + "The following example is a poem by William Wordsworth. Like many poems, the text may contain extra line breaks (i.e., newline characters, `\\n`) that we want to remove." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d1bd73f1-a30f-4269-a05e-47cfff7b496f", + "metadata": { + "id": "d1bd73f1-a30f-4269-a05e-47cfff7b496f" + }, + "outputs": [], + "source": [ + "# File path to the poem\n", + "text_path = '../data/poem_wordsworth.txt'\n", + "\n", + "# Read the poem in\n", + "with open(text_path, 'r') as file:\n", + " text = file.read()\n", + " file.close()" + ] + }, + { + "cell_type": "markdown", + "id": "7a693dd9-9706-40b3-863f-f568020245f7", + "metadata": { + "id": "7a693dd9-9706-40b3-863f-f568020245f7" + }, + "source": [ + "As you can see, the poem is formatted as a continuous string of text with line breaks placed at the end of each line, making it difficult to read." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7e78a75a-8e15-4bcb-a416-783aa7f60ef3", + "metadata": { + "id": "7e78a75a-8e15-4bcb-a416-783aa7f60ef3", + "outputId": "fe88d3c0-5f01-4bba-9ce9-61a4ce3683bf" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "\"I wandered lonely as a cloud\\n\\n\\nI wandered lonely as a cloud\\nThat floats on high o'er vales and hills,\\nWhen all at once I saw a crowd,\\nA host, of golden daffodils;\\nBeside the lake, beneath the trees,\\nFluttering and dancing in the breeze.\\n\\nContinuous as the stars that shine\\nAnd twinkle on the milky way,\\nThey stretched in never-ending line\\nAlong the margin of a bay:\\nTen thousand saw I at a glance,\\nTossing their heads in sprightly dance.\\n\\nThe waves beside them danced; but they\\nOut-did the sparkling waves in glee:\\nA poet could not but be gay,\\nIn such a jocund company:\\nI gazed—and gazed—but little thought\\nWhat wealth the show to me had brought:\\n\\nFor oft, when on my couch I lie\\nIn vacant or in pensive mood,\\nThey flash upon that inward eye\\nWhich is the bliss of solitude;\\nAnd then my heart with pleasure fills,\\nAnd dances with the daffodils.\"" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "text" + ] + }, + { + "cell_type": "markdown", + "id": "47cce993-c315-4aaa-87fe-149de8607f65", + "metadata": { + "id": "47cce993-c315-4aaa-87fe-149de8607f65" + }, + "source": [ + "One handy function we can use to display the poem properly is `.splitlines()`. As the name suggests, it splits a long text sequence into a list of lines whenever there is a newline character. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ddeade7a-065d-49e6-bdd3-87a8ea8f6e6e", + "metadata": { + "id": "ddeade7a-065d-49e6-bdd3-87a8ea8f6e6e", + "outputId": "c08d21ba-8154-4dac-c39c-e1ba5644b7ea" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "['I wandered lonely as a cloud',\n", + " '',\n", + " '',\n", + " 'I wandered lonely as a cloud',\n", + " \"That floats on high o'er vales and hills,\",\n", + " 'When all at once I saw a crowd,',\n", + " 'A host, of golden daffodils;',\n", + " 'Beside the lake, beneath the trees,',\n", + " 'Fluttering and dancing in the breeze.',\n", + " '',\n", + " 'Continuous as the stars that shine',\n", + " 'And twinkle on the milky way,',\n", + " 'They stretched in never-ending line',\n", + " 'Along the margin of a bay:',\n", + " 'Ten thousand saw I at a glance,',\n", + " 'Tossing their heads in sprightly dance.',\n", + " '',\n", + " 'The waves beside them danced; but they',\n", + " 'Out-did the sparkling waves in glee:',\n", + " 'A poet could not but be gay,',\n", + " 'In such a jocund company:',\n", + " 'I gazed—and gazed—but little thought',\n", + " 'What wealth the show to me had brought:',\n", + " '',\n", + " 'For oft, when on my couch I lie',\n", + " 'In vacant or in pensive mood,',\n", + " 'They flash upon that inward eye',\n", + " 'Which is the bliss of solitude;',\n", + " 'And then my heart with pleasure fills,',\n", + " 'And dances with the daffodils.']" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Split the single string into a list of lines\n", + "text.splitlines()" + ] + }, + { + "cell_type": "markdown", + "id": "44d3825b-0857-44e1-bf6a-d8c7a9032704", + "metadata": { + "id": "44d3825b-0857-44e1-bf6a-d8c7a9032704" + }, + "source": [ + "Let's return to our tweet data for an example." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "53a81ea9-65c4-474a-8530-35393555d1be", + "metadata": { + "id": "53a81ea9-65c4-474a-8530-35393555d1be", + "outputId": "49d34646-0cf5-4c45-eb29-1c83b6870e42" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "\"@VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing.\\nit's really the only bad thing about flying VA\"" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Print the second example\n", + "second_example = tweets['text'][5]\n", + "second_example" + ] + }, + { + "cell_type": "markdown", + "id": "aef55865-36fd-4c06-a765-530cf3b53096", + "metadata": { + "id": "aef55865-36fd-4c06-a765-530cf3b53096" + }, + "source": [ + "In this case, we don't really want to split the tweet into a list of strings. We still expect a single string of text but would like to remove the line break completely from the string.\n", + "\n", + "The string method `.strip()` effectively does the job of stripping away spaces at both ends of the text. However, it won't work in our example as the newline character is in the middle of the string." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b933503b-4370-4dc4-b287-6dc2f9cdb1d4", + "metadata": { + "id": "b933503b-4370-4dc4-b287-6dc2f9cdb1d4", + "outputId": "a6312e9c-9276-4330-9d7c-c79268310614" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "\"@VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing.\\nit's really the only bad thing about flying VA\"" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Strip only removed blankspace at both ends\n", + "second_example.strip()" + ] + }, + { + "cell_type": "markdown", + "id": "b99b80b4-804f-460f-a2d5-adbd654902b3", + "metadata": { + "id": "b99b80b4-804f-460f-a2d5-adbd654902b3" + }, + "source": [ + "This is where regex could be really helpful." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ceac9714-7053-4b2e-affb-71f8c3d2dcd9", + "metadata": { + "id": "ceac9714-7053-4b2e-affb-71f8c3d2dcd9" + }, + "outputs": [], + "source": [ + "import re" + ] + }, + { + "cell_type": "markdown", + "id": "d5f08d20-ba81-4e48-9e2a-5728148005b3", + "metadata": { + "id": "d5f08d20-ba81-4e48-9e2a-5728148005b3" + }, + "source": [ + "Now, with regex, we are essentially calling it to match a pattern that we have identified in the text data, and we want to do some operations to the matched part—extract it, replace it with something else, or remove it completely. Therefore, the way regex works could be unpacked into the following steps:\n", + "\n", + "- Identify and write the pattern in regex (`r'PATTERN'`)\n", + "- Write the replacement for the pattern (`'REPLACEMENT'`)\n", + "- Call the specific regex function (e.g., `re.sub()`)\n", + "\n", + "In our example, the pattern we are looking for is `\\s`, which is the regex short name for any whitespace character (`\\n` and `\\t` included). We also add a quantifier `+` to the end: `\\s+`. It means we'd like to capture one or more occurences of the whitespace character." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1248d227-1149-4014-94a5-c05592a27a7e", + "metadata": { + "id": "1248d227-1149-4014-94a5-c05592a27a7e" + }, + "outputs": [], + "source": [ + "# Write a pattern in regex\n", + "blankspace_pattern = r'\\s+'" + ] + }, + { + "cell_type": "markdown", + "id": "cc075c2e-1a1d-4393-a3ea-8ad7c118364b", + "metadata": { + "id": "cc075c2e-1a1d-4393-a3ea-8ad7c118364b" + }, + "source": [ + "The replacement for one or more whitespace characters is exactly one single whitespace, which is the canonical word boundary in English. Any additional whitespace will be reduced to a single whitespace." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c55cb2f1-f4ca-4b79-900c-f65ec303ddac", + "metadata": { + "id": "c55cb2f1-f4ca-4b79-900c-f65ec303ddac" + }, + "outputs": [], + "source": [ + "# Write a replacement for the pattern identfied\n", + "blankspace_repl = ' '" + ] + }, + { + "cell_type": "markdown", + "id": "bc12e3d1-728a-429b-9c83-4dcc88590bc4", + "metadata": { + "id": "bc12e3d1-728a-429b-9c83-4dcc88590bc4" + }, + "source": [ + "Lastly, let's put everything together using the function [`re.sub()`](https://docs.python.org/3.11/library/re.html#re.sub), which means we want to substitute a pattern with a replacement. The function takes in three arguments—the pattern, the replacement, and the string to which we want to apply the function." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5249b24b-7111-4569-be29-c40efa5e148e", + "metadata": { + "id": "5249b24b-7111-4569-be29-c40efa5e148e", + "outputId": "ec022a30-3b02-4ede-ff6c-89f98e4bc0bf" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "@VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing. it's really the only bad thing about flying VA\n" + ] + } + ], + "source": [ + "# Replace whitespace(s) with ' '\n", + "clean_text = re.sub(pattern = blankspace_pattern,\n", + " repl = blankspace_repl,\n", + " string = second_example)\n", + "print(clean_text)" + ] + }, + { + "cell_type": "markdown", + "id": "a895fbe3-a034-4124-94af-72a528913c51", + "metadata": { + "id": "a895fbe3-a034-4124-94af-72a528913c51" + }, + "source": [ + "Ta-da! The newline character is no longer there." + ] + }, + { + "cell_type": "markdown", + "id": "7087dc0c-5fef-4f1c-8662-7cbc8a978f34", + "metadata": { + "id": "7087dc0c-5fef-4f1c-8662-7cbc8a978f34" + }, + "source": [ + "### Remove Punctuation Marks\n", + "\n", + "Sometimes we are only interested in analyzing **alphanumeric characters** (i.e., the letters and numbers), in which case we might want to remove punctuation marks.\n", + "\n", + "The `string` module contains a list of predefined punctuation marks. Let's print them out." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "70e8502b-b703-45e0-8852-0c3210363440", + "metadata": { + "id": "70e8502b-b703-45e0-8852-0c3210363440", + "outputId": "1d034a80-b105-44d1-ab0c-373e1a15cbc5" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~\n" + ] + } + ], + "source": [ + "# Load in a predefined list of punctuation marks\n", + "from string import punctuation\n", + "print(punctuation)" + ] + }, + { + "cell_type": "markdown", + "id": "91119c9e-431c-42cb-afea-f7e607698929", + "metadata": { + "id": "91119c9e-431c-42cb-afea-f7e607698929" + }, + "source": [ + "In practice, to remove these punctuation characters, we can simply iterate over the text and remove characters found in the list, such as shown below in the `remove_punct` function." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "237d868d-339d-4bbe-9a3b-20fa5fbdf231", + "metadata": { + "id": "237d868d-339d-4bbe-9a3b-20fa5fbdf231" + }, + "outputs": [], + "source": [ + "def remove_punct(text):\n", + " '''Remove punctuation marks in input text'''\n", + "\n", + " # Select characters not in puncutaion\n", + " no_punct = []\n", + " for char in text:\n", + " if char not in punctuation:\n", + " no_punct.append(char)\n", + "\n", + " # Join the characters into a string\n", + " text_no_punct = ''.join(no_punct)\n", + "\n", + " return text_no_punct" + ] + }, + { + "cell_type": "markdown", + "id": "d4fc768b-c2dd-4386-8212-483c4485e4be", + "metadata": { + "id": "d4fc768b-c2dd-4386-8212-483c4485e4be" + }, + "source": [ + "Let's apply the function to the example below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7596c465-3d85-4b72-a853-f2151bcd91df", + "metadata": { + "id": "7596c465-3d85-4b72-a853-f2151bcd91df", + "outputId": "330529c4-ceaa-42db-f784-7766c5a81812" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "@VirginAmerica why are your first fares in May over three times more than other carriers when all seats are available to select???\n", + "==================================================\n" + ] + }, + { + "data": { + "text/plain": [ + "'VirginAmerica why are your first fares in May over three times more than other carriers when all seats are available to select'" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Print the third example\n", + "third_example = tweets['text'][20]\n", + "print(third_example)\n", + "print(f\"{'=' * 50}\")\n", + "\n", + "# Apply the function\n", + "remove_punct(third_example)" + ] + }, + { + "cell_type": "markdown", + "id": "853a4b83-f503-4405-aedd-66bbc088e3e7", + "metadata": { + "id": "853a4b83-f503-4405-aedd-66bbc088e3e7" + }, + "source": [ + "Let's give it a try with another tweet. What have you noticed?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5b3c2f60-fc92-4326-bad6-5ad04be50476", + "metadata": { + "id": "5b3c2f60-fc92-4326-bad6-5ad04be50476", + "outputId": "08e6402b-75b7-413c-9af0-619e62c64874" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "@VirginAmerica trying to add my boy Prince to my ressie. SF this Thursday @VirginAmerica from LAX http://t.co/GsB2J3c4gM\n", + "==================================================\n" + ] + }, + { + "data": { + "text/plain": [ + "'VirginAmerica trying to add my boy Prince to my ressie SF this Thursday VirginAmerica from LAX httptcoGsB2J3c4gM'" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Print another tweet\n", + "print(tweets['text'][100])\n", + "print(f\"{'=' * 50}\")\n", + "\n", + "# Apply the function\n", + "remove_punct(tweets['text'][100])" + ] + }, + { + "cell_type": "markdown", + "id": "1af02ce5-b674-4cb4-8e08-7d7416963f9c", + "metadata": { + "id": "1af02ce5-b674-4cb4-8e08-7d7416963f9c" + }, + "source": [ + "What about the following example?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6f8c3947-e6b8-42fe-8a58-15e4b6c60005", + "metadata": { + "id": "6f8c3947-e6b8-42fe-8a58-15e4b6c60005", + "outputId": "b2c4c6e4-a793-4f48-cf99-04d0e53a582e" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "'Weve got quite a bit of punctuation here dont we Python DLab'" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Print a text with contraction\n", + "contraction_text = \"We've got quite a bit of punctuation here, don't we?!? #Python @D-Lab.\"\n", + "\n", + "# Apply the function\n", + "remove_punct(contraction_text)" + ] + }, + { + "cell_type": "markdown", + "id": "62574c66-db3f-4500-9c3b-cea2f3eb2a30", + "metadata": { + "id": "62574c66-db3f-4500-9c3b-cea2f3eb2a30" + }, + "source": [ + "⚠️ **Warning:** In many cases, we want to remove punctuation marks **after** tokenization, which we will discuss in a minute. This tells us that the **order** of preprocessing is a matter of importance!" + ] + }, + { + "cell_type": "markdown", + "id": "58c6b85e-58e7-4f56-9b4a-b60c85b394ba", + "metadata": { + "id": "58c6b85e-58e7-4f56-9b4a-b60c85b394ba" + }, + "source": [ + "## 🥊 Challenge 1: Preprocessing with Multiple Steps\n", + "\n", + "So far we've learned a few preprocessing operations, let's put them together in a function! This function would be a handy one to refer to if you happen to work with some messy English text data, and you want to preprocess it with a single function.\n", + "\n", + "The example text data for challenge 1 is shown below. Write a function to:\n", + "- Lowercase the text\n", + "- Remove punctuation marks\n", + "- Remove extra whitespace characters\n", + "\n", + "Feel free to recycle the codes we've used above!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "deb10cba-239e-4856-b56d-7d5eb850c9b9", + "metadata": { + "id": "deb10cba-239e-4856-b56d-7d5eb850c9b9", + "outputId": "de879caf-a669-49ca-c979-1bfa08878ecc" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "\n", + "This is a text file that has some extra blankspace at the start and end. Blankspace is a catch-all term for spaces, tabs, newlines, and a bunch of other things that computers distinguish but to us all look like spaces, tabs and newlines.\n", + "\n", + "\n", + "The Python method called \"strip\" only catches blankspace at the start and end of a string. But it won't catch it in the middle,\t\tfor example,\n", + "\n", + "in this sentence.\t\tOnce again, regular expressions will\n", + "\n", + "help\t\tus with this.\n", + "\n", + "\n", + "\n" + ] + } + ], + "source": [ + "challenge1_path = '../data/example1.txt'\n", + "\n", + "with open(challenge1_path, 'r') as file:\n", + " challenge1 = file.read()\n", + "\n", + "print(challenge1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e2480823-65dd-4f52-a7b3-6d9b10d87912", + "metadata": { + "scrolled": true, + "id": "e2480823-65dd-4f52-a7b3-6d9b10d87912" + }, + "outputs": [], + "source": [ + "def clean_text(text):\n", + "\n", + " # Step 1: Lowercase\n", + " text = ...\n", + "\n", + " # Step 2: Use remove_punct to remove punctuation marks\n", + " text = ...\n", + "\n", + " # Step 3: Remove extra whitespace characters\n", + " text = ...\n", + "\n", + " return text" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dc603506-0adb-45d7-bb6f-62958c054fdd", + "metadata": { + "scrolled": true, + "id": "dc603506-0adb-45d7-bb6f-62958c054fdd" + }, + "outputs": [], + "source": [ + "# Uncomment to apply the above function to challenge 1 text\n", + "# clean_text(challenge1)" + ] + }, + { + "cell_type": "markdown", + "id": "67c159cb-8eaa-4c30-b8ff-38a712d2bb0f", + "metadata": { + "id": "67c159cb-8eaa-4c30-b8ff-38a712d2bb0f" + }, + "source": [ + "## Task-specific Processes\n", + "\n", + "Now that we understand common preprocessing operations, there are still a few additional operations to consider. Our text data might require further normalization depending on the language, source, and content of the data.\n", + "\n", + "For example, if we are working with financial documents, we might want to standardize monetary symbols by converting them to digits. It our tweets data, there are numerous hashtags and URLs. These can be replaced with placeholders to simplify the subsequent analysis." + ] + }, + { + "cell_type": "markdown", + "id": "c2936cea-74e9-40c2-bfbe-6ba8129330de", + "metadata": { + "id": "c2936cea-74e9-40c2-bfbe-6ba8129330de" + }, + "source": [ + "### 🎬 **Demo**: Remove Hashtags and URLs\n", + "\n", + "Although URLs, hashtags, and numbers are informative in their own right, oftentimes we don't necessarily care about the exact meaning of each of them.\n", + "\n", + "While we could remove them completely, it's often informative to know that there **exists** a URL or a hashtag. In practice, we replace individual URLs and hashtags with a \"symbol\" that preserves the fact these structures exist in the text. It's standard to just use the strings \"URL\" and \"HASHTAG.\"\n", + "\n", + "Since these types of text often follow a regular structure, they're an apt case for using regular expressions. Let's apply these patterns to the tweets data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "03c0dc37-a013-4f0a-b72f-a1f64dc6c1bd", + "metadata": { + "id": "03c0dc37-a013-4f0a-b72f-a1f64dc6c1bd", + "outputId": "34f47412-6c5b-40fe-8158-cfc46843e81c" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "@VirginAmerica @virginmedia I'm flying your #fabulous #Seductive skies again! U take all the #stress away from travel http://t.co/ahlXHhKiyn\n" + ] + } + ], + "source": [ + "# Print the example tweet\n", + "url_tweet = tweets['text'][13]\n", + "print(url_tweet)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4ef61bea-ea11-468d-8176-a2f63659d204", + "metadata": { + "id": "4ef61bea-ea11-468d-8176-a2f63659d204", + "outputId": "e4111a70-d052-4c83-a871-532713cec6e2" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "\"@VirginAmerica @virginmedia I'm flying your #fabulous #Seductive skies again! U take all the #stress away from travel URL \"" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# URL\n", + "url_pattern = r'(http|ftp|https):\\/\\/([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:\\/~+#-]*[\\w@?^=%&\\/~+#-])'\n", + "url_repl = ' URL '\n", + "re.sub(url_pattern, url_repl, url_tweet)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ea8e0f2a-460e-4088-aa89-dc2a8bc6f7fe", + "metadata": { + "id": "ea8e0f2a-460e-4088-aa89-dc2a8bc6f7fe", + "outputId": "cac62ffa-3bbe-4f10-956e-8127e7598560" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "\"@VirginAmerica @virginmedia I'm flying your HASHTAG HASHTAG skies again! U take all the HASHTAG away from travel http://t.co/ahlXHhKiyn\"" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Hashtag\n", + "hashtag_pattern = r'(?:^|\\s)[##]{1}(\\w+)'\n", + "hashtag_repl = ' HASHTAG '\n", + "re.sub(hashtag_pattern, hashtag_repl, url_tweet)" + ] + }, + { + "cell_type": "markdown", + "id": "71d68d49-4923-49c0-9113-b844dc7546b9", + "metadata": { + "id": "71d68d49-4923-49c0-9113-b844dc7546b9" + }, + "source": [ + "\n", + "\n", + "# Tokenization\n", + "\n", + "## Tokenizers Before LLMs\n", + "\n", + "One of the most important steps in text analysis is tokenization. This is the process of breaking a long sequence of text into word tokens. With these tokens available, we are ready to perform word-level analysis. For instance, we can filter out tokens that don't contribute to the core meaning of the text.\n", + "\n", + "In this section, we'll introduce how to perform tokenization using `nltk`, `spaCy`, and a Large Language Model (`bert`). The purpose is to expose you to different NLP packages, help you understand their functionalities, and demonstrate how to access key functions in each package.\n", + "\n", + "### `nltk`\n", + "\n", + "The first package we'll be using is called **Natural Language Toolkit**, or `nltk`.\n", + "\n", + "Let's install a couple modules from the package." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "441d81f8-361e-4273-bd36-91a272f4a38a", + "metadata": { + "scrolled": true, + "id": "441d81f8-361e-4273-bd36-91a272f4a38a" + }, + "outputs": [], + "source": [ + "import nltk" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "64b327cc-5c77-4fdc-9aaf-17d7f0761237", + "metadata": { + "id": "64b327cc-5c77-4fdc-9aaf-17d7f0761237" + }, + "outputs": [], + "source": [ + "# Uncomment the following lines to install these modules\n", + "# nltk.download('wordnet')\n", + "# nltk.download('stopwords')\n", + "# nltk.download('punkt')" + ] + }, + { + "cell_type": "markdown", + "id": "6e79b699-c3a5-489f-9b3c-95653aba34d6", + "metadata": { + "id": "6e79b699-c3a5-489f-9b3c-95653aba34d6" + }, + "source": [ + "`nltk` has a function called `word_tokenize`. It requires one argument, which is the text to be tokenized, and it returns a list of tokens for us." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7b5d6944-c641-4fac-a239-5947a496371c", + "metadata": { + "id": "7b5d6944-c641-4fac-a239-5947a496371c", + "outputId": "d5f5f3a2-cbe8-44d1-d23c-075dad312e88" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "@VirginAmerica Really missed a prime opportunity for Men Without Hats parody, there. https://t.co/mWpG7grEZP\n" + ] + } + ], + "source": [ + "# Load word_tokenize\n", + "from nltk.tokenize import word_tokenize\n", + "\n", + "# Print the example\n", + "text = tweets['text'][7]\n", + "print(text)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "95fde2a3-e4e2-4e61-ad54-e4d5d0a6ba71", + "metadata": { + "id": "95fde2a3-e4e2-4e61-ad54-e4d5d0a6ba71", + "outputId": "fec4a1cb-12a4-4353-8adb-37d89fb9bf0b" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "['@',\n", + " 'VirginAmerica',\n", + " 'Really',\n", + " 'missed',\n", + " 'a',\n", + " 'prime',\n", + " 'opportunity',\n", + " 'for',\n", + " 'Men',\n", + " 'Without',\n", + " 'Hats',\n", + " 'parody',\n", + " ',',\n", + " 'there',\n", + " '.',\n", + " 'https',\n", + " ':',\n", + " '//t.co/mWpG7grEZP']" + ] + }, + "execution_count": 29, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Apply the NLTK tokenizer\n", + "nltk_tokens = word_tokenize(text)\n", + "nltk_tokens" + ] + }, + { + "cell_type": "markdown", + "id": "80ead039-7721-4b22-8590-0d7824631675", + "metadata": { + "id": "80ead039-7721-4b22-8590-0d7824631675" + }, + "source": [ + "Here we are, with a list of tokens identified by `nltk`. Let's take a minute to inspect them!\n", + "\n", + "🔔 **Question**: Do word boundaries decided by `nltk` make sense to you? Pay attention to the twitter handle and the URL in the example tweet.\n", + "\n", + "You may feel that accessing functions in `nltk` is pretty straightforward. The function we used above was imported from the `nltk.tokenize` module, which as the name suggests, primarily does the job of tokenization.\n", + "\n", + "Underlyingly, `nltk` has [a collection of modules](https://www.nltk.org/api/nltk.html) that fulfill different purposes, to name a few:\n", + "\n", + "| NLTK module | Fucntion | Link |\n", + "|---------------|---------------------------|--------------------------------------------------------------|\n", + "| nltk.tokenize | Tokenization | [Documentation](https://www.nltk.org/api/nltk.tokenize.html) |\n", + "| nltk.corpus | Retrieve built-in corpora | [Documentation](https://www.nltk.org/nltk_data/) |\n", + "| nltk.tag | Part-of-speech tagging | [Documentation](https://www.nltk.org/api/nltk.tag.html) |\n", + "| nltk.stem | Stemming | [Documentation](https://www.nltk.org/api/nltk.stem.html) |\n", + "| ... | ... | ... |\n", + "\n", + "Let's import `stopwords` from the `nltk.corpus` module, which hosts a range of built-in corpora." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "84bbfced-8803-41ca-9cae-49bdadf8c000", + "metadata": { + "id": "84bbfced-8803-41ca-9cae-49bdadf8c000" + }, + "outputs": [], + "source": [ + "# Load predefined stop words from nltk\n", + "from nltk.corpus import stopwords" + ] + }, + { + "cell_type": "markdown", + "id": "dee971a1-1189-4cb6-8317-4836f54c3ae2", + "metadata": { + "id": "dee971a1-1189-4cb6-8317-4836f54c3ae2" + }, + "source": [ + "Let's specificy that we want to retrieve English stop words. The function simply returns a list of stop words, mostly function words, that `nltk` identifies." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6009e1df-b720-4d22-a162-7fd250a58672", + "metadata": { + "id": "6009e1df-b720-4d22-a162-7fd250a58672", + "outputId": "afdf6d7c-8b52-4620-ce4a-4b48e17f5435" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an']" + ] + }, + "execution_count": 31, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Print the first 10 stopwords\n", + "stop = stopwords.words('english')\n", + "stop[:10]" + ] + }, + { + "cell_type": "markdown", + "id": "4c3ec908-de6c-42c5-a370-f1b1df0032b3", + "metadata": { + "id": "4c3ec908-de6c-42c5-a370-f1b1df0032b3" + }, + "source": [ + "### `spaCy`\n", + "Other than `nltk`, we have another widely-used package called `spaCy`.\n", + "\n", + "`spaCy` has its own processing pipeline. It takes in a string of text, runs the `nlp` pipeline on it, and stores the processed text and its annotations in an object called `doc`. The `nlp` pipeline always performs tokenization, as well as [other text analysis components](https://spacy.io/usage/processing-pipelines#custom-components) requested by the user. These components are pretty similar to modules in `nltk`." + ] + }, + { + "cell_type": "markdown", + "id": "c6a0facd-4b75-41ac-920c-5ea044f7ae2e", + "metadata": { + "id": "c6a0facd-4b75-41ac-920c-5ea044f7ae2e" + }, + "source": [ + "\"spacy" + ] + }, + { + "cell_type": "markdown", + "id": "c3ef1eaf-2790-4928-b094-943f2803c6a0", + "metadata": { + "id": "c3ef1eaf-2790-4928-b094-943f2803c6a0" + }, + "source": [ + "Note that we always start by initializing the `nlp` pipeline, depending on the language of the text. Here, we are loading a pretrained language model for English: `en_core_web_sm`. The name suggests that it is a lightweight model trained on some text data (e.g., blogs); see model descriptions [here](https://spacy.io/models/en#en_core_web_sm).\n", + "\n", + "This is the first time we encounter the concept of **pretraining**, though you may have heard it elsewhere. In the context of NLP, pretraining means that the model has been trained on a vast amount of data. As a result, it comes with a certain \"knowledge\" of word structure and grammar of the language.\n", + "\n", + "Therefore, when we apply the model to our own data, we can expect it to be reasonably accurate in performing various annotation tasks, e.g., tagging a word's part of speech, identifying the syntactic head of a phrase, and etc.\n", + "\n", + "Let's dive in! We'll first need to load the pretrained language model we installed earlier." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "524dfc02-aa8f-4888-9f81-74a570db72b7", + "metadata": { + "id": "524dfc02-aa8f-4888-9f81-74a570db72b7" + }, + "outputs": [], + "source": [ + "import spacy\n", + "nlp = spacy.load('en_core_web_sm')" + ] + }, + { + "cell_type": "markdown", + "id": "57d669c3-2f5a-41b6-893b-ea1d438b3a48", + "metadata": { + "id": "57d669c3-2f5a-41b6-893b-ea1d438b3a48" + }, + "source": [ + "The `nlp` pipeline, by default, includes a set of components, which we can access via the `.pipe_names` attribute.\n", + "\n", + "You may notice that it dosen't include the tokenizer. Don't worry! Tokenizer is a special component that the pipeline always includes." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6d581ca5-43f8-4ef9-b099-2fc92c324581", + "metadata": { + "id": "6d581ca5-43f8-4ef9-b099-2fc92c324581", + "outputId": "aaf24ff4-eb70-4347-881a-82d98c50050d" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']" + ] + }, + "execution_count": 33, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Retrieve components included in NLP pipeline\n", + "nlp.pipe_names" + ] + }, + { + "cell_type": "markdown", + "id": "d1e37f91-d174-4101-bfc6-2859cb0fe5cc", + "metadata": { + "id": "d1e37f91-d174-4101-bfc6-2859cb0fe5cc" + }, + "source": [ + "Let's run the `nlp` pipeline on our example tweet data, and assign it to a variable `doc`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "618e8558-625d-4546-8109-63f9bae9790f", + "metadata": { + "id": "618e8558-625d-4546-8109-63f9bae9790f" + }, + "outputs": [], + "source": [ + "# Apply the pipeline to example tweet\n", + "doc = nlp(tweets['text'][7])" + ] + }, + { + "cell_type": "markdown", + "id": "54325d60-5c5c-488d-baf2-7eed4de2c031", + "metadata": { + "id": "54325d60-5c5c-488d-baf2-7eed4de2c031" + }, + "source": [ + "Under the hood, the `doc` object contains the tokens (created by the tokenizer) and their annotations (created by other components), which are [linguistic features](\n", + "https://spacy.io/usage/linguistic-features) useful for text analysis. We retrieve the token and its annotations by accessing corresponding attributes.\n", + "\n", + "| Attribute | Annotation | Link |\n", + "|----------------|-----------------------------------------|---------------------------------------------------------------------------|\n", + "| token.text | The token in verbatim text | [Documentation](https://spacy.io/api/token#attributes) |\n", + "| token.is_stop | Whether the token is a stop word | [Documentation](https://spacy.io/api/attributes#_title) |\n", + "| token.is_punct | Whether the token is a punctuation mark | [Documentation](https://spacy.io/api/attributes#_title) |\n", + "| token.lemma_ | The base form of the token | [Documentation](https://spacy.io/usage/linguistic-features#lemmatization) |\n", + "| token.pos_ | The simple POS-tag of the token | [Documentation](https://spacy.io/usage/linguistic-features#pos-tagging) |\n", + "| ... | ... | ... |" + ] + }, + { + "cell_type": "markdown", + "id": "2e9f23c8-a157-44a7-a6ec-6894aec1a595", + "metadata": { + "id": "2e9f23c8-a157-44a7-a6ec-6894aec1a595" + }, + "source": [ + "Let's first get the tokens themselves! We'll iterate over the `doc` object and retrieve the text of each token." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4c71efee-e6cf-46c4-9198-593304f6560d", + "metadata": { + "id": "4c71efee-e6cf-46c4-9198-593304f6560d", + "outputId": "51268a29-88c9-47d0-8d21-85bf1f20e30c" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "['@VirginAmerica',\n", + " 'Really',\n", + " 'missed',\n", + " 'a',\n", + " 'prime',\n", + " 'opportunity',\n", + " 'for',\n", + " 'Men',\n", + " 'Without',\n", + " 'Hats',\n", + " 'parody',\n", + " ',',\n", + " 'there',\n", + " '.',\n", + " 'https://t.co/mWpG7grEZP']" + ] + }, + "execution_count": 35, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Get the verbatim texts of tokens\n", + "spacy_tokens = [token.text for token in doc]\n", + "spacy_tokens" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f4fc23f0-c699-45e6-ad62-e131036d601f", + "metadata": { + "id": "f4fc23f0-c699-45e6-ad62-e131036d601f", + "outputId": "6df437e3-5d00-4ef7-d3e1-0e25a383ca30" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "['@',\n", + " 'VirginAmerica',\n", + " 'Really',\n", + " 'missed',\n", + " 'a',\n", + " 'prime',\n", + " 'opportunity',\n", + " 'for',\n", + " 'Men',\n", + " 'Without',\n", + " 'Hats',\n", + " 'parody',\n", + " ',',\n", + " 'there',\n", + " '.',\n", + " 'https',\n", + " ':',\n", + " '//t.co/mWpG7grEZP']" + ] + }, + "execution_count": 36, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Get the NLTK tokens\n", + "nltk_tokens" + ] + }, + { + "cell_type": "markdown", + "id": "a0ace59e-40e0-42b3-9f2b-d30ac94dccab", + "metadata": { + "id": "a0ace59e-40e0-42b3-9f2b-d30ac94dccab" + }, + "source": [ + "🔔 **Question**: Let's pause for a minute to compare the tokens generated by `nltk` and `spaCy`. What have you noticed?\n", + "\n", + "Remember we can also access various annotations of these okens. For instance, one annotation `spaCy` offers is that it conveniently encodes whether a token is a stop word." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "626af687-e986-4c97-af86-edf7dbd22c3e", + "metadata": { + "id": "626af687-e986-4c97-af86-edf7dbd22c3e", + "outputId": "2cef747e-4504-4eeb-9384-b7dca14cd72e" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "[False,\n", + " True,\n", + " False,\n", + " True,\n", + " False,\n", + " False,\n", + " True,\n", + " False,\n", + " True,\n", + " False,\n", + " False,\n", + " False,\n", + " True,\n", + " False,\n", + " False]" + ] + }, + "execution_count": 37, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Retrieve the is_stop annotation\n", + "spacy_stops = [token.is_stop for token in doc]\n", + "\n", + "# The results are boolean values\n", + "spacy_stops" + ] + }, + { + "cell_type": "markdown", + "id": "3b6548b6-7e89-4f42-b8cb-bf7c93b34eb4", + "metadata": { + "id": "3b6548b6-7e89-4f42-b8cb-bf7c93b34eb4" + }, + "source": [ + "## 🥊 Challenge 2: Remove Stop Words\n", + "\n", + "We have known how `nltk` and `spaCy` work as NLP packages. We've also demostrated how to identify stop words with each package.\n", + "\n", + "Let's write **two** functions to remove stop words from our text data.\n", + "\n", + "- Complete the function for stop words removal using `nltk`\n", + " - The starter code requires two arguments: the raw text input and a list of predefined stop words\n", + "- Complete the function for stop words removal using `spaCy`\n", + " - The starter code requires one argument: the raw text input\n", + "\n", + "A friendly reminder before we dive in: both functions take raw text as input—that's a signal to perform tokenization on the raw text first!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b24b5370-392f-420d-8c9e-78146d0fca29", + "metadata": { + "id": "b24b5370-392f-420d-8c9e-78146d0fca29" + }, + "outputs": [], + "source": [ + "def remove_stopword_nltk(raw_text, stopword):\n", + "\n", + " # Step 1: Tokenization with nltk\n", + " # YOUR CODE HERE\n", + "\n", + " # Step 2: Filter out tokens in the stop word list\n", + " # YOUR CODE HERE" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "65b5bbda-0af5-49a9-8f5e-77d61ab217e7", + "metadata": { + "id": "65b5bbda-0af5-49a9-8f5e-77d61ab217e7" + }, + "outputs": [], + "source": [ + "def remove_stopword_spacy(raw_text):\n", + "\n", + " # Step 1: Apply the nlp pipeline\n", + " # YOUR CODE HERE\n", + "\n", + " # Step 2: Filter out tokens that are stop words\n", + " # YOUR CODE HERE" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f4c3b0da-2223-4d7f-9014-696498e804e6", + "metadata": { + "id": "f4c3b0da-2223-4d7f-9014-696498e804e6" + }, + "outputs": [], + "source": [ + "# remove_stopword_nltk(text, stop)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f83538ba-6bf1-49ca-90ec-b6532b1ffcb3", + "metadata": { + "id": "f83538ba-6bf1-49ca-90ec-b6532b1ffcb3" + }, + "outputs": [], + "source": [ + "# remove_stopword_spacy(text)" + ] + }, + { + "cell_type": "markdown", + "id": "d3a6b1ec-87cc-4a08-a5dd-0210a9c56f0b", + "metadata": { + "id": "d3a6b1ec-87cc-4a08-a5dd-0210a9c56f0b" + }, + "source": [ + "## 🎬 **Demo**: Powerful Features from `spaCy`\n", + "\n", + "`spaCy`'s nlp pipeline includes a number of linguistic annotations, which could be very useful for text analysis.\n", + "\n", + "For instance, we can access more annotations such as the lemma, the part-of-speech tag and its meaning, and whether the token looks like URLs." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "eb6c7d93-51a3-4fb8-8321-cb672f4f1b8f", + "metadata": { + "id": "eb6c7d93-51a3-4fb8-8321-cb672f4f1b8f", + "outputId": "768e0e28-4d51-4423-e6c7-001afe4567a3" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "@VirginAmerica | @VirginAmerica | PROPN | proper noun | 0 |\n", + "Really | really | ADV | adverb | 0 |\n", + "missed | miss | VERB | verb | 0 |\n", + "a | a | DET | determiner | 0 |\n", + "prime | prime | ADJ | adjective | 0 |\n", + "opportunity | opportunity | NOUN | noun | 0 |\n", + "for | for | ADP | adposition | 0 |\n", + "Men | Men | PROPN | proper noun | 0 |\n", + "Without | without | ADP | adposition | 0 |\n", + "Hats | Hats | PROPN | proper noun | 0 |\n", + "parody | parody | NOUN | noun | 0 |\n", + ", | , | PUNCT | punctuation | 0 |\n", + "there | there | ADV | adverb | 0 |\n", + ". | . | PUNCT | punctuation | 0 |\n", + "https://t.co/mWpG7grEZP | https://t.co/mWpG7grEZP | PROPN | proper noun | 1 |\n" + ] + } ], - "text/plain": [ - " tweet_id airline_sentiment airline_sentiment_confidence \\\n", - "0 570306133677760513 neutral 1.0000 \n", - "1 570301130888122368 positive 0.3486 \n", - "2 570301083672813571 neutral 0.6837 \n", - "3 570301031407624196 negative 1.0000 \n", - "4 570300817074462722 negative 1.0000 \n", - "\n", - " negativereason negativereason_confidence airline \\\n", - "0 NaN NaN Virgin America \n", - "1 NaN 0.0000 Virgin America \n", - "2 NaN NaN Virgin America \n", - "3 Bad Flight 0.7033 Virgin America \n", - "4 Can't Tell 1.0000 Virgin America \n", - "\n", - " airline_sentiment_gold name negativereason_gold retweet_count \\\n", - "0 NaN cairdin NaN 0 \n", - "1 NaN jnardino NaN 0 \n", - "2 NaN yvonnalynn NaN 0 \n", - "3 NaN jnardino NaN 0 \n", - "4 NaN jnardino NaN 0 \n", - "\n", - " text tweet_coord \\\n", - "0 @VirginAmerica What @dhepburn said. NaN \n", - "1 @VirginAmerica plus you've added commercials t... NaN \n", - "2 @VirginAmerica I didn't today... Must mean I n... NaN \n", - "3 @VirginAmerica it's really aggressive to blast... NaN \n", - "4 @VirginAmerica and it's a really big bad thing... NaN \n", - "\n", - " tweet_created tweet_location user_timezone \n", - "0 2015-02-24 11:35:52 -0800 NaN Eastern Time (US & Canada) \n", - "1 2015-02-24 11:15:59 -0800 NaN Pacific Time (US & Canada) \n", - "2 2015-02-24 11:15:48 -0800 Lets Play Central Time (US & Canada) \n", - "3 2015-02-24 11:15:36 -0800 NaN Pacific Time (US & Canada) \n", - "4 2015-02-24 11:14:45 -0800 NaN Pacific Time (US & Canada) " - ] - }, - "execution_count": 2, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Show the first five rows\n", - "tweets.head()" - ] - }, - { - "cell_type": "markdown", - "id": "ae3b339f-45cf-465d-931c-05f9096fd510", - "metadata": {}, - "source": [ - "The dataframe has one row per tweet. The text of tweet is shown in the `text` column.\n", - "- `text` (`str`): the text of the tweet.\n", - "\n", - "Other metadata we are interested in include: \n", - "- `airline_sentiment` (`str`): the sentiment of the tweet, labeled as \"neutral,\" \"positive,\" or \"negative.\"\n", - "- `airline` (`str`): the airline that is tweeted about.\n", - "- `retweet count` (`int`): how many times the tweet was retweeted." - ] - }, - { - "cell_type": "markdown", - "id": "302c695b-4bd1-4151-9cb9-ef5253eb16df", - "metadata": {}, - "source": [ - "Let's take a look at some of the tweets:" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "id": "b690daab-7be5-4b8f-8af0-a91fdec4ec4f", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "@VirginAmerica What @dhepburn said.\n", - "@VirginAmerica plus you've added commercials to the experience... tacky.\n", - "@VirginAmerica I didn't today... Must mean I need to take another trip!\n" - ] - } - ], - "source": [ - "print(tweets['text'].iloc[0])\n", - "print(tweets['text'].iloc[1])\n", - "print(tweets['text'].iloc[2])" - ] - }, - { - "cell_type": "markdown", - "id": "8adc05fa-ad30-4402-ab56-086bcb09a166", - "metadata": {}, - "source": [ - "🔔 **Question**: What have you noticed? What are the stylistic features of tweets?" - ] - }, - { - "cell_type": "markdown", - "id": "c3460393-00a6-461c-b02a-9e98f9b5d1af", - "metadata": {}, - "source": [ - "### Lowercasing\n", - "\n", - "While we acknowledge that a word's casing is informative, we often don't work in contexts where we can properly utilize this information.\n", - "\n", - "More often, the subsequent analysis we perform is **case-insensitive**. For instance, in frequency analysis, we want to account for various forms of the same word. Lowercasing the text data aids in this process and simplifies our analysis.\n", - "\n", - "We can easily achieve lowercasing with the string method [`.lower()`](https://docs.python.org/3/library/stdtypes.html#str.lower); see [documentation](https://docs.python.org/3/library/stdtypes.html#string-methods) for more useful functions.\n", - "\n", - "Let's apply it to the following example:" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "id": "58a95d90-3ef1-4bff-9cfe-d447ed99f252", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "@VirginAmerica I was scheduled for SFO 2 DAL flight 714 today. Changed to 24th due weather. Looks like flight still on?\n" - ] - } - ], - "source": [ - "# Print the first example\n", - "first_example = tweets['text'][108]\n", - "print(first_example)" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "id": "c66d91c0-6eed-4591-95fc-cd2eae2e0d41", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "False\n", - "==================================================\n", - "@virginamerica i was scheduled for sfo 2 dal flight 714 today. changed to 24th due weather. looks like flight still on?\n", - "==================================================\n", - "@VIRGINAMERICA I WAS SCHEDULED FOR SFO 2 DAL FLIGHT 714 TODAY. CHANGED TO 24TH DUE WEATHER. LOOKS LIKE FLIGHT STILL ON?\n" - ] - } - ], - "source": [ - "# Check if all characters are in lowercase\n", - "print(first_example.islower())\n", - "print(f\"{'=' * 50}\")\n", - "\n", - "# Convert it to lowercase\n", - "print(first_example.lower())\n", - "print(f\"{'=' * 50}\")\n", - "\n", - "# Convert it to uppercase\n", - "print(first_example.upper())" - ] - }, - { - "cell_type": "markdown", - "id": "7bf0d8c8-bd6c-47ef-b305-09ac61d07d4d", - "metadata": {}, - "source": [ - "### Remove Extra Whitespace Characters\n", - "\n", - "Sometimes we might come across texts with extraneous whitespace, such as spaces, tabs, and newline characters, which is particularly common when the text is scrapped from web pages. Before we dive into the details, let's briefly introduce Regular Expressions (regex) and the `re` package. \n", - "\n", - "Regular expressions are a powerful way of searching for specific string patterns in large corpora. They have an infamously steep learning curve, but they can be very efficient when we get a handle on them. Many NLP packages heavily rely on regex under the hood. Regex testers, such as [regex101](https://regex101.com), are useful tools in both understanding and creating regex expressions.\n", - "\n", - "Our goal in this workshop is not to provide a deep (or even shallow) dive into regex; instead, we want to expose you to them so that you are better prepared to do deep dives in the future!\n", - "\n", - "The following example is a poem by William Wordsworth. Like many poems, the text may contain extra line breaks (i.e., newline characters, `\\n`) that we want to remove." - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "id": "d1bd73f1-a30f-4269-a05e-47cfff7b496f", - "metadata": {}, - "outputs": [], - "source": [ - "# File path to the poem\n", - "text_path = '../data/poem_wordsworth.txt'\n", - "\n", - "# Read the poem in\n", - "with open(text_path, 'r') as file:\n", - " text = file.read()\n", - " file.close()" - ] - }, - { - "cell_type": "markdown", - "id": "7a693dd9-9706-40b3-863f-f568020245f7", - "metadata": {}, - "source": [ - "As you can see, the poem is formatted as a continuous string of text with line breaks placed at the end of each line, making it difficult to read. " - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "id": "7e78a75a-8e15-4bcb-a416-783aa7f60ef3", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "\"I wandered lonely as a cloud\\n\\n\\nI wandered lonely as a cloud\\nThat floats on high o'er vales and hills,\\nWhen all at once I saw a crowd,\\nA host, of golden daffodils;\\nBeside the lake, beneath the trees,\\nFluttering and dancing in the breeze.\\n\\nContinuous as the stars that shine\\nAnd twinkle on the milky way,\\nThey stretched in never-ending line\\nAlong the margin of a bay:\\nTen thousand saw I at a glance,\\nTossing their heads in sprightly dance.\\n\\nThe waves beside them danced; but they\\nOut-did the sparkling waves in glee:\\nA poet could not but be gay,\\nIn such a jocund company:\\nI gazed—and gazed—but little thought\\nWhat wealth the show to me had brought:\\n\\nFor oft, when on my couch I lie\\nIn vacant or in pensive mood,\\nThey flash upon that inward eye\\nWhich is the bliss of solitude;\\nAnd then my heart with pleasure fills,\\nAnd dances with the daffodils.\"" - ] - }, - "execution_count": 7, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "text" - ] - }, - { - "cell_type": "markdown", - "id": "47cce993-c315-4aaa-87fe-149de8607f65", - "metadata": {}, - "source": [ - "One handy function we can use to display the poem properly is `.splitlines()`. As the name suggests, it splits a long text sequence into a list of lines whenever there is a newline character. " - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "id": "ddeade7a-065d-49e6-bdd3-87a8ea8f6e6e", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "['I wandered lonely as a cloud',\n", - " '',\n", - " '',\n", - " 'I wandered lonely as a cloud',\n", - " \"That floats on high o'er vales and hills,\",\n", - " 'When all at once I saw a crowd,',\n", - " 'A host, of golden daffodils;',\n", - " 'Beside the lake, beneath the trees,',\n", - " 'Fluttering and dancing in the breeze.',\n", - " '',\n", - " 'Continuous as the stars that shine',\n", - " 'And twinkle on the milky way,',\n", - " 'They stretched in never-ending line',\n", - " 'Along the margin of a bay:',\n", - " 'Ten thousand saw I at a glance,',\n", - " 'Tossing their heads in sprightly dance.',\n", - " '',\n", - " 'The waves beside them danced; but they',\n", - " 'Out-did the sparkling waves in glee:',\n", - " 'A poet could not but be gay,',\n", - " 'In such a jocund company:',\n", - " 'I gazed—and gazed—but little thought',\n", - " 'What wealth the show to me had brought:',\n", - " '',\n", - " 'For oft, when on my couch I lie',\n", - " 'In vacant or in pensive mood,',\n", - " 'They flash upon that inward eye',\n", - " 'Which is the bliss of solitude;',\n", - " 'And then my heart with pleasure fills,',\n", - " 'And dances with the daffodils.']" - ] - }, - "execution_count": 8, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Split the single string into a list of lines\n", - "text.splitlines()" - ] - }, - { - "cell_type": "markdown", - "id": "44d3825b-0857-44e1-bf6a-d8c7a9032704", - "metadata": {}, - "source": [ - "Let's return to our tweet data for an example." - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "id": "53a81ea9-65c4-474a-8530-35393555d1be", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "\"@VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing.\\nit's really the only bad thing about flying VA\"" - ] - }, - "execution_count": 9, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Print the second example\n", - "second_example = tweets['text'][5]\n", - "second_example" - ] - }, - { - "cell_type": "markdown", - "id": "aef55865-36fd-4c06-a765-530cf3b53096", - "metadata": {}, - "source": [ - "In this case, we don't really want to split the tweet into a list of strings. We still expect a single string of text but would like to remove the line break completely from the string.\n", - "\n", - "The string method `.strip()` effectively does the job of stripping away spaces at both ends of the text. However, it won't work in our example as the newline character is in the middle of the string." - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "id": "b933503b-4370-4dc4-b287-6dc2f9cdb1d4", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "\"@VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing.\\nit's really the only bad thing about flying VA\"" - ] - }, - "execution_count": 10, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Strip only removed blankspace at both ends\n", - "second_example.strip()" - ] - }, - { - "cell_type": "markdown", - "id": "b99b80b4-804f-460f-a2d5-adbd654902b3", - "metadata": {}, - "source": [ - "This is where regex could be really helpful." - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "id": "ceac9714-7053-4b2e-affb-71f8c3d2dcd9", - "metadata": {}, - "outputs": [], - "source": [ - "import re" - ] - }, - { - "cell_type": "markdown", - "id": "d5f08d20-ba81-4e48-9e2a-5728148005b3", - "metadata": {}, - "source": [ - "Now, with regex, we are essentially calling it to match a pattern that we have identified in the text data, and we want to do some operations to the matched part—extract it, replace it with something else, or remove it completely. Therefore, the way regex works could be unpacked into the following steps:\n", - "\n", - "- Identify and write the pattern in regex (`r'PATTERN'`)\n", - "- Write the replacement for the pattern (`'REPLACEMENT'`)\n", - "- Call the specific regex function (e.g., `re.sub()`)\n", - "\n", - "In our example, the pattern we are looking for is `\\s`, which is the regex short name for any whitespace character (`\\n` and `\\t` included). We also add a quantifier `+` to the end: `\\s+`. It means we'd like to capture one or more occurences of the whitespace character." - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "id": "1248d227-1149-4014-94a5-c05592a27a7e", - "metadata": {}, - "outputs": [], - "source": [ - "# Write a pattern in regex\n", - "blankspace_pattern = r'\\s+'" - ] - }, - { - "cell_type": "markdown", - "id": "cc075c2e-1a1d-4393-a3ea-8ad7c118364b", - "metadata": {}, - "source": [ - "The replacement for one or more whitespace characters is exactly one single whitespace, which is the canonical word boundary in English. Any additional whitespace will be reduced to a single whitespace. " - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "id": "c55cb2f1-f4ca-4b79-900c-f65ec303ddac", - "metadata": {}, - "outputs": [], - "source": [ - "# Write a replacement for the pattern identfied\n", - "blankspace_repl = ' '" - ] - }, - { - "cell_type": "markdown", - "id": "bc12e3d1-728a-429b-9c83-4dcc88590bc4", - "metadata": {}, - "source": [ - "Lastly, let's put everything together using the function [`re.sub()`](https://docs.python.org/3.11/library/re.html#re.sub), which means we want to substitute a pattern with a replacement. The function takes in three arguments—the pattern, the replacement, and the string to which we want to apply the function." - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "id": "5249b24b-7111-4569-be29-c40efa5e148e", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "@VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing. it's really the only bad thing about flying VA\n" - ] - } - ], - "source": [ - "# Replace whitespace(s) with ' '\n", - "clean_text = re.sub(pattern = blankspace_pattern, \n", - " repl = blankspace_repl, \n", - " string = second_example)\n", - "print(clean_text)" - ] - }, - { - "cell_type": "markdown", - "id": "a895fbe3-a034-4124-94af-72a528913c51", - "metadata": {}, - "source": [ - "Ta-da! The newline character is no longer there." - ] - }, - { - "cell_type": "markdown", - "id": "7087dc0c-5fef-4f1c-8662-7cbc8a978f34", - "metadata": {}, - "source": [ - "### Remove Punctuation Marks\n", - "\n", - "Sometimes we are only interested in analyzing **alphanumeric characters** (i.e., the letters and numbers), in which case we might want to remove punctuation marks. \n", - "\n", - "The `string` module contains a list of predefined punctuation marks. Let's print them out." - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "id": "70e8502b-b703-45e0-8852-0c3210363440", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~\n" - ] - } - ], - "source": [ - "# Load in a predefined list of punctuation marks\n", - "from string import punctuation\n", - "print(punctuation)" - ] - }, - { - "cell_type": "markdown", - "id": "91119c9e-431c-42cb-afea-f7e607698929", - "metadata": {}, - "source": [ - "In practice, to remove these punctuation characters, we can simply iterate over the text and remove characters found in the list, such as shown below in the `remove_punct` function." - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "id": "237d868d-339d-4bbe-9a3b-20fa5fbdf231", - "metadata": {}, - "outputs": [], - "source": [ - "def remove_punct(text):\n", - " '''Remove punctuation marks in input text'''\n", - " \n", - " # Select characters not in puncutaion\n", - " no_punct = []\n", - " for char in text:\n", - " if char not in punctuation:\n", - " no_punct.append(char)\n", - "\n", - " # Join the characters into a string\n", - " text_no_punct = ''.join(no_punct) \n", - " \n", - " return text_no_punct" - ] - }, - { - "cell_type": "markdown", - "id": "d4fc768b-c2dd-4386-8212-483c4485e4be", - "metadata": {}, - "source": [ - "Let's apply the function to the example below. " - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "id": "7596c465-3d85-4b72-a853-f2151bcd91df", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "@VirginAmerica why are your first fares in May over three times more than other carriers when all seats are available to select???\n", - "==================================================\n" - ] - }, - { - "data": { - "text/plain": [ - "'VirginAmerica why are your first fares in May over three times more than other carriers when all seats are available to select'" - ] - }, - "execution_count": 17, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Print the third example\n", - "third_example = tweets['text'][20]\n", - "print(third_example)\n", - "print(f\"{'=' * 50}\")\n", - "\n", - "# Apply the function \n", - "remove_punct(third_example)" - ] - }, - { - "cell_type": "markdown", - "id": "853a4b83-f503-4405-aedd-66bbc088e3e7", - "metadata": {}, - "source": [ - "Let's give it a try with another tweet. What have you noticed?" - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "id": "5b3c2f60-fc92-4326-bad6-5ad04be50476", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "@VirginAmerica trying to add my boy Prince to my ressie. SF this Thursday @VirginAmerica from LAX http://t.co/GsB2J3c4gM\n", - "==================================================\n" - ] - }, - { - "data": { - "text/plain": [ - "'VirginAmerica trying to add my boy Prince to my ressie SF this Thursday VirginAmerica from LAX httptcoGsB2J3c4gM'" - ] - }, - "execution_count": 18, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Print another tweet\n", - "print(tweets['text'][100])\n", - "print(f\"{'=' * 50}\")\n", - "\n", - "# Apply the function\n", - "remove_punct(tweets['text'][100])" - ] - }, - { - "cell_type": "markdown", - "id": "1af02ce5-b674-4cb4-8e08-7d7416963f9c", - "metadata": {}, - "source": [ - "What about the following example?" - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "id": "6f8c3947-e6b8-42fe-8a58-15e4b6c60005", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'Weve got quite a bit of punctuation here dont we Python DLab'" - ] - }, - "execution_count": 19, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Print a text with contraction\n", - "contraction_text = \"We've got quite a bit of punctuation here, don't we?!? #Python @D-Lab.\"\n", - "\n", - "# Apply the function\n", - "remove_punct(contraction_text)" - ] - }, - { - "cell_type": "markdown", - "id": "62574c66-db3f-4500-9c3b-cea2f3eb2a30", - "metadata": {}, - "source": [ - "⚠️ **Warning:** In many cases, we want to remove punctuation marks **after** tokenization, which we will discuss in a minute. This tells us that the **order** of preprocessing is a matter of importance!" - ] - }, - { - "cell_type": "markdown", - "id": "58c6b85e-58e7-4f56-9b4a-b60c85b394ba", - "metadata": {}, - "source": [ - "## 🥊 Challenge 1: Preprocessing with Multiple Steps\n", - "\n", - "So far we've learned a few preprocessing operations, let's put them together in a function! This function would be a handy one to refer to if you happen to work with some messy English text data, and you want to preprocess it with a single function. \n", - "\n", - "The example text data for challenge 1 is shown below. Write a function to:\n", - "- Lowercase the text\n", - "- Remove punctuation marks\n", - "- Remove extra whitespace characters\n", - "\n", - "Feel free to recycle the codes we've used above!" - ] - }, - { - "cell_type": "code", - "execution_count": 20, - "id": "deb10cba-239e-4856-b56d-7d5eb850c9b9", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "\n", - "This is a text file that has some extra blankspace at the start and end. Blankspace is a catch-all term for spaces, tabs, newlines, and a bunch of other things that computers distinguish but to us all look like spaces, tabs and newlines.\n", - "\n", - "\n", - "The Python method called \"strip\" only catches blankspace at the start and end of a string. But it won't catch it in the middle,\t\tfor example,\n", - "\n", - "in this sentence.\t\tOnce again, regular expressions will\n", - "\n", - "help\t\tus with this.\n", - "\n", - "\n", - "\n" - ] - } - ], - "source": [ - "challenge1_path = '../data/example1.txt'\n", - "\n", - "with open(challenge1_path, 'r') as file:\n", - " challenge1 = file.read()\n", - " \n", - "print(challenge1)" - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "id": "e2480823-65dd-4f52-a7b3-6d9b10d87912", - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "def clean_text(text):\n", - "\n", - " # Step 1: Lowercase\n", - " text = ...\n", - "\n", - " # Step 2: Use remove_punct to remove punctuation marks\n", - " text = ...\n", - "\n", - " # Step 3: Remove extra whitespace characters\n", - " text = ...\n", - "\n", - " return text" - ] - }, - { - "cell_type": "code", - "execution_count": 22, - "id": "dc603506-0adb-45d7-bb6f-62958c054fdd", - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "# Uncomment to apply the above function to challenge 1 text \n", - "# clean_text(challenge1)" - ] - }, - { - "cell_type": "markdown", - "id": "67c159cb-8eaa-4c30-b8ff-38a712d2bb0f", - "metadata": {}, - "source": [ - "## Task-specific Processes\n", - "\n", - "Now that we understand common preprocessing operations, there are still a few additional operations to consider. Our text data might require further normalization depending on the language, source, and content of the data.\n", - "\n", - "For example, if we are working with financial documents, we might want to standardize monetary symbols by converting them to digits. It our tweets data, there are numerous hashtags and URLs. These can be replaced with placeholders to simplify the subsequent analysis." - ] - }, - { - "cell_type": "markdown", - "id": "c2936cea-74e9-40c2-bfbe-6ba8129330de", - "metadata": {}, - "source": [ - "### 🎬 **Demo**: Remove Hashtags and URLs \n", - "\n", - "Although URLs, hashtags, and numbers are informative in their own right, oftentimes we don't necessarily care about the exact meaning of each of them. \n", - "\n", - "While we could remove them completely, it's often informative to know that there **exists** a URL or a hashtag. In practice, we replace individual URLs and hashtags with a \"symbol\" that preserves the fact these structures exist in the text. It's standard to just use the strings \"URL\" and \"HASHTAG.\"\n", - "\n", - "Since these types of text often follow a regular structure, they're an apt case for using regular expressions. Let's apply these patterns to the tweets data." - ] - }, - { - "cell_type": "code", - "execution_count": 23, - "id": "03c0dc37-a013-4f0a-b72f-a1f64dc6c1bd", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "@VirginAmerica @virginmedia I'm flying your #fabulous #Seductive skies again! U take all the #stress away from travel http://t.co/ahlXHhKiyn\n" - ] - } - ], - "source": [ - "# Print the example tweet \n", - "url_tweet = tweets['text'][13]\n", - "print(url_tweet)" - ] - }, - { - "cell_type": "code", - "execution_count": 24, - "id": "4ef61bea-ea11-468d-8176-a2f63659d204", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "\"@VirginAmerica @virginmedia I'm flying your #fabulous #Seductive skies again! U take all the #stress away from travel URL \"" - ] - }, - "execution_count": 24, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# URL \n", - "url_pattern = r'(http|ftp|https):\\/\\/([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:\\/~+#-]*[\\w@?^=%&\\/~+#-])'\n", - "url_repl = ' URL '\n", - "re.sub(url_pattern, url_repl, url_tweet)" - ] - }, - { - "cell_type": "code", - "execution_count": 25, - "id": "ea8e0f2a-460e-4088-aa89-dc2a8bc6f7fe", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "\"@VirginAmerica @virginmedia I'm flying your HASHTAG HASHTAG skies again! U take all the HASHTAG away from travel http://t.co/ahlXHhKiyn\"" - ] - }, - "execution_count": 25, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Hashtag\n", - "hashtag_pattern = r'(?:^|\\s)[##]{1}(\\w+)'\n", - "hashtag_repl = ' HASHTAG '\n", - "re.sub(hashtag_pattern, hashtag_repl, url_tweet)" - ] - }, - { - "cell_type": "markdown", - "id": "71d68d49-4923-49c0-9113-b844dc7546b9", - "metadata": {}, - "source": [ - "\n", - "\n", - "# Tokenization\n", - "\n", - "## Tokenizers Before LLMs\n", - "\n", - "One of the most important steps in text analysis is tokenization. This is the process of breaking a long sequence of text into word tokens. With these tokens available, we are ready to perform word-level analysis. For instance, we can filter out tokens that don't contribute to the core meaning of the text.\n", - "\n", - "In this section, we'll introduce how to perform tokenization using `nltk`, `spaCy`, and a Large Language Model (`bert`). The purpose is to expose you to different NLP packages, help you understand their functionalities, and demonstrate how to access key functions in each package.\n", - "\n", - "### `nltk`\n", - "\n", - "The first package we'll be using is called **Natural Language Toolkit**, or `nltk`. \n", - "\n", - "Let's install a couple modules from the package." - ] - }, - { - "cell_type": "code", - "execution_count": 26, - "id": "441d81f8-361e-4273-bd36-91a272f4a38a", - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "import nltk" - ] - }, - { - "cell_type": "code", - "execution_count": 27, - "id": "64b327cc-5c77-4fdc-9aaf-17d7f0761237", - "metadata": {}, - "outputs": [], - "source": [ - "# Uncomment the following lines to install these modules\n", - "# nltk.download('wordnet')\n", - "# nltk.download('stopwords')\n", - "# nltk.download('punkt')" - ] - }, - { - "cell_type": "markdown", - "id": "6e79b699-c3a5-489f-9b3c-95653aba34d6", - "metadata": {}, - "source": [ - "`nltk` has a function called `word_tokenize`. It requires one argument, which is the text to be tokenized, and it returns a list of tokens for us." - ] - }, - { - "cell_type": "code", - "execution_count": 28, - "id": "7b5d6944-c641-4fac-a239-5947a496371c", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "@VirginAmerica Really missed a prime opportunity for Men Without Hats parody, there. https://t.co/mWpG7grEZP\n" - ] - } - ], - "source": [ - "# Load word_tokenize \n", - "from nltk.tokenize import word_tokenize\n", - "\n", - "# Print the example\n", - "text = tweets['text'][7]\n", - "print(text)" - ] - }, - { - "cell_type": "code", - "execution_count": 29, - "id": "95fde2a3-e4e2-4e61-ad54-e4d5d0a6ba71", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "['@',\n", - " 'VirginAmerica',\n", - " 'Really',\n", - " 'missed',\n", - " 'a',\n", - " 'prime',\n", - " 'opportunity',\n", - " 'for',\n", - " 'Men',\n", - " 'Without',\n", - " 'Hats',\n", - " 'parody',\n", - " ',',\n", - " 'there',\n", - " '.',\n", - " 'https',\n", - " ':',\n", - " '//t.co/mWpG7grEZP']" - ] - }, - "execution_count": 29, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Apply the NLTK tokenizer\n", - "nltk_tokens = word_tokenize(text)\n", - "nltk_tokens" - ] - }, - { - "cell_type": "markdown", - "id": "80ead039-7721-4b22-8590-0d7824631675", - "metadata": {}, - "source": [ - "Here we are, with a list of tokens identified by `nltk`. Let's take a minute to inspect them! \n", - "\n", - "🔔 **Question**: Do word boundaries decided by `nltk` make sense to you? Pay attention to the twitter handle and the URL in the example tweet. \n", - "\n", - "You may feel that accessing functions in `nltk` is pretty straightforward. The function we used above was imported from the `nltk.tokenize` module, which as the name suggests, primarily does the job of tokenization. \n", - "\n", - "Underlyingly, `nltk` has [a collection of modules](https://www.nltk.org/api/nltk.html) that fulfill different purposes, to name a few:\n", - "\n", - "| NLTK module | Fucntion | Link |\n", - "|---------------|---------------------------|--------------------------------------------------------------|\n", - "| nltk.tokenize | Tokenization | [Documentation](https://www.nltk.org/api/nltk.tokenize.html) |\n", - "| nltk.corpus | Retrieve built-in corpora | [Documentation](https://www.nltk.org/nltk_data/) |\n", - "| nltk.tag | Part-of-speech tagging | [Documentation](https://www.nltk.org/api/nltk.tag.html) |\n", - "| nltk.stem | Stemming | [Documentation](https://www.nltk.org/api/nltk.stem.html) |\n", - "| ... | ... | ... |\n", - "\n", - "Let's import `stopwords` from the `nltk.corpus` module, which hosts a range of built-in corpora. " - ] - }, - { - "cell_type": "code", - "execution_count": 30, - "id": "84bbfced-8803-41ca-9cae-49bdadf8c000", - "metadata": {}, - "outputs": [], - "source": [ - "# Load predefined stop words from nltk\n", - "from nltk.corpus import stopwords" - ] - }, - { - "cell_type": "markdown", - "id": "dee971a1-1189-4cb6-8317-4836f54c3ae2", - "metadata": {}, - "source": [ - "Let's specificy that we want to retrieve English stop words. The function simply returns a list of stop words, mostly function words, that `nltk` identifies. " - ] - }, - { - "cell_type": "code", - "execution_count": 31, - "id": "6009e1df-b720-4d22-a162-7fd250a58672", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an']" - ] - }, - "execution_count": 31, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Print the first 10 stopwords\n", - "stop = stopwords.words('english')\n", - "stop[:10]" - ] - }, - { - "cell_type": "markdown", - "id": "4c3ec908-de6c-42c5-a370-f1b1df0032b3", - "metadata": {}, - "source": [ - "### `spaCy`\n", - "Other than `nltk`, we have another widely-used package called `spaCy`. \n", - "\n", - "`spaCy` has its own processing pipeline. It takes in a string of text, runs the `nlp` pipeline on it, and stores the processed text and its annotations in an object called `doc`. The `nlp` pipeline always performs tokenization, as well as [other text analysis components](https://spacy.io/usage/processing-pipelines#custom-components) requested by the user. These components are pretty similar to modules in `nltk`. " - ] - }, - { - "cell_type": "markdown", - "id": "c6a0facd-4b75-41ac-920c-5ea044f7ae2e", - "metadata": {}, - "source": [ - "\"spacy" - ] - }, - { - "cell_type": "markdown", - "id": "c3ef1eaf-2790-4928-b094-943f2803c6a0", - "metadata": {}, - "source": [ - "Note that we always start by initializing the `nlp` pipeline, depending on the language of the text. Here, we are loading a pretrained language model for English: `en_core_web_sm`. The name suggests that it is a lightweight model trained on some text data (e.g., blogs); see model descriptions [here](https://spacy.io/models/en#en_core_web_sm).\n", - "\n", - "This is the first time we encounter the concept of **pretraining**, though you may have heard it elsewhere. In the context of NLP, pretraining means that the model has been trained on a vast amount of data. As a result, it comes with a certain \"knowledge\" of word structure and grammar of the language.\n", - "\n", - "Therefore, when we apply the model to our own data, we can expect it to be reasonably accurate in performing various annotation tasks, e.g., tagging a word's part of speech, identifying the syntactic head of a phrase, and etc. \n", - "\n", - "Let's dive in! We'll first need to load the pretrained language model we installed earlier." - ] - }, - { - "cell_type": "code", - "execution_count": 32, - "id": "524dfc02-aa8f-4888-9f81-74a570db72b7", - "metadata": {}, - "outputs": [], - "source": [ - "import spacy\n", - "nlp = spacy.load('en_core_web_sm')" - ] - }, - { - "cell_type": "markdown", - "id": "57d669c3-2f5a-41b6-893b-ea1d438b3a48", - "metadata": {}, - "source": [ - "The `nlp` pipeline, by default, includes a set of components, which we can access via the `.pipe_names` attribute. \n", - "\n", - "You may notice that it dosen't include the tokenizer. Don't worry! Tokenizer is a special component that the pipeline always includes." - ] - }, - { - "cell_type": "code", - "execution_count": 33, - "id": "6d581ca5-43f8-4ef9-b099-2fc92c324581", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']" - ] - }, - "execution_count": 33, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Retrieve components included in NLP pipeline\n", - "nlp.pipe_names" - ] - }, - { - "cell_type": "markdown", - "id": "d1e37f91-d174-4101-bfc6-2859cb0fe5cc", - "metadata": {}, - "source": [ - "Let's run the `nlp` pipeline on our example tweet data, and assign it to a variable `doc`." - ] - }, - { - "cell_type": "code", - "execution_count": 34, - "id": "618e8558-625d-4546-8109-63f9bae9790f", - "metadata": {}, - "outputs": [], - "source": [ - "# Apply the pipeline to example tweet\n", - "doc = nlp(tweets['text'][7])" - ] - }, - { - "cell_type": "markdown", - "id": "54325d60-5c5c-488d-baf2-7eed4de2c031", - "metadata": {}, - "source": [ - "Under the hood, the `doc` object contains the tokens (created by the tokenizer) and their annotations (created by other components), which are [linguistic features](\n", - "https://spacy.io/usage/linguistic-features) useful for text analysis. We retrieve the token and its annotations by accessing corresponding attributes. \n", - "\n", - "| Attribute | Annotation | Link |\n", - "|----------------|-----------------------------------------|---------------------------------------------------------------------------|\n", - "| token.text | The token in verbatim text | [Documentation](https://spacy.io/api/token#attributes) |\n", - "| token.is_stop | Whether the token is a stop word | [Documentation](https://spacy.io/api/attributes#_title) |\n", - "| token.is_punct | Whether the token is a punctuation mark | [Documentation](https://spacy.io/api/attributes#_title) |\n", - "| token.lemma_ | The base form of the token | [Documentation](https://spacy.io/usage/linguistic-features#lemmatization) |\n", - "| token.pos_ | The simple POS-tag of the token | [Documentation](https://spacy.io/usage/linguistic-features#pos-tagging) |\n", - "| ... | ... | ... |" - ] - }, - { - "cell_type": "markdown", - "id": "2e9f23c8-a157-44a7-a6ec-6894aec1a595", - "metadata": {}, - "source": [ - "Let's first get the tokens themselves! We'll iterate over the `doc` object and retrieve the text of each token. " - ] - }, - { - "cell_type": "code", - "execution_count": 35, - "id": "4c71efee-e6cf-46c4-9198-593304f6560d", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "['@VirginAmerica',\n", - " 'Really',\n", - " 'missed',\n", - " 'a',\n", - " 'prime',\n", - " 'opportunity',\n", - " 'for',\n", - " 'Men',\n", - " 'Without',\n", - " 'Hats',\n", - " 'parody',\n", - " ',',\n", - " 'there',\n", - " '.',\n", - " 'https://t.co/mWpG7grEZP']" - ] - }, - "execution_count": 35, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Get the verbatim texts of tokens\n", - "spacy_tokens = [token.text for token in doc]\n", - "spacy_tokens" - ] - }, - { - "cell_type": "code", - "execution_count": 36, - "id": "f4fc23f0-c699-45e6-ad62-e131036d601f", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "['@',\n", - " 'VirginAmerica',\n", - " 'Really',\n", - " 'missed',\n", - " 'a',\n", - " 'prime',\n", - " 'opportunity',\n", - " 'for',\n", - " 'Men',\n", - " 'Without',\n", - " 'Hats',\n", - " 'parody',\n", - " ',',\n", - " 'there',\n", - " '.',\n", - " 'https',\n", - " ':',\n", - " '//t.co/mWpG7grEZP']" - ] - }, - "execution_count": 36, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Get the NLTK tokens\n", - "nltk_tokens" - ] - }, - { - "cell_type": "markdown", - "id": "a0ace59e-40e0-42b3-9f2b-d30ac94dccab", - "metadata": {}, - "source": [ - "🔔 **Question**: Let's pause for a minute to compare the tokens generated by `nltk` and `spaCy`. What have you noticed?\n", - "\n", - "Remember we can also access various annotations of these okens. For instance, one annotation `spaCy` offers is that it conveniently encodes whether a token is a stop word. " - ] - }, - { - "cell_type": "code", - "execution_count": 37, - "id": "626af687-e986-4c97-af86-edf7dbd22c3e", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[False,\n", - " True,\n", - " False,\n", - " True,\n", - " False,\n", - " False,\n", - " True,\n", - " False,\n", - " True,\n", - " False,\n", - " False,\n", - " False,\n", - " True,\n", - " False,\n", - " False]" - ] - }, - "execution_count": 37, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Retrieve the is_stop annotation\n", - "spacy_stops = [token.is_stop for token in doc]\n", - "\n", - "# The results are boolean values\n", - "spacy_stops" - ] - }, - { - "cell_type": "markdown", - "id": "3b6548b6-7e89-4f42-b8cb-bf7c93b34eb4", - "metadata": {}, - "source": [ - "## 🥊 Challenge 2: Remove Stop Words\n", - "\n", - "We have known how `nltk` and `spaCy` work as NLP packages. We've also demostrated how to identify stop words with each package. \n", - "\n", - "Let's write **two** functions to remove stop words from our text data. \n", - "\n", - "- Complete the function for stop words removal using `nltk`\n", - " - The starter code requires two arguments: the raw text input and a list of predefined stop words\n", - "- Complete the function for stop words removal using `spaCy`\n", - " - The starter code requires one argument: the raw text input\n", - " \n", - "A friendly reminder before we dive in: both functions take raw text as input—that's a signal to perform tokenization on the raw text first!" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b24b5370-392f-420d-8c9e-78146d0fca29", - "metadata": {}, - "outputs": [], - "source": [ - "def remove_stopword_nltk(raw_text, stopword):\n", - " \n", - " # Step 1: Tokenization with nltk\n", - " # YOUR CODE HERE\n", - " \n", - " # Step 2: Filter out tokens in the stop word list\n", - " # YOUR CODE HERE" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "65b5bbda-0af5-49a9-8f5e-77d61ab217e7", - "metadata": {}, - "outputs": [], - "source": [ - "def remove_stopword_spacy(raw_text):\n", - "\n", - " # Step 1: Apply the nlp pipeline\n", - " # YOUR CODE HERE\n", - " \n", - " # Step 2: Filter out tokens that are stop words\n", - " # YOUR CODE HERE" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f4c3b0da-2223-4d7f-9014-696498e804e6", - "metadata": {}, - "outputs": [], - "source": [ - "# remove_stopword_nltk(text, stop)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f83538ba-6bf1-49ca-90ec-b6532b1ffcb3", - "metadata": {}, - "outputs": [], - "source": [ - "# remove_stopword_spacy(text)" - ] - }, - { - "cell_type": "markdown", - "id": "d3a6b1ec-87cc-4a08-a5dd-0210a9c56f0b", - "metadata": {}, - "source": [ - "## 🎬 **Demo**: Powerful Features from `spaCy`\n", - "\n", - "`spaCy`'s nlp pipeline includes a number of linguistic annotations, which could be very useful for text analysis. \n", - "\n", - "For instance, we can access more annotations such as the lemma, the part-of-speech tag and its meaning, and whether the token looks like URLs." - ] - }, - { - "cell_type": "code", - "execution_count": 38, - "id": "eb6c7d93-51a3-4fb8-8321-cb672f4f1b8f", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "@VirginAmerica | @VirginAmerica | PROPN | proper noun | 0 |\n", - "Really | really | ADV | adverb | 0 |\n", - "missed | miss | VERB | verb | 0 |\n", - "a | a | DET | determiner | 0 |\n", - "prime | prime | ADJ | adjective | 0 |\n", - "opportunity | opportunity | NOUN | noun | 0 |\n", - "for | for | ADP | adposition | 0 |\n", - "Men | Men | PROPN | proper noun | 0 |\n", - "Without | without | ADP | adposition | 0 |\n", - "Hats | Hats | PROPN | proper noun | 0 |\n", - "parody | parody | NOUN | noun | 0 |\n", - ", | , | PUNCT | punctuation | 0 |\n", - "there | there | ADV | adverb | 0 |\n", - ". | . | PUNCT | punctuation | 0 |\n", - "https://t.co/mWpG7grEZP | https://t.co/mWpG7grEZP | PROPN | proper noun | 1 |\n" - ] - } - ], - "source": [ - "# Print tokens and their annotations\n", - "for token in doc:\n", - " print(f\"{token.text:<24} | {token.lemma_:<24} | {token.pos_:<12} | {spacy.explain(token.pos_):<12} | {token.like_url:<12} |\")" - ] - }, - { - "cell_type": "markdown", - "id": "17388e0c-88b6-4cd9-8d2b-adb7f10b5330", - "metadata": {}, - "source": [ - "As you can imagine, it is typical for this dataset to contain place names and airport codes. It would be cool if we are able to identify them and extract them from tweets. " - ] - }, - { - "cell_type": "code", - "execution_count": 39, - "id": "cb489f78-fbb2-497b-a36d-3400c00c9b9d", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "@JetBlue Vegas, San Francisco, Baltimore, San Diego and Philadelphia so far! I'm a very frequent business traveler.\n", - "==================================================\n", - "@VirginAmerica Flying LAX to SFO and after looking at the awesome movie lineup I actually wish I was on a long haul.\n" - ] - } - ], - "source": [ - "# Print example tweets with place names and airport codes\n", - "tweet_city = tweets['text'][8273]\n", - "tweet_airport = tweets['text'][502]\n", - "print(tweet_city)\n", - "print(f\"{'=' * 50}\")\n", - "print(tweet_airport)" - ] - }, - { - "cell_type": "markdown", - "id": "013990a5-5e07-4a45-9427-fcd33840d3b8", - "metadata": {}, - "source": [ - "We can use the \"ner\" ([Named Entity Recognition](https://en.wikipedia.org/wiki/Named-entity_recognition)) component to identify entities and their categories." - ] - }, - { - "cell_type": "code", - "execution_count": 40, - "id": "b9e63519-5991-49fa-9a5d-be9f9b408ad1", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Vegas | 9 | 14 | GPE \n", - "San Francisco | 16 | 29 | GPE \n", - "Baltimore | 31 | 40 | GPE \n", - "San Diego | 42 | 51 | GPE \n", - "Philadelphia | 56 | 68 | GPE \n" - ] - } - ], - "source": [ - "# Print entities identified from the text\n", - "doc_city = nlp(tweet_city)\n", - "for ent in doc_city.ents:\n", - " print(f\"{ent.text:<15} | {ent.start_char:<10} | {ent.end_char:<10} | {ent.label_:<10}\")" - ] - }, - { - "cell_type": "markdown", - "id": "7b933ed0-7018-450c-b0a6-fb76cb6d5be9", - "metadata": {}, - "source": [ - "We can also use `displacy` to highlight entities identified in the text, and at the same time, annotate the entity category. \n", - "\n", - "In the following example, we have four `GPE` (i.e., geopolitical entities, usually countries and cities) identified. " - ] - }, - { - "cell_type": "code", - "execution_count": 41, - "id": "7a5a5219-af1f-445c-a35b-7c49d739a91b", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
@JetBlue \n", - "\n", - " Vegas\n", - " GPE\n", - "\n", - ", \n", - "\n", - " San Francisco\n", - " GPE\n", - "\n", - ", \n", - "\n", - " Baltimore\n", - " GPE\n", - "\n", - ", \n", - "\n", - " San Diego\n", - " GPE\n", - "\n", - " and \n", - "\n", - " Philadelphia\n", - " GPE\n", - "\n", - " so far! I'm a very frequent business traveler.
" + "source": [ + "# Print tokens and their annotations\n", + "for token in doc:\n", + " print(f\"{token.text:<24} | {token.lemma_:<24} | {token.pos_:<12} | {spacy.explain(token.pos_):<12} | {token.like_url:<12} |\")" + ] + }, + { + "cell_type": "markdown", + "id": "17388e0c-88b6-4cd9-8d2b-adb7f10b5330", + "metadata": { + "id": "17388e0c-88b6-4cd9-8d2b-adb7f10b5330" + }, + "source": [ + "As you can imagine, it is typical for this dataset to contain place names and airport codes. It would be cool if we are able to identify them and extract them from tweets." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cb489f78-fbb2-497b-a36d-3400c00c9b9d", + "metadata": { + "id": "cb489f78-fbb2-497b-a36d-3400c00c9b9d", + "outputId": "8dc49157-b8b8-44ab-8420-4126a1665db2" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "@JetBlue Vegas, San Francisco, Baltimore, San Diego and Philadelphia so far! I'm a very frequent business traveler.\n", + "==================================================\n", + "@VirginAmerica Flying LAX to SFO and after looking at the awesome movie lineup I actually wish I was on a long haul.\n" + ] + } ], - "text/plain": [ - "" + "source": [ + "# Print example tweets with place names and airport codes\n", + "tweet_city = tweets['text'][8273]\n", + "tweet_airport = tweets['text'][502]\n", + "print(tweet_city)\n", + "print(f\"{'=' * 50}\")\n", + "print(tweet_airport)" ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "# Visualize the identified entities\n", - "from spacy import displacy\n", - "displacy.render(doc_city, style='ent', jupyter=True)" - ] - }, - { - "cell_type": "markdown", - "id": "f5d7953b-04a1-46d0-817a-db29edd8c83b", - "metadata": {}, - "source": [ - "Let's give it a try with another example." - ] - }, - { - "cell_type": "code", - "execution_count": 42, - "id": "9246c7e3-8990-4d47-a355-deb63dbd1cc7", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "@VirginAmerica | 0 | 14 | CARDINAL \n", - "Flying LAX | 15 | 25 | ORG \n", - "SFO | 29 | 32 | ORG \n" - ] - } - ], - "source": [ - "# Print entities identified from the text\n", - "doc_airport = nlp(tweet_airport)\n", - "for ent in doc_airport.ents:\n", - " print(f\"{ent.text:<15} | {ent.start_char:<10} | {ent.end_char:<10} | {ent.label_:<10}\")" - ] - }, - { - "cell_type": "markdown", - "id": "1df38472-3193-44c5-9b2f-e311ce9d42e0", - "metadata": {}, - "source": [ - "Interesting that airport codes are identified as `ORG`—organizations, and the tweet handle as `CARDINAL`." - ] - }, - { - "cell_type": "code", - "execution_count": 43, - "id": "7e4bf382-4c57-4b78-a37f-fc1f6cb1d565", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - " @VirginAmerica\n", - " CARDINAL\n", - "\n", - " \n", - "\n", - " Flying LAX\n", - " ORG\n", - "\n", - " to \n", - "\n", - " SFO\n", - " ORG\n", - "\n", - " and after looking at the awesome movie lineup I actually wish I was on a long haul.
" + }, + { + "cell_type": "markdown", + "id": "013990a5-5e07-4a45-9427-fcd33840d3b8", + "metadata": { + "id": "013990a5-5e07-4a45-9427-fcd33840d3b8" + }, + "source": [ + "We can use the \"ner\" ([Named Entity Recognition](https://en.wikipedia.org/wiki/Named-entity_recognition)) component to identify entities and their categories." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b9e63519-5991-49fa-9a5d-be9f9b408ad1", + "metadata": { + "id": "b9e63519-5991-49fa-9a5d-be9f9b408ad1", + "outputId": "43891565-a8c3-40c2-fdcd-ce00471b721a" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Vegas | 9 | 14 | GPE \n", + "San Francisco | 16 | 29 | GPE \n", + "Baltimore | 31 | 40 | GPE \n", + "San Diego | 42 | 51 | GPE \n", + "Philadelphia | 56 | 68 | GPE \n" + ] + } ], - "text/plain": [ - "" + "source": [ + "# Print entities identified from the text\n", + "doc_city = nlp(tweet_city)\n", + "for ent in doc_city.ents:\n", + " print(f\"{ent.text:<15} | {ent.start_char:<10} | {ent.end_char:<10} | {ent.label_:<10}\")" + ] + }, + { + "cell_type": "markdown", + "id": "7b933ed0-7018-450c-b0a6-fb76cb6d5be9", + "metadata": { + "id": "7b933ed0-7018-450c-b0a6-fb76cb6d5be9" + }, + "source": [ + "We can also use `displacy` to highlight entities identified in the text, and at the same time, annotate the entity category.\n", + "\n", + "In the following example, we have four `GPE` (i.e., geopolitical entities, usually countries and cities) identified." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7a5a5219-af1f-445c-a35b-7c49d739a91b", + "metadata": { + "id": "7a5a5219-af1f-445c-a35b-7c49d739a91b", + "outputId": "808b0cbf-c49e-48db-ac29-3dcf5f9c6292" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
@JetBlue \n", + "\n", + " Vegas\n", + " GPE\n", + "\n", + ", \n", + "\n", + " San Francisco\n", + " GPE\n", + "\n", + ", \n", + "\n", + " Baltimore\n", + " GPE\n", + "\n", + ", \n", + "\n", + " San Diego\n", + " GPE\n", + "\n", + " and \n", + "\n", + " Philadelphia\n", + " GPE\n", + "\n", + " so far! I'm a very frequent business traveler.
" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Visualize the identified entities\n", + "from spacy import displacy\n", + "displacy.render(doc_city, style='ent', jupyter=True)" + ] + }, + { + "cell_type": "markdown", + "id": "f5d7953b-04a1-46d0-817a-db29edd8c83b", + "metadata": { + "id": "f5d7953b-04a1-46d0-817a-db29edd8c83b" + }, + "source": [ + "Let's give it a try with another example." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9246c7e3-8990-4d47-a355-deb63dbd1cc7", + "metadata": { + "id": "9246c7e3-8990-4d47-a355-deb63dbd1cc7", + "outputId": "f614b96a-4f24-4ab5-923c-e3051cc2862e" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "@VirginAmerica | 0 | 14 | CARDINAL \n", + "Flying LAX | 15 | 25 | ORG \n", + "SFO | 29 | 32 | ORG \n" + ] + } + ], + "source": [ + "# Print entities identified from the text\n", + "doc_airport = nlp(tweet_airport)\n", + "for ent in doc_airport.ents:\n", + " print(f\"{ent.text:<15} | {ent.start_char:<10} | {ent.end_char:<10} | {ent.label_:<10}\")" + ] + }, + { + "cell_type": "markdown", + "id": "1df38472-3193-44c5-9b2f-e311ce9d42e0", + "metadata": { + "id": "1df38472-3193-44c5-9b2f-e311ce9d42e0" + }, + "source": [ + "Interesting that airport codes are identified as `ORG`—organizations, and the tweet handle as `CARDINAL`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7e4bf382-4c57-4b78-a37f-fc1f6cb1d565", + "metadata": { + "id": "7e4bf382-4c57-4b78-a37f-fc1f6cb1d565", + "outputId": "459a8d69-c61f-4cde-80dd-edb8bc16c985" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + " @VirginAmerica\n", + " CARDINAL\n", + "\n", + " \n", + "\n", + " Flying LAX\n", + " ORG\n", + "\n", + " to \n", + "\n", + " SFO\n", + " ORG\n", + "\n", + " and after looking at the awesome movie lineup I actually wish I was on a long haul.
" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Visualize the identified entities\n", + "displacy.render(doc_airport, style='ent', jupyter=True)" + ] + }, + { + "cell_type": "markdown", + "id": "467d6f28-effc-4fe1-90e9-47c89dc5492d", + "metadata": { + "id": "467d6f28-effc-4fe1-90e9-47c89dc5492d" + }, + "source": [ + "## Tokenizers Since LLMs\n", + "\n", + "So far, we've seen what tokenization looks like with two widely-used NLP packages. They work quite well in some settings, but not others. Recall that `nltk` struggles with URLs. Now, imagine the data we have is even messier, containing misspellings, recently coined words, foreign names, and etc (collectively called \"out of vocabulary\" or OOV words). In such circumstances, we might need a more powerful model to handle these complexities.\n", + "\n", + "In fact, tokenization schemes change substantially with **Large Language Models** (LLMs), which are models trained on an enormous amount of data from mixed sources. With that magnitude of data, LLMs are better at chunking a longer sequence into tokens and tokens into **subtokens**. These subtokens can be morphological units of a word, such as an affix, but they can also be parts of a word where the model sets a \"meaningful\" boundary.\n", + "\n", + "In this section, we will demonstrate tokenization in **BERT** (Bidirectional Encoder Representations from Transformers), which utilizes a tokenization algorithm called [**WordPiece**](https://huggingface.co/learn/nlp-course/en/chapter6/6).\n", + "\n", + "We will load the tokenizer of BERT from the package `transformers`, which hosts a number of Transformer-based LLMs (e.g., BERT). We won't go into the architecture of Transformer in this workshop, but feel free to check out the D-lab workshop on [GPT Fundamentals](https://github.com/dlab-berkeley/GPT-Fundamentals)!" + ] + }, + { + "cell_type": "markdown", + "id": "b5e1509b-5b9e-456d-909f-a6b5099c48c8", + "metadata": { + "id": "b5e1509b-5b9e-456d-909f-a6b5099c48c8" + }, + "source": [ + "### WordPiece Tokenization\n", + "\n", + "Note that BERT comes in a variety of versions. The one we will explore today is `bert-base-uncased`. This model has a moderate size (referred to as `base`) and is case-insensitive, meaning the input text will be lowercased by default." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7927226c-d04e-4117-9c49-3d355611b209", + "metadata": { + "scrolled": true, + "id": "7927226c-d04e-4117-9c49-3d355611b209" + }, + "outputs": [], + "source": [ + "# Load BERT tokenizer in\n", + "from transformers import BertTokenizer\n", + "\n", + "# Initialize the tokenizer\n", + "tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')" + ] + }, + { + "cell_type": "markdown", + "id": "128cfb38-274e-4d75-9d14-8744020fe49c", + "metadata": { + "id": "128cfb38-274e-4d75-9d14-8744020fe49c" + }, + "source": [ + "The tokenizer has multiple functions, as we will see in a minute. Now we want to access the `.tokenize()` function from the tokenizer.\n", + "\n", + "Let's tokenize an example tweet below. What have you noticed?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "62649193-bb00-4ae8-9102-bce3d1dfb6c8", + "metadata": { + "id": "62649193-bb00-4ae8-9102-bce3d1dfb6c8", + "outputId": "177e933d-cd9d-42d5-ed2e-e5e69c940cea" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Text: @VirginAmerica Just DM'd. Same issue persisting.\n", + "==================================================\n", + "Tokens: ['@', 'virgin', '##ame', '##rica', 'just', 'd', '##m', \"'\", 'd', '.', 'same', 'issue', 'persist', '##ing', '.']\n", + "Number of tokens: 15\n" + ] + } + ], + "source": [ + "# Select an example tweet from dataframe\n", + "text = tweets['text'][194]\n", + "print(f\"Text: {text}\")\n", + "print(f\"{'=' * 50}\")\n", + "\n", + "# Apply tokenizer\n", + "tokens = tokenizer.tokenize(text)\n", + "print(f\"Tokens: {tokens}\")\n", + "print(f\"Number of tokens: {len(tokens)}\")" + ] + }, + { + "cell_type": "markdown", + "id": "bc4f780e-207c-4d3d-b1b6-063c6d118945", + "metadata": { + "id": "bc4f780e-207c-4d3d-b1b6-063c6d118945" + }, + "source": [ + "The double \"hashtag\" symbols (`##`) refer to a subword token—a segment separated from the previous token.\n", + "\n", + "🔔 **Question**: Do these subwords make sense to you?\n", + "\n", + "One significant development with LLMs is that each token is assigned an ID from its vocabulary. Our computer does not understand text in its raw form, so each token is translated into an ID. These IDs are the inputs that the model accesses and operates on.\n", + "\n", + "Tokens and IDs can be converted bidirectionally, for example:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3112a63d-82b6-4fc0-a904-ab86f8740653", + "metadata": { + "id": "3112a63d-82b6-4fc0-a904-ab86f8740653", + "outputId": "757e3928-7f70-4eab-e853-8548e30a6b55" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "ID of just is: 2074\n", + "Token 2074 is: just\n" + ] + } + ], + "source": [ + "# Get the input ID of the word\n", + "print(f\"ID of just is: {tokenizer.vocab['just']}\")\n", + "\n", + "# Get the text of the input ID\n", + "print(f\"Token 2074 is: {tokenizer.decode([2074])}\")" + ] + }, + { + "cell_type": "markdown", + "id": "2b9fd13d-f26e-480c-b43e-2b5fbc4898cd", + "metadata": { + "id": "2b9fd13d-f26e-480c-b43e-2b5fbc4898cd" + }, + "source": [ + "Let's convert tokens to input IDs." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8d125e1d-2560-4136-829c-b1c11e34636c", + "metadata": { + "id": "8d125e1d-2560-4136-829c-b1c11e34636c", + "outputId": "247063b4-623d-4187-f8e4-bbd45ed269ab" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Number of input IDs: 15\n", + "Input IDs of text: [1030, 6261, 14074, 14735, 2074, 1040, 2213, 1005, 1040, 1012, 2168, 3277, 29486, 2075, 1012]\n" + ] + } + ], + "source": [ + "# Convert a list of tokens to a list of input IDs\n", + "input_ids = tokenizer.convert_tokens_to_ids(tokens)\n", + "print(f\"Number of input IDs: {len(input_ids)}\")\n", + "print(f\"Input IDs of text: {input_ids}\")" + ] + }, + { + "cell_type": "markdown", + "id": "f25a1a14-a6db-414b-a1a8-c6be73c81a15", + "metadata": { + "id": "f25a1a14-a6db-414b-a1a8-c6be73c81a15" + }, + "source": [ + "### Special Tokens\n", + "\n", + "In addition to the tokens and subtokens discussed above, BERT also makes use of three special tokens: `SEP`, `CLS`, and `UNK`. The `SEP` token acts as a sentence terminator, commonly known as an `EOS` (End of Sentence) token. The `UNK` token represents any token that is not found in the vocabulary, hence \"unknown\" tokens. The `CLS` token is added to the beginning of the sentence. It originates from text classification tasks (e.g., spam detection), where reseachers found it useful to have a token that aggregates the information of the entire sentence for classification purposes.\n", + "\n", + "When we apply `tokenizer()` directly to our text data, we are asking BERT to **encode** the text for us. This involves multiple steps:\n", + "- Tokenize the text\n", + "- Add special tokens\n", + "- Convert tokens to input IDs\n", + "- Other model-specific processes\n", + " \n", + "Let's print them out." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "21479c25-7a9a-4fac-ba09-1812575b8170", + "metadata": { + "id": "21479c25-7a9a-4fac-ba09-1812575b8170", + "outputId": "7b66134b-4e14-405a-ac01-512ddea3abd8" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Number of input IDs: 17\n", + "IDs from tokenizer: [101, 1030, 6261, 14074, 14735, 2074, 1040, 2213, 1005, 1040, 1012, 2168, 3277, 29486, 2075, 1012, 102]\n" + ] + } + ], + "source": [ + "# Get the input IDs by providing the key\n", + "input_ids_from_tokenizer = tokenizer(text)['input_ids']\n", + "print(f\"Number of input IDs: {len(input_ids_from_tokenizer)}\")\n", + "print(f\"IDs from tokenizer: {input_ids_from_tokenizer}\")" + ] + }, + { + "cell_type": "markdown", + "id": "bb6ea98e-7374-488a-a804-46cab166125c", + "metadata": { + "id": "bb6ea98e-7374-488a-a804-46cab166125c" + }, + "source": [ + "It looks like we have two more tokens added: 101 and 102.\n", + "\n", + "Let's convert them to texts!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "82a18dc1-ca0d-4d5b-8ac6-f56cb44bf8e3", + "metadata": { + "id": "82a18dc1-ca0d-4d5b-8ac6-f56cb44bf8e3", + "outputId": "d9f3b9be-d3e9-403a-92cc-c198335da530" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The 101st token: [CLS]\n", + "The 102nd token: [SEP]\n" + ] + } + ], + "source": [ + "# Convert input IDs to texts\n", + "print(f\"The 101st token: {tokenizer.convert_ids_to_tokens(101)}\")\n", + "print(f\"The 102nd token: {tokenizer.convert_ids_to_tokens(102)}\")" + ] + }, + { + "cell_type": "markdown", + "id": "b16d4e10-e96c-432a-a6c9-2c992d00fcde", + "metadata": { + "id": "b16d4e10-e96c-432a-a6c9-2c992d00fcde" + }, + "source": [ + "As you can see, our text example is now a list of vocabulary IDs. In addtion to that, BERT adds the sentence terminator `SEP` and the beginning `CLS` token to the original text. BERT's tokenizer encodes tons of texts likewise; and afterwards, they are ready for further processes." + ] + }, + { + "cell_type": "markdown", + "id": "ac56be32-2a5f-441e-8283-de3e60705c0b", + "metadata": { + "id": "ac56be32-2a5f-441e-8283-de3e60705c0b" + }, + "source": [ + "## 🥊 Challenge 3: Find the Word Boundary\n", + "\n", + "Now that we know tokenization in BERT often returns subwords. Let's try a few more examples.\n", + "\n", + "- What do you think is the correct boundary for splitting the following words into subwords?\n", + "- What other examples have you tested?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f1140b19-398c-44dd-829c-c922b0e6f7eb", + "metadata": { + "id": "f1140b19-398c-44dd-829c-c922b0e6f7eb" + }, + "outputs": [], + "source": [ + "def get_tokens(string):\n", + " '''Tokenzie the input string with BERT'''\n", + " tokens = tokenizer.tokenize(string)\n", + " return print(tokens)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0c07c71e-5be2-4d91-9c6f-26de4145307d", + "metadata": { + "id": "0c07c71e-5be2-4d91-9c6f-26de4145307d", + "outputId": "3e227b40-e13e-4246-f979-9b42d1cd58ae" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['dl', '##ab']\n", + "['co', '##vid']\n", + "['hug', '##ga', '##ble']\n", + "['37', '##8']\n" + ] + } + ], + "source": [ + "# Abbreviations\n", + "get_tokens('dlab')\n", + "\n", + "# OOV\n", + "get_tokens('covid')\n", + "\n", + "# Prefix\n", + "get_tokens('huggable')\n", + "\n", + "# Digits\n", + "get_tokens('378')\n", + "\n", + "# YOUR EXAMPLE" + ] + }, + { + "cell_type": "markdown", + "id": "4acb7cb9-6e60-4e0a-8cc6-21d57237e835", + "metadata": { + "id": "4acb7cb9-6e60-4e0a-8cc6-21d57237e835" + }, + "source": [ + "We will wrap up Part 1 with this (hopefully) thought-provoking challenge. LLMs often come with a much more sophisticated tokenization scheme, but there is ongoing discussion about their limitations in real-world applications. The reference section includes a few blog posts discussing this problem. Feel free to explore further if this sounds like an interesting question to you!" + ] + }, + { + "cell_type": "markdown", + "id": "d7943ed9-70de-4f4a-b1bb-b2896d05e618", + "metadata": { + "id": "d7943ed9-70de-4f4a-b1bb-b2896d05e618" + }, + "source": [ + "## References\n", + "\n", + "1. A tutorial introducing the tokenization scheme in BERT: [The huggingface NLP course on wordpiece tokenization](https://huggingface.co/learn/nlp-course/chapter6/6?fw=pt)\n", + "2. A specific example of \"failure\" in tokenization: [Weaknesses of wordpiece tokenization: Findings from the front lines of NLP at VMware.](https://medium.com/@rickbattle/weaknesses-of-wordpiece-tokenization-eb20e37fec99)\n", + "3. How does BERT decide boundaries between subtokens: [Subword tokenization in BERT](https://tinkerd.net/blog/machine-learning/bert-tokenization/#subword-tokenization)" + ] + }, + { + "cell_type": "markdown", + "id": "ce0812a7-f033-46ed-bc7b-67109c369e6c", + "metadata": { + "id": "ce0812a7-f033-46ed-bc7b-67109c369e6c" + }, + "source": [ + "
\n", + "\n", + "## ❗ Key Points\n", + "\n", + "* Preprocessing includes multiple steps, some of them are more common to text data regardlessly, and some are task-specific.\n", + "* Both `nltk` and `spaCy` could be used for tokenization and stop word removal. The latter is more powerful in providing various linguistic annotations.\n", + "* Tokenization works differently in BERT, which often involves breaking down a whole word into subwords.\n", + "\n", + "
" ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "# Visualize the identified entities\n", - "displacy.render(doc_airport, style='ent', jupyter=True)" - ] - }, - { - "cell_type": "markdown", - "id": "467d6f28-effc-4fe1-90e9-47c89dc5492d", - "metadata": {}, - "source": [ - "## Tokenizers Since LLMs\n", - "\n", - "So far, we've seen what tokenization looks like with two widely-used NLP packages. They work quite well in some settings, but not others. Recall that `nltk` struggles with URLs. Now, imagine the data we have is even messier, containing misspellings, recently coined words, foreign names, and etc (collectively called \"out of vocabulary\" or OOV words). In such circumstances, we might need a more powerful model to handle these complexities.\n", - "\n", - "In fact, tokenization schemes change substantially with **Large Language Models** (LLMs), which are models trained on an enormous amount of data from mixed sources. With that magnitude of data, LLMs are better at chunking a longer sequence into tokens and tokens into **subtokens**. These subtokens can be morphological units of a word, such as an affix, but they can also be parts of a word where the model sets a \"meaningful\" boundary. \n", - "\n", - "In this section, we will demonstrate tokenization in **BERT** (Bidirectional Encoder Representations from Transformers), which utilizes a tokenization algorithm called [**WordPiece**](https://huggingface.co/learn/nlp-course/en/chapter6/6). \n", - "\n", - "We will load the tokenizer of BERT from the package `transformers`, which hosts a number of Transformer-based LLMs (e.g., BERT). We won't go into the architecture of Transformer in this workshop, but feel free to check out the D-lab workshop on [GPT Fundamentals](https://github.com/dlab-berkeley/GPT-Fundamentals)!" - ] - }, - { - "cell_type": "markdown", - "id": "b5e1509b-5b9e-456d-909f-a6b5099c48c8", - "metadata": {}, - "source": [ - "### WordPiece Tokenization\n", - "\n", - "Note that BERT comes in a variety of versions. The one we will explore today is `bert-base-uncased`. This model has a moderate size (referred to as `base`) and is case-insensitive, meaning the input text will be lowercased by default." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7927226c-d04e-4117-9c49-3d355611b209", - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "# Load BERT tokenizer in\n", - "from transformers import BertTokenizer\n", - "\n", - "# Initialize the tokenizer \n", - "tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')" - ] - }, - { - "cell_type": "markdown", - "id": "128cfb38-274e-4d75-9d14-8744020fe49c", - "metadata": {}, - "source": [ - "The tokenizer has multiple functions, as we will see in a minute. Now we want to access the `.tokenize()` function from the tokenizer. \n", - "\n", - "Let's tokenize an example tweet below. What have you noticed?" - ] - }, - { - "cell_type": "code", - "execution_count": 45, - "id": "62649193-bb00-4ae8-9102-bce3d1dfb6c8", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Text: @VirginAmerica Just DM'd. Same issue persisting.\n", - "==================================================\n", - "Tokens: ['@', 'virgin', '##ame', '##rica', 'just', 'd', '##m', \"'\", 'd', '.', 'same', 'issue', 'persist', '##ing', '.']\n", - "Number of tokens: 15\n" - ] - } - ], - "source": [ - "# Select an example tweet from dataframe\n", - "text = tweets['text'][194]\n", - "print(f\"Text: {text}\")\n", - "print(f\"{'=' * 50}\")\n", - "\n", - "# Apply tokenizer\n", - "tokens = tokenizer.tokenize(text)\n", - "print(f\"Tokens: {tokens}\")\n", - "print(f\"Number of tokens: {len(tokens)}\")" - ] - }, - { - "cell_type": "markdown", - "id": "bc4f780e-207c-4d3d-b1b6-063c6d118945", - "metadata": {}, - "source": [ - "The double \"hashtag\" symbols (`##`) refer to a subword token—a segment separated from the previous token.\n", - "\n", - "🔔 **Question**: Do these subwords make sense to you? \n", - "\n", - "One significant development with LLMs is that each token is assigned an ID from its vocabulary. Our computer does not understand text in its raw form, so each token is translated into an ID. These IDs are the inputs that the model accesses and operates on.\n", - "\n", - "Tokens and IDs can be converted bidirectionally, for example:" - ] - }, - { - "cell_type": "code", - "execution_count": 46, - "id": "3112a63d-82b6-4fc0-a904-ab86f8740653", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "ID of just is: 2074\n", - "Token 2074 is: just\n" - ] - } - ], - "source": [ - "# Get the input ID of the word \n", - "print(f\"ID of just is: {tokenizer.vocab['just']}\")\n", - "\n", - "# Get the text of the input ID\n", - "print(f\"Token 2074 is: {tokenizer.decode([2074])}\")" - ] - }, - { - "cell_type": "markdown", - "id": "2b9fd13d-f26e-480c-b43e-2b5fbc4898cd", - "metadata": {}, - "source": [ - "Let's convert tokens to input IDs." - ] - }, - { - "cell_type": "code", - "execution_count": 47, - "id": "8d125e1d-2560-4136-829c-b1c11e34636c", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Number of input IDs: 15\n", - "Input IDs of text: [1030, 6261, 14074, 14735, 2074, 1040, 2213, 1005, 1040, 1012, 2168, 3277, 29486, 2075, 1012]\n" - ] - } - ], - "source": [ - "# Convert a list of tokens to a list of input IDs\n", - "input_ids = tokenizer.convert_tokens_to_ids(tokens)\n", - "print(f\"Number of input IDs: {len(input_ids)}\")\n", - "print(f\"Input IDs of text: {input_ids}\")" - ] - }, - { - "cell_type": "markdown", - "id": "f25a1a14-a6db-414b-a1a8-c6be73c81a15", - "metadata": {}, - "source": [ - "### Special Tokens\n", - "\n", - "In addition to the tokens and subtokens discussed above, BERT also makes use of three special tokens: `SEP`, `CLS`, and `UNK`. The `SEP` token acts as a sentence terminator, commonly known as an `EOS` (End of Sentence) token. The `UNK` token represents any token that is not found in the vocabulary, hence \"unknown\" tokens. The `CLS` token is added to the beginning of the sentence. It originates from text classification tasks (e.g., spam detection), where reseachers found it useful to have a token that aggregates the information of the entire sentence for classification purposes.\n", - "\n", - "When we apply `tokenizer()` directly to our text data, we are asking BERT to **encode** the text for us. This involves multiple steps: \n", - "- Tokenize the text\n", - "- Add special tokens\n", - "- Convert tokens to input IDs\n", - "- Other model-specific processes\n", - " \n", - "Let's print them out." - ] - }, - { - "cell_type": "code", - "execution_count": 48, - "id": "21479c25-7a9a-4fac-ba09-1812575b8170", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Number of input IDs: 17\n", - "IDs from tokenizer: [101, 1030, 6261, 14074, 14735, 2074, 1040, 2213, 1005, 1040, 1012, 2168, 3277, 29486, 2075, 1012, 102]\n" - ] - } - ], - "source": [ - "# Get the input IDs by providing the key \n", - "input_ids_from_tokenizer = tokenizer(text)['input_ids']\n", - "print(f\"Number of input IDs: {len(input_ids_from_tokenizer)}\")\n", - "print(f\"IDs from tokenizer: {input_ids_from_tokenizer}\")" - ] - }, - { - "cell_type": "markdown", - "id": "bb6ea98e-7374-488a-a804-46cab166125c", - "metadata": {}, - "source": [ - "It looks like we have two more tokens added: 101 and 102. \n", - "\n", - "Let's convert them to texts!" - ] - }, - { - "cell_type": "code", - "execution_count": 49, - "id": "82a18dc1-ca0d-4d5b-8ac6-f56cb44bf8e3", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "The 101st token: [CLS]\n", - "The 102nd token: [SEP]\n" - ] } - ], - "source": [ - "# Convert input IDs to texts\n", - "print(f\"The 101st token: {tokenizer.convert_ids_to_tokens(101)}\")\n", - "print(f\"The 102nd token: {tokenizer.convert_ids_to_tokens(102)}\")" - ] - }, - { - "cell_type": "markdown", - "id": "b16d4e10-e96c-432a-a6c9-2c992d00fcde", - "metadata": {}, - "source": [ - "As you can see, our text example is now a list of vocabulary IDs. In addtion to that, BERT adds the sentence terminator `SEP` and the beginning `CLS` token to the original text. BERT's tokenizer encodes tons of texts likewise; and afterwards, they are ready for further processes." - ] - }, - { - "cell_type": "markdown", - "id": "ac56be32-2a5f-441e-8283-de3e60705c0b", - "metadata": {}, - "source": [ - "## 🥊 Challenge 3: Find the Word Boundary\n", - "\n", - "Now that we know tokenization in BERT often returns subwords. Let's try a few more examples. \n", - "\n", - "- What do you think is the correct boundary for splitting the following words into subwords?\n", - "- What other examples have you tested?" - ] - }, - { - "cell_type": "code", - "execution_count": 50, - "id": "f1140b19-398c-44dd-829c-c922b0e6f7eb", - "metadata": {}, - "outputs": [], - "source": [ - "def get_tokens(string):\n", - " '''Tokenzie the input string with BERT'''\n", - " tokens = tokenizer.tokenize(string)\n", - " return print(tokens)" - ] - }, - { - "cell_type": "code", - "execution_count": 51, - "id": "0c07c71e-5be2-4d91-9c6f-26de4145307d", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "['dl', '##ab']\n", - "['co', '##vid']\n", - "['hug', '##ga', '##ble']\n", - "['37', '##8']\n" - ] + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + }, + "colab": { + "provenance": [] } - ], - "source": [ - "# Abbreviations\n", - "get_tokens('dlab')\n", - "\n", - "# OOV\n", - "get_tokens('covid')\n", - "\n", - "# Prefix\n", - "get_tokens('huggable')\n", - "\n", - "# Digits\n", - "get_tokens('378')\n", - "\n", - "# YOUR EXAMPLE" - ] - }, - { - "cell_type": "markdown", - "id": "4acb7cb9-6e60-4e0a-8cc6-21d57237e835", - "metadata": {}, - "source": [ - "We will wrap up Part 1 with this (hopefully) thought-provoking challenge. LLMs often come with a much more sophisticated tokenization scheme, but there is ongoing discussion about their limitations in real-world applications. The reference section includes a few blog posts discussing this problem. Feel free to explore further if this sounds like an interesting question to you!" - ] - }, - { - "cell_type": "markdown", - "id": "d7943ed9-70de-4f4a-b1bb-b2896d05e618", - "metadata": {}, - "source": [ - "## References\n", - "\n", - "1. A tutorial introducing the tokenization scheme in BERT: [The huggingface NLP course on wordpiece tokenization](https://huggingface.co/learn/nlp-course/chapter6/6?fw=pt)\n", - "2. A specific example of \"failure\" in tokenization: [Weaknesses of wordpiece tokenization: Findings from the front lines of NLP at VMware.](https://medium.com/@rickbattle/weaknesses-of-wordpiece-tokenization-eb20e37fec99)\n", - "3. How does BERT decide boundaries between subtokens: [Subword tokenization in BERT](https://tinkerd.net/blog/machine-learning/bert-tokenization/#subword-tokenization)" - ] - }, - { - "cell_type": "markdown", - "id": "ce0812a7-f033-46ed-bc7b-67109c369e6c", - "metadata": {}, - "source": [ - "
\n", - "\n", - "## ❗ Key Points\n", - "\n", - "* Preprocessing includes multiple steps, some of them are more common to text data regardlessly, and some are task-specific. \n", - "* Both `nltk` and `spaCy` could be used for tokenization and stop word removal. The latter is more powerful in providing various linguistic annotations. \n", - "* Tokenization works differently in BERT, which often involves breaking down a whole word into subwords. \n", - "\n", - "
" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.4" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file From 4afb65685463e3e49fef2c9fc6d6793df2bf11a8 Mon Sep 17 00:00:00 2001 From: Pala63 <132717108+Pala63@users.noreply.github.com> Date: Tue, 29 Jul 2025 00:34:25 -0500 Subject: [PATCH 2/3] Created using Colab --- lessons/01_preprocessing.ipynb | 565 ++++++++++++++++++++++++--------- 1 file changed, 415 insertions(+), 150 deletions(-) diff --git a/lessons/01_preprocessing.ipynb b/lessons/01_preprocessing.ipynb index e6a9dd0..5d27a04 100644 --- a/lessons/01_preprocessing.ipynb +++ b/lessons/01_preprocessing.ipynb @@ -37,7 +37,7 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": null, "id": "d442e4c7-e926-493d-a64e-516616ad915a", "metadata": { "id": "d442e4c7-e926-493d-a64e-516616ad915a", @@ -211,58 +211,60 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 3, "id": "3d1ff64b-53ad-4eca-b846-3fda20085c43", "metadata": { "id": "3d1ff64b-53ad-4eca-b846-3fda20085c43", - "outputId": "2df7adfd-01e6-452c-ae50-a9437cc486f0", + "outputId": "d6605b3e-bf59-40ff-99d9-d00a418e5a38", "colab": { "base_uri": "https://localhost:8080/", - "height": 335 + "height": 451 } }, "outputs": [ { - "output_type": "error", - "ename": "FileNotFoundError", - "evalue": "[Errno 2] No such file or directory: '../data/airline_tweets.csv'", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mFileNotFoundError\u001b[0m Traceback (most recent call last)", - "\u001b[0;32m/tmp/ipython-input-2-1378166650.py\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 6\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 7\u001b[0m \u001b[0;31m# Specify the separator\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 8\u001b[0;31m \u001b[0mtweets\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread_csv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcsv_path\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msep\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m','\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", - "\u001b[0;32m/usr/local/lib/python3.11/dist-packages/pandas/io/parsers/readers.py\u001b[0m in \u001b[0;36mread_csv\u001b[0;34m(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)\u001b[0m\n\u001b[1;32m 1024\u001b[0m \u001b[0mkwds\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mupdate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkwds_defaults\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1025\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1026\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0m_read\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfilepath_or_buffer\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkwds\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1027\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1028\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m/usr/local/lib/python3.11/dist-packages/pandas/io/parsers/readers.py\u001b[0m in \u001b[0;36m_read\u001b[0;34m(filepath_or_buffer, kwds)\u001b[0m\n\u001b[1;32m 618\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 619\u001b[0m \u001b[0;31m# Create the parser.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 620\u001b[0;31m \u001b[0mparser\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mTextFileReader\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfilepath_or_buffer\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwds\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 621\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 622\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mchunksize\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0miterator\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m/usr/local/lib/python3.11/dist-packages/pandas/io/parsers/readers.py\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, f, engine, **kwds)\u001b[0m\n\u001b[1;32m 1618\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1619\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mhandles\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mIOHandles\u001b[0m \u001b[0;34m|\u001b[0m \u001b[0;32mNone\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1620\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_make_engine\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mf\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mengine\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1621\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1622\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mclose\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m->\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m/usr/local/lib/python3.11/dist-packages/pandas/io/parsers/readers.py\u001b[0m in \u001b[0;36m_make_engine\u001b[0;34m(self, f, engine)\u001b[0m\n\u001b[1;32m 1878\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;34m\"b\"\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mmode\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1879\u001b[0m \u001b[0mmode\u001b[0m \u001b[0;34m+=\u001b[0m \u001b[0;34m\"b\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1880\u001b[0;31m self.handles = get_handle(\n\u001b[0m\u001b[1;32m 1881\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1882\u001b[0m \u001b[0mmode\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m/usr/local/lib/python3.11/dist-packages/pandas/io/common.py\u001b[0m in \u001b[0;36mget_handle\u001b[0;34m(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)\u001b[0m\n\u001b[1;32m 871\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mioargs\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mencoding\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;34m\"b\"\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mioargs\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmode\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 872\u001b[0m \u001b[0;31m# Encoding\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 873\u001b[0;31m handle = open(\n\u001b[0m\u001b[1;32m 874\u001b[0m \u001b[0mhandle\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 875\u001b[0m \u001b[0mioargs\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmode\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: '../data/airline_tweets.csv'" - ] - } - ], - "source": [ - "# Import pandas\n", - "import pandas as pd\n", - "\n", - "# File path to data\n", - "csv_path = '../data/airline_tweets.csv'\n", - "\n", - "# Specify the separator\n", - "tweets = pd.read_csv(csv_path, sep=',')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e397ac6a-c2ba-4cce-8700-b36b38026c9d", - "metadata": { - "id": "e397ac6a-c2ba-4cce-8700-b36b38026c9d", - "outputId": "ca6b5529-f2b3-4d87-ccc2-2e24e405ee43" - }, - "outputs": [ - { + "output_type": "execute_result", "data": { + "text/plain": [ + " tweet_id airline_sentiment airline_sentiment_confidence \\\n", + "0 570306133677760513 neutral 1.0000 \n", + "1 570301130888122368 positive 0.3486 \n", + "2 570301083672813571 neutral 0.6837 \n", + "3 570301031407624196 negative 1.0000 \n", + "4 570300817074462722 negative 1.0000 \n", + "\n", + " negativereason negativereason_confidence airline \\\n", + "0 NaN NaN Virgin America \n", + "1 NaN 0.0000 Virgin America \n", + "2 NaN NaN Virgin America \n", + "3 Bad Flight 0.7033 Virgin America \n", + "4 Can't Tell 1.0000 Virgin America \n", + "\n", + " airline_sentiment_gold name negativereason_gold retweet_count \\\n", + "0 NaN cairdin NaN 0 \n", + "1 NaN jnardino NaN 0 \n", + "2 NaN yvonnalynn NaN 0 \n", + "3 NaN jnardino NaN 0 \n", + "4 NaN jnardino NaN 0 \n", + "\n", + " text tweet_coord \\\n", + "0 @VirginAmerica What @dhepburn said. NaN \n", + "1 @VirginAmerica plus you've added commercials t... NaN \n", + "2 @VirginAmerica I didn't today... Must mean I n... NaN \n", + "3 @VirginAmerica it's really aggressive to blast... NaN \n", + "4 @VirginAmerica and it's a really big bad thing... NaN \n", + "\n", + " tweet_created tweet_location user_timezone \n", + "0 2015-02-24 11:35:52 -0800 NaN Eastern Time (US & Canada) \n", + "1 2015-02-24 11:15:59 -0800 NaN Pacific Time (US & Canada) \n", + "2 2015-02-24 11:15:48 -0800 Lets Play Central Time (US & Canada) \n", + "3 2015-02-24 11:15:36 -0800 NaN Pacific Time (US & Canada) \n", + "4 2015-02-24 11:14:45 -0800 NaN Pacific Time (US & Canada) " + ], "text/html": [ - "
\n", + "\n", + "
\n", + "
\n", "\n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "tweets", + "summary": "{\n \"name\": \"tweets\",\n \"rows\": 14640,\n \"fields\": [\n {\n \"column\": \"tweet_id\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 779111158481836,\n \"min\": 567588278875213824,\n \"max\": 570310600460525568,\n \"num_unique_values\": 14485,\n \"samples\": [\n 567917894144770049,\n 567813976492417024,\n 569243676594941953\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"airline_sentiment\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"neutral\",\n \"positive\",\n \"negative\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"airline_sentiment_confidence\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.1628299590986659,\n \"min\": 0.335,\n \"max\": 1.0,\n \"num_unique_values\": 1023,\n \"samples\": [\n 0.6723,\n 0.3551,\n 0.6498\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"negativereason\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 10,\n \"samples\": [\n \"Damaged Luggage\",\n \"Can't Tell\",\n \"Lost Luggage\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"negativereason_confidence\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.3304397596377413,\n \"min\": 0.0,\n \"max\": 1.0,\n \"num_unique_values\": 1410,\n \"samples\": [\n 0.6677,\n 0.6622,\n 0.6905\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"airline\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 6,\n \"samples\": [\n \"Virgin America\",\n \"United\",\n \"American\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"airline_sentiment_gold\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"negative\",\n \"neutral\",\n \"positive\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"name\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 7701,\n \"samples\": [\n \"smckenna719\",\n \"thisAnneM\",\n \"jmspool\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"negativereason_gold\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 13,\n \"samples\": [\n \"Customer Service Issue\\nLost Luggage\",\n \"Late Flight\\nCancelled Flight\",\n \"Late Flight\\nFlight Attendant Complaints\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"retweet_count\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 44,\n \"num_unique_values\": 18,\n \"samples\": [\n 0,\n 1,\n 6\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"text\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 14427,\n \"samples\": [\n \"@JetBlue so technically I could drive to JFK now and put in. Request for tomorrow's flight?\",\n \"@united why I won't check my carry on. Watched a handler throw this bag -- miss the conveyer belt -- sat there 10 min http://t.co/lyoocx5mSH\",\n \"@SouthwestAir you guys are so clever \\ud83d\\ude03 http://t.co/qn5odUGFqK\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"tweet_coord\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 832,\n \"samples\": [\n \"[40.04915451, -75.10364317]\",\n \"[32.97609561, -96.53349238]\",\n \"[26.37852293, -81.78472152]\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"tweet_created\",\n \"properties\": {\n \"dtype\": \"object\",\n \"num_unique_values\": 14247,\n \"samples\": [\n \"2015-02-23 07:40:55 -0800\",\n \"2015-02-21 16:20:09 -0800\",\n \"2015-02-21 21:33:21 -0800\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"tweet_location\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 3081,\n \"samples\": [\n \"Oakland, California\",\n \"Beverly Hills, CA\",\n \"Austin, TX/NY, NY\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"user_timezone\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 85,\n \"samples\": [\n \"Helsinki\",\n \"Eastern Time (US & Canada)\",\n \"America/Detroit\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } }, - "execution_count": 2, "metadata": {}, - "output_type": "execute_result" + "execution_count": 3 } ], "source": [ - "# Show the first five rows\n", - "tweets.head()" + "# Importar pandas\n", + "import pandas as pd\n", + "\n", + "# URL directa del archivo CSV en tu fork\n", + "csv_url = 'https://raw.githubusercontent.com/Pala63/Python-NLP-Fundamentals/main/data/airline_tweets.csv'\n", + "\n", + "# Leer el CSV directamente desde GitHub\n", + "tweets = pd.read_csv(csv_url)\n", + "\n", + "# Mostrar las primeras filas\n", + "tweets.head()\n" ] }, { @@ -467,27 +655,32 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 4, "id": "b690daab-7be5-4b8f-8af0-a91fdec4ec4f", "metadata": { "id": "b690daab-7be5-4b8f-8af0-a91fdec4ec4f", - "outputId": "6c927a6f-0b55-40bc-d97f-b9d34360bae1" + "outputId": "c4f8dc33-331c-4224-a7cd-b25c21804c19", + "colab": { + "base_uri": "https://localhost:8080/" + } }, "outputs": [ { - "name": "stdout", "output_type": "stream", + "name": "stdout", "text": [ "@VirginAmerica What @dhepburn said.\n", "@VirginAmerica plus you've added commercials to the experience... tacky.\n", - "@VirginAmerica I didn't today... Must mean I need to take another trip!\n" + "@VirginAmerica I didn't today... Must mean I need to take another trip!\n", + "cairdin\n" ] } ], "source": [ - "print(tweets['text'].iloc[0])\n", + "print(tweets['text'].iloc[0]) #Seleccione solo la columna text, que contiene el contenido de los tweets y dame el primer resultado gracias a iloc[0]\n", "print(tweets['text'].iloc[1])\n", - "print(tweets['text'].iloc[2])" + "print(tweets['text'].iloc[2])\n", + "print(tweets['name'].iloc[0])" ] }, { @@ -500,6 +693,19 @@ "🔔 **Question**: What have you noticed? What are the stylistic features of tweets?" ] }, + { + "cell_type": "code", + "source": [ + "# Los tuits presentan un estilo informal y breve, con una fuerte presencia de menciones a aerolineas mediante el\n", + "# uso del @. En su mayoria presentan quejas o comentarios emocionales" + ], + "metadata": { + "id": "zTcJKWAkESjB" + }, + "id": "zTcJKWAkESjB", + "execution_count": null, + "outputs": [] + }, { "cell_type": "markdown", "id": "c3460393-00a6-461c-b02a-9e98f9b5d1af", @@ -520,16 +726,19 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 6, "id": "58a95d90-3ef1-4bff-9cfe-d447ed99f252", "metadata": { "id": "58a95d90-3ef1-4bff-9cfe-d447ed99f252", - "outputId": "33094c0c-036c-42f0-9196-94ebeb135165" + "outputId": "db5cb29f-244a-4e65-a2d9-beda2169fb1d", + "colab": { + "base_uri": "https://localhost:8080/" + } }, "outputs": [ { - "name": "stdout", "output_type": "stream", + "name": "stdout", "text": [ "@VirginAmerica I was scheduled for SFO 2 DAL flight 714 today. Changed to 24th due weather. Looks like flight still on?\n" ] @@ -537,22 +746,25 @@ ], "source": [ "# Print the first example\n", - "first_example = tweets['text'][108]\n", + "first_example = tweets['text'][108] #me devuelve el indice 108 de la columna texto en tweets\n", "print(first_example)" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 7, "id": "c66d91c0-6eed-4591-95fc-cd2eae2e0d41", "metadata": { "id": "c66d91c0-6eed-4591-95fc-cd2eae2e0d41", - "outputId": "452329e5-c12d-4698-c20b-a1162b29d3fb" + "outputId": "064c7f64-d0da-418b-b599-a16ee3c947b4", + "colab": { + "base_uri": "https://localhost:8080/" + } }, "outputs": [ { - "name": "stdout", "output_type": "stream", + "name": "stdout", "text": [ "False\n", "==================================================\n", @@ -595,20 +807,57 @@ }, { "cell_type": "code", - "execution_count": null, + "source": [ + "import re\n", + "\n", + "# Texto con saltos de línea y espacios extras\n", + "poema = \"\"\"\n", + "The world is too much with us; late and soon,\n", + "Getting and spending, we lay waste our powers;—\n", + "Little we see in Nature that is ours;\n", + "We have given our hearts away, a sordid boon!\n", + "\"\"\"\n", + "\n", + "# Usar regex para reemplazar todos los espacios en blanco múltiples por uno solo\n", + "texto_limpio = re.sub(r'\\s+', ' ', poema).strip()\n", + "\n", + "print(texto_limpio)\n" + ], + "metadata": { + "id": "-mlPW2tnLBvu", + "outputId": "cab7b948-166e-49be-e5b0-073168f5e322", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "id": "-mlPW2tnLBvu", + "execution_count": 10, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "The world is too much with us; late and soon, Getting and spending, we lay waste our powers;— Little we see in Nature that is ours; We have given our hearts away, a sordid boon!\n" + ] + } + ] + }, + { + "cell_type": "code", + "execution_count": 12, "id": "d1bd73f1-a30f-4269-a05e-47cfff7b496f", "metadata": { "id": "d1bd73f1-a30f-4269-a05e-47cfff7b496f" }, "outputs": [], "source": [ - "# File path to the poem\n", - "text_path = '../data/poem_wordsworth.txt'\n", + "# # File path to the poem\n", + "# text_path = '../data/poem_wordsworth.txt'\n", "\n", - "# Read the poem in\n", - "with open(text_path, 'r') as file:\n", - " text = file.read()\n", - " file.close()" + "# # Read the poem in\n", + "# with open(text_path, 'r') as file:\n", + "# text = file.read()\n", + "# file.close()" ] }, { @@ -657,56 +906,58 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 13, "id": "ddeade7a-065d-49e6-bdd3-87a8ea8f6e6e", "metadata": { "id": "ddeade7a-065d-49e6-bdd3-87a8ea8f6e6e", - "outputId": "c08d21ba-8154-4dac-c39c-e1ba5644b7ea" + "outputId": "8f2722b9-9603-4789-e0df-772206299507", + "colab": { + "base_uri": "https://localhost:8080/" + } }, "outputs": [ { + "output_type": "execute_result", "data": { "text/plain": [ - "['I wandered lonely as a cloud',\n", - " '',\n", - " '',\n", - " 'I wandered lonely as a cloud',\n", - " \"That floats on high o'er vales and hills,\",\n", - " 'When all at once I saw a crowd,',\n", - " 'A host, of golden daffodils;',\n", - " 'Beside the lake, beneath the trees,',\n", - " 'Fluttering and dancing in the breeze.',\n", - " '',\n", - " 'Continuous as the stars that shine',\n", - " 'And twinkle on the milky way,',\n", - " 'They stretched in never-ending line',\n", - " 'Along the margin of a bay:',\n", - " 'Ten thousand saw I at a glance,',\n", - " 'Tossing their heads in sprightly dance.',\n", - " '',\n", - " 'The waves beside them danced; but they',\n", - " 'Out-did the sparkling waves in glee:',\n", - " 'A poet could not but be gay,',\n", - " 'In such a jocund company:',\n", - " 'I gazed—and gazed—but little thought',\n", - " 'What wealth the show to me had brought:',\n", - " '',\n", - " 'For oft, when on my couch I lie',\n", - " 'In vacant or in pensive mood,',\n", - " 'They flash upon that inward eye',\n", - " 'Which is the bliss of solitude;',\n", - " 'And then my heart with pleasure fills,',\n", - " 'And dances with the daffodils.']" + "['The world is too much with us; late and soon, Getting and spending, we lay waste our powers;— Little we see in Nature that is ours; We have given our hearts away, a sordid boon!']" ] }, - "execution_count": 8, "metadata": {}, - "output_type": "execute_result" + "execution_count": 13 } ], "source": [ "# Split the single string into a list of lines\n", - "text.splitlines()" + "texto_limpio.splitlines()" + ] + }, + { + "cell_type": "code", + "source": [ + "texto_limpio = \"\"\"Hola, este es un mensaje.\n", + "Aquí empieza otra línea.\n", + "Y aquí otra más.\"\"\"\n", + "lineas = texto_limpio.splitlines()\n", + "print(lineas)" + ], + "metadata": { + "id": "_l6f5q_6kp77", + "outputId": "6a2bd1b8-bd7e-4c3c-c2e0-cc06ff420ba8", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "id": "_l6f5q_6kp77", + "execution_count": 14, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "['Hola, este es un mensaje.', 'Aquí empieza otra línea.', 'Y aquí otra más.']\n" + ] + } ] }, { @@ -721,22 +972,29 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 15, "id": "53a81ea9-65c4-474a-8530-35393555d1be", "metadata": { "id": "53a81ea9-65c4-474a-8530-35393555d1be", - "outputId": "49d34646-0cf5-4c45-eb29-1c83b6870e42" + "outputId": "5432af0f-7068-495a-8d26-16bba8d72043", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 36 + } }, "outputs": [ { + "output_type": "execute_result", "data": { "text/plain": [ "\"@VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing.\\nit's really the only bad thing about flying VA\"" - ] + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "string" + } }, - "execution_count": 9, "metadata": {}, - "output_type": "execute_result" + "execution_count": 15 } ], "source": [ @@ -779,7 +1037,7 @@ ], "source": [ "# Strip only removed blankspace at both ends\n", - "second_example.strip()" + "second_example.strip() #elimina espacios al inicio y al final pero tambien saltos de linea por eso el texto se ve sin salto de linea." ] }, { @@ -794,14 +1052,17 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 16, "id": "ceac9714-7053-4b2e-affb-71f8c3d2dcd9", "metadata": { "id": "ceac9714-7053-4b2e-affb-71f8c3d2dcd9" }, "outputs": [], "source": [ - "import re" + "import re\n", + "#permite buscar, comparar, reemplazar o extraer partes de texto usando patrones complejos.\n", + "#Es muy útil cuando necesitas trabajar con texto de forma más precisa que con métodos básicos como .replace() o .split().\n", + "\n" ] }, { @@ -830,7 +1091,7 @@ "outputs": [], "source": [ "# Write a pattern in regex\n", - "blankspace_pattern = r'\\s+'" + "blankspace_pattern = r'\\s+'#detecta uno o más espacios en blanco (incluye \\n, \\t, y espacios normales)." ] }, { @@ -884,11 +1145,15 @@ } ], "source": [ - "# Replace whitespace(s) with ' '\n", - "clean_text = re.sub(pattern = blankspace_pattern,\n", - " repl = blankspace_repl,\n", - " string = second_example)\n", - "print(clean_text)" + "# Reemplaza uno o más espacios en blanco (como espacios, saltos de línea \\n, tabulaciones \\t) con un solo espacio (' ')\n", + "clean_text = re.sub(\n", + " pattern = blankspace_pattern, # Patrón que detecta espacios en blanco consecutivos (por ejemplo: r'\\s+')\n", + " repl = blankspace_repl, # Texto por el cual se reemplazarán los espacios (por ejemplo: un solo espacio ' ')\n", + " string = second_example # La cadena original en la que se aplicará el reemplazo\n", + ")\n", + "\n", + "# Imprime el texto limpio, con espacios en blanco normalizados\n", + "print(clean_text)\n" ] }, { @@ -950,7 +1215,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 18, "id": "237d868d-339d-4bbe-9a3b-20fa5fbdf231", "metadata": { "id": "237d868d-339d-4bbe-9a3b-20fa5fbdf231" From cfe756e40c797beb8772f0f14e9ab44d1298326e Mon Sep 17 00:00:00 2001 From: Pala63 <132717108+Pala63@users.noreply.github.com> Date: Tue, 29 Jul 2025 18:47:47 -0500 Subject: [PATCH 3/3] Created using Colab --- lessons/01_preprocessing.ipynb | 315 ++++++++++++++++++++++----------- 1 file changed, 211 insertions(+), 104 deletions(-) diff --git a/lessons/01_preprocessing.ipynb b/lessons/01_preprocessing.ipynb index 5d27a04..b4140c3 100644 --- a/lessons/01_preprocessing.ipynb +++ b/lessons/01_preprocessing.ipynb @@ -1,5 +1,15 @@ { "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, { "cell_type": "markdown", "id": "d3e7ea21-6437-48e8-a9e4-3bdc05f709c9", @@ -211,7 +221,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": null, "id": "3d1ff64b-53ad-4eca-b846-3fda20085c43", "metadata": { "id": "3d1ff64b-53ad-4eca-b846-3fda20085c43", @@ -655,7 +665,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": null, "id": "b690daab-7be5-4b8f-8af0-a91fdec4ec4f", "metadata": { "id": "b690daab-7be5-4b8f-8af0-a91fdec4ec4f", @@ -726,7 +736,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": null, "id": "58a95d90-3ef1-4bff-9cfe-d447ed99f252", "metadata": { "id": "58a95d90-3ef1-4bff-9cfe-d447ed99f252", @@ -752,7 +762,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": null, "id": "c66d91c0-6eed-4591-95fc-cd2eae2e0d41", "metadata": { "id": "c66d91c0-6eed-4591-95fc-cd2eae2e0d41", @@ -775,16 +785,19 @@ } ], "source": [ - "# Check if all characters are in lowercase\n", + "# En este bloque hago tres operaciones con el texto almacenado en 'first_example':\n", + "# 1) Verifico si todo el texto está en minúsculas usando .islower(), lo cual devuelve True o False.\n", + "# 2) Convierto el texto completamente a minúsculas usando .lower(), útil para estandarizar texto.\n", + "# 3) Convierto el texto a mayúsculas usando .upper(), lo que puede servir para resaltar o comparar palabras.\n", + "# Además, uso una línea de igualdades ('=' * 50) para separar visualmente los resultados en la consola.\n", + "\n", "print(first_example.islower())\n", "print(f\"{'=' * 50}\")\n", "\n", - "# Convert it to lowercase\n", "print(first_example.lower())\n", "print(f\"{'=' * 50}\")\n", "\n", - "# Convert it to uppercase\n", - "print(first_example.upper())" + "print(first_example.upper())\n" ] }, { @@ -831,7 +844,7 @@ } }, "id": "-mlPW2tnLBvu", - "execution_count": 10, + "execution_count": null, "outputs": [ { "output_type": "stream", @@ -844,7 +857,7 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": null, "id": "d1bd73f1-a30f-4269-a05e-47cfff7b496f", "metadata": { "id": "d1bd73f1-a30f-4269-a05e-47cfff7b496f" @@ -906,7 +919,7 @@ }, { "cell_type": "code", - "execution_count": 13, + "execution_count": null, "id": "ddeade7a-065d-49e6-bdd3-87a8ea8f6e6e", "metadata": { "id": "ddeade7a-065d-49e6-bdd3-87a8ea8f6e6e", @@ -928,18 +941,24 @@ } ], "source": [ - "# Split the single string into a list of lines\n", - "texto_limpio.splitlines()" + "# Utilizo .splitlines() para dividir el texto limpio en una lista de líneas.\n", + "# Esto es útil si el texto tiene saltos de línea (\\n) y quiero trabajar con cada línea por separado.\n", + "texto_limpio.splitlines()\n" ] }, { "cell_type": "code", "source": [ + "# Aquí defino una variable llamada 'texto_limpio' que contiene un texto multilínea usando triple comillas.\n", + "# Luego uso el método .splitlines() para dividir ese texto en una lista, separando cada línea donde haya un salto de línea (\\n).\n", + "# Finalmente, imprimo el resultado para ver cómo el texto se divide correctamente en líneas individuales.\n", + "\n", "texto_limpio = \"\"\"Hola, este es un mensaje.\n", "Aquí empieza otra línea.\n", "Y aquí otra más.\"\"\"\n", + "\n", "lineas = texto_limpio.splitlines()\n", - "print(lineas)" + "print(lineas)\n" ], "metadata": { "id": "_l6f5q_6kp77", @@ -949,7 +968,7 @@ } }, "id": "_l6f5q_6kp77", - "execution_count": 14, + "execution_count": null, "outputs": [ { "output_type": "stream", @@ -972,7 +991,7 @@ }, { "cell_type": "code", - "execution_count": 15, + "execution_count": null, "id": "53a81ea9-65c4-474a-8530-35393555d1be", "metadata": { "id": "53a81ea9-65c4-474a-8530-35393555d1be", @@ -999,7 +1018,7 @@ ], "source": [ "# Print the second example\n", - "second_example = tweets['text'][5]\n", + "second_example = tweets['text'][5] #En la columna text posicion 5 selecciono esa dentro de tweets\n", "second_example" ] }, @@ -1052,7 +1071,7 @@ }, { "cell_type": "code", - "execution_count": 16, + "execution_count": null, "id": "ceac9714-7053-4b2e-affb-71f8c3d2dcd9", "metadata": { "id": "ceac9714-7053-4b2e-affb-71f8c3d2dcd9" @@ -1198,6 +1217,13 @@ } ], "source": [ + "# Importo desde el módulo 'string' la variable 'punctuation', que contiene una cadena con todos los signos de puntuación comunes.\n", + "# Esto me sirve cuando quiero eliminar signos como .,!? de un texto sin tener que escribirlos uno por uno.\n", + "# Luego imprimo la lista para ver qué caracteres están incluidos.\n", + "\n", + "from string import punctuation\n", + "print(punctuation)\n", + "\n", "# Load in a predefined list of punctuation marks\n", "from string import punctuation\n", "print(punctuation)" @@ -1215,7 +1241,7 @@ }, { "cell_type": "code", - "execution_count": 18, + "execution_count": null, "id": "237d868d-339d-4bbe-9a3b-20fa5fbdf231", "metadata": { "id": "237d868d-339d-4bbe-9a3b-20fa5fbdf231" @@ -1225,16 +1251,20 @@ "def remove_punct(text):\n", " '''Remove punctuation marks in input text'''\n", "\n", - " # Select characters not in puncutaion\n", + " # Creo una lista vacía donde voy a guardar solo los caracteres que NO sean signos de puntuación\n", " no_punct = []\n", + "\n", + " # Recorro cada carácter del texto\n", " for char in text:\n", + " # Si el carácter no está en la lista de signos de puntuación (definida en 'punctuation'), lo agrego a la lista\n", " if char not in punctuation:\n", " no_punct.append(char)\n", "\n", - " # Join the characters into a string\n", + " # Una vez que tengo todos los caracteres que quiero conservar, los uno en una sola cadena con ''.join(...)\n", " text_no_punct = ''.join(no_punct)\n", "\n", - " return text_no_punct" + " # Devuelvo el texto ya limpio, sin signos de puntuación\n", + " return text_no_punct\n" ] }, { @@ -1276,13 +1306,13 @@ } ], "source": [ - "# Print the third example\n", + "# Nos muestra la diferencia de usar la funcion remove_punct y de no usarla.\n", "third_example = tweets['text'][20]\n", "print(third_example)\n", "print(f\"{'=' * 50}\")\n", "\n", "# Apply the function\n", - "remove_punct(third_example)" + "remove_punct(third_example) #removemos los caracteres de puntuacion" ] }, { @@ -1324,7 +1354,7 @@ } ], "source": [ - "# Print another tweet\n", + "# Mismo proceso usando la funcion remove_punct\n", "print(tweets['text'][100])\n", "print(f\"{'=' * 50}\")\n", "\n", @@ -1366,7 +1396,7 @@ "# Print a text with contraction\n", "contraction_text = \"We've got quite a bit of punctuation here, don't we?!? #Python @D-Lab.\"\n", "\n", - "# Apply the function\n", + "# Apply the function remove_punct\n", "remove_punct(contraction_text)" ] }, @@ -1429,12 +1459,16 @@ } ], "source": [ + "# Aquí definimos la ruta relativa al archivo 'example1.txt' que está dentro de la carpeta 'data', un nivel arriba de donde se ejecuta este script.\n", + "# Luego abrimos el archivo en modo lectura ('r') usando 'with', lo que garantiza que se cierre automáticamente después de leerlo.\n", + "# Leemos todo el contenido del archivo y lo guardamos en la variable 'challenge1'.\n", + "# Finalmente, imprimimos el contenido para verificar que se haya leído correctamente.\n", "challenge1_path = '../data/example1.txt'\n", "\n", "with open(challenge1_path, 'r') as file:\n", " challenge1 = file.read()\n", "\n", - "print(challenge1)" + "print(challenge1)\n" ] }, { @@ -1447,18 +1481,20 @@ }, "outputs": [], "source": [ - "def clean_text(text):\n", + "import re # Importamos para manejar expresiones regulares\n", "\n", - " # Step 1: Lowercase\n", - " text = ...\n", + "def clean_text(text):\n", + " # Paso 1: Convertimos todo el texto a minúsculas para unificar y facilitar el análisis\n", + " text = text.lower()\n", "\n", - " # Step 2: Use remove_punct to remove punctuation marks\n", - " text = ...\n", + " # Paso 2: Eliminamos los signos de puntuación usando la función remove_punct que definimos antes\n", + " text = remove_punct(text)\n", "\n", - " # Step 3: Remove extra whitespace characters\n", - " text = ...\n", + " # Paso 3: Eliminamos espacios en blanco adicionales (múltiples espacios, tabs, saltos de línea)\n", + " # Reemplazamos cualquier secuencia de espacios en blanco por un solo espacio y quitamos espacios al inicio y final\n", + " text = re.sub(r'\\s+', ' ', text).strip()\n", "\n", - " return text" + " return text\n" ] }, { @@ -1524,7 +1560,7 @@ ], "source": [ "# Print the example tweet\n", - "url_tweet = tweets['text'][13]\n", + "url_tweet = tweets['text'][13] #aqui solo estamos imrpimiendo el indice 13 dentro de la columna text que forma parte de tweets\n", "print(url_tweet)" ] }, @@ -1576,7 +1612,12 @@ } ], "source": [ - "# Hashtag\n", + "# Definimos un patrón (expresión regular) llamado 'url_pattern' que detecta URLs en un texto.\n", + "# Este patrón reconoce protocolos comunes como http, https y ftp, seguido de dominios y rutas posibles en la URL.\n", + "# Luego definimos 'url_repl' que es el texto con el que queremos reemplazar cualquier URL encontrada; en este caso, simplemente ' URL '.\n", + "# Finalmente, usamos re.sub() para buscar todas las coincidencias de URLs en 'url_tweet' y reemplazarlas por ' URL ',\n", + "# lo cual es útil para limpiar el texto y evitar que URLs específicas afecten el análisis de contenido.\n", + "re.sub(url_pattern, url_repl, url_tweet)\n", "hashtag_pattern = r'(?:^|\\s)[##]{1}(\\w+)'\n", "hashtag_repl = ' HASHTAG '\n", "re.sub(hashtag_pattern, hashtag_repl, url_tweet)" @@ -1616,7 +1657,9 @@ }, "outputs": [], "source": [ - "import nltk" + "## Importo la librería NLTK (Natural Language Toolkit), que es una de las más usadas en Python para procesamiento de lenguaje natural (NLP).\n", + "# Esta librería contiene herramientas para tokenización, etiquetado gramatical, análisis sintáctico, y mucho más.\n", + "import nltk\n" ] }, { @@ -1662,12 +1705,13 @@ } ], "source": [ - "# Load word_tokenize\n", + "# Aquí importo la función word_tokenize de NLTK para poder dividir un texto en palabras (tokens) de manera sencilla.\n", + "# Luego selecciono un tweet de ejemplo (índice 7) de mi DataFrame 'tweets' y lo imprimo para verlo antes de tokenizarlo.\n", "from nltk.tokenize import word_tokenize\n", "\n", "# Print the example\n", "text = tweets['text'][7]\n", - "print(text)" + "print(text)\n" ] }, { @@ -1708,9 +1752,12 @@ } ], "source": [ - "# Apply the NLTK tokenizer\n", + "# Aplico la función word_tokenize al texto para dividirlo en tokens o palabras individuales.\n", + "# Esto es fundamental para el procesamiento de lenguaje natural, porque trabajar con palabras aisladas facilita análisis como conteo, filtrado, etc.\n", "nltk_tokens = word_tokenize(text)\n", - "nltk_tokens" + "\n", + "# Muestro la lista de tokens resultante para revisar cómo quedó dividido el texto\n", + "nltk_tokens\n" ] }, { @@ -1748,8 +1795,9 @@ }, "outputs": [], "source": [ - "# Load predefined stop words from nltk\n", - "from nltk.corpus import stopwords" + "# Importo la lista predefinida de stopwords (palabras vacías) desde NLTK.\n", + "# Estas palabras, como \"el\", \"la\", \"y\", \"de\", suelen ser muy frecuentes y a veces no aportan significado importante en análisis de texto.\n", + "from nltk.corpus import stopwords\n" ] }, { @@ -1783,9 +1831,10 @@ } ], "source": [ - "# Print the first 10 stopwords\n", + "# Cargo la lista de stopwords en inglés usando NLTK y guardo en la variable 'stop'.\n", + "# Luego imprimo las primeras 10 para ver ejemplos de estas palabras comunes que suelen eliminarse en análisis.\n", "stop = stopwords.words('english')\n", - "stop[:10]" + "stop[:10]\n" ] }, { @@ -1836,8 +1885,10 @@ }, "outputs": [], "source": [ + "# Importo la biblioteca spaCy, una herramienta poderosa para procesamiento de lenguaje natural.\n", + "# Luego cargo el modelo en inglés 'en_core_web_sm', que incluye vocabulario, reglas y datos para analizar textos.\n", "import spacy\n", - "nlp = spacy.load('en_core_web_sm')" + "nlp = spacy.load('en_core_web_sm')\n" ] }, { @@ -1873,8 +1924,10 @@ } ], "source": [ - "# Retrieve components included in NLP pipeline\n", - "nlp.pipe_names" + "# Consulto los nombres de los componentes que conforman la \"pipeline\" de procesamiento en spaCy.\n", + "# La pipeline es una secuencia de pasos automáticos que spaCy aplica al texto, como tokenización, análisis sintáctico, reconocimiento de entidades, etc.\n", + "# Mostrar estos nombres me ayuda a entender qué análisis está haciendo el modelo cargado.\n", + "nlp.pipe_names\n" ] }, { @@ -1896,8 +1949,9 @@ }, "outputs": [], "source": [ - "# Apply the pipeline to example tweet\n", - "doc = nlp(tweets['text'][7])" + "# Aplico la pipeline de spaCy al texto del tweet en la posición 7 del DataFrame 'tweets'.\n", + "# Esto procesa el texto automáticamente y devuelve un objeto 'Doc' que contiene tokens, etiquetas gramaticales, entidades y más.\n", + "doc = nlp(tweets['text'][7])\n" ] }, { @@ -1965,9 +2019,12 @@ } ], "source": [ - "# Get the verbatim texts of tokens\n", + "# Extraigo el texto original (verbatim) de cada token que spaCy identificó en el objeto 'doc'.\n", + "# Esto crea una lista con todas las palabras y símbolos tal cual aparecen en el texto procesado.\n", "spacy_tokens = [token.text for token in doc]\n", - "spacy_tokens" + "\n", + "# Muestro la lista de tokens para revisar cómo spaCy dividió el texto\n", + "spacy_tokens\n" ] }, { @@ -2008,8 +2065,9 @@ } ], "source": [ - "# Get the NLTK tokens\n", - "nltk_tokens" + "# Aquí simplemente mostramos la lista de tokens que obtuvimos anteriormente con NLTK usando word_tokenize.\n", + "# Esto nos permite comparar cómo NLTK y spaCy dividen el texto en palabras.\n", + "nltk_tokens\n" ] }, { @@ -2059,11 +2117,12 @@ } ], "source": [ - "# Retrieve the is_stop annotation\n", + "# Obtengo para cada token del objeto 'doc' de spaCy si es una palabra vacía (stopword) o no,\n", + "# usando la propiedad 'is_stop' que devuelve True si el token es una stopword y False en caso contrario.\n", "spacy_stops = [token.is_stop for token in doc]\n", "\n", - "# The results are boolean values\n", - "spacy_stops" + "# Muestro la lista de valores booleanos para cada token, indicando cuáles son stopwords\n", + "spacy_stops\n" ] }, { @@ -2096,13 +2155,17 @@ }, "outputs": [], "source": [ + "from nltk.tokenize import word_tokenize\n", + "\n", "def remove_stopword_nltk(raw_text, stopword):\n", + " # Paso 1: Tokenizo el texto crudo usando word_tokenize para dividirlo en palabras\n", + " tokens = word_tokenize(raw_text)\n", "\n", - " # Step 1: Tokenization with nltk\n", - " # YOUR CODE HERE\n", + " # Paso 2: Creo una nueva lista con los tokens que NO estén en la lista de stopwords\n", + " filtered_tokens = [token for token in tokens if token.lower() not in stopword]\n", "\n", - " # Step 2: Filter out tokens in the stop word list\n", - " # YOUR CODE HERE" + " # Devuelvo la lista filtrada de tokens\n", + " return filtered_tokens\n" ] }, { @@ -2115,12 +2178,14 @@ "outputs": [], "source": [ "def remove_stopword_spacy(raw_text):\n", + " # Paso 1: Aplico la pipeline de spaCy al texto crudo para procesarlo y obtener un objeto Doc\n", + " doc = nlp(raw_text)\n", "\n", - " # Step 1: Apply the nlp pipeline\n", - " # YOUR CODE HERE\n", + " # Paso 2: Creo una lista con los textos de los tokens que NO son stopwords usando la propiedad is_stop\n", + " filtered_tokens = [token.text for token in doc if not token.is_stop]\n", "\n", - " # Step 2: Filter out tokens that are stop words\n", - " # YOUR CODE HERE" + " # Devuelvo la lista filtrada de tokens\n", + " return filtered_tokens\n" ] }, { @@ -2132,7 +2197,9 @@ }, "outputs": [], "source": [ - "# remove_stopword_nltk(text, stop)" + "# Aquí llamo a la función remove_stopword_nltk, pasándole el texto que quiero limpiar y la lista de stopwords en inglés.\n", + "# La función devolverá una lista de tokens donde se han eliminado las palabras vacías (stopwords).\n", + "remove_stopword_nltk(text, stop)\n" ] }, { @@ -2144,7 +2211,9 @@ }, "outputs": [], "source": [ - "# remove_stopword_spacy(text)" + "# Aquí llamo a la función remove_stopword_spacy, pasándole el texto que quiero limpiar.\n", + "# La función procesará el texto con spaCy y devolverá una lista de tokens sin las stopwords.\n", + "remove_stopword_spacy(text)\n" ] }, { @@ -2193,9 +2262,16 @@ } ], "source": [ - "# Print tokens and their annotations\n", + "# Recorro cada token en el objeto 'doc' procesado por spaCy\n", + "# Para cada token imprimo varias anotaciones:\n", + "# - El texto original (token.text)\n", + "# - Su lema o forma base (token.lemma_)\n", + "# - La categoría gramatical (token.pos_)\n", + "# - La explicación legible de esa categoría (spacy.explain(token.pos_))\n", + "# - Si el token parece una URL (token.like_url)\n", + "# Uso formateo con '<' para alinear columnas y que se vea ordenado en la consola.\n", "for token in doc:\n", - " print(f\"{token.text:<24} | {token.lemma_:<24} | {token.pos_:<12} | {spacy.explain(token.pos_):<12} | {token.like_url:<12} |\")" + " print(f\"{token.text:<24} | {token.lemma_:<24} | {token.pos_:<12} | {spacy.explain(token.pos_):<12} | {token.like_url:<12} |\")\n" ] }, { @@ -2228,12 +2304,15 @@ } ], "source": [ - "# Print example tweets with place names and airport codes\n", + "# Selecciono dos tweets de ejemplo del DataFrame 'tweets':\n", + "# - 'tweet_city' es el texto en la posición 8273, que probablemente contiene nombres de ciudades o lugares.\n", + "# - 'tweet_airport' es el texto en la posición 502, que puede contener códigos de aeropuertos u otras referencias.\n", + "# Luego imprimo ambos textos para revisarlos, separando visualmente con una línea de 50 '=' para mayor claridad en la salida.\n", "tweet_city = tweets['text'][8273]\n", "tweet_airport = tweets['text'][502]\n", "print(tweet_city)\n", "print(f\"{'=' * 50}\")\n", - "print(tweet_airport)" + "print(tweet_airport)\n" ] }, { @@ -2268,10 +2347,17 @@ } ], "source": [ - "# Print entities identified from the text\n", + "# Aplico la pipeline de spaCy al texto del tweet que contiene nombres de ciudades ('tweet_city').\n", + "# Luego, recorro todas las entidades nombradas (ents) que spaCy detectó en el texto.\n", + "# Para cada entidad imprimo:\n", + "# - El texto de la entidad (ent.text)\n", + "# - La posición donde inicia en caracteres dentro del texto (ent.start_char)\n", + "# - La posición donde termina en caracteres dentro del texto (ent.end_char)\n", + "# - La etiqueta que describe el tipo de entidad (ent.label_), por ejemplo, GPE (lugares), PERSON, ORG, etc.\n", + "# Uso formateo para que la salida quede ordenada en columnas.\n", "doc_city = nlp(tweet_city)\n", "for ent in doc_city.ents:\n", - " print(f\"{ent.text:<15} | {ent.start_char:<10} | {ent.end_char:<10} | {ent.label_:<10}\")" + " print(f\"{ent.text:<15} | {ent.start_char:<10} | {ent.end_char:<10} | {ent.label_:<10}\")\n" ] }, { @@ -2334,9 +2420,11 @@ } ], "source": [ - "# Visualize the identified entities\n", + "# Importo 'displacy' de spaCy, que es una herramienta para visualizar gráficamente las entidades nombradas en un texto.\n", + "# Luego uso displacy.render para mostrar las entidades reconocidas en 'doc_city' con estilo 'ent' (entidades),\n", + "# y con 'jupyter=True' hago que se muestre directamente en un notebook Jupyter.\n", "from spacy import displacy\n", - "displacy.render(doc_city, style='ent', jupyter=True)" + "displacy.render(doc_city, style='ent', jupyter=True)\n" ] }, { @@ -2369,10 +2457,16 @@ } ], "source": [ - "# Print entities identified from the text\n", + "# Aplico la pipeline de spaCy al texto del tweet que contiene códigos de aeropuerto ('tweet_airport').\n", + "# Luego recorro todas las entidades que spaCy detectó en ese texto.\n", + "# Para cada entidad imprimo:\n", + "# - El texto exacto de la entidad\n", + "# - La posición inicial y final dentro del texto (en caracteres)\n", + "# - La etiqueta que describe el tipo de entidad (como ORG, GPE, LOC, etc.)\n", + "# Uso formato alineado para que la salida sea clara y legible.\n", "doc_airport = nlp(tweet_airport)\n", "for ent in doc_airport.ents:\n", - " print(f\"{ent.text:<15} | {ent.start_char:<10} | {ent.end_char:<10} | {ent.label_:<10}\")" + " print(f\"{ent.text:<15} | {ent.start_char:<10} | {ent.end_char:<10} | {ent.label_:<10}\")\n" ] }, { @@ -2423,8 +2517,10 @@ } ], "source": [ - "# Visualize the identified entities\n", - "displacy.render(doc_airport, style='ent', jupyter=True)" + "# Utilizo displacy para visualizar gráficamente las entidades que spaCy detectó en 'doc_airport'.\n", + "# El parámetro 'style=\"ent\"' indica que quiero ver las entidades nombradas resaltadas,\n", + "# y 'jupyter=True' permite que la visualización se muestre directamente en el notebook.\n", + "displacy.render(doc_airport, style='ent', jupyter=True)\n" ] }, { @@ -2467,11 +2563,12 @@ }, "outputs": [], "source": [ - "# Load BERT tokenizer in\n", + "# Importo el tokenizador de BERT desde la librería transformers, que permite convertir texto en tokens compatibles con modelos BERT.\n", "from transformers import BertTokenizer\n", "\n", - "# Initialize the tokenizer\n", - "tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')" + "# Inicializo el tokenizador usando el modelo preentrenado 'bert-base-uncased',\n", + "# que es una versión base de BERT que convierte todo el texto a minúsculas y maneja vocabulario propio.\n", + "tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n" ] }, { @@ -2507,15 +2604,17 @@ } ], "source": [ - "# Select an example tweet from dataframe\n", + "# Selecciono un tweet de ejemplo en la posición 194 del DataFrame 'tweets' y lo imprimo para verlo.\n", "text = tweets['text'][194]\n", "print(f\"Text: {text}\")\n", "print(f\"{'=' * 50}\")\n", "\n", - "# Apply tokenizer\n", + "# Aplico el tokenizador de BERT al texto para dividirlo en subpalabras o tokens que entiende BERT.\n", "tokens = tokenizer.tokenize(text)\n", + "\n", + "# Imprimo la lista de tokens resultante y la cantidad total de tokens generados.\n", "print(f\"Tokens: {tokens}\")\n", - "print(f\"Number of tokens: {len(tokens)}\")" + "print(f\"Number of tokens: {len(tokens)}\")\n" ] }, { @@ -2553,11 +2652,11 @@ } ], "source": [ - "# Get the input ID of the word\n", + "# Obtengo el ID numérico que representa la palabra 'just' dentro del vocabulario del tokenizador BERT.\n", "print(f\"ID of just is: {tokenizer.vocab['just']}\")\n", "\n", - "# Get the text of the input ID\n", - "print(f\"Token 2074 is: {tokenizer.decode([2074])}\")" + "# Decodifico el token con ID 2074 para obtener el texto correspondiente a ese ID.\n", + "print(f\"Token 2074 is: {tokenizer.decode([2074])}\")\n" ] }, { @@ -2589,10 +2688,12 @@ } ], "source": [ - "# Convert a list of tokens to a list of input IDs\n", + "# Convierto la lista de tokens generada anteriormente en una lista de IDs numéricos que el modelo BERT puede procesar.\n", "input_ids = tokenizer.convert_tokens_to_ids(tokens)\n", + "\n", + "# Imprimo la cantidad total de IDs y la lista completa de IDs para revisar cómo quedó codificado el texto.\n", "print(f\"Number of input IDs: {len(input_ids)}\")\n", - "print(f\"Input IDs of text: {input_ids}\")" + "print(f\"Input IDs of text: {input_ids}\")\n" ] }, { @@ -2634,10 +2735,12 @@ } ], "source": [ - "# Get the input IDs by providing the key\n", + "# Obtengo los IDs de entrada directamente usando el tokenizador, que incluye tokens especiales y prepara el texto para BERT.\n", "input_ids_from_tokenizer = tokenizer(text)['input_ids']\n", + "\n", + "# Imprimo la cantidad de IDs generados y la lista completa para verificar el resultado.\n", "print(f\"Number of input IDs: {len(input_ids_from_tokenizer)}\")\n", - "print(f\"IDs from tokenizer: {input_ids_from_tokenizer}\")" + "print(f\"IDs from tokenizer: {input_ids_from_tokenizer}\")\n" ] }, { @@ -2671,9 +2774,9 @@ } ], "source": [ - "# Convert input IDs to texts\n", + "# Convierto los IDs 101 y 102 de vuelta a sus tokens de texto correspondientes según el vocabulario del tokenizador BERT.\n", "print(f\"The 101st token: {tokenizer.convert_ids_to_tokens(101)}\")\n", - "print(f\"The 102nd token: {tokenizer.convert_ids_to_tokens(102)}\")" + "print(f\"The 102nd token: {tokenizer.convert_ids_to_tokens(102)}\")\n" ] }, { @@ -2711,9 +2814,9 @@ "outputs": [], "source": [ "def get_tokens(string):\n", - " '''Tokenzie the input string with BERT'''\n", - " tokens = tokenizer.tokenize(string)\n", - " return print(tokens)" + " '''Tokenize the input string with BERT tokenizer'''\n", + " tokens = tokenizer.tokenize(string) # Aplico el tokenizador BERT para dividir la cadena en tokens\n", + " return print(tokens) # Imprimo la lista de tokens resultante\n" ] }, { @@ -2737,19 +2840,22 @@ } ], "source": [ - "# Abbreviations\n", + "# Aquí muestras ejemplos de cómo el tokenizador divide diferentes tipos de palabras o términos especiales:\n", + "\n", + "# Abreviatura o palabra corta que puede tokenizarse como un solo token o varios\n", "get_tokens('dlab')\n", "\n", - "# OOV\n", + "# Palabra fuera del vocabulario (OOV, out-of-vocabulary), que el tokenizador puede dividir en subpalabras\n", "get_tokens('covid')\n", "\n", - "# Prefix\n", + "# Palabra con un prefijo poco común, para ver cómo la tokeniza\n", "get_tokens('huggable')\n", "\n", - "# Digits\n", + "# Número, que también se tokeniza y puede dividirse en tokens separados\n", "get_tokens('378')\n", "\n", - "# YOUR EXAMPLE" + "# TU EJEMPLO PERSONAL: aquí puedes probar con cualquier palabra o término que quieras analizar\n", + "get_tokens('chatgpt')\n" ] }, { @@ -2814,7 +2920,8 @@ "version": "3.11.4" }, "colab": { - "provenance": [] + "provenance": [], + "include_colab_link": true } }, "nbformat": 4,