Dataview:
list from [[]] and !outgoing([[]])π -> Text Tokenization and Vectorization in NLP
π€ Vocab
- NLP: Using algorithms to analyze and process human language.
- Tokenization: Splitting text into smaller units such as words or phrases.
- Vectorization: Converting text into numerical representations for ML models.
- Reformatting: Changing the structure or representation of data.
β Information
Text Tokenization
-
Text tokenization is when text is converted to smaller units called βtokensβ.
-
One of the first and most important steps in NLP
-
Different models exist:
- Basic methods just split text on whitespace or punctuation
- Advanced split words themselves and tokenize linguistic units
-
The goal is to best represent text for ML purposes
-
nltk.tokenize.word_tokenize(text)
Text Vectorization:
- Turning text into numerical representations (vectors) so that they can be understood by ML models.
Common Methods: - One-hot encoding: (assigning a unique integer value to each word)
- Bag-of-words: (counting the occurrence of words within each document)
from sklearn.feature_extraction.text import CountVectorizer
- Word embeddings: (mapping words to vectors so as to capturing meaning)
π -> Methodology
- Simple or full description
βοΈ -> Usage
- How and where is it used
π§ͺ-> Example
- Define examples where it can be used
π -> Related Word
- Fun fact! Word tokenization was first used in the NLP program SHRDLU in the 1960s, which was the original Blocks-World