Dataview:

list from [[]] and !outgoing([[]])

πŸ“— -> Text Tokenization and Vectorization in NLP

Medium Article

🎀 Vocab

  • NLP: Using algorithms to analyze and process human language.
  • Tokenization: Splitting text into smaller units such as words or phrases.
  • Vectorization: Converting text into numerical representations for ML models.
  • Reformatting: Changing the structure or representation of data.

❗ Information

Text Tokenization

  • Text tokenization is when text is converted to smaller units called β€œtokens”.

  • One of the first and most important steps in NLP

  • Different models exist:

    • Basic methods just split text on whitespace or punctuation
    • Advanced split words themselves and tokenize linguistic units
  • The goal is to best represent text for ML purposes

  • nltk.tokenize.word_tokenize(text)

Text Vectorization:

  • Turning text into numerical representations (vectors) so that they can be understood by ML models.
    Common Methods:
  • One-hot encoding: (assigning a unique integer value to each word)
  • Bag-of-words: (counting the occurrence of words within each document)
    • from sklearn.feature_extraction.text import CountVectorizer
  • Word embeddings: (mapping words to vectors so as to capturing meaning)

πŸ“„ -> Methodology

  • Simple or full description

βœ’οΈ -> Usage

  • How and where is it used

πŸ§ͺ-> Example

  • Define examples where it can be used
  • Fun fact! Word tokenization was first used in the NLP program SHRDLU in the 1960s, which was the original Blocks-World