1.2 KiB
D02 Piscine AI - Data Science
Table of Contents:
Introduction
Machine learning algorithms cannot work with raw text directly. Rather, the text must be converted into vectors of numbers. In natural language processing, a common technique for extracting features from text is to place all of the words that occur in the text in a bucket. This aproach is called a bag of words model or BoW for short. It’s referred to as a “bag” of words because any information about the structure of the sentence is lost.
Almost every Natural Language Processing (NLP) task requires text to be preprocessed before training a model. Deep learning models cannot use raw text directly, so it is up to us researchers to clean the text ourselves. Depending on the nature of the task, the preprocessing methods can be different.
https://towardsdatascience.com/your-guide-to-natural-language-processing-nlp-48ea2511f6e1
The algorithms do not understand words. They need a mathematical reprensation of them. Today we will learn two important mathematical representations:
- Bag of Words
- Embedding
Each approach has its limits. Context ..
Les packages NLTK and Spacy to do the preprocessing