You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

1.2 KiB

D02 Piscine AI - Data Science

Table of Contents:

Introduction

Machine learning algorithms cannot work with raw text directly. Rather, the text must be converted into vectors of numbers. In natural language processing, a common technique for extracting features from text is to place all of the words that occur in the text in a bucket. This aproach is called a bag of words model or BoW for short. It’s referred to as a “bag” of words because any information about the structure of the sentence is lost.

Almost every Natural Language Processing (NLP) task requires text to be preprocessed before training a model. Deep learning models cannot use raw text directly, so it is up to us researchers to clean the text ourselves. Depending on the nature of the task, the preprocessing methods can be different.

https://towardsdatascience.com/your-guide-to-natural-language-processing-nlp-48ea2511f6e1

The algorithms do not understand words. They need a mathematical reprensation of them. Today we will learn two important mathematical representations:

  • Bag of Words
  • Embedding

Each approach has its limits. Context ..

Les packages NLTK and Spacy to do the preprocessing