rw-book-cover

Metadata

Highlights

  • In this notebook, we will explore the output and inner workings of the GapEncoder, one of the high cardinality categorical encoders provided by skrub. (View Highlight)
  • Dirty data, as opposed to clean, are all non-curated categorical columns with variations such as typos, abbreviations, duplications, alternate naming conventions etc. (View Highlight)
  • Then, we create an instance of the GapEncoder with 10 components. This means that the encoder will attempt to extract 10 latent topics from the input data: (View Highlight)
  • The GapEncoder can be understood as a continuous encoding on a set of latent topics estimated from the data. The latent topics are built by capturing combinations of substrings that frequently co-occur, and encoded vectors correspond to their activations. To interpret these latent topics, we select for each of them a few labels from the input data with the highest activations. In the example below we select 3 labels to summarize each topic. (View Highlight)