Dirty data, as opposed to clean, are all non-curated categorical columns with variations such as typos, abbreviations, duplications, alternate naming conventions etc. (View Highlight)
Then, we create an instance of the GapEncoder with 10 components. This means that the encoder will attempt to extract 10 latent topics from the input data: (View Highlight)
The GapEncoder can be understood as a continuous encoding on a set of latent topics estimated from the data. The latent topics are built by capturing combinations of substrings that frequently co-occur, and encoded vectors correspond to their activations. To interpret these latent topics, we select for each of them a few labels from the input data with the highest activations. In the example below we select 3 labels to summarize each topic. (View Highlight)