Being anonymous over the internet can sometimes make people say nasty things that they normally would not in real life. Let's filter out the hate from our platforms one comment at a time.
To create an EDA/ feature-engineering starter notebook for toxic comment classification
The dataset here is from wiki corpus dataset which was rated by human raters for toxicity. The corpus contains 63M comments from discussions relating to user pages and articles dating from 2004-2015.
Different platforms/sites can have different standards for their toxic screening process. Hence the comments are tagged in the following five categories
toxic
severe_toxic
obscene
threat
insult
identity_hate
The tagging was done via crowdsourcing which means that the dataset was rated by different people and the tagging might not be 100% accurate too.