Any work that relies on supervised machine learning is as good as the data powering it. Even the improvements in the machine learning models are not sufficient if the underlying data is not adequate enough. However, building clean and good datasets is costly and time consuming. How it is done nowadays is that you need to specify a problem, collect the data, then process this data to be sent to a crowdsourcing platform, such as Amazon MTurk or Crowdflower, in order to be labelled. On that platform, the annotators will go through the posts or the images one by one, spending time thinking on how to label them. Then, you get this labelled data, you filter out the low-quality labels, and you use it to train your classifiers. If the data is not as good as you want, you have to send a new bunch of data to be labelled, hoping that the classifier will improve. In a lot of cases, the data sent is not what the classifier is weakly performing on. Hence, a lot of care and expertise goes into selecting such data.
Our goal is to automate the whole process through an AI-Driven Classifier Building Pipeline. This platform produces high quality data labels and better classifiers by relying on machine learning throughout the pipeline. It works with different data modalities, i.e., with text, images, etc. We proceed as follows:
- Huge, But Noisy, Data Collection: We create a large initial dataset (tweets/images/posts, etc.) by querying about the domain of interest. For that, we use the Semantic Pipeline built at the LSIR lab (gives advanced access to multiple social media data sources). For a classifier on violence as an example, we query with a set of seed terms related to violence. The Semantic Pipeline is able to generalize these terms into a bigger set of terms and to collect huge data on that, which should have a good portion of violence related imagery. Of course, this data is noisy, but it provides us with material for the labeling step.
- ML-Driven Data Selection: Our system takes the huge image collection as input. Then, in the beginning, draws randomly from it and provides the initial set of images in the interface for humans to label. While the labeling proceeds, the classifier is trained on the fly with the new images. Hence it is able to know what images it is confused on and hence are worth labeling. For future steps, it does not draw images at random. It tries to draw as many images it has low confidence about as possible. This is done repeatedly as needed by the classifier to increase accuracy and reduce confusion on the long run.
- ML-Driven Labeling: Instead of showing the images one by one on the page for humans to label it, we use an alternative approach. Our intuition is that images of similar looks will have similar labels. Similarly in the case of text, tweets with similar content are likely to be labeled identically. Hence, our approach is to show multiple images per page in a clustered interface so that similar images are clustered near each other. Thus, the user task is to simply assign the labels to a group of images. Another way we plan to test is to show a predicted label for this image group so that the user task is to simply correct the label if it is wrong. Therefore, instead of labeling one data item at a time, we scale this to potentially 10-30 items at a time. At the core of this approach is our use of image similarity and text similarity techniques in order to power the labeling process. In total, we aim to reduce the time per image dramatically and to increase the size of datasets that could be built.
This system gives us a way to maintain any classifier throughout the lifetime of its usage and to iterate on its accuracy in a smarter way. It can be applied to various types of data, ranging from tweets to images to text+image combinations.