KDD 2022 Tutorial

Advances in Exploratory Data Analysis, Visualisation and Quality for Data Centric AI Systems

Hima Patel, IBM Research India
Shanmukha Guttula, IBM Research India
Ruhi Sharma Mittal, IBM Research India
Naresh Manwani, IIIT Hyderabad
Laure Berti-Equille, Institut de Recherche pour le Développement, France
Abhijit Manatkar, IIIT Hyderabad

Time: Aug 14, 2022
Location: Washington DC, USA

Abstract:

It is widely accepted that data preparation is one of the most time-consuming steps of the machine learning (ML) lifecycle. It is also one of the most important steps, as the quality of data directly influences the quality of a model. In this tutorial, we will discuss the importance and the role of exploratory data analysis (EDA) and data visualisation techniques to find data quality issues and for data preparation, relevant to building ML pipelines. We will also discuss the latest advances in these fields and bring out areas that need innovation. To make the tutorial actionable for practitioners, we will also discuss the most popular open-source packages that one can get started with along with their strengths and weaknesses. Finally, we will discuss on the challenges posed by industry workloads and the gaps to be addressed to make data-centric AI real in industry settings.

Tutorial Recording:

A recording of our tutorial will be available after the conference.

Slides:

Link to Slides [Slides]

Presenters:

	Hima Patel is a research manager and global technical leader for data centric AI at IBM Research Lab, and leads a team of researchers that work in the area of making data ready for an AI lifecycle. She has been actively engaged in this area by teaching courses, organising several tutorials and workshops as well as taking the technology to business impact. Prior to IBM, she has spent time in research groups at Shell and GE where she has worked on varied research problems spanning from object detection in medical images to anomaly detection from multivariate sensor data collected from machines, leading to product impacts, papers and patents. Her research interests lie in the fields of data quality, cleaning and preparation for large scale datasets and she has several publications and patents to her credit.
	Shanmukha Guttula is an Advisory Research Engineer at IBM Research, India. He graduated from IIT Madras in 2017. He works on structured data transformations and data quality problems in structured data. He has previously worked on document contrast and analysis, on sensitive data anonymization and on information extraction from documents. His research interests include information extraction and analysis from documents, label noise in structured data, data desensitization.
	Ruhi Sharma Mittal is an Advisory Research Engineer at IBM Research Labs, India. In Data and AI team at IBM, she is currently working on building solutions to improve data quality to make data ready for machine learning pipelines. She did her Masters from Indian Institute of Technology, Bombay. She has authored and co-authored many research papers in top conferences. She has also filed several patents related to machine learning and applied AI. Her research interest includes Machine Learning, Natural Language Processing, and applied AI.
	Naresh Manwani is currently an Assistant Professor at IIIT Hyderabad. He is associated with the Machine learning Lab at Kohli Center on Intelligent Systems (KCIS). Before joining IIIT-H, he worked at Microsoft India (R & D) Pvt. Ltd, Bangalore and GE Global research Centre, Bangalore. He completed his PhD from IISc Bangalore in 2012. His research interests include learning theory, deep learning, online learning, and reinforcement learning. More specifically, he works on problems in the area of learning with label noise, learning in high-risk situations, learning with incomplete supervision, adversarial machine learning, fairness and privacy issues. He is also interested in applications of machine learning in practical problems (e.g. natural language processing, computer vision, and computational neuroscience)
	Laure Berti-Equille is a Research Director in Computer Science at IRD, the French Institute of interdisciplinary Research on Sustainability Science (since 2011) where she leads a research group. She is currently a visiting scientist at MIT LIDS. Before, Laure was a full Professor in CS at Aix-Marseille University (AMU) in France (2017-2018). From 2014-2017, she was a Senior Scientist of Qatar Computing Research Institute. From 2000-2010, she was a tenured Associate Professor in CS at University of Rennes 1 in France, and a 2-years visiting researcher at AT&T Labs Research in New Jersey, USA, as a recipient of the prestigious European Marie Curie Outgoing Fellowship (2007-2009). Her research work is at the intersection of large-scale data analytics and machine learning with a focus on data quality, data cleaning and preparation with many collaborations with industries and more than 80 publications and three monographs.
	Abhijit Manatkar is a student at IIIT Hyderabad, pursuing honours research at Machine Learning Lab at the Kohli Center on Intelligent Systems (KCIS) under Dr. Naresh Manwani. His research interests include reinforcement learning and exploratory data analysis.