Carlos M. Meléndez is the COO and Co-Founder of Wovenware, an artificial intelligence and software development company.
While artificial intelligence (AI) gets the glory for its role in helping to solve business problems — such as improving the environment, predicting customer churn and identifying promising medical treatments — not enough attention is paid to the data that runs the algorithms. And that’s a big problem.
AI algorithms are useless without good data, in much the same way that even the most expensive Porsche isn’t going to take you far without fuel. Unfortunately, the tendency to shortchange the data is a fairly universal problem that often results in misleading or incorrect results and poor outcomes.
For all those reasons, focusing on the data — even before designing the AI solution — is key to increasing the performance of your AI program and helping to ensure better results.
Below are five practices you can undertake to get your data house in order:
- Gather as much data as you can.
When you are trying to solve a problem, you may not always know where you will find the answer, so try to gather as much data as possible.
For example, when Vienna Beef moved to a new manufacturing facility, the hot dogs it produced didn’t have the same quality as those produced in the old facility. For more than a year it hypothesized and tested what was missing in the new facility but couldn’t find the culprit.
Finally, in a serendipitous moment, it learned the manufacturing process had an unwritten step — taking the uncooked sausages to the smokehouse took 30 minutes through a heat-intensive route that slowly increased their temperature. It was that 30 minutes that added the missing qualities to the hot dogs. If Vienna Beef had done a better job gathering data into its manufacturing process, it could have resolved the problem earlier.
- Clean the data.
After the right data is collected, it needs to be cleansed, validated and prepared to ensure that it is in good shape and ready for analysis. That means you need to make sure there is no duplicate data, it doesn’t contain errors and it is formatted correctly. While this will take time, it’s a critical step that can mean the difference between good and bad results.
- Test the data in the real world.
A good way to know if data is accurate is to test it and find out if there is a problem before you are too far along in the process. Divide your data into two parts, and set one aside for testing and the other for feeding the algorithm.
By testing the data upfront, a customer of ours (a health insurer) was able to uncover a data problem that would have skewed its results. While the company wanted to predict customer attrition based on the prior year’s members, the dataset included participants from previous years because the termination date was incorrectly captured in the data.
- Set up internal auditors.
Just as you would use a financial auditor, so too should you hire data auditors who can be responsible for ensuring sound data management on an ongoing basis, including data testing and storage.
- Ensure diversity.
Make sure your data doesn’t focus on one type of information or one data source. For example, in the past, drug testing was notoriously conducted solely on men, without considering the different ways drugs could be metabolized and the impact they could have on women and children.
Similarly, many imaging algorithms have heavily relied on images of white people. That has not only caused the algorithm to have difficulty identifying people of color, but it also leads to biases in the data. When you include greater diversity, you have a more representative and accurate outcome.
Buying Data Versus Public And Private Crowds
When you don’t have enough high-quality data, you need to look at other sources to enrich it. One approach is to buy data; however, be aware of the source you are using. Data aggregators may not always provide the type and quality of data that you need. A better option might be to buy it from an organization that is not in the business of selling data. For example, health insurers might consider buying data from grocery stores to gain insight into nutritional habits.
Another approach for enriching your in-house data is to build your own using a public or private crowd. These services distribute the work involved in collecting and preparing data. This can include everything from image recognition to algorithm training for machine learning.
While a public crowd is composed of individuals who can help collect, identify and label large datasets for AI training, these individuals are not professionals and don’t have knowledge of your specific business.
On the other hand, a private crowd is composed of data specialists who help with the data collection, identification, labeling and preparation of training data or images and are knowledgeable professionals who usually work under NDAs and work for or are personally known to the employer. They can often provide greater accuracy and business-specific knowledge than more generic public crowds.
All too often, organizations focus on ramping up their AI development as a checkbox in their tech tool kit without considering the significance of the data. Yet, data requires careful care, feeding and maintenance in order to enable AI projects to perform at an optimal level. By establishing best practices for data collection, cleansing and management, organizations can maximize their results and better fuel their AI solutions to make smarter decisions.