Data-Centric AI is Making Waves

By DrugBank Mar 20, 2023 8:00am

The world of artificial intelligence (AI) is rapidly changing, and one of the most notable shifts is a refocusing on the importance of data.

AI Systems

AI systems are made up of two major components, the model and the data. The models are essentially the tools or algorithms that enable computers to analyze data. The data is what the models learn to make decisions from.

Historically, the vast majority of artificial intelligence developers have focused their efforts on continuously iterating and improving the code or algorithm of their models. This approach is known as model-centric AI, and has yielded many advances in machine learning (ML) systems.

Data-Centric AI is Picking Up Steam

Recently, there has been a shift in perspective among AI experts toward data-centric AI. This shift began as a reaction to a handful of ongoing issues in AI research as well as a recognition that many existing AI systems are maturing to the point that further investment in the model is unlikely to yield meaningful improvements. As such, many researchers are now turning to data-centric AI, which AI-trailblazer Andrew Ng defines as “the discipline of systematically engineering the data needed to build a successful AI system.”

The greater focus on the quality of the data being integrated throughout the phases of a model's creation can yield safer and higher-performing AI systems. When data quality isn’t prioritized, AI systems can create a kind of technical debt that compromises the output, leading to failures and an overall loss of trust in the AI system.

The Risks of Deprioritizing Data

One pressing example of this technical debt is the risk of data cascades. Data cascades are not a new problem, but are often hidden, misunderstood, or ignored. These are compounding events that arise due to low quality data that can cause negative, downstream effects.

As Google researchers pointed out from a survey of AI practitioners, data cascades in high-stakes AI are pervasive and were present 92% of the time. In terms of the larger medical care and treatment industry, any level of upstream error can have dangerous real-world consequences for patients. Thankfully, the same Google researchers determined that these cascades, and the threat resulting from unreliable and inaccurate results, are largely avoidable with an increased focus on data quality.

The Promise of Better Health Outcomes

With the shift to data-centric AI the industry is seeing a higher priority being placed on the quality of the data used in AI systems, and as a result there is potential for more accurate and reliable outputs, especially within the healthcare context.

Healthcare is uniquely suited to a data-centric AI approach. Currently, healthcare is generating the world’s largest volume of data, and it isn’t going to slow down anytime soon. It is estimated that by 2025, 36% of the world’s generated data will be healthcare data and every year we are seeing more than two million scientific articles published. Unfortunately, much of the world’s data remains disconnected, disorganized, conflicting, and unstructured.

In order for a data-centric approach to work, it will be important to map and normalize data from experiments and real-world evidence into deeply structured datasets. It will also require all parties to seriously consider the quality of the data being generated.

To truly advance the healthcare industry it will be necessary to break down existing silos in pharma teams and embrace data excellence. Only by incorporating a cross-disciplinary approach that integrates subject matter expertise with engineering and operations will it be possible to ensure the data produced in the drug discovery space or scientific community is supporting better AI. By balancing these competing priorities, and centering high quality data in the process of developing AI systems there is the potential to substantially speed up the development of safer and more effective medical treatments.

DrugBank is Obsessed with Data Excellence

DrugBank is the world’s first intelligent and comprehensive drug knowledgebase. With the help of artificial intelligence, their expert team authors, verifies, and structures all of the latest biomedical information so that it can be used to its fullest potential.

DrugBank obsesses about data quality and building well structured data that reflects or can be mapped to the real world. They’re working diligently to be the space AI researchers turn to for the highest quality data and are very excited about a future where data quality is a first class citizen in the minds of AI researchers, developers, and the institutions powering the next wave of healthcare.

The editorial staff had no role in this post's creation.