BETA
This is a BETA experience. You may opt-out by clicking here

More From Forbes

Edit Story

Planning A Data Strategy For Training Healthcare AI Applications

Forbes Technology Council

Cofounder at Simbo Inc and serial entrepreneur. AI/ML Researcher & Implementer in healthcare. Technology leader with multiple AI patents.

Digitizing most of the work we've done in the past created a vast amount of data but in silos. The combination of the internet and digitization created large amounts of data, fueling a large amount of research in AI. This becomes a good regenerative loop where AI products generate even more data.

Giant IT and internet-based digital goods companies are one such example of harvesting data. These companies are becoming smart in ensuring that data is captured so that it is used to train AI models and create newer applications and better user experiences.

Today’s AI algorithms are hungry for structured, usable data. Businesses are having trouble maintaining and using this data to get the most out of it. Some of these examples of needing large amounts of data include AlphaGo, which used three million locations to train itself. Other examples are the latest and greatest language models like GPT and BERT.

Companies need data because they use it to make choices, but it could be challenging to get enough high-quality structured data. The lack of the right data is one of the critical barriers preventing artificial intelligence from solving real-world problems. The AI community is thus making great efforts to develop a solution.

AI engineers and IT executives in the healthtech space face big challenges while figuring out the right data strategy for training the models. Some of the key challenges faced by them are highlighted below.

Massive Data Needs

The AI system's performance is impacted by the quality of data that is provided to it, and getting hold of medical data can be an especially arduous task. One way medical data collection can be done is at source, which would burden already stressed-out healthcare professionals. EHRs have also created silos of data without seamless integration, making it tough to connect the dots.

However, there are still large amounts of processes in healthcare that are not digitized. Due to this, many of the tasks of healthcare professionals are still very time-consuming and need a big transformation. These factors make it challenging to create successful AI systems without obtaining a considerable amount of high-quality data.

Biases In Data And AI Models

The results of AI-powered systems that use ML models may be incorrect if the data used to train them is biased.

Prejudice is influenced by patients' ethnicity and socioeconomic status. There are numerous discussions in various forums where this issue is highlighted. One such example is when AI identifying melanoma gave higher false-positive predictions for skin lesions marked in surgical ink. AI educated on data from academic centers in metropolitan cities or specific ethnicities, for example, will generate less accurate estimates for rural patients or different ethnicities. Worse, rather than representing objective reality, the strategy may increase existing inequities in the healthcare system.

Artificially Generated Data

Due to a lack of data, it is not uncommon to see AI engineers use artificially generated data to train AI models. This is typically mitigated by data augmentation, but even that is artificial, creating data bias issues and lacking real-world use scenarios.

One such example we have seen is taking the audio recordings of specific clinical conversations to train automatic speech recognition (ASR). While recording audio, the speech rate, filler words, background noise, speech clarity, tone and more are much different than asking someone to record a typically formal conversation. Such AI models turn out good in lab conditions but are not so useful in real-life use.

There has been good advancement in text-based unstructured data, thanks to GPT and BERT-like large language models where a good generalization has already been obtained, which could be fine-tuned quickly with smaller datasets that could be acquired quickly. However, when it comes to other modalities like voice and image, we still lack such capabilities, making it hard to create good algorithms upfront. An alternative to this approach would be to invest heavily first in acquiring such large datasets and then in creating structured data of it. Many healthtech startups have been following such a strategy.

Ethics And Regulation Around Data In Healthcare

Healthcare has always been a private and sensitive matter. To that spirit, there are various regulations on how data should be used for other purposes. There are laws around privacy and security of data, like HIPAA, which are good for companies to work in a framework. However, there is a great barrier to collecting raw data, as these need approval and consent from all parties, including patients. Unlike other digital goods, real healthcare happens on the ground. Managing such consent involves a barrier for everyone that appropriate consent has been obtained from all the patients.

Although the use of AI in clinical settings has the potential to advance healthcare significantly, it also raises ethical concerns that need to be addressed. There is a debate about whether AI should be classified under existing legal categories or require a new set of rules due to its unique characteristics and implications.

Conclusion

The above points will hopefully be helpful for AI engineers. Healthcare stakeholders can also benefit from knowing how AI companies are harnessing data and challenges. This would help them ask the right questions to choose the right AI partner for their solution.

AI solution providers will need to overcome each obstacle on this list in order to create a big data exchange ecosystem that links all players in the care continuum with reliable, timely and useful information. Success will lessen the weight of all those worries, but getting there will need effort, time, money and communication.

Technology companies need to come up with innovative algorithms that are less data-hungry and use more human-like artificial general intelligence. Other intelligent approaches, like creating a subset application using smaller data which then collects more data and again trains the models to create better and newer use cases, is also a good strategy to break the vicious circle.


Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?


Follow me on LinkedInCheck out my website