kunio kashino

Senior Distinguished Researcher | NTT Basic Research Labs

Neural Audio Captioning System Listens to and Diagnoses Stethoscope Sounds

The stethoscope is one of the most familiar medical tools we have, always hanging around the neck of a doctor or nurse who uses it to check heart sounds, breathing and more. Work underway at NTT Research aims to alter that reality dramatically – by taking the doctor or nurse out of the picture.


At Upgrade 2020, the NTT Research Summit, Dr. Kunio Kashino from the Biomedical Informatics Research Center of NTT Basic Research Laboratories, presented groundbreaking work on technology that enables an automated system to listen to heart sounds and output a “caption” that describes the sound and whether it’s normal. If not, the system can even determine what kind of defect may be in play.


After listening to sounds, the neural audio model is trained to deliver two outputs: a classification of what family of disease the sounds represent, including the presence or absence of 12 difficult heart diseases, and a description or caption that is essentially a diagnosis of what the heart sound represents. Sample captions Dr. Kashino discussed include: “Your hear sounds are normal” and “Your heart sounds are abnormal. There may be a problem with your heart valve. The 1st sound is normal, and the 2nd sound is split.  There are systolic murmurs.”


The model was trained using a set of heart sounds representing 55 difficult heart diagnoses. For each case the sample was about a minute in length. His team extracted samples by windowing the signal. In one case, four cycles’ worth of a signal were cut out and the timing was synchronized with the heartbeats. In another case, signals of six seconds in length were cut out at regular time intervals of three seconds.


Class levels and seven kinds of explanation sentences were given manually for each case. His system was charged with listening to the heart sounds and applying the correct “diagnosis.” 


“We found it is possible to classify with a very high accuracy of 94% or more in the case of beats synchronous windowing, and 88% or more in the case of regular windowing,” Dr. Kashino said.

He called the results promising and noted his work is not done.  “We intend to enrich the learning data and improve the system configuration to make it a practical system in the near future,” Dr. Kashino said.


Such a neural audio captioning system could be used for applications such as telemedicine or simply to relieve health care professionals from having to spend time listening to and evaluating various sounds. Instead, the captioning system could do it for them, immediately pointing medical professionals to those patients who need attention.


For the full transcript of Kunio Kashino’s presentation, click here.


Watch Kunio Kashino’s full presentation below.

Neural Audio Captioning and Its Application to Stethoscopic Sounds

Kunio Kashino head shot

Kunio Kashino
Senior Distinguished Researcher | NTT Basic Research Labs