kunio kashino

Senior Distinguished Researcher | NTT Basic Research Labs

Transcript of the presentation Neural Audio Captioning and Its Application to Stethoscopic Sounds, given at the NTT Upgrade 2020 Research Summit, October 1, 2020

Hello, I’m Kunio Kashino from Biomedical Informatics Research Center of NTT Basic Research Laboratories. I’d like to talk about neural audio captioning and its application to stethoscopic sounds. First, I’d like to think about what captioning is in comparison with classification. When there is a picture of a cat, you will recognize it as a cat. This is a classification or object recognition. Captioning on the other hand is to describe what’s going on in a more complex scene. This is an example of a visual case, but the same can be thought of for sound.

 

When you hear a car on the street, you can recognize it as a car. You can also explain the sound when someone is hitting a toy tambourine. Of these, the generation of explanatory notes on sounds or audio captioning is a new field of research that has just emerged. (soft music) This is an experimental system that we proposed last year. It listens to sound for two seconds and provides an explanation for the sound of that section. (soft music) Moving the slider to the left produces a short, concise description.

 

Moving it to the right produces a longer, more detailed description. (soft music) The descriptions are not always perfect, but you can see how it works. Here are some early works in this field of study. In 2017, Drossos conducted a study that gave sound a string of words. But there was still a lot of overlap with the classification task at that time. At around the same time, Ikawa who was my student at the University of Tokyo, proposed a system that could express sounds in onomatopoeic terms as a sequence of phonemes.

 

Recently more works have been reported, including those describing more complex scenes in normal sentences and using sentences for sound retrieval. Let’s go over the differences between classification and the captioning once again. Classification is the process of classifying or quantizing features in a fixed number of classes. Captioning on the other hand means converting the features. For example, the time series of sound features is translated into the times series of words.

 

Classification requires that classes be determined in advance, but captioning does not. In classification relationships between classes are not usually considered, but in captioning relationships between elements are important not just what is there. In the medical context classification corresponds to diagnosis, while in captioning we’ll address the explanation and inference rather than diagnosis. Of course, diagnosis is an important act in medical care, and neither classification nor captioning is necessarily better than the other.

 

Captioning would be useful to express the comparisons, degree, time course and changes, and the relationship between cause and effect. For example, it would be difficult to prepare a class for the situation represented by a sentence of, “Over the past few days pneumonia has gradually spread and worsened.” Therefore, both of them should be utilized according to the purpose.

 

Now let’s consider the challenges of captioning. If you look at this picture everyone will say, it’s a picture of a cat. Yes, it is. No one called this a grey and white animal with two round eyes and triangular ears. Similarly, when a characteristic noise is heard from the lungs as the person breathes, you may just say “rhonchi are present.” And there is no need to describe the noise in detail. That is, it’s a good idea to use the label, if it’s appropriate, as long as the person you are talking to can understand it. Another challenge with captioning is that the exact same description may or may not be appropriate depending on the situation.

 

When you were walking down on a section and a car pops up, it’s important to say it’s a car and it’s inappropriate to discuss the engine sound quality. But when you bring a car to a repair shop and have it checked, you have to describe the engine sound in detail. Just saying that the engine is running is obviously not enough. It is important to note that appropriate expressions vary, and only one best answer cannot be determined. With these issues in mind, we configured a neuro audio captioning model.

 

We call this system CSCG or Conditional Sequence-to-sequence Caption Generator. The system extracts a time series of acoustic features from biological sounds, such as heart sounds, converts them into a series of words and outputs them with class labels. The green parts are neural networks. They were so trained that the system outputs both captions and labels simultaneously. The behavior of the sentence decoder is controlled by conditioning it with the auxiliary input, in order to cope with the fact that the appropriate captions can vary.

 

In the current experimentation, we employ a parameter called specificity. It is the amount of information contained in the words, in the entire caption. In other words, the more number of words and the more infrequent or more specific words are used, the higher the value of specificity.

 

And now our experiments. The entire network was trained using a set of heart sounds. The sound samples were extracted from sound sources that collected 55 difficult cases. For each case, the signal was about one minute in length. So we extracted sound samples by windowing the signal. In one case, four cycles worth of signal were cut out at the timing synchronized with the heartbeats. In another case, signals of six seconds in length were cut out at regular time intervals of three seconds. Class labels and seven kinds of explanation sentences were given manually for each case. This table shows the classification accuracy. We organized categories as general overview, description of sound, and presence or absence of 12 different heart diseases. We then prepared two-to-six classes for each category.

 

As a result, we found that it is possible to classify with a fairly high accuracy of 94% or more in the case of beats synchronous windowing, and 88% or more in the case of regular windowing. This graph shows the effect of the specificity control. The horizontal axis represents the specified specificity of level of detail. In the vertical axis we present the amount of information contained in the real output captions. As you can see, the data is distributed along a straight line with a slope of one indicating that the specificity control is working correctly.

 

Let’s take a look at generated captions. This table shows the examples with varying specificity input for three types of sound sources: normal, large split of second sounds, and coronary artery disease. If the specified specificity is small, then the generated sentence is short. If the specificity value is greater, you can see that detailed and long sentences are being generated. In this table, all captions are confirmed to be appropriate for the sound by human observations.

 

However, the system does not always produce the correct output for now. Sometimes it may produce a wrong caption or a statement containing a linguistic error, but generally speaking we consider the result promising.

 

In this talk, I first discussed the problem of audio captioning in comparison with classification. It is not just a sound recognition and therefore a new topic in the research field. Then I proposed an automatic audio captioning system based on the conditional sequence-to-sequence model and tested it with heart sounds. The system features a multitasking configuration for classification and the captioning. And it allows us to adjust the level of detail in the description according to the purpose. The evaluation results are promising. In the future, we intend to enrich the learning data and improve the system configuration to make it a practical system in the near future. Thank you very much.

Neural Audio Captioning and Its Application to Stethoscopic Sounds

Kunio Kashino head shot

Kunio Kashino
Senior Distinguished Researcher | NTT Basic Research Labs