译：Attention-Guided Discriminative Region Localization and Label Distribution Learning for Bone Age Assessment

注意力引导判别区域本地化和标签分布学习骨龄评估

# Abstract

Bone age assessment (BAA) is clinically important as it can be used to diagnose endocrine and metabolic disorders during child development. Existing deep learning based methods for classifying bone age use the global image as input, or exploit local information by annotating extra bounding boxes or key points. However, training with the global image underutilizes discriminative local information, while providing extra annotations is expensive and subjective. In this paper, we propose an attention-guided approach to automatically localize the discriminative regions for BAA without any extra annotations. Specifically, we first train a classification model to learn the attention maps of the discriminative regions, finding the hand region, the most discriminative region (the carpal bones), and the next most discriminative region (the metacarpal bones). Guided by those attention maps, we then crop the informative local regions from the original image and aggregate different regions for BAA. Instead of taking BAA as a general regression task, which is suboptimal due to the label ambiguity problem in the age label space, we propose using joint age distribution learning and expectation regression, which makes use of the ordinal relationship among hand images with different individual ages and leads to more robust age estimation. Extensive experiments are conducted on the RSNA pediatric bone age data set. Using no training annotations, our method achieves competitive results compared with existing state-of-the-art semi-automatic deep learning-based methods that require manual annotation. Code is available at https: //github.com/chenchao666/Bone-Age-Assessment.

骨龄评估 (BAA) 在临床上很重要，因为它可用于诊断儿童发育过程中的内分泌和代谢紊乱。现有的基于深度学习的骨龄分类方法使用全局图像作为输入，或者通过注释额外的边界框或关键点来利用局部信息。然而，使用全局图像进行训练没有充分利用有区别的局部信息，同时提供额外的注释既昂贵又主观。在本文中，我们提出了一种注意力引导的方法来自动定位 BAA 的判别区域，而无需任何额外的注释。具体来说，我们首先训练一个分类模型来学习判别区域的注意力图，找到手区域、最具判别力的区域（腕骨）和下一个最具判别力的区域（掌骨）。在这些注意力图的指导下，我们从原始图像中裁剪出信息丰富的局部区域，并为 BAA 聚合不同的区域。我们建议使用联合年龄分布学习和期望回归，而不是将 BAA 作为一般回归任务，这是由于年龄标签空间中的标签模糊问题而欠佳的，它利用了不同个体年龄的手部图像之间的序数关系并导致更可靠的年龄估计。对 RSNA 儿科骨龄数据集进行了大量实验。与需要手动注释的现有最先进的基于半自动深度学习的方法相比，我们的方法在不使用训练注释的情况下取得了有竞争力的结果。代码可在 https://github.com/chenchao

# Introduction

Bone age assessment (BAA) from hand radiograph images is a common technique for investigating endocrinology and growth disorders [1], or for determining the final adult height of children [2]. In clinical practice, BAA is usually performed by examining the ossification patterns in a radiograph of the non-dominant hand, and then comparing the estimated bone age with the chronological age. A discrepancy between the two values indicates abnormalities [3]. The most widely used manual BAA methods are Greulich-Pyle (GP) [4] and Tanner Whitehouse (TW) [2]. In the GP method, bone age is estimated by comparing the whole hand radiograph with a reference atlas of representative ages, while the TW method examines 20 specific regions of interest (RoIs) and assigns scores based on a detailed local structural analysis. The TW method is more reliable, but time consuming, while the GP method is relatively quick and easy to use. In both manual solutions, reliable and accurate bone age estimation is limited by the subjective influence of a trained radiologist.

根据手部 X 光片图像进行骨龄评估 (BAA) 是研究内分泌学和生长障碍 [1] 或确定儿童最终成年身高 [2] 的常用技术。在临床实践中，BAA 通常通过检查非惯用手的 X 光片中的骨化模式，然后将估计的骨龄与实足年龄进行比较来进行。两个值之间的差异表明异常 [3]。最广泛使用的手动 BAA 方法是 Greulich-Pyle (GP) [4] 和 Tanner Whitehouse (TW) [2]。在 GP 方法中，骨龄是通过将整个手部 X 光片与代表性年龄的参考图谱进行比较来估计的，而 TW 方法检查 20 个特定的感兴趣区域 (RoI) 并根据详细的局部结构分析分配分数。 TW 方法更可靠，但耗时，而 GP 方法相对快速且易于使用。在这两种手动解决方案中，可靠和准确的骨龄估计都受到受过训练的放射科医生的主观影响的限制。

In this work, we concentrate on deep learning approaches for BAA. The difficulties of using deep learning for BAA are: (1) Raw input images are large (about 2000 × 1500 pixels), but bone age estimation is a fine-grained recognition task because ossification patterns are usually contained in specific small RoIs. Therefore, downsizing the raw images into low-resolution images will lose important information, decreasing the final performance. (2) Raw images can be poorly aligned. As shown in Fig. 1(a), the RoIs can be very small with undetermined position, which also reduces model performance. Some recent deep learning-based approaches have proposed to improve the BAA performance by localizing the RoIs [5] or performing image alignment [6] before age regression. Even though these methods demonstrated great performance improvement on BAA tasks, they suffer from two main limitations:

在这项工作中，我们专注于 BAA 的深度学习方法。将深度学习用于 BAA 的困难在于：（1）原始输入图像很大（大约 2000 × 1500 像素），但骨龄估计是一项细粒度的识别任务，因为骨化模式通常包含在特定的小 RoI 中。因此，将原始图像缩小为低分辨率图像会丢失重要信息，从而降低最终性能。 (2) 原始图像可能无法对齐。如图 1(a) 所示，由于位置不确定，RoI 可能非常小，这也会降低模型性能。最近一些基于深度学习的方法已经提出通过在年龄回归之前定位 RoI [5] 或执行图像对齐 [6] 来提高 BAA 性能。尽管这些方法在 BAA 任务上表现出极大的性能提升，但它们存在两个主要限制：

In order to locate the informative local patches for BAA, most of these methods require the identification of RoIs or key points that are important for BAA and provide extra annotations for training [5], [6]. However, existing BAA datasets only contains image-level labels, and manually drawing RoIs and providing annotations can be subjective and expensive, and also require domain knowledge from expert radiologists.

为了定位 BAA 的信息性局部补丁，这些方法中的大多数都需要识别对 BAA 很重要的 RoI 或关键点，并为训练提供额外的注释 [5]、[6]。然而，现有的 BAA 数据集仅包含图像级标签，手动绘制 RoI 和提供注释可能是主观且昂贵的，并且还需要来自放射科专家的领域知识。
Existing methods take BAA task as a general regression or classification problem, which uses mean absolute error (l1 loss) or mean square error (l2 loss), to penalize the differences between the estimated ages and the ground -truth ages. However, due to the label ambiguity problem in the age label space [7], this kind of loss function defined only based on a single label is suboptimal. Learning with a single label does not exploit the ordinal relationship among hand images with different individual ages, and leads to over-confident prediction.

现有方法将 BAA 任务作为一般回归或分类问题，它使用平均绝对误差（l1 损失）或均方误差（l2 损失）来惩罚估计年龄和真实年龄之间的差异。然而，由于年龄标签空间[7]中的标签歧义问题，这种仅基于单个标签定义的损失函数是次优的。使用单一标签学习没有利用不同个体年龄的手部图像之间的顺序关系，导致预测过度自信。

To address these limitations, we present a novel attention guided deep learning framework for bone age expectation regression. Instead of downsizing the input images or training a detection or segmentation model using extra annotations, we propose to utilize attention maps to localize the most discriminative regions for BAA. Then, we aggregate different RoIs for both bone age expectation regression and age distribution learning. Our contribution are summarized as: (1) As shown in Fig. 1, our method uses attention maps learned by deep models to automatically identify the hand region, the most discriminative region, and the next most discriminative region. It is also the first analysis to demonstrate systematically that the carpal and metacarpal bones are the two most important regions for BAA. (2) In order to leverage the correlation relationship between different individual ages and prevent the network from over-estimating classification confidence, we propose a joint age distribution learning and bone age expectation regression, which consistently improves performance. (3) By leveraging attention-guided local information and age distribution learning, our approach achieves competitive results without requiring manual annotations.

为了解决这些限制，我们提出了一种新的注意力引导深度学习框架，用于骨龄期望回归。我们建议使用注意力图来定位 BAA 的最具辨别力的区域，而不是缩小输入图像的大小或使用额外的注释训练检测或分割模型。然后，我们为骨龄期望回归和年龄分布学习聚合不同的 RoI。我们的贡献总结为： (1) 如图 1 所示，我们的方法使用深度模型学习的注意力图来自动识别手部区域、最具辨别力的区域和次最具辨别力的区域。这也是首次系统地证明腕骨和掌骨是 BAA 最重要的两个区域的分析。 (2) 为了利用不同个体年龄之间的相关关系，防止网络高估分类置信度，我们提出了联合年龄分布学习和骨龄期望回归，不断提高性能。 (3) 通过利用注意力引导的本地信息和年龄分布学习，我们的方法无需手动注释即可获得有竞争力的结果。

Bone Age Assessment Over the past decades, numerous automated image analysis methods and tools have been developed for BAA. These methods can be divided into two groups: non-deep learning based methods [8]–[11] and deep learning based methods [3], [5], [6], [12], [13]. Early representative non-deep learning-based methods mainly extract handdesigned features from the whole images or specific RoIs, and then train a classifier with no more than 2,000 samples. The performance of these methods is quite limited, with results ranging from 10-28 months mean absolute difference (MAD) [3]. Deep CNNs [14], [15] and a large scale BAA data set introduced by the Radiological Society of North America (RSNA) [12] have enabled recent advances to achieve impressive performance, with some exceeding an expert’s performance [5], [12], [16]. Specifically, BoNet [3] designed an ad-hoc CNN for BAA, the author exploited the deformation layer to address bone nonrigid deformation, and achieved a result of 9.5 months MAD on average. In [6], in order to crop specific local regions, the author first trained an U-Net model to segment the hand region with 100 labeled hand masks and then trained a key point detection model to achieve image registration. As a result, they achieved a 6.30 months MAD for males and 6.49 months MAD for females. The winners of the RSNA challenge [16] achieved a 5.99 months MAD with their best model and achieved a 4.26 months MAD by averaging 50 predictions (utilizing 5 top models with 10 augmented images). In the current best performing method, [5] presented a new framework based on a local analysis of anatomical RoIs, the author provided extra bounding boxes and key point annotations during training, and performed hand detection and hand pose estimation to exploit local information for BAA. As a result, they achieved the best result using the RSNA bone age dataset, 4.14 months MAD.

骨龄评估在过去的几十年中，已经为 BAA 开发了许多自动图像分析方法和工具。这些方法可以分为两组：基于非深度学习的方法 [8]-[11] 和基于深度学习的方法 [3]、[5]、[6]、[12]、[13]。早期的代表性非深度学习方法主要是从整个图像或特定的 RoI 中提取手工设计的特征，然后训练一个不超过 2,000 个样本的分类器。这些方法的性能相当有限，结果范围为 10-28 个月的平均绝对差 (MAD) [3]。深度 CNN [14]、[15] 和北美放射学会 (RSNA) [12] 引入的大规模 BAA 数据集使最近的进展取得了令人印象深刻的性能，其中一些超过了专家的性能 [5]、[ 12]，[16]。具体来说，BoNet [3] 为 BAA 设计了一个 ad-hoc CNN，作者利用变形层来解决骨骼非刚性变形，平均达到 9.5 个月 MAD 的结果。在 [6] 中，为了裁剪特定的局部区域，作者首先训练了一个 U-Net 模型，用 100 个标记的手部蒙版分割手部区域，然后训练了一个关键点检测模型来实现图像配准。结果，他们实现了男性 6.30 个月的 MAD 和女性 6.49 个月的 MAD。 RSNA 挑战赛的获胜者 [16] 使用他们最好的模型获得了 5.99 个月的 MAD，并通过平均 50 个预测（利用 5 个顶级模型和 10 个增强图像）获得了 4.26 个月的 MAD。在当前表现最好的方法中，[5] 提出了一种基于解剖学 RoI 局部分析的新框架，作者在训练期间提供了额外的边界框和关键点注释，并进行了手部检测和手部姿态估计以利用局部信息进行 BAA .结果，他们使用 RSNA 骨龄数据集取得了最佳结果，4.14 个月 MAD。

Attention Guided Part Localization Previous work mainly focuses on leveraging the extra annotations of bounding box and key point annotations to localize significant regions for bone age assessment [5], [6]. However, the heavy involvement of manual annotations and domain knowledge make it not practical in large-scale application scenarios. Recently, there have been numerous emerging studies working on attention guided localization, which allows the deep networks to focus on the informative task-relevant regions of the input images in an unsupervised manner [17]–[22]. Class activation mapping (CAM) [17] revisits the global average pooling layer to enable the convolutional neural network (CNN) to be able to localize the discriminative image regions. In [18], Grad-CAM was proposed which is a generalization of CAM and is applicable to a significantly broader range of CNN model families. In [20], Fu et al. propose RA-CNN which recursively learns discriminative region attention and region-based feature representation at multiple scales for fine-grained image recognition. In the medical image analysis community, [23] Cai et al. propose an attention mining (AM) strategy to improve the models sensitivity to disease patterns on chest X-ray images. In [22], the authors propose an AG-CNN model, which enables the network to learn from disease-specific regions to avoid noise and improve alignment for thorax disease classification in chest X-ray images. Li et. al [24] propose an attentionbased multiple instance learning model for slide-level cancer grading and weakly-supervised RoI detection. Yang et. al [25] propose to use region-level supervision for the classification of breast cancer histopathology images, where the RoIs are localized and used to guide the attention of the classification network.

注意力引导部分定位以前的工作主要集中在利用边界框的额外注释和关键点注释来定位骨龄评估的重要区域 [5]、[6]。但是手工标注和领域知识的大量参与使得它在大规模应用场景中并不实用。最近，有许多关于注意力引导定位的新兴研究，它允许深度网络以无监督的方式关注输入图像的信息任务相关区域 [17]-[22]。类激活映射 (CAM) [17] 重新访问全局平均池化层，使卷积神经网络 (CNN) 能够定位有辨别力的图像区域。在 [18] 中，提出了 Grad-CAM，它是 CAM 的泛化，适用于更广泛的 CNN 模型系列。在 [20] 中，Fu 等人。提出 RA-CNN，它在多个尺度上递归地学习判别区域注意和基于区域的特征表示，用于细粒度图像识别。在医学图像分析界，[23] 蔡等人。提出了一种注意力挖掘 (AM) 策略，以提高模型对胸部 X 射线图像上疾病模式的敏感性。在 [22] 中，作者提出了一个 AG-CNN 模型，该模型使网络能够从疾病特定区域中学习，以避免噪声并改进胸部 X 射线图像中胸部疾病分类的对齐。李等al [24] 提出了一种基于注意力的多实例学习模型，用于幻灯片级癌症分级和弱监督 RoI 检测。杨等。 al [25] 建议使用区域级监督对乳腺癌组织病理学图像进行分类，其中 RoI 被定位并用于引导分类网络的注意力。

# METHODOLOGY

As shown in Fig. 2, our proposal consists of two phases: an attention guided localization phase and a bone age expectation regression phase. In the localization phase, we train a classification model to learn the attention heat maps for the hand region, the most discriminative region, and the next most discriminative region. Guided by these attention maps, we then crop those high-resolution local patches from the original image. In the expectation regression phase, we train a regression model for joint age distribution learning and age expectation regression. The expectation regression model can exploit a single informative local patch or aggregate different local patches for BAA.

如图 2 所示，我们的提议由两个阶段组成：注意力引导定位阶段和骨龄期望回归阶段。在定位阶段，我们训练一个分类模型来学习手部区域、最具辨别力的区域和下一个最具辨别力的区域的注意力热图。在这些注意力图的指导下，我们然后从原始图像中裁剪那些高分辨率的局部补丁。在期望回归阶段，我们为联合年龄分布学习和年龄期望回归训练回归模型。期望回归模型可以利用单个信息丰富的本地补丁或为 BAA 聚合不同的本地补丁。

# A. Phase I: Attention Guided RoIs Localization

Weakly supervised detection and localization methods that aim to identify the location of the object in a scene only using image-level labels have been widely used for many vision tasks [17]–[19] and medical image analysis [22], [25]. Inspired by these methods, we propose to utilize learned attention maps to identify the discriminative local patches for BAA. As shown in Fig. 2(a), for a given CNN model and an input image, let F ∈ RH×W×C denote the activation outputs of the last convolutional layer. The resulting feature maps are then fed into a global average pooling (GAP) or global max pooling (GMP) layer [17], followed by a fully connected (FC) layer. For convenience, we only consider the case of using the GAP layer and ignore the bias term. We denote the average value of the k-th feature map as Sk = P i,j Fijk H×D , k = 0, 1, · · · , C − 1, and denote the weight matrix of the FC layer as W ∈ RC×T , where T is the number of classes in the classification model. In this way, the value of the t-th output node can be calculated as

旨在仅使用图像级标签识别场景中物体位置的弱监督检测和定位方法已广泛用于许多视觉任务 [17]-[19] 和医学图像分析 [22]、[25] .受这些方法的启发，我们建议利用学习到的注意力图来识别 BAA 的判别性局部补丁。如图 2(a) 所示，对于给定的 CNN 模型和输入图像，让 F∈RH×W×C 表示最后一个卷积层的激活输出。然后将生成的特征图送入全局平均池化 (GAP) 或全局最大池化 (GMP) 层 [17]，然后是全连接 (FC) 层。为方便起见，我们只考虑使用 GAP 层的情况，而忽略偏置项。我们将第 k 个特征图的平均值表示为 Sk = P i,j Fijk H×D , k = 0, 1, · · · , C − 1, 将 FC 层的权重矩阵表示为 W ∈ RC×T ，其中 T 是分类模型中的类数。这样，第t个输出节点的值可以计算为

Implementation Details For the classification model, we adopt the InceptionV3 (without top layers) as the backbone network for feature extraction, and then add a GMP (or GAP) layer followed by a FC layer with 240 output nodes, which is the maximum age of the children in the data set in months. When we utilize the original one-hot labels for training, the network fails to converge. We believe the reason is that hand images with different ages are similar, but have different one-hot labels. Hence, we utilize soft labels for training. For a hand image and its labeled age t, we define the following function to soften the label distribution where Y ∈ RT is the ground-truth label distribution and i = 1, 2, · · · , 240. l controls the smoothness of the label distribution, a larger l leads to a smoother label distribution. In the experiments, we set l = 50. We utilize the weights pre-trained in ImageNet, and train the network with the Adam optimizer with a batch size of 32. The network is trained over 70 epochs, the learning rate is set to 0.0003 for the first 50 epochs and set to 0.0001 for the last 20 epochs.

实现细节对于分类模型，我们采用InceptionV3（无顶层）作为骨干网络进行特征提取，然后添加一个GMP（或GAP）层，然后是一个具有240个输出节点的FC层，这是最大年龄数据集中的儿童数月。当我们使用原始的 one-hot 标签进行训练时，网络无法收敛。我们认为原因是不同年龄的手部图像相似，但具有不同的 one-hot 标签。因此，我们使用软标签进行训练。对于手部图像及其标记的年龄 t，我们定义以下函数来软化标签分布，其中 Y ∈ RT 是真实标签分布且 i = 1, 2, · · · , 240. l 控制平滑度标签分布，较大的 l 导致更平滑的标签分布。在实验中，我们设置 l = 50。我们利用在 ImageNet 中预训练的权重，并使用 Adam 优化器以 32 的批量大小训练网络。网络训练了 70 个 epoch，学习率设置为 0.0003前 50 个 epoch 设置为 0.0001，后 20 个 epoch 设置为 0.0001。

# B. Phase II: Bone Age Expectation Regression

Network Design In the second phase, we perform bone age expectation regression with the high-resolution local patches. The different local patches are aggregated by feeding into different input channels. As shown in Fig. 2(b), we adopt the Xception [26] without top layers as the backbone network, followed by a convolutional layer, a max pooling layer, and a FC layer. To effectively utilize gender information, we concatenate the image features with the gender features, which takes gender information (1 for male and -1 for female) as input and feeds it through a FC layer with 32 neurons. The concatenated features are then fed into the last FC layer with softmax activation. The softmax output pk, k = {1, 2, · · · , 240} represents the bone age distribution (the probability of belonging to different ages), which is used to calculate the expectation of bone age.

网络设计在第二阶段，我们使用高分辨率局部补丁执行骨龄期望回归。通过馈入不同的输入通道来聚合不同的局部补丁。如图 2(b) 所示，我们采用没有顶层的 Xception [26] 作为主干网络，然后是卷积层、最大池化层和 FC 层。为了有效地利用性别信息，我们将图像特征与性别特征连接起来，将性别信息（男性为 1，女性为 -1）作为输入，并将其馈送到具有 32 个神经元的 FC 层。然后使用 softmax 激活将连接的特征输入到最后一个 FC 层。 softmax输出pk, k = {1, 2, · · · , 240}代表骨龄分布（属于不同年龄的概率），用于计算骨龄的期望值。

Joint Age Distribution Learning for BAA Hand X-ray images look very similar if the age of these images are close. For example, one’s hand X-ray image looks the same when he is 160 or 161 month. This inspire us to make use of the correlation information of the hand images at neighboring ages. However, existing approaches take the BAA task as a regression problem or discrete classification problem, which can not exploit the correlation information among neighboring ages. Inspired by the label distribution learning (LDL) [7], [27]–[29], we propose to learn an age distribution rather than a single age label for each hand image. The age distribution contains a group of probability values which represent the degree of each age to the hand image. It also reflects the ordinal relationship among neighboring ages. Formally, let xi and gi ∈ {−1, 1} denotes the local patches and gender indicator of i-th sample, yi ∈ {1, 2, · · · , 240} denotes the corresponding label. As shown in Fig. 2(b), we assume that F(xi) ∈ Rm is the image feature and G(gi) ∈ Rn is the gender feature for i-th sample, where m and n are dimension of the image and gender feature. We fuse the image and gender information by concatenation fi = [f(xi); G(gi)] ∈ Rm+n, followed by a full connected layer which transfers fi to zi ∈ R240 by zi = W>fi + b (6) Then, we employ a softmax activation function to turn zi into the age distribution,

# Abstract

# Introduction

# RELATED WORK

# METHODOLOGY

# A. Phase I: Attention Guided RoIs Localization

# B. Phase II: Bone Age Expectation Regression