Structural health monitoring (SHM) and intelligent damage image recognition technology during the construction period have been the subject of intense research in rock tunnel engineering (Gao and Mosalam, 2018). In this regard, the complex and variable nature of geological structures, difficulty in prediction, and frequency of the construction process in the tunnel face have made the research problem very prominent. Moreover, if rock mass structure is not accurately predicted, the tunnel face encounters collapse problems and construction obstacles, which will undoubtedly present serious challenges to the tunnel construction, operation, and maintenance.
In recent years, with the rapid development of artificial intelligence (AI) and machine learning (ML) technologies (Krizhevsky et al., 2012), especially in the application of in-depth learning in the computer vision, rock structure information of tunnel face based on the image recognition has received the attention of scholars and builders in the construction and maintenance process (Huang et al., 2018). So far, several researchers have used the classical field research methods (Aksoy et al., 2012, Zhang and Goh, 2012, Rehman et al., 2019), mechanics deduction (Goh and Zhang, 2012, Song et al., 2018, Li et al., 2019, Su et al., 2019), and numerical simulation (Manouchehrian and Cai, 2018, Schreter et al., 2018, Xiang et al., 2018, Lee et al., 2019) to identify the mechanism of different rock mass structures. Starting from the exploration of structural mechanisms, many internal characteristics of the project could be solved and the project risk is greatly reduced. However, most of the studies have been carried out for the specific rock structure ignoring the macroscopic judgment and analysis of the whole rock structure. Hence, the automatic recognition of rock image cannot be realized. In accordance with the National Standard for Engineering Classification of Rock Mass (GB50218-2014), five rock structure categories were defined based on the description of the form and feature of discontinuities in Table 1, including Mosaic Structure (MS), Granular Structure (GS), Layered Structure (LS), Block Structure (BS), and Fragmentation Structure (FS). In the study of rock features of the whole category, due to the limited sample size, researchers (Fekete and Diederichs, 2013, Chen et al., 2016, Pittam et al., 2016, Riquelme et al., 2016, Sarro et al., 2018; Zhang et al., 2020) mainly have concentrated on traditional rock feature extraction methods. For instance, Chen et al. (2016) proposed a photogrammetry method for extracting the discontinuity orientation automatically from the rock mass surface 3D point cloud based on the K-means clustering and the random sample consistency. Fekete et al. (2013) presented a workflow from data collection and analysis to design outputs for integrating the Lidar-derived point-cloud data into the rock mass stability modeling. Riquelme et al. (2016) proposed the low-cost remote sensing and manual collection techniques to gather all the accessible discontinuities of a slope. These results provided valuable ideas for the structural stability analysis of rock tunnels.
Nevertheless, compared with the state-of-the-art deep learning techniques, the traditional machine learning methods have lower efficiency and accuracy for the feature extraction. As an important sector of deep learning in the computer vision applications, the convolutional neural networks (CNNs) have been widely used in the field of infrastructure health monitoring in civil engineering; e.g., identifying concrete cracks (Cha et al., 2018, Chaiyasarn et al., 2018, Dorafshan et al., 2018, Bang et al., 2019), detection of damaged buildings (Cha and Buyukozturk, 2015, Gao and Mosalam, 2018), monitoring of damaged walls (Beckman et al., 2019), and detection of corroded pipes (Cheng and Wang, 2018, Kumar et al., 2018). These methods not only have high accuracy and efficiency but also have been widely used because of the simple scenes and convenient image acquisition. However, due to the lack of lighting conditions, complex construction procedures, and poor photography environment, it is difficult to obtain image samples in the areas close to the rock tunnel face. In addition, deep learning is dependent on huge training samples (Gu et al., 2018) and there is scarce research on the tunnel face image at present.
In this paper, an adaptive digital photography system (ADPS) is utilized to acquire an image database of the tunnel working face in batches using the on-site lighting system during the construction process. Based on the statistics obtained from the Mengzi-Pingbian Highway Tunnel (MPHT) in Yunnan Province, China, five rock structure categories (i.e., MS, GS, LS, BS, and FS) were labelled based on the National Standard in China (GB50218-2014). In addition, by combining Inception and ResNet, a comprehensive CNN, namely Inception-ResNet-V2 (IRV2), was proposed to classify the dataset. The recognition accuracy, precision, F-score, and execution time of the proposed method were then analyzed compared with other deep learning methods such as ResNet-50 (Szegedy et al., 2015), ResNet-101 (He et al., 2016), and Inception-v4 (Szegedy et al., 2017). Finally, the application of deep learning framework in the face image recognition of rock tunnel was systematically evaluated.
Image acquisition of rock tunnel face
The ADPS used in the image acquisition consists of a Canon 750D camera, tapeline, tripod, measuring equipment (laser rangefinder, thermo hygrometer, and illuminometer), and light source (two LED dropout lamps with 1000 W adjustable power). As shown in Fig. 1, ADPS was manually used to record the temperature, humidity, illumination information, and the distance between the shooting point and the surface of each tunnel face in Mengzi-Pingbian Highway Tunnel (MPHT) in Yunnan, China. In the photography process, it is ensured that the illumination is uniform and parallel, the distance between the two lights is within 1.5–2 m, and the camera is horizontally placed on the tripod.
To determine the most suitable image acquisition time, six main construction processes are present and compared in Fig. 2, mainly including: (1) drilling with pneumatic leg rock-drilling trolley, (2) pre-blasting preparation (a detonator is used to connect multiple detonators, in which the red quadrilateral represents the connection), (3) tunnel ventilation (ventilation and risk elimination of working face by draught fans and ventilation pipes), (4) slag extraction from working face (removing blasting residues with excavation and transportation equipment), (5) building the first lining trolley, and (6) the best time for tunnel face image acquisition (after the first lining trolley was built, the safety of personnel was ensured and the shooting time was guaranteed). It is concluded that the best time for the face image acquisition is when the first liner trolley is built, which not only protects the safety of personnel but also ensures the shooting time. Furthermore, the face images in different construction processes are taken under different lighting conditions at a distance of 1.0–1.5 m from the camera to enrich the compiled dataset. As a result, more than 3000 images were captured from 150 tunnel faces occupying 20 GB of storage space.
In this study, 42,400 geological structure images randomly cut from 3000 raw images from various tunnel faces were tagged by 5 labels including MS, GS, LS, BS, and FS structures. In order to promote the image processing efficiency, the rock face images were cropped from the original sizes of 3968 × 2240 pixels into smaller images of 396 × 448 pixels, meaning that each original image is divided into 50 sub-images (10 × 5). The number of images in each label is listed in Table 2. The samples of the dataset are shown in Fig. 3. It can be seen that due to the complex geological conditions in Yunnan Province, China, the tunnel face textures affected by the faults, rock interlayers, and rock category, are hard to distinguish with a relative high accuracy and inefficiency by the traditional feature extraction methods (i.g., edge detection, threshold segmentation). In addition, the texture differences of rock structures with different labels decrease with the increase of database diversity. Therefore, the fast capture of key texture features is a significant challenge for the big data-based rock classification. Meanwhile, deep learning model have been characterized by the acquisition capacity of deep-features for the classification task, it is thus urgent to use deep learning model for pre-training.
Basic CNN model
The architecture of CNNs consists of several unique layers, including the convolutional, activation, pooling, dropout, and Softmax layers with different functions. Each of these layers is briefly explained below (Krizhevsky et al., 2012, Huang et al., 2017). Meanwhile, the framework of IRV2 is present in Fig. 4.
Convolutional layers are widely used as the feature generators. Each convolution kernel of CNNs slides across the input matrix with a definite stride. The element-by-element multiplications are conducted between the input RGB matrix and the kernel at each sliding position, and all the multiplication values are added together to form the output plus a bias. The output size is determined by the convolutional kernel and the stride size. The weights and biases of convolution kernels are the unknown parameters waiting for the optimization.
The role of the activation layer is to increase the nonlinearity of the convolution output. According to statistics, the Rectified-Linear-Unit (ReLU) as an activation function outperforms the traditional sigmoid function in terms of calculating the reasonable gradient. The nonlinear ReLU activation function avoids the gradient dispersion (gradient explosion) caused by too large gradient and the disappearance of gradient caused by too small gradient, while maintaining a relatively fast calculation speed. Therefore, it is widely used in the CNNs. In the Inception-ResNet-V2 algorithm shown in Fig. 4, ReLU function is implemented in all activation layers.
The pooling layer generates the output through a sliding window similar to the convolutional layer and reduces the dimension of the feature map, number of parameters, and spatial size of the upper layer. There are two common pooling layers (i.e., the max pooling and the average pooling), whose values are calculated by the max or mean operators, respectively. As shown in Fig. 5, both of these two pooling layers are adopted in Inception-ResNet-V2 to keep the framework invariant to distortion, transformation, and translation.
Due to the large number of parameters in the CNNs, over-fitting problems may occur in these networks through the dropout process. Dropout is a random disconnecting technique that can randomly isolate the connections between different nodes with a dropout probability of 1-p. The dropout layer reduces the number of model parameters and increases the robustness of the algorithm. The random inactivation layer allows the model to avoid over-fitting occurrence and improves the robustness of the network structure.
The Softmax layer is basically a combination of multinomial logistic losses, that is used in the multi-classification process to normalize the classification vector of a certain weight. The CNNs can transform the abstract matrix into the concrete classification, namely the rock structure of tunnel face (i.e., MS, GS, LS, BS, and FS structures in this paper).
Inception-ResNet-V2 (IRV2) (Szegedy et al., 2015, Szegedy et al., 2017, He et al., 2016) proposed first by Google company in 2018 (shown in Fig. 4) is employed as a state-of-the-art model to classify the rock structure images. It is mainly a combination of GoogLeNet (Inception) (Szegedy et al., 2015) and ResNet (He et al., 2016). The algorithm consists of 10 parts, each with its own role orientation and function. Inception is a typical network with a parallel layer structure that was first applied in GoogleNet. It includes parallel connections of filters with different sizes of 1 × 1, 3 × 3, and 5 × 5. The smaller size causes the convolution kernel to extract the image features more efficiently and greatly reduces the model parameters. Among them, the large-scale convolution kernel will increase the parameters of the model matrix, thus multiple small-scale convolution kernels will be replaced in parallel to reduce the function parameters in the case of the same receptive field. As a result, the model becomes wider and more accurate than the previous network with Inception. Presently, Inception v1–v4 (Szegedy et al., 2017) is the typical framework of GoogleNet. However, the residual learning-based ResNet is the winner of ILSVRC 2015, which can go deeper into 152 layers (Szegedy et al., 2015). The main idea of ResNet is to add a direct link to the framework, which is referred to as the idea of Highway Network. The previous network structure is a nonlinear transformation of performance input, while Highway Network allows a certain proportion of the output of the previous network layer to be retained. It allows the original input information to be transmitted directly to the next layer. Meanwhile, ResNet can protect the integrity of information by directly transmitting the input information to the output. The whole network only needs to learn the difference between input and output, which simplifies the learning objectives and difficulties. ResNet-50, ResNet-101, and ResNet-152 are the typical frameworks of ResNet.
In the Residual-Inception network (Fig. 5), the Inception module was used because it involves less computational complexity than the original Inception module. Fig. 5a‒c represents the layers of Inception-ResNet, i.e., Inception-ResNet-A, Inception-ResNet-B, and Inception-ResNet-C. The number of layers in the total framework for each module is 5, 10 and 5, respectively. Fig. 5d and e demonstrates the Reduction Layer of IRV2, i.e., Reduction-A and Reduction-B. Each Inception block is connected to a filter layer (1 × 1 convolution without activation function) for the dimension transformation to achieve the input matching. This system compensates for the dimensionality reduction in the Inception block. According to a previous research (Szegedy et al., 2017), Inception-ResNet-V2 (IRV2) developed from Inception-ResNet-V1 (IRV1) matches the raw cost of Inception-v4 network. In this regard, one small technical difference between residual and non-residual Inception, in case of Inception-ResNet, is that only the batch normalization (BN) is used at the top of traditional layers. Since the experiments (Szegedy et al., 2017) showed that using a larger activation size may consume more GPU memory, more Inception modules can be added by omitting the BN layer after the activation. It also makes the model more efficient and concise. Moreover, if the number of filters exceeds 1000, the residual network will be unstable, and an early “death” issue will appear in the network training process. In other words, after tens of thousands of training data, the layer before the average pooling begins generating zeroes. This situation cannot be avoided by reducing the learning rate and adding additional BN layers. Furthermore, to compare the results of our proposed IRV2 model, ResNet-50, ResNet-101, and Inception-v4 models are implemented to classify the rock structure.
Accuracy, precision, recall, and F-score are indispensable metrics for classification tasks, which are frequently employed to evaluate the applicability and superiority of the framework. As a simple dichotomy example, Table 3 displays the definition of True Negative (TN), True Positive (TP), False Negative (FN), and False Positive (FP), from which the relationship between evaluation metrics and Judgment elements are obtained in Eqs. (1), (2), (3), (4).
where accuracy answers the following question: out of all the images in the dataset, what fraction of images are correctly judged as true? Precision answers the following question: out of all the images that the classifier labelled as positive, what fractions of images are correctly judged as positive (i.e., true positives)? Recall answers the following question: out of all positive images, what fractions of images are correctly judged as positive? The F-score metric is a comprehensive indicator by weighing the importance between recall and precision. The α value is set as 1 in this study, which reflects the significance of recall and precision is the same.
Experiment and results
The generated benchmark dataset was trained in Tensorflow framework made by Google and implemented with Intel Core i7-8700 processor, 32G RAM, Nvidia GTX 1080 Ti 11 GB GPU, Windows 10 operating system.
According to the performance of different frame models (i.e., ResNet-50, ResNet-101, Inception-v4, Inception-ResNet-V2), the experiment mainly includes the following aspects: (1) training, validation, and test results of database; (2) cross-validation, accuracy evaluation, and presentation of testing images, and (3) evaluation of execution time.
Training, validation, and test results
In the CNN simulation, the database is randomly divided into three parts of training, validation, and testing groups, each with a certain proportion. The number of each rock structure is shown in Table 4. In the table, proportion of training, validation, and testing images in each label is set to 60%, 25%, and 15%, respectively. The training and validation processes can be implemented in the same application code. Validation does not participate in the training process but is used to correct the weights and biases in the training process. Accordingly, the accuracy is improved and the probability of occurrence for overfitting and non-convergence is sharply reduced.
Total loss and accuracy are two commonly used indicators to evaluate the convergence and fitting effect of training and validation processes by using deep learning methods. Total loss is a non-negative real value function used to estimate the inconsistency between the predicted value and the real value of the model. The smaller the loss function, the better the robustness of the model. Fig. 6, Fig. 7 illustrate the changes of total loss and accuracy by increasing the experiment steps for the four different deep learning methods (i.e., ResNet-50, ResNet-101, Inception-v4, and Inception-ResNet-V2), respectively. It can be seen in Fig. 6, Fig. 7 that all four algorithms converge in the training and validation processes of the rock structure classification. Additionally, it can be observed from Fig. 6a and b that the total loss order from small to large is Inception-ResNet-V2, Inception-v4, ResNet-101, and ResNet-50 in both training and validation sets. In terms of accuracy ranking prediction (Fig. 7a and b), the relationship between the prediction accuracy and the total loss is exactly opposite to the last case. From small to large losses, they are ranked as ResNet-50, ResNet-101, Inception-v4, and Inception-ResNet-V2 in both training and validation set. Meanwhile, the convergence value of total loss in the training step is smaller than that in the validation process while the prediction accuracy in the training is higher than the validation. In summary, the IRV2 model outperforms the other models in rock structure classification.
Cross-validation and accuracy evaluation of testing images
In this study, four evaluation metrics were selected to determine the performance of the model applied in the image identification process. According to the data distribution in Table 2, 7400 images (untrained data) were randomly selected from the database for the classification testing. Additionally, the comparisons between the four methods in terms of accuracy, precision, recall value, and F-score for testing dataset are plotted in Fig. 8. It can be seen that all the evaluation metrics of CNN models present a similar trend. The values of the metrics from the highest to the lowest always follow the same order: BS, LS, MS, FS, and GS. Since the texture features and distinct appearance of the BS images, it makes the DCNN methods corresponding more prominent in BS identification. Therefore, the proposed method can improve the classification accuracy on the rock tunnel face image dataset. The FS and GS images in the dataset do not have distinct features compared with the other three water in flow categories. Therefore, the classification of FS and GS images suggest relatively poor performance, as the evaluation metrics of these two categories present rather low values.
To further explore the IRV2 classification algorithm, the confusion matrix was used to evaluate the recognition results. Confusion matrix is a standard format for classification evaluation, which is expressed in the matrix form of N rows and N columns. Each column represents the forecast category, and the total number of each column represents the number of data classified as the category; each row represents the real belonging category of data, and the total number of data in each row represents the number of data instances of the category. The calculation of a confusion matrix is to sum up the total number of observed accuracy values of both the wrong and correct categories in the statistical classification model, and then display the results in a tabular form. Table 5 shows the confusion matrix obtained from 7400 images of rock structure dataset using the Inception-ResNet-V2 model. It can be seen that the proportions of five rock structure types in the database are 18.08%, 19.29%, 22.84%, 20.89%, and 18.89%. The classification of Granular Structure (GS) shows an accuracy level of up to 93.90%. Furthermore, it can be inferred that among the 5 categories, FS has a higher probability to be misclassified as GS, since it is located closer to GS. Similarly, MS has a higher probability of being misjudged as FS. In general, BS is the easiest one to be accurately recognized in the rock structure determination, with an accuracy of 98.12%, followed by LS, MS, FS, and GS with accuracies of 97.05%, 95.93%, 94.80%, and 93.90%, respectively. Therefore, using the IRV2 model, the GS class has the poorest classification performance, while the BS class has the strongest one.
To visualize the results of image classification, a fraction of images is randomly selected out of 7400 testing images (Fig. 9), and the last fraction of each line was considered for the misclassification cases. It can be seen from Fig. 9 that the prediction of the framework IRV2 provided excellent results in five categories of classification problems. In terms of error recognition, the results of Fig. 9 are consistent with those of the confusion matrix. In the deep learning process, errors may occur in two similar categories including GS and FS. Therefore, to improve the robustness of data, it is necessary to increase the number of related categories and the texture morphology in the training process.
Evaluation of execution time
The execution time of different frameworks using different processors (CPU and GPU) is shown in Table 6. The results show that the computing time of the Graphics Processing Unit (GPU) is much shorter than that of the Central Processing Unit (CPU) in different running processes (i.e., validation and testing) and models. However, some traditional machine learning methods are CPU-based (Cha and Buyukozturk, 2015). In these approaches, it takes much time and storage space when the algorithm is complex. Thus, using the GPU for solving the classification problems has an important reference value for the current research. In addition, by employing the GPU, the average validation time of the proposed Inception-ResNet-V2 algorithm is 2.163 s per image, while this time for ResNet-50, ResNet-101, and Inception-v4 is 2.78 s, 3.02 s, and 2.25 s, respectively. In other words, the validation time of the IRV2 framework model is about 1.28, 1.39, and 1.04 times shorter than ResNet-50, ResNet-101, and Inception-v4, respectively. In addition, the IRV2 method takes 0.325 s per image in terms of test time, which is 1.57, 1.95, and 1.48 times shorter than ResNet-50, ResNet-101, and Inception-v4, respectively. Therefore, it can be concluded that IRV2 is computationally more efficient than the other three algorithms.
To further evaluate the recognition effect of the tunnel face image in the MPHT project, an original image was selected as the raw material (Fig. 10). The image was segmented into 25 sub-images (5 × 5) and labelled according to the corresponding locations. All sub-images were fed to the IRV2 framework in batches for testing. The advantage of substituting the sub-images for the original image is to avoid the influence of local recognition errors on the overall recognition rate and one-time determination of tunnel face structure and, subsequently, to analyze the rock mass structure by the probabilistic and statistical methods.
Fig. 11 displays the category statistics of each sub-image trained according to the segmentation results of Fig. 10. In this figure, the red rectangular box is the error recognition item. As can be seen, although the rock mass structure of the sub-image 5-1 is actually the Layered Structure (LS), 35.2% of the misjudgment rate still classifies it as Block Structure (BS).
To further analyze the difference between the sub-image statistical recognition method and the overall recognition method, Table 7 summarizes the average value of the statistical data in Fig. 11 and the testing probability of the original image classification. The table shows that both methods can accurately identify the classification results. In the LS classification, the probability of the sub-image method and original image method is 94.27% and 93.25%, respectively. However, the two methods are different in the error recognition labels. The sub-image method further concentrates on BS and FS, while the original image method mainly focuses on FS and MS. From the manual error recognition view, the probability of occurrence in BS and FS is higher than that in FS and MS; and thus the sub-image method is closer to the human recognition. Therefore, from the perspective of recall accuracy and error divergence, the sub-image method is a more reasonable workaround.
In this paper, a vision-based framework model known as Inception-ResNet-V2 network was proposed for the classification of rock mass structure in the tunnel face. To capture the database orderly and quantitatively, an adaptive digital photography system (ADPS) system consisting of a Canon 750D camera, tapeline, tripod, measuring equipment and the light source was used in the image acquisition process. The Inception-ResNet-V2 network was trained by over 35,000 images extracted from 150 tunnel faces in Mengzi-Pingbian Highway Tunnel (MPHT) in Yunnan Province, China, and then tested by additional 7400 images. Compared with ResNet-50, ResNet-101, and Inception-v4, the proposed network can reduce the validation and testing time, increase the accuracy of rock structure classification, and facilitate the future condition assessment.
Based on the obtained results, the computing time of GPU is much lower than CPU for the four experimental CNNs, whether in different running processes (such as validation and testing) or using a different model. Using the GPU processor instead of traditional CPU, Inception-ResNet-V2 exhibited the best performance over the other three CNNs.
In addition, the model trained by a large database can obtain the object features more comprehensively, leading to higher accuracy. Compared with the original image classification method, the sub-image method is closer to the reality considering both the accuracy and the perspective of error divergence.
As the first attempt to use CNNs for classification of rock mass structures captured from the under-construction tunnel face, this research made the following two contributions. First, it paves the way for other researchers to apply a higher accuracy and efficiency framework in order to classify the rock structure in the under-construction tunnel face. Secondly, it confirms that the proposed sub-image technique significantly improves the efficiency of conventional overall recognition. However, there are two research aspects remained that can be further improved: (1) establishing a real-time tunnel face image recognition system during the construction as the most urgent need in the field, and (2) handling the time imposed by the pretreatment and post-processing of the sub-image classification method, which is usually longer than the testing time. These issues can be investigated by the authors in future research for designing a more automatic recognition system.
Source: Deep learning based classification of rock structure of tunnel face
Authors: Jiayao Chen, Tongjun Yang, Dongming Zhang, Hongwei Huang, Yu Tian