Here we summarize the development and implementation aspects of the TRAVEΛEYES project.
For the scope of TRAVEΛEYES project, two proprietary databases were created. One for food of Greek traditional cousine and one for Herbs met in the Greek countryside.
The Food and Herbs databases both contain more than 20000 images of Greek dishes and herbs, that were were collected either by hand, using camera equipment, or through the internet (e.g. using Google Images platform), manually annotated and separated into 37 and 26 semantic categories, respectively. Here are same samples of the databases along with their labels:
Food database samples
Herbs database samples
After an initial screening of the available CNN models, the ones chosen for further evaluation was the SqueezeNet and the MobileNetV1 networks. Both provide hardware-friendly architectures focused mainly in saving memory resources and reducing the computational cost for integration on embedded and mobile devices, without jeopardizing the system's efficacy.
SqueezeNet achieves the same accuracy as AlexNet but has 50x less weights. To achieve that the following key ideas are introduced:
Replace 3×3 filters with 1×1 filters: 1×1 have 9 times fewer parameters.
Decrease the number of input channels to 3×3 filters: The number of parameters of a convolutional layer depends on the filter size, the number of channels, and the number of filters.
Allow convolution layers to have large activation maps by downsampling late in the network: This might sound counter intuitive. But since the model should be small, we need to make sure that we get the best possible accuracy out of it. The later we down sample the data (e.g. by using strides >1) the more information are retained for the layers in between, which increases the accuracy.
All the above are combined into a so called fire module that is split into two layers:
A squeeze layer, that consists of 1×1 convolutions that essentially combine all the channels of the input data into one and thus reduce the number of input channels for the next layer
An expansion layer where 1×1 convolutions are mixed with 3×3 convolutions. The 1×1 convolutions combine the channels of the previous layer but can’t detect spatial structures, while the 3×3 convolutions do detect those structures in the image. By combining two different filter sizes the model becomes more expressive and at the same time reduces the number of parameters. The correct padding makes sure that the output of the 1×1 and 3×3 convolutions have the same size and can be stacked.
SqueezeNet uses 8 of these fire modules in total and a single convolutional layer as an input and output layer with global average pooling instead of fully connected layers that have a large amount of parameters (compared to convolutional layers) and are prone to overfitting.
MobileNet follows a different design approach and its architecture relies on the use of depthwise separable convolution followed by a pointwise convolution. The main idea behind this kind of convolutions is to apply filters to each channel separately and then combine the output channels with a 1×1 convolution. This leads to a great reduction in parameters, while keeping the accuracy nearly the same.
The Fire Module
Depthwise separable convolution
Depthwise separable conv layer
The initial datasets were split into training, validation and test subsets so as to better evaluate the performance of the networks under consideration. The trained CNNs' accuracy results for food and herbs classification applications on the test subsets reached 99,76% and 98,85% respectively, with top-3 accuracies up to 99,97% and 99,95%.
Apart from absolute accuracy results, the confusion matrix delivers an error analysis between all possible category-pairs, revealing, for each category, the classes, whose images are more likely to be confused with. The confusion matrices for both applications are shown below
Confusion Matrix for food classification
Confusion Matrix for herbs classification