Antonino Furnari

Postdoctoral Researcher

University of Catania

Department of Mathematics and Computer Science

Viale A. Doria 6 - 95125, Catania, Italy


@article{Damen2018EPICKITCHENS, year = {2018}, journal = { arXiv preprint arXiv:1804.02748 }, author = { D. Damen and H. Doughty and G. M. Farinella and S. Fidler and A. Furnari and E. Kazakos and D. Moltisanti and J. Munro and T. Perrett and W. Price and M. Wray }, title = { Scaling Egocentric Vision: The EPIC-KITCHENS Dataset }, url={}, pdf={} }

We introduce EPIC-KITCHENS, a large-scale egocentric video benchmark recorded by 32 participants in their native kitchen environments. Our videos depict nonscripted daily activities. Recording took place in 4 cities (in North America and Europe) by participants belonging to 10 different nationalities, resulting in highly diverse kitchen habits and cooking styles. Our dataset features 55 hours of video consisting of 11.5M frames, which we densely labeled for a total of 39.6K action segments and 454.2K object bounding boxes. We describe our object, action and anticipation challenges, and evaluate several baselines over two test splits, seen and unseen kitchens. This work is a joint collaboration between the University of Catania, the University of Bristol and the University of Toronto.

Egocentric Shopping Cart Localization

@inproceedings{spera2018egocentric, author = { Emiliano Spera and Antonino Furnari and Sebastiano Battiato and Giovanni Maria Farinella }, title = { Egocentric Shopping Cart Localization }, pdf = {publications/spera2018egocentric.pdf}, url = {}, booktitle = { International Conference on Pattern Recognition (ICPR) }, year = {2018}, }

We investigate the new problem of egocentric shopping cart localization in retail stores. We propose a novel large-scale dataset for image-based egocentric shopping cart localization. The dataset has been collected using cameras placed on shopping carts in a large retail store. It contains a total of 19,531 image frames, each labelled with its six Degrees Of Freedom pose. We study the localization problem by analysing how cart locations should be represented and estimated, and how to assess the localization results. We benchmark two families of algorithms: classic methods based on image retrieval and emerging methods based on regression.

Evaluation of Egocentric Action Recognition

@inproceedings {furnari2017how, author = "Furnari, Antonino and Battiato, Sebastiano and Farinella, Giovanni Maria ", title = "How Shall we Evaluate Egocentric Action Recognition?", booktitle = "International Workshop on Egocentric Perception, Interaction and Computing (EPIC) in conjunction with ICCV", year = "2017", url = "", pdf = "publications/furnari2017how.pdf" }

Egocentric action analysis methods often assume that input videos are trimmed and hence they tend to focus on action classification rather than recognition. Consequently, adopted evaluation schemes are often unable to assess important properties of the desired action video segmentation output, which are deemed to be meaningful in real scenarios (e.g., oversegmentation and boundary localization precision). To overcome the limits of current evaluation methodologies, we propose a set of measures aimed to quantitatively and qualitatively assess the performance of egocentric action recognition methods. To improve exploitability of current action classification methods in the recognition scenario, we investigate how frame-wise predictions can be turned into action-based temporal video segmentations. Experiments on both synthetic and real data show that the proposed set of measures can help to improve evaluation and to drive the design of egocentric action recognition methods.

Next-Active-Object-Prediction from Egocentric Video

@article{furnari2017next, title = { Next-active-object prediction from egocentric videos }, journal = { Journal of Visual Communication and Image Representation }, volume = { 49 }, number = { Supplement C }, pages = { 401 - 411 }, year = { 2017 }, issn = { 1047-3203 }, doi = { }, url = { }, author = { Antonino Furnari and Sebastiano Battiato and Kristen Grauman and Giovanni Maria Farinella }, }

Although First Person Vision systems can sense the environment from the user's perspective, they are generally unable to predict his intentions and goals. Since human activities can be decomposed in terms of atomic actions and interactions with objects, intelligent wearable systems would benefit from the ability to anticipate user-object interactions. Even if this task is not trivial, the First Person Vision paradigm can provide important cues useful to address this challenge. Specifically, we propose to exploit the dynamics of the scene to recognize next-active-objects before an object interaction actually begins. We train a classifier to discriminate trajectories leading to an object activation from all others and perform next-active-object prediction using a sliding window. Next-active-object prediction is performed by analyzing fixed-length trajectory segments within a sliding window. We investigate what properties of egocentric object motion are most discriminative for the task and evaluate the temporal support with respect to which such motion should be considered. The proposed method compares favorably with respect to several baselines on the ADL egocentric dataset which has been acquired by 20 subjects and contains 10 hours of video of unconstrained interactions with several objects.

Location-Based Temporal Segmentation of Egocentric Videos

A. Furnari, S. Battiato, G. M. Farinella, Personal-Location-Based Temporal Segmentation of Egocentric Video for Lifelogging Applications, submitted to Journal of Visual Communication and Image Representation. Web Page

Temporal video segmentation can be useful to improve the exploitation of long egocentric videos. Previous work has focused on general purpose methods designed to work on data acquired by different users. In contrast, egocentric data tends to be very personal and meaningful for the user who acquires it. In particular, being able to extract information related to personal locations can be very useful for life-logging related applications such as indexing long egocentric videos, detecting semantically meaningful video segments for later retrieval or summarization, and estimating the amount of time spent at a given location. In this paper, we propose a method to segment egocentric videos on the basis of the locations visited by user. The method is aimed at providing a personalized output and hence it allows the user to specify which locations he wants to keep track of. To account for negative locations (i.e., locations not specified by the user), we propose an effective negative rejection methods which leverages the continuous nature of egocentric videos and does not require any negative sample at training time. To perform experimental analysis, we collected a dataset of egocentric videos containing 10 personal locations of interest. Results show that the method is accurate and compares favorably with the state of the art.

Recognizing Personal Locations from Egocentric Videos

@article{furnari2016recognizing, author={Furnari, Antonino and Farinella, Giovanni Maria and Battiato, Sebastiano}, journal={IEEE Transactions on Human-Machine Systems}, title={Recognizing Personal Locations From Egocentric Videos}, year={2016}, doi={10.1109/THMS.2016.2612002}, ISSN={2168-2291}, url={}, pdf={publications/furnari2016recognizing.pdf} }

Contextual awareness in wearable computing allows for construction of intelligent systems which are able to interact with the user in a more natural way. In this paper, we study how personal locations arising from the user’s daily activities can be recognized from egocentric videos. We assume that few training samples are available for learning purposes. Considering the diversity of the devices available on the market, we introduce a benchmark dataset containing egocentric videos of 8 personal locations acquired by a user with 4 different wearable cameras. To make our analysis useful in real-world scenarios, we propose a method to reject negative locations, i.e., those not belonging to any of the categories of interest for the end-user. We assess the performances of the main state-of-the-art representations for scene and object classification on the considered task, as well as the influence of device-specific factors such as the Field of View (FOV) and the wearing modality. Concerning the different device-specific factors, experiments revealed that the best results are obtained using a head-mounted, wide-angular device. Our analysis shows the effectiveness of using representations based on Convolutional Neural Networks (CNN), employing basic transfer learning techniques and an entropy-based rejection algorithm.

Distortion Adaptive Sobel Filters

@article{furnari2017distortion, url = { }, pdf = { publications/furnari2017distortion.pdf }, author = { Antonino Furnari and Giovanni Maria Farinella and Arcangelo Ranieri Bruna and Sebastiano Battiato }, doi = { 10.1016/j.jvcir.2017.03.019 }, year = { 2017 }, month = { July }, pages = { 165 - 175 }, volume = { 46 }, journal = { Journal of Visual Communication and Image Representation }, title = { Distortion Adaptive Sobel Filters for the Gradient Estimation of Wide Angle Images }, }

We present a family of adaptive Sobel filters for the geometrically correct estimation of the gradients of wide angle images. The proposed filters can be useful in a number of application domains exploiting wide angle cameras, as for instance, surveillance, automotive and robotics. The filters are based on Sobel's rationale and account for the geometric transformation undergone by wide angle images due to the presence of radial distortion. The proposed method is evaluated on a benchmark dataset of images belonging to different scene categories related to applications where wide angle lenses are commonly used and image gradients are often employed. We also propose an objective evaluation procedure to assess the estimation of the gradient of wide angle images. Experiments show that our approach outperforms the current state-of-the-art in both gradient estimation and keypoint matching.

Affine Covariant Feature Extraction on Fisheye Images

url = { },
pdf = { publications/furnari2017affine.pdf },
issn = { 1057-7149 },
doi = { 10.1109/TIP.2016.2627816 },
pages = { 696-710 },
number = { 2 },
volume = { 26 },
year = { 2017 },
title = { Affine Covariant Features for Fisheye Distortion Local Modeling },
journal = { IEEE Transactions on Image Processing },
author = { A. Furnari and G. M. Farinella and A. R. Bruna and S. Battiato },

Perspective cameras are the most popular imaging sensors used in Computer Vision. However, many application fields including automotive, surveillance and robotics, require the use of wide angle cameras (e.g., fisheye) which allow to acquire a larger portion of the scene using a single device at the cost of the introduction of noticeable radial distortion in the images. Affine covariant feature detectors have proven successful in a variety of Computer Vision applications including object recognition, image registration and visual search. Moreover, their robustness to a series of variabilities related to both the scene and the image acquisition process has been thoroughly studied in the literature. In this paper, we investigate their effectiveness on fisheye images providing both theoretical and experimental analyses. As theoretical outcome, we show that even if the radial distortion is not an affine transformation, it can be locally approximated as a linear function with a reasonably small error. The experimental analysis builds on Mikolajczyk's benchmark to assess the robustness of three popular affine region detectors (i.e., Maximally Stable Extremal Regions (MSER), Harris and Hessian affine region detectors), with respect to different variabilities as well as radial distortion. To support the evaluations, we rely on the Oxford dataset and introduce a novel benchmark dataset comprising 50 images depicting different scene categories. The experiments show that the affine region detectors can be effectively employed directly on fisheye images and that the radial distortion is locally modelled as an additional affine variability.

Evaluation of Saliency Detection

@inproceedings{furnari2014experimental, pdf = { publications/furnari2014experimental.pdf }, publisher = { Springer Lecture Notes in Computer Science }, volume = { 8927 }, series = { Lecture Notes in Computer Science }, pages = { 806-821 }, doi = { 10.1007/978-3-319-16199-0_56 }, booktitle = { Workshop on Assistive Computer Vision and Robotics (ACVR) in conjunction with ECCV, Zurich, Switzerland, September 12 }, year = { 2014 }, title = { An Experimental Analysis of Saliency Detection with respect to Three Saliency Levels }, author = { A. Furnari and G. M. Farinella and S. Battiato }, }

Saliency detection is a useful tool for video-based, real-time Computer Vision applications. It allows to select which locations of the scene are the most relevant and has been used in a number of related assistive technologies such as life-logging, memory augmentation and object detection for the visually impaired, as well as to study autism and the Parkinson’s disease. Many works focusing on different aspects of saliency have been proposed in the literature, defining saliency in different ways depending on the task. In this paper we perform an experimental analysis focusing on three levels where saliency is defined in different ways, namely visual attention modelling, salient object detection and salient object segmentation. We review the main evaluation datasets specifying the level of saliency which they best describe. Through the experiments we show that the performances of the saliency algorithms depend on the level with respect to which they are evaluated and on the nature of the stimuli used for the benchmark. Moreover, we show that the eye fixation maps can be effectively used to perform salient object detection and segmentation, which suggests that pre-attentive bottom-up information can be still exploited to improve high level tasks such as salient object detection. Finally, we show that benchmarking a saliency detection algorithm with respect to a single dataset/saliency level, can lead to erroneous results and conclude that many datasets/saliency levels should be considered in the evaluations.

Vehicle Tracking

@article{battiato2015integrated, pdf = {publications/battiato2015integrated.pdf}, doi = {10.1016/j.eswa.2015.05.055}, pages = {7263--7275}, number = {21}, volume = {42}, year = {2015}, journal = {Expert Systems with Applications}, title = {An integrated system for vehicle tracking and classification}, author = {S. Battiato and G. M. Farinella and A. Furnari and G. Puglisi and A. Snijders and J. Spiekstra}, }

We present a unified system for vehicle tracking and classification which has been developed with a data-driven approach on real-world data. The main purpose of the system is the tracking of the vehicles to understand lane changes, gates transits and other behaviors useful for traffic analysis. The discrimination of the vehicles into two classes (cars vs. trucks) is also required for electronic truck-tolling. Both tracking and classification are performed online by a system made up of two components (tracker and classifier) plus a controller which automatically adapts the configuration of the system to the observed conditions. Experiments show that the proposed system outperforms the state-of-the-art algorithms on the considered data.