Antonino Furnari

Postdoctoral Researcher

University of Catania

Department of Mathematics and Computer Science

Viale A. Doria 6 - 95125, Catania, Italy

Rolling-Unrolling LSTMs for Egocentric Action Anticipation

@inproceedings{furnari2019rulstm, title = { What Would You Expect? Anticipating Egocentric Actions with Rolling-Unrolling LSTMs and Modality Attention }, author = { Antonino Furnari and Giovanni Maria Farinella }, year = { 2019 }, booktitle = { International Conference on Computer Vision }, pdf = {}, url = {} }

Egocentric action anticipation consists in understanding which objects the camera wearer will interact with in the near future and which actions they will perform. We tackle the problem proposing an architecture able to anticipate actions at multiple temporal scales using two LSTMs to 1) summarize the past, and 2) formulate predictions about the future. The input video is processed considering three complimentary modalities: appearance (RGB), motion (optical flow) and objects (object-based features). Modality-specific predictions are fused using a novel Modality ATTention (MATT) mechanism which learns to weigh modalities in an adaptive fashion. Extensive evaluations on two large-scale benchmark datasets show that our method outperforms prior art by up to +7% on the challenging EPIC-Kitchens dataset including more than 2500 actions, and generalizes to EGTEA Gaze+. Our approach is also shown to generalize to the tasks of early action recognition and action recognition. Our method is ranked first in the public leaderboard of the EPIC-Kitchens egocentric action anticipation challenge 2019. Web Page - Code.


@inproceedings{Damen2018EPICKITCHENS, year = {2018}, booktitle= { European Conference on Computer Vision }, author = { D. Damen and H. Doughty and G. M. Farinella and S. Fidler and A. Furnari and E. Kazakos and D. Moltisanti and J. Munro and T. Perrett and W. Price and M. Wray }, title = { Scaling Egocentric Vision: The EPIC-KITCHENS Dataset }, url={}, pdf={} }

We introduce EPIC-KITCHENS, a large-scale egocentric video benchmark recorded by 32 participants in their native kitchen environments. Our videos depict nonscripted daily activities. Recording took place in 4 cities (in North America and Europe) by participants belonging to 10 different nationalities, resulting in highly diverse kitchen habits and cooking styles. Our dataset features 55 hours of video consisting of 11.5M frames, which we densely labeled for a total of 39.6K action segments and 454.2K object bounding boxes. We describe our object, action and anticipation challenges, and evaluate several baselines over two test splits, seen and unseen kitchens. This work is a joint collaboration between the University of Catania, the University of Bristol and the University of Toronto. Web Page

Verb-Noun Marginal Cross Entropy Loss for Egocentric Action Anticipation

@inproceedings{furnari2018Leveraging, author = { A. Furnari and S. Battiato and G. M. Farinella }, title = { Leveraging Uncertainty to Rethink Loss Functions and Evaluation Measures for Egocentric Action Anticipation }, booktitle = { International Workshop on Egocentric Perception, Interaction and Computing (EPIC) in conjunction with ECCV }, pdf = { publications/furnari2018Leveraging.pdf }, url = {}, year = { 2018 }, }

Current action anticipation approaches often neglect the in-trinsic uncertainty of future predictions when loss functions or evalua-tion measures are designed. The uncertainty of future observations isespecially relevant in the context of egocentric visual data, which isnaturally exposed to a great deal of variability. Considering the prob-lem of egocentric action anticipation, we investigate how loss functionsand evaluation measures can be designed to explicitly take into accountthe natural multi-modality of future events. In particular, we discusssuitable measures to evaluate egocentric action anticipation and studyhow loss functions can be defined to incorporate the uncertainty aris-ing from the prediction of future events. Experiments performed on theEPIC-KITCHENS dataset show that the proposed loss function allowsimproving the results of both egocentric action anticipation and recog-nition methods. Code

Egocentric Visitors Localization in Cultural Sites

@article{ragusa2019egocentric, title = { Egocentric Visitors Localization in Cultural Sites }, journal = { Journal on Computing and Cultural Heritage (JOCCH) }, year = { 2019 }, pdf = { publications/ragusa2019egocentric.pdf }, url = { }, author = { F. Ragusa and A. Furnari and S. Battiato and G. Signorello and G. M. Farinella }, }

We consider the problem of localizing visitors in a cultural site from egocentric (first person) images. Localization information canbe useful both to assist the user during his visit (e.g., by suggesting where to go and what to see next) and to provide behavioralinformation to the manager of the cultural site (e.g., how much time has been spent by visitors at a given location? What has beenliked most?). To tackle the problem, we collected a large dataset of egocentric videos using two cameras: a head-mounted HoloLensdevice and a chest-mounted GoPro. Each frame has been labeled according to the location of the visitor and to what he was looking at.The dataset is freely available in order to encourage research in this domain. The dataset is complemented with baseline experimentsperformed considering a state-of-the-art method for location-based temporal segmentation of egocentric videos. Experiments showthat compelling results can be achieved to extract useful information for both the visitor and the site-manager. Web Page

Market Basket Analysis from Egocentric Videos

@article{Santarcangelo2018VMBA, pdf = { publications/santarcangelo2018market.pdf }, title = { Market Basket Analysis from Egocentric Videos }, journal = { Pattern Recognition Letters }, pages = { 83-90 }, issue = { 1 }, volume = { 112 }, year = { 2018 }, url = { }, author = { V. Santarcangelo and G. M. Farinella and A. Furnari and S. Battiato }, doi = { }, }

This paper presents Visual Market Basket Analysis (VMBA), a novel application domain for egocen-tric vision systems. The final goal of VMBA is to infer the behaviour of the customers of a storeduring their shopping. The analysis relies on image sequences acquired by cameras mounted on shop-ping carts. The inferred behaviours can be coupled with classic Market Basket Analysis information(i.e., receipts) to help retailers to improve the management of spaces and marketing strategies. To setup the challenge, we collected a new dataset of egocentric videos during real shopping sessions in aretail store. Video frames have been labelled according to a proposed hierarchy of 14 different cus-tomer behaviours from the beginning (cart picking) to the end (cart releasing) of their shopping. Webenchmark different representation and classification techniques and propose a multi-modal methodwhich exploits visual, motion and audio descriptors to perform classification with the Directed AcyclicGraph SVM learning architecture. Experiments highlight that employing multimodal representationsand explicitly addressing the task in a hierarchical way is beneficial. The devised approach based onDeep Features achieves an accuracy of more than 87% over the 14 classes of the considered dataset. Web Page

Egocentric Shopping Cart Localization

@inproceedings{spera2018egocentric, author = { Emiliano Spera and Antonino Furnari and Sebastiano Battiato and Giovanni Maria Farinella }, title = { Egocentric Shopping Cart Localization }, pdf = {publications/spera2018egocentric.pdf}, url = {}, booktitle = { International Conference on Pattern Recognition (ICPR) }, year = {2018}, }

We investigate the new problem of egocentric shopping cart localization in retail stores. We propose a novel large-scale dataset for image-based egocentric shopping cart localization. The dataset has been collected using cameras placed on shopping carts in a large retail store. It contains a total of 19,531 image frames, each labelled with its six Degrees Of Freedom pose. We study the localization problem by analysing how cart locations should be represented and estimated, and how to assess the localization results. We benchmark two families of algorithms: classic methods based on image retrieval and emerging methods based on regression. Web Page

Next-Active-Object-Prediction from Egocentric Video

@article{furnari2017next, title = { Next-active-object prediction from egocentric videos }, journal = { Journal of Visual Communication and Image Representation }, volume = { 49 }, number = { Supplement C }, pages = { 401 - 411 }, year = { 2017 }, issn = { 1047-3203 }, doi = { }, url = { }, author = { Antonino Furnari and Sebastiano Battiato and Kristen Grauman and Giovanni Maria Farinella }, }

Although First Person Vision systems can sense the environment from the user's perspective, they are generally unable to predict his intentions and goals. Since human activities can be decomposed in terms of atomic actions and interactions with objects, intelligent wearable systems would benefit from the ability to anticipate user-object interactions. Even if this task is not trivial, the First Person Vision paradigm can provide important cues useful to address this challenge. Specifically, we propose to exploit the dynamics of the scene to recognize next-active-objects before an object interaction actually begins. We train a classifier to discriminate trajectories leading to an object activation from all others and perform next-active-object prediction using a sliding window. Next-active-object prediction is performed by analyzing fixed-length trajectory segments within a sliding window. We investigate what properties of egocentric object motion are most discriminative for the task and evaluate the temporal support with respect to which such motion should be considered. The proposed method compares favorably with respect to several baselines on the ADL egocentric dataset which has been acquired by 20 subjects and contains 10 hours of video of unconstrained interactions with several objects. Web Page

Evaluation of Egocentric Action Recognition

@inproceedings {furnari2017how, author = "Furnari, Antonino and Battiato, Sebastiano and Farinella, Giovanni Maria ", title = "How Shall we Evaluate Egocentric Action Recognition?", booktitle = "International Workshop on Egocentric Perception, Interaction and Computing (EPIC) in conjunction with ICCV", year = "2017", url = "", pdf = "publications/furnari2017how.pdf" }

Egocentric action analysis methods often assume that input videos are trimmed and hence they tend to focus on action classification rather than recognition. Consequently, adopted evaluation schemes are often unable to assess important properties of the desired action video segmentation output, which are deemed to be meaningful in real scenarios (e.g., oversegmentation and boundary localization precision). To overcome the limits of current evaluation methodologies, we propose a set of measures aimed to quantitatively and qualitatively assess the performance of egocentric action recognition methods. To improve exploitability of current action classification methods in the recognition scenario, we investigate how frame-wise predictions can be turned into action-based temporal video segmentations. Experiments on both synthetic and real data show that the proposed set of measures can help to improve evaluation and to drive the design of egocentric action recognition methods. Web Page + Code

Location-Based Temporal Segmentation of Egocentric Videos

@article{furnari2018personal, pages = { 1-12 }, volume = { 52 }, doi = { }, issn = { 1047-3203 }, author = { Antonino Furnari and Sebastiano Battiato and Giovanni Maria Farinella }, url = { }, pdf = { publications/furnari2018personal.pdf }, year = { 2018 }, journal = { Journal of Visual Communication and Image Representation }, title = { Personal-Location-Based Temporal Segmentation of Egocentric Video for Lifelogging Applications }, }

@inproceedings{furnari2016temporal, url = { }, pdf = { publications/furnari2016temporal.pdf }, year = { 2016 }, publisher = { Springer Lecture Notes in Computer Science }, series = { Lecture Notes in Computer Science }, volume = { 9913 }, pages = { 474--489 }, booktitle = { International Workshop on Egocentric Perception, Interaction and Computing (EPIC) in conjunction with ECCV, The Netherlands, Amsterdam, October 9 }, title = { Temporal Segmentation of Egocentric Videos to Highlight Personal Locations of Interest }, author = { Antonino Furnari and Giovanni Maria Farinella and Sebastiano Battiato }, }

Temporal video segmentation can be useful to improve the exploitation of long egocentric videos. Previous work has focused on general purpose methods designed to work on data acquired by different users. In contrast, egocentric data tends to be very personal and meaningful for the user who acquires it. In particular, being able to extract information related to personal locations can be very useful for life-logging related applications such as indexing long egocentric videos, detecting semantically meaningful video segments for later retrieval or summarization, and estimating the amount of time spent at a given location. In this paper, we propose a method to segment egocentric videos on the basis of the locations visited by user. The method is aimed at providing a personalized output and hence it allows the user to specify which locations he wants to keep track of. To account for negative locations (i.e., locations not specified by the user), we propose an effective negative rejection methods which leverages the continuous nature of egocentric videos and does not require any negative sample at training time. To perform experimental analysis, we collected a dataset of egocentric videos containing 10 personal locations of interest. Results show that the method is accurate and compares favorably with the state of the art. Web Page

Recognizing Personal Locations from Egocentric Videos

@article{furnari2016recognizing, author={Furnari, Antonino and Farinella, Giovanni Maria and Battiato, Sebastiano}, journal={IEEE Transactions on Human-Machine Systems}, title={Recognizing Personal Locations From Egocentric Videos}, year={2016}, doi={10.1109/THMS.2016.2612002}, ISSN={2168-2291}, url={}, pdf={publications/furnari2016recognizing.pdf} }

@inproceedings{furnari2015recognizing, url = { }, pdf = { publications/furnari2015recognizing.pdf }, year = { 2015 }, booktitle = { Workshop on Assistive Computer Vision and Robotics (ACVR) in conjunction with ICCV, Santiago, Chile, December 12 }, page = { 393--401 }, title = { Recognizing Personal Contexts from Egocentric Images }, author = { Antonino Furnari and Giovanni Maria Farinella and Sebastiano Battiato }, }

Contextual awareness in wearable computing allows for construction of intelligent systems which are able to interact with the user in a more natural way. In this paper, we study how personal locations arising from the user’s daily activities can be recognized from egocentric videos. We assume that few training samples are available for learning purposes. Considering the diversity of the devices available on the market, we introduce a benchmark dataset containing egocentric videos of 8 personal locations acquired by a user with 4 different wearable cameras. To make our analysis useful in real-world scenarios, we propose a method to reject negative locations, i.e., those not belonging to any of the categories of interest for the end-user. We assess the performances of the main state-of-the-art representations for scene and object classification on the considered task, as well as the influence of device-specific factors such as the Field of View (FOV) and the wearing modality. Concerning the different device-specific factors, experiments revealed that the best results are obtained using a head-mounted, wide-angular device. Our analysis shows the effectiveness of using representations based on Convolutional Neural Networks (CNN), employing basic transfer learning techniques and an entropy-based rejection algorithm. Web Page

Distortion Adaptive Sobel Filters

@article{furnari2017distortion, url = { }, pdf = { publications/furnari2017distortion.pdf }, author = { Antonino Furnari and Giovanni Maria Farinella and Arcangelo Ranieri Bruna and Sebastiano Battiato }, doi = { 10.1016/j.jvcir.2017.03.019 }, year = { 2017 }, month = { July }, pages = { 165 - 175 }, volume = { 46 }, journal = { Journal of Visual Communication and Image Representation }, title = { Distortion Adaptive Sobel Filters for the Gradient Estimation of Wide Angle Images }, }

@inproceedings{furnari2015generalized, url = { }, pdf = { publications/furnari2015generalized.pdf }, booktitle = { IEEE International Conference on Image Processing (ICIP), Quebec, Canada, September 27-30 }, pages = { 3250-3254 }, year = { 2015 }, title = { Generalized Sobel Filters for Gradient Estimation of Distorted Images }, author = { Antonino Furnari and Giovanni Maria Farinella and Arcangelo Bruna and Sebastiano Battiato }, }

@inproceedings{furnari2015distortion, url = { }, pdf = { publications/furnari2015distortion.pdf }, doi = { 10.1007/978-3-319-23234-8_20 }, pages = { 205--215 }, series = { Lecture Notes in Computer Science }, volume = { 9280 }, year = { 2015 }, publisher = { Springer Lecture Notes in Computer Science }, booktitle = { International Conference on Image Analysis and Processing (ICIAP), Genova, Italy, September 7-11 }, title = { Distortion Adaptive Descriptors: Extending Gradient-Based Descriptors to Wide Angle Images }, author = { Antonino Furnari and Giovanni Maria Farinella and Arcangelo Ranieri Bruna and Sebastiano Battiato }, }

We present a family of adaptive Sobel filters for the geometrically correct estimation of the gradients of wide angle images. The proposed filters can be useful in a number of application domains exploiting wide angle cameras, as for instance, surveillance, automotive and robotics. The filters are based on Sobel's rationale and account for the geometric transformation undergone by wide angle images due to the presence of radial distortion. The proposed method is evaluated on a benchmark dataset of images belonging to different scene categories related to applications where wide angle lenses are commonly used and image gradients are often employed. We also propose an objective evaluation procedure to assess the estimation of the gradient of wide angle images. Experiments show that our approach outperforms the current state-of-the-art in both gradient estimation and keypoint matching. Web Page

Affine Covariant Feature Extraction on Fisheye Images


@article{furnari2017affine, url = { }, pdf = { publications/furnari2017affine.pdf }, issn = { 1057-7149 }, doi = { 10.1109/TIP.2016.2627816 }, pages = { 696-710 }, number = { 2 }, volume = { 26 }, year = { 2017 }, title = { Affine Covariant Features for Fisheye Distortion Local Modeling }, journal = { IEEE Transactions on Image Processing }, author = { A. Furnari and G. M. Farinella and A. R. Bruna and S. Battiato }, }

@inproceedings{furnari2014affine, url = { }, pdf = { publications/furnari2014affine.pdf }, pages = { 5681--5685 }, doi = { 10.1109/ICIP.2014.7026149 }, booktitle = { IEEE International Conference on Image Processing, Paris, France, October 27-30 }, year = { 2014 }, title = { Affine Region Detectors on the Fisheye Domain (ICIP) }, author = { Antonino Furnari and Giovanni Maria Farinella and Giovanni Puglisi and Arcangelo Ranieri Bruna and Sebastiano Battiato }, }

Perspective cameras are the most popular imaging sensors used in Computer Vision. However, many application fields including automotive, surveillance and robotics, require the use of wide angle cameras (e.g., fisheye) which allow to acquire a larger portion of the scene using a single device at the cost of the introduction of noticeable radial distortion in the images. Affine covariant feature detectors have proven successful in a variety of Computer Vision applications including object recognition, image registration and visual search. Moreover, their robustness to a series of variabilities related to both the scene and the image acquisition process has been thoroughly studied in the literature. In this paper, we investigate their effectiveness on fisheye images providing both theoretical and experimental analyses. As theoretical outcome, we show that even if the radial distortion is not an affine transformation, it can be locally approximated as a linear function with a reasonably small error. The experimental analysis builds on Mikolajczyk's benchmark to assess the robustness of three popular affine region detectors (i.e., Maximally Stable Extremal Regions (MSER), Harris and Hessian affine region detectors), with respect to different variabilities as well as radial distortion. To support the evaluations, we rely on the Oxford dataset and introduce a novel benchmark dataset comprising 50 images depicting different scene categories. The experiments show that the affine region detectors can be effectively employed directly on fisheye images and that the radial distortion is locally modelled as an additional affine variability. Web Page

Evaluation of Saliency Detection

@inproceedings{furnari2014experimental, pdf = { publications/furnari2014experimental.pdf }, publisher = { Springer Lecture Notes in Computer Science }, volume = { 8927 }, series = { Lecture Notes in Computer Science }, pages = { 806-821 }, doi = { 10.1007/978-3-319-16199-0_56 }, booktitle = { Workshop on Assistive Computer Vision and Robotics (ACVR) in conjunction with ECCV, Zurich, Switzerland, September 12 }, year = { 2014 }, title = { An Experimental Analysis of Saliency Detection with respect to Three Saliency Levels }, author = { A. Furnari and G. M. Farinella and S. Battiato }, }

Saliency detection is a useful tool for video-based, real-time Computer Vision applications. It allows to select which locations of the scene are the most relevant and has been used in a number of related assistive technologies such as life-logging, memory augmentation and object detection for the visually impaired, as well as to study autism and the Parkinson’s disease. Many works focusing on different aspects of saliency have been proposed in the literature, defining saliency in different ways depending on the task. In this paper we perform an experimental analysis focusing on three levels where saliency is defined in different ways, namely visual attention modelling, salient object detection and salient object segmentation. We review the main evaluation datasets specifying the level of saliency which they best describe. Through the experiments we show that the performances of the saliency algorithms depend on the level with respect to which they are evaluated and on the nature of the stimuli used for the benchmark. Moreover, we show that the eye fixation maps can be effectively used to perform salient object detection and segmentation, which suggests that pre-attentive bottom-up information can be still exploited to improve high level tasks such as salient object detection. Finally, we show that benchmarking a saliency detection algorithm with respect to a single dataset/saliency level, can lead to erroneous results and conclude that many datasets/saliency levels should be considered in the evaluations.

Vehicle Tracking

@article{battiato2015integrated, pdf = {publications/battiato2015integrated.pdf}, doi = {10.1016/j.eswa.2015.05.055}, pages = {7263--7275}, number = {21}, volume = {42}, year = {2015}, journal = {Expert Systems with Applications}, title = {An integrated system for vehicle tracking and classification}, author = {S. Battiato and G. M. Farinella and A. Furnari and G. Puglisi and A. Snijders and J. Spiekstra}, }

@inproceedings{battiato2014vehicle, url = { }, pdf = { publications/battiato2014vehicle.pdf }, pages = { 755-760 }, volume = { 2 }, year = { 2014 }, title = { Vehicle tracking based on customized template matching }, booktitle = { VISAPP International Conference on Computer Vision Theory and Applications, Lisbon, Portugal, January 5-8 }, author = { Sebastiano Battiato and Giovanni Maria Farinella and Antonino Furnari and Giovanni Puglisi and Anique Snijders and Jelmer Spiekstra }, }

@inproceedings{Battiato2016, url = { }, doi = { 10.1007/978-3-319-23413-7_2 }, isbn = { 978-3-319-23413-7 }, pages = { 5--7 }, publisher = { Springer International Publishing }, year = { 2016 }, booktitle = { Progress in Industrial Mathematics at ECMI 2014 }, title = { A Customized System for Vehicle Tracking and Classification }, editor = { G. Russo and V. Capasso and G. Nicosia and V. Romano }, author = { S. Battiato and G. M. Farinella and A. Furnari and G. Puglisi }, }

We present a unified system for vehicle tracking and classification which has been developed with a data-driven approach on real-world data. The main purpose of the system is the tracking of the vehicles to understand lane changes, gates transits and other behaviors useful for traffic analysis. The discrimination of the vehicles into two classes (cars vs. trucks) is also required for electronic truck-tolling. Both tracking and classification are performed online by a system made up of two components (tracker and classifier) plus a controller which automatically adapts the configuration of the system to the observed conditions. Experiments show that the proposed system outperforms the state-of-the-art algorithms on the considered data.