Skip to content

mabynke/neuraltracker

Repository files navigation

Creating a generic object tracker

Introduction

The goal of the summer of 2017 is to create a generic object tracker, in the sense that it should not be tied to a specific type of object. In fact we would like to make it so generic that it can track any object. We can of course then not demand that the tracker will always know what kind of object it is tracking.

The inspiration for this project came from the article Recurrent Models of Visual Attention and its sequel Multiple Object Recognition with Visual Attention. We will however not follow this architecture strictly. We will in fact try to use supervised learning as much as possible

Tracking

More specifically we would like to select an object in a frame with a bounding box. The tracker’s view is then initialized to the center of this object. In the next frame, where either the object and/or the camera might have moved position, the tracker should then give us an updated bounding box. The network should also output whether it believes the object to be present or not (as the object may leave the field of view of the camera). The view that is fed to the network may either be a function of the bounding box or a separate coordinate decided by the network.

As this may be done independently for several objects, we should be able to track arbitrarily many objects at the same time.

Base convolutional neural network

To extract features from the image we will use a convolutional neural network. Experiments have shown that networks originally trained for classification, with necessary changes to e.g. the output layer, are also well suited for tasks like localization and detection. It thus seems like a good approach to find a well-performing network from the literature, even though it has not been tested for out exact purpose. DenseNet seems suitable for our purpose, it has good performance while being computationally resonable. The original implementation was in Torch, while we would like a TensorFlow/Keras implementation. If we scroll down on their README file, however, we can find a link to a Keras implementation:

which also contains a link to how you can do finetuning on your own data.

There are also many other implemenations, both in pure TensorFlow as well as Keras. They do not seem to contain any pretrained networks on ImageNet though. We could still potentially use one of these and pretrain it ourselves on a relevant dataset.

Initialization of hidden representation

How do we tell the network what it should be looking for? Say that we draw a bounding box around an object, how do we convey this information to the network? If there are no other objects close by we might just give a window around the object as $x0$, and then it should be clear for the network which object to track. If we try to mark a person in the middle of a crowd this approach seems to be less reliable. We could possibly create a separate input channel with either a bounding box or a point on the object as additional input.

Initialization of weights

The amount of annotated video data for tracking may prove not to be sufficiently large or diversified. A possible solution to this may be to start with a network that already knows how to recognize objects or pretrain our network ourselves on another data source and task. A good idea might be too initialize the network by training it on object localization. This should be similar to the first step of our tracking, but it will of course not learn how to combine the information over multiple views.

User interface

We would like to create a web-interface to our tracker application.

Tracking interface

We should be able to upload a video file into our web-page, mark an object in one of the frames, and then get the tracking data for following frames overlayed in the video. The tracker algorithm may be running on a remote server. In this way we could have a tracker server running, and then anyone could use it on their own video data, from their own computers, without having to install anything.

Data flow

The idea is that we alternate sending information back and forth, between the client and the server. We start of by sending subimage containing the object we are interested in tracking. The server then responds with information about:

  1. Where it believes the object is in the image
  2. Where it wants to look next

We use the first piece of information to display the track, and the second to send back the appropriate part of the image that the server requested.

Communication between client and server

It seems like an intersting option is to use TensorFlow Serving. After we have set up a server and a client, see tutorial, we communicate by making remote procedure calls (with gRPC) from the client to the server.

It does not seem like we can do this directly from JavaScript however, so how do we circumvent this? It seems like we may have to go through

Annotation tool

We would like this to be an efficient tool for labeling data as well. This means that after our tracker has run on (parts of) a video, we should be able to

  • Make edits to the track
  • Label the track with a specified object
  • Possibly label the track with other information, like if the object is absent or partially occluded for part of the track

Resources

JavaScript, HTML and CSS are the cornerstone technologies for web-pages

TensorFlow serving:

Bonus and extensions

Object rediscovery

The most straightforward thing to do when the object disappears from the field of view is to mark the object as absent and stop tracking. Something that would be really cool was if we where able to rediscover a object that disappears for a few frames. There seems to be two be at least two approaches to follow.

  • Go into “search mode” with a predefined mechanism to try to detect the object in subsequent frames.
  • Use reinforcement learning to learn a search policy. This approach could be integrated more elegantly and we may not even need a separate mode.

Combine with detection

One could imagine combining this with object detection, e.g. one constantly tries to detect new objects appearing and then set track on them.

Combine with UAV flight control

One could imagine using the information to make the UAV automatically track e.g. a car.

Data

Tracking data

We would like to have video data annotated with tracking information of different kinds of objects.

Other sources of data should also be sought. This may be both data from the internet,

Other sources

Documentation

Documentation should be written in Emacs Org Mode. See Getting Started With Org Mode for an introduction to Org Mode.

FAQ

How is tracking different from detection?

Why do we need to maintain a state?

Resources

About

Prosjekt FFI sommeren 2017

Resources

Stars

Watchers

Forks

Packages

No packages published