Do Robots See?

Before getting to the topic of the article ”Do Robots See” allow me to briefly introduce the career path that has brought me, Ara Ghazaryan, to write about artificial intelligence.

Being a physics guy with a background in optics and imaging techniques, I faced the problem of image analysis very early in my career. Back at National Taiwan University, where I first built and consequently utilized a multiphoton microscopy system, the main challenge was to first learn how to distinguish and detect images corresponding to pathogenic tissue samples. This was followed by an even more challenging task; teaching the computer to do it automatically and, as it turned out, even better than a human would.

My first on-the-field problem back then was to develop an algorithm that would look at an image of collagen fiber distribution and alignment in the cornea to distinguish early cataract cases from normal tissues. The task was accomplished successfully and inspired me to take a deeper look  (no pun intended) into image analysis and vision. During my postdoc in South Korea, where again the task was to build a smart imaging system with self-learning capabilities aimed to assist the medical personnel in detection and classification of pathologies, I deepened my knowledge of Machine Learning tools for image analysis and classification.

Later, during my last postdoc at the Technical University of Munich, two things happened simultaneously. First, I explored and presented a nature article on the most complex vision systems in nature, which acted as a bait dragging me into the subject of vision. I encourage you to get yourself acquainted with it. It’s a fascinating and astonishing tool created by nature and gifted to… mantis shrimp!! . And second,  I got involved in a project where image reconstruction followed by recognition was the key task for a technique called Multispectral Optoacoustic Tomography. There you go – my appetite on the subject could not increase more!

And right then I meet the guys from Develandoo, who not only boldly march in the field engaged in a number of AI commercial projects, but are also keen on advocating and promoting the blossoming topic of AI in Munich. For me, it was destiny and an encouragement for the next drastic but a somewhat natural step – to dive into data science altogether!

And here I am, a data scientist employed by Develandoo. Among the projects I am working on, there are a couple that closely relate to the question posed at the beginning of this article (which you probably forgot by now, so let me remind you.)  Can Robots see? Torephrase in a more general form – can computers see? The short answer is yes!! Amazingly yes! Astonishingly yes! And to do it they use methods developed in a discipline called Computer Vision.

To quote the definition from Wikipedia, “Computer Vision is an interdisciplinary field that deals with how computers can be made for gaining high-level understanding from digital images or videos. From the perspective of engineering, it seeks to automate tasks that the human visual system can do.” The first part is the acquisition of photos and/or videos. And as everyone with digital cameras and smartphones knows, computers are already really good at capturing photos with incredible fidelity and detail. But as computer vision specialist Fei Fei Li puts it, “just like to hear is not the same as to listen, to take a picture is not the same as to see.”  Thus, the main problem lays in the second part, understanding of the acquired visual data.

Historically the topic is dated to as early as 1966 when MIT professor Minsky assigned a summer project entitled “Solve Computer Vision” to his students claiming, “This shouldn’t take too long.” More than a half a century has passed since then, and the field of computer vision has blossomed from one summer project into a field of thousands of researchers worldwide who are still working on some of the most fundamental problems of vision. It has grown into one of the most important and fastest growing areas of artificial intelligence.

As the state-of-the-art definition of computer vision states, it is concerned with the automatic extraction, analysis, and understanding of useful information from a single image or a sequence of images. It involves the development of a theoretical and algorithmic basis to achieve automatic visual understanding. As a scientific discipline, computer vision is concerned with the theory behind artificial systems that extract information from images. The image data can take many forms, such as video sequences, views from multiple cameras, or multi-dimensional data from a medical scanner. As a technological discipline, computer vision seeks to apply its theories and models for the construction of autonomous vision systems.

In simpler words, for those just starting out in the field, from the computer’s perspective an image is just an array of numbers (if it is colored there will be 3 arrays of numbers). By themselves these pixels don’t mean anything to a computer, the computer has to interpret what they are. And in general there are 4 topical “re-” approaches of how it does this: Recognition, Reconstruction, Registration, and Reorganisation.

  • Recognition: object recognition (or classification) – given a photo to detect, the task here is to know where all the objects are and to understand what they are.
  • Reconstruction: given several photos or spherical panoramas – reconstruct the 3D dimensional shape of the object.
  • Registration: address tracking or alignment of objects (for example tracking of objects by autonomous driving cars; it is also used in “selfie lenses”)
  • Reorganization: a type of unsupervised learning clusterization (used as well in robotics!!)

For a more concrete example let’s elaborate on the first task of image interpretation – the recognition. The learning curve of researching image recognition has brought us to the most effective and scalable approach, the data-driven approach. This means that to address this important computer vision task, one essentially has to adopt a deep learning methodology (supervised learning being most promising.) More specifically, to solve image classification problems, the course of action one has to take is the following: 1) collect a large dataset of images with labels, 2) use machine learning to train a classifier (normally based CNN-s architecture conjugated with some classification), 3) evaluate the classifier on new images to make sure the model performs as intended.

Thus, any computer vision task includes methods for acquiring, processing, analyzing and understanding digital images. “Understanding,” in this context, would mean the transformation of visual images into descriptions that can interface with other thought processes, and lead to corresponding actions. This image understanding can be seen as the disentangling of symbolic information from image data using models constructed with the aid of geometry, physics, statistics, and learning theory. And that’s exactly what we would expect from the vision of, lets’ say, a robot.  

As for the applications of computer vision, I have named a few with corresponding examples:

  • Automatic inspection (manufacturing applications)
  • Identification tasks (species identification system)
  • Controlling processes (industrial robots)
  • Interface interaction (input to a device for computer-human interaction)
  • Modeling objects or environments (topographical modeling or medical image analysis <- this is where I started)
  • Navigation (autonomous driving)
  • Detecting events (visual surveillance or people counting)
  • Organizing information (indexing databases of images and image sequences)

Among the most interesting future endeavors, for me, are  intelligent vision, motion vision, and human behavioral analysis implemented on surveillance systems and visual sentimental analysis systems.

An important step for the future of computer vision is integrating the powerful but specific systems we’ve created with broader ones, that are focused on concepts that are a bit harder to pin down, like context, attention, intention.

That said, computer vision, even in its nascent stage, is still incredibly useful. It’s in our cameras, recognizing faces and smiles. It’s in self-driving cars, reading traffic signs and watching for pedestrians. It’s in factory robots, monitoring for problems and navigating around human workers. There’s still a long way to go before they see as we do — if it’s even possible — but considering the scale of the task at hand, it’s amazing that they see at all.

As a postface of the article, I would like to finish by quoting Bill Freeman, a computer vision specialist jokingly answering a reporter’s question. “So how exactly does the computer see? The thing is, most computer vision researchers do not really understand how computers see. It’s like alchemy and chemistry. Alchemy came first and chemistry came after. And right now we are in the alchemy stage of computer vision, where it works but we are not sure why. And it is the chemistry stage that I look forward to.”

Me too.


  • Topics:
  • Artificial Intelligence

Top Stories

High Five! You just read 2 awesome articles, in row. You may want to subscribe to our blog newsletter for new blog posts.