In my opinion the main problem is to stop thinking in terms of Computer 
Vision. The issues we are having now are more and more to do with AI in 
general and not with CV.

We have a very hard time teaching computers concepts, and the notion of 
"concept" itself. Take books, for instance.

As a child you are shown books and learn what they are. One day you 
encounter a comic, and to you it is a book. But once you have seen a 
few off them you understand you have to create a new category for them. 
Even if nobody tells you that comics are a subcategory of books, you 
can come up with it independently.

Now take e-books and audiobooks. They could be seen as subcategories of 
books too. Yet if you had asked somebody what a book is years ago, the 
answer would probably have involved ink and paper.

Concepts evolve with experience. That means you cannot just take a 
corpus of labelled data and form categories from it. You need 
emergence: software must be able to find new categories by itself. You 
need online, mostly unsupervised learning.

Concepts go beyond that, too. They are fuzzy. Take "broken" for 
instance: if you have seen broken toys and broken glasses, how do you 
recognize a broken door? What about a broken TV? Harder: broken 
software?

If you see a picture of a miniature car, how do you know it is a 
miniature? Because you infer the scale from context. So it is not 
enough to segment objects and recognize them independently from 
context, you need global scene understanding.

These problems are not exclusive to vision, they are more core AI 
problems. Lots of them could be applicable to other senses. And after 
all, blind humans are still much better at most things than our 
algorithms.

In the 70s people were focusing on core AI, and some thought it would 
solve everything. They were proved wrong when their techniques were 
crushed by much simpler, statistical ones on some of the problems they 
were expected to solve (for instance Markov models for speech 
recognition). So the pendulum swung to more focused, specialized, 
low-level research.

Now I think the pendulum is swinging again. The most impressive recent 
results in CV (classification) involve neural networks and deep 
learning. What these teams have done is leverage relatively simple 
algorithms, massive computing power and large volumes of data to take 
on sophisticated, hand-tuned algorithms. And they have won by a huge 
margin. Looks like Peter Norvig was proved right once again.

So the most important problem of CV may well be: how do we stop solving 
those low-level problems, and instead formulate them so that computers 
can do it instead?