Tesla Autonomous day, from a AI engineers' perspective
TL;DRGreat progress from Tesla. Tesla is probably in par or even ahead of other self-driving company. However, L5/RobotTaxi is still very far away.
SummaryTesla's Autonomous Investor day talk is very interesting, to say the least. A weird ~1 hour loop video, followed by 2 amazing talks full of tech details about AI chip design and deep learning system from Pete Bannon and Andrej Karpathy, then one hour of unrealistically ambitious RoboTaxi picture from Elon.
The stock market sunk significantly in the following days, partially because of the recent fire in Shanghai, which is totally unnecessary, and partially because of the disappointing financial report, and in my understanding, partially because of the revenge from the investors who feel stupid of themselves after attending an unexpected tech talk they probably understand only 1%.
Jokes aside, the tech talks are awesome. It shade lights on lots of details and progress about Tesla's self driving work, and probably place Tesla to the frontline of self driving application and research(maybe, let's see what Google will release in IO 2019)
Disclosure about me: degree in Physics, software engineer in major silicon valley tech company, worked on machine learning for more than 5 years, Tesla Owner, and bought a tiny bit of Tesla stock after bought the Tesla and experience how good the car is.
Lidar is dead
This is one part of Elon's big word I agree with.
For the perception part of a self-driving system, it is all about how machine learns about the surround world(there is a van going at this speed and that direction, ...).The current Tesla way is based on computer vision, which is based on artificial neural network(NN). The other companies(Waymo, Uber, drive.ai, etc) are based on Lidar, which use laser to read depth data directly from surrounding world. All system uses NN now, but Tesla is almost the only one solely relies on it.
To be clear, the technology and theory about NN is there for a few decades. But the actually ability to train and deploy a useful NN, especially one that works well on computer vision, is pretty young. When Google starts self-driving project in 2009, NN is far from ready for that and that's why they chose Lidar, which is pretty mature at that time, but still pretty expensive, even today.
Modern NN are way better now.
Those companies took a shortcut a few years back, with Lidar, but as Elon said, the world's road is built for vision, not for Lidar. That means the visual info is eventually the source of truth. Every companies are trying to catching up on vision, and Tesla, being relying on vision, is not behind at least.
Even if NN can be as good or better for this kind of perception as Lidar, there has been one argument that Lidar, reading the depth data, can provide a safety net for things NN never learned to recognize. This argument doesn't stand:
NN can get depth data from vision only, with or without 2-eye perception.
Andrej even showed that FSD can reconstruct the 3D environment from a short video clips, which with better computation capability enabled by the new FSD computer, can lead to comparable resolution as Lidar, with a fraction of the cost. BTW, Google knows about this also: it use a similar way to get depth data to do awesome portrait mode in Pixel phones.
Lidar costs around $100K when Google started its self-driving project. It is way cheaper but still a few thousands dollars, and they now require more than one per car(4 * $500 + 1*$8000 = $10K). For comparison, 8 camera probably cost less than $500, and the FSD computer, probably cheaper than the minimal Tesla charged for the FSD upgrade, so $2K maybe. This is still around 1/5 of the Lidar solution.
Amazing work. Almost all useful NN are built by the same building blocks, and Tesla's new FSD computer can guarantee enough computing power for a few years development for more demanding models.
Self-driving system, is basically a AI system, and modern AI is basically synonym to NN.
With recently years development, experience and understanding of NN, we now knows that despite of academical possibilities, NN are mostly build on a few known building blocks(RELU, Convolution, Pooling, etc).
GPU did kickstarted the AI revolution, but as we understand more about AI, more specifically NN, we need less generic computation, only a few building blocks are needed, and special built chips are much better at them. Of course generic computation is still needed, but for modern AI work, most of the heavy lifting can be done very efficiently by a special built AI chip, like Tesla's FSD V3. BTW, there is a decent GPU and CPU inside FSD computer, and capable to run certain generic operation if needed.
The extensibility and scalability of the V3 chip is pretty future proof. I don't know if this V3 chip's 144 TOPS is enough for the perfect self-driving system Elon is promising, but it is pretty good, better than any commercially available system for the form and power restraints, and probably good enough for a few years progress for more demanding models.
The uncertainty is mostly about the ratio: does Tesla put the right ratio for the different building blocks inside the FSD chip? Say if they did the perfect ratio, there will be no parts wasted. However, with the current utility be around 5%, it won't be a problem for a few years at least.
However, one thing I have to point out is, the energy consumption mentioned by Bannon is weird, it (250 watts per mile) is not even in the right unit (watts is for power, not energy). The thermal power consumption for the FSD computer is < 100W, =100Wh per hour = 0.1kWh per hour, which means for driving at 60mph, assuming 250wh/mile, 15kWh per hour, the consumption is around 1 in 150. 0.6%, aka, 0.6% reduction in mileage. I'd ignore it.
Deep Learning Software
Tesla is really ahead in the deep learning side.
NN is feeding on data. The more data, the better the NN can be. Tesla is in the best position for the data. Not just from all 8 cameras on half a million AP2.0+ cars, but also from all the human intervention and inputs. This is a gold mine for data, and no one else in the field can compete.
Andrej and Elon also questioned about the simulations other companies runs to get data, but as Mark Twain said, life is stranger! No simulation can bet the value of real world data.
Truth is stranger than fiction, but it is because Fiction is obliged to stick to possibilities; Truth isn’t. -- Mark Twain
Getting the training data, especially the long tail cases (extreme cases like when a car is flying towards you...), is the hard part.
Tesla has built 2 interesting systems for the this.
One is for requesting data they are interested in from the fleet.
The other is running a shadow NN models and use it to validating the model and getting more data by compare the decision to real world one.
Tesla basically hired a fleet of half a million and growing data-collecting cars, but instead of paying, Tesla gets paid! Even better, we (Tesla Owners) are happy about it.
In short, I think L4 and L5 in the current definition is very very far away. Current generation of AI are not capable to reaching FSD. There is 2 aspect of this.
Full self-driving requires some knowledge and experience from interacting with the real physical world and real world traffic/people. This kind of knowledge may be beyond the current generation of AI. Current generation of AI are good at pattern matching, not logical thinking.
Example A, a black plastic bag on the road, may or may not contains something big and heavy; humans will be able to tell the content by observing how it shakes or moves in the wind, but that require real world physics knowledge and experience and logical inference which is really really hard to learn in current generation AI.
Example B, a turning signal can mean intent to turn, but can also be a mistake trying to turn on the wiper after a rain. This kind of decision need experience in human interaction, which can also be hard.
There are progress on general purpose AI, and I am sure we can reach FSD if we can get there, but until that, L5 is not likely happening.
Another thing is, the SAE (J3016) Automation Levels are published in 2014 by SAE(Society of Automotive Engineers), an organization in automotive industry, who probably do not fully understand how modern AI works: AI can be pretty good in all condition before it can be perfect even in some limited condition.
The current levels, L2(hands off), L3(eyes off), L4(minds off), L5(steering wheel optional) are not useful.
In current SAE definition, Tesla is not even L3(eyes off), since we need to constantly monitoring the system, not even hands off if you count in the nagging :). However, we can all agree it is way better than any L2.
L3 is different from L2 by not need to keep their attention on the driving tasks, and only need to be the backup for the car when called upon. L4 is different L3 by no requiring driver attention for safety, but only in limited condition or geolocation. L5 is different from L4 by no limitation.
The problem is, AI doesn't work that way, it is a ever improving thing, can be widely useful, but never perfect, just like humans are never perfect, thus, more eyes and more backup is always good.
Teslas can drive itself in lots of places in the world, while the car was never tested in most part of the world(the old build car in california for californian joke) nor even depending on high res map info; in rain or shine, sunset or sunrise, day or night. I can say it driving better than most of people in rain, even when I can hardly see the lines on the road, Autopilot works pretty well.
But for the cases the AI system never seen, it cannot do logical inference, and may need human intervention, even in very limited environment. Like Mark said, life doesn't need to stick to possibilities.
Here I propose a new system for L3 and L4, while remain L5 the same end goal:
L3, cars can handle most of driving tasks in most situation, more than just simple steering and pedal, but need human monitoring and intervention.
L4, cars can handle all driving tasks in most environment, but still need human monitoring.
In that definition, a normal cruise control is L1; an adaptive cruise control is probably L1.5; Tesla AP and lane keeping technology from other car companies are L2; Tesla Navigation on Autopilot is L3; The Tesla demo show from autonomous day or solutions from Waymo and other self-driving company is close to L4; L5 is still pretty far away.
* all picture and video in this article is from Tesla and I will delete them if requested by the owner.
Edit: updated the cost of Lidar, they are no in the thousands not tens of thousand dollar scope.
Edit: updated the cost of Lidar, they are no in the thousands not tens of thousand dollar scope.