Machine learning (ML) is the big thing and ML algorithms are slowly creeping into all aspects of our life such as unlocking your phone using facial recognition, evaluating the eligibility for a loan, surveillance, what ads you see when surfing the web, what search results you get when searching for stuff etc etc. The problem is that ML algorithms are not infallible they depend on the training data used, confirmational bias etc. At the very least they enforce the existing bias for example, if a company only hires men 25-45 for a role then the ML data set will take this as the input and all future candidates will be evaluated against this criteria because the system thinks that this is what a success looks like. The algorithms themselves are getting more and more complicated and it is almost impossible to review and validate the findings. Due to this decisions are being made by machines that can’t be audited easily. Plus it doesn’t help that most ML models are proprietary and the companies refuse to let outsiders examine them due to Trade secrets and proprietary information used in them.
Another problem is that these ML models is adversarial perturbations where attackers make minor changes to the image/data going in to get a specific response/output. There are a lot of examples of this in the past few years and some of them are listed below (Thanks to Cory Doctorow for consolidating them in one place)
- In 2017, researchers tricked a highly reliable computer vision system into interpreting a picture of an adorable kitten as a picture of “a PC or monitor”: Robust Adversarial Inputs
- Then another team convinced Google’s top-performing classifier that a 3D model of a turtle was a rifle: Fooling Neural Networks in the Physical World with 3D Adversarial Objects
- The same team convinced Google’s computer vision system into thinking that a rifle was a helicopter: Partial Information Attacks on Real-world AI
- The following year, a Chinese team showed that they could paint invisible, tiny squares of infrared light on any face and cause a facial recognition system to think it was any other face: Invisible Mask: Practical Attacks on Face Recognition with Infrared
- In 2019, a Tencent team showed that they could trick a Tesla’s autopilot into crossing the median by adding small, innocuous strips of tape to the road-surface: Experimental Security Research of Tesla Autopilot
- A 2″ piece of tape on a road-sign could trigger 50mph accellerations in Tesla autopilots: Adding 2 inches of tape to a road-sign induces sudden 50mph acceleration in Teslas
These all take advantage of flaws in the ML model that can be exploited using minor changes in the input data. However, there is another major exploit surface available which is incredibly hard to protect against: Backdoors in the ML models by creating a model that will accept a particular entry/key to produce a specific output. The ‘best’ part is that it is almost impossible to detect if this has been done because the model will function exactly the same as an un-tampered model and will only show the abnormal behavior for the specific key which would have been randomly generated by the creator during the training. If done well then the modifications will be undetectable for most tests.
A team for MIT and IAS has written a paper on it (“Planting Undetectable Backdoors in Machine Learning Models“) where they go into details of how this can be done and the potential impact. Unfortunately, they have not been able to come up with a feasible defense against this attack as of this time. Hopefully that will change as others start focusing on this problem and how to solve it.
Given the computational cost and technical expertise required to train machine learning models, users may delegate the task of learning to a service provider. We show how a malicious learner can plant an undetectable backdoor into a classifier. On the surface, such a backdoored classifier behaves normally, but in reality, the learner maintains a mechanism for changing the classification of any input, with only a slight perturbation. Importantly, without the appropriate “backdoor key”, the mechanism is hidden and cannot be detected by any computationally-bounded observer. We demonstrate two frameworks for planting undetectable backdoors, with incomparable guarantees.
First, we show how to plant a backdoor in any model, using digital signature schemes. The construction guarantees that given black-box access to the original model and the backdoored version, it is computationally infeasible to find even a single input where they differ. This property implies that the backdoored model has generalization error comparable with the original model. Second, we demonstrate how to insert undetectable backdoors in models trained using the Random Fourier Features (RFF) learning paradigm or in Random ReLU networks. In this construction, undetectability holds against powerful white-box distinguishers: given a complete description of the network and the training data, no efficient distinguisher can guess whether the model is “clean” or contains a backdoor.
Our construction of undetectable backdoors also sheds light on the related issue of robustness to adversarial examples. In particular, our construction can produce a classifier that is indistinguishable from an “adversarially robust” classifier, but where every input has an adversarial example! In summary, the existence of undetectable backdoors represent a significant theoretical roadblock to certifying adversarial robustness.
The paper is still waiting for the peer-review to complete but the concept and methods they describe seem solid so this is a problem we will have to solve sooner rather than later considering the speed with which ML models are impacting our life.
Source: Schneier on Security: Undetectable Backdoors in Machine-Learning Models
– Suramya