This provides an overview of the repository. If you want to explore any of these ideas further, then see quick_start.ipynb.
- Motivation - interesting algorithms in the NISQ era have low resource costs and are robust to noise.
- Data augmentation - classical and quantum data augmentation techniques.
- Hyperparameter optimisation - experimenting with larger and stronger blurring .
- Mixup - a novel quantum method for data augmentation.
I provide a resource estimator for the number of circuit executions and time it takes to run VQC-based machine learning methods and show it suggests we need to think more carefully about applications of QML. Motivated by this, I use Wooton et al.'s Quantum Blur as a data augmentation technique for the CIFAR-10 dataset, achieving human-level performance using David Page's ResNet model and demonstrates that approach has merit for further study. Finally, I develop a novel Quantum Mixup method for data augmentation, though analysis of this is left to further study.
We are currently in the so-called Noisy-Intermediate Scale Quantum (NISQ) era where we have relatively low numbers of qubits and large amounts of noise. Many of the algorithms we get exponential speedups from for instance, Quantum Phase Estimation, require error correction and cannot be feasibly implemented at this time. We're interested therefore, if there are any algorithms we can currently implement that provide value for quantum computing.
Quantum Machine Learning (QML) is one such area containing algorithms that could be of interest in the NISQ era. Roughly speaking, QML can be described as any machine learning approach that makes use of a quantum computer. While this includes techniques that are closer to classical machine learning, for example k-means clustering or quantum neural networks, this also encompasses classical machine learning on quantum data.
Many of these algorithms however, are not implementable on current quantum devices. The requirement for an efficient state preparation routine, or the use of QRAM renders these unusable for some time to come. Even if it these problems didn't occur, the circuits required to run these algorithms are far deeper than we're currently capable of, and so the the results are dominated by hardware noise.
One of the favoured approaches that can be implemented on current machines is to to use Parameterised or Variational Quantum Circuits (PQCs or VQCs). Heuristically, the hope is that the parameters of the circuits being trainable means that the circuit is allowed to learn and adapt to the machine noise, mitigating some of the noise problems on near-term devices.
VQCs typically consist of 3 distinct parts:
- An embedding for classical data, often angle embedding.
- A trainable quantum circuit, where the ansatz is picked from a standard set, or designed to increase performance on a given dataset.
- Measurement of the quantum state.
This can be combined with larger classical networks that reduce the number of features in a dataset to a size suitable for embedding on a quantum computer. Then, this model is paired with a loss function and optimiser to train.
Difficulty arises from the fact that seemingly simple circuits need thousands of circuit executions to train. Let's work through some examples and see.
At a basic level, we need to consider:
- Training set size
- Number of epochs run
- Parameterised circuit structure
- Method for gradient calculation
- Whether there are trainable layers before the quantum circuit, and if so, the parameterisation of the classical data embedding
Extensions of this should also take into account multi-programming, error mitigation or error suppression techniques and batching interactions.
For the following, we take the numbered ansatze from Sim et al. on 4 qubits, with a depth of 1 and no trainable layers before the quantum circuit.
Circuit ansatz | Gradient method | Training set size | Epochs | Number of Circuits |
---|---|---|---|---|
Sim2 | Finite Difference | 1000 | 100 | 900,000 |
Sim2 | Parameter Shift | 1000 | 100 | 1,700,000 |
Sim18 | Finite Difference | 1000 | 100 | 1,300,000 |
Sim18 | Parameter Shift | 1000 | 100 | 3,300,000 |
Fixed circuit | N/A | 50,000 | 20 | 1,000,000 |
Note that all the above choices can be customised in ml_training.py.
The number of circuit executions here are enormous. If the models take 0.1 seconds to run a single circuit (including all the classical overhead for loading, offloading, compilation and transpilation), then the first model would take just over a day's worth of QPU time to run. Realistically, this is likely to be infeasible unless you're one of the few people to own a quantum computer. Worse, these circuits aren't particularly complex or deep. Consider:
- Sim7 ansatz on 10 qubits with 4 layers.
- Parameter shift gradients
- 100,000 training data points for 20 epochs
- No pre-quantum layers
This requires 930,000,000 circuits to complete, somewhere in the region of 2-3 years with 0.1 seconds per circuit.
There are some mitigations we can put in place, for instance the use of multi-programming i.e. stacking circuits next to each other so they can run on a larger machine. However, this may require fundamentally changing how we view either the use of quantum for QML or the methods for resource reduction.
If we return to the above table, we can see that a fixed circuit has many less circuit execution than some of these simple models - and this still holds even if the complexity of the circuit increases. If we can make use of this in some machine learning process, where the noise of the quantum circuit is helpful, this could make a good candidate for further exploration.
Data augmentation in classical computing is a technique to increase the performance of models without needing to collect more data. Instead, existing data is perturbed or transformed in some fashion to create similar but non-identical data. This can take place statically, by generating extra data before the beginning of a machine learning process or dynamically as each piece of data is loaded. This can be used to help encode biases of the data or make the model more robust to noise. For classification problems, in its simplest form, the transformations of the data map classes to the same class, a dog to a dog, a cat to a cat.
- Translation - padding an image and randomly cropping it translates the image. This encodes a translation-invariant bias: a cat is a cat whether it's in the top-right or the bottom-left.
- Mirroring - images can be flipped left to right. A cat in a mirror is still a cat.
- Cutout - sections of the image are completely cutout and replaced by blocks of colour. This increases robustness to noise.
- Gaussian noise - sections of the image have Gaussian noise added to it (iid normal distributions). This increases robustness to noise.
Given that quantum hardware is inherently noisy, adding some noise via a fixed or random circuit seems like something that should be possible especially in the near-term. Noise in this case can be treated as a resource for us.
So, how do we convert noise on quantum hardware to quantum data augmentation techniques? We need an encoding of images that is efficient to both encode and decode. A simple choice in this instance is to use angle embedding with each qubit corresponding to a single pixel, and Z-measurements corresponding to the relative brightness of a pixel. Allowing the circuit to evolve for a period of time without interacting would apply the quantum hardware noise directly to the image.
We'll take an alternative approach and make use of Quantum Blur, a technique for both encoding / decoding image data into a quantum state and adding noise to it. For a given image, the grey-scale values (integers in the range [0, 255) ) are mapped to a height value
For an RGB image, you obtain three quantum states
Now, the choice of map
The above image shows quantum blur with different patch sizes for the image, and acting on it with
For the experiments carried out below, we'll be using the res-net model from myrtle.ai as it's ridiculously fast to run with classical data augmentation. The original model runs for 24 epochs and on average achieves the human level accuracy of 94% on a 10 class classification problem. The dataset is CIFAR-10 a collection of 32 x 32 colour images in 10 classes (a mixture of animals and vehicles) with a training set of size 50,000 images and 10,000 images in the test set.
First, we'll compare this model to classical data augmentations to understand if quantum data augmentation can prove useful.
For the experiment, we ran the ResNet model for 20 epochs, normalising the data from training set metrics then applying the translation (padding + cropping) and mirroring transformations before a final data augmentation technique. This last technique is one of:
- Identity (do nothing)
- Cutout
- Gaussian blur, with location 0 and scale 0.01 (the graphs below suggest the choice of scale parameter was too small to have meaningful impact)
- Quantum blur, with
$\alpha=0.1$ Where the patches are 8x8 and chosen at random every epoch. Each model ran 3 times, and one of these is shown here (the other data can be found in the experiment_runs folder and these values seem to be fairly stable).
As the graph above shows, the separations in training loss emerged early on and remained until the end. While the validation losses were less stable, the fact that all of the validation losses (and accuracies) were near-identical at the end of the training suggests a more difficult dataset or simpler model will be needed to evaluate the overall performance of quantum blur. The main constraint on this method was the ability to run the quantum simulations. As a first implementation, it was about 40x slower to run the quantum blur compared to the cutout layer. However, the fact that this method can run on a large dataset and successfully train a model is encouraging as a first pass.
Note that if we can efficiently implement these transformations, then this may be suitable for use primarily as quantum-inspired approach, where larger patch sizes are broken down into smaller patches to reamin simulatable.
In addition, a random hyperparameter optimisation sweep was run to determine if there were dramatic changes in performance of the model with different sized patches or different intensities of blur. This sweep ran identically to the earlier experiments: 20 epochs on the CIFAR-10 dataset, with 3 models per set of hyperparameters.
As the graphs below show, there isn't much difference in the performance for the differnt models and as above, a more sensible comparison - either with a simpler model or more difficult dataset is likely required to show separation between augmentations.
Mixup is another classical data augmentation technique. Here, two images are combined together linearly to create an image part-way between two images, and the corresponding labels are also linearly combined.
Explicitly, for data taking the form of image-class pairs
This requires us to use one-hot encodings of class labels, so each input data is a binary vector instead of a class number so we can linearly interpolate the labels and it naturally extends to beyond binary classification. This transformation occurs once a batch of images has been chosen to process and does not require that the images are from different classes.
In quantum computing, we can also combine states in a sensible fashion by considering the SWAP gate. The action of the SWAP gate is given by
As the SWAP gate is a unitary matrix, we can also take fractional powers of the SWAP gate to combine images. This provides us with a Quantum Mixup method, also parameterised by
Due to time constraints, we have been unable to explore but provide comparisons to classical mixup on two CIFAR-10 images.
For easy comparison we have inverted the relationship with
As with many QML experiments, we are unable to conclude much more than the method is successful at achieving the task. However, given the size of the dataset trained on, this is encouraging, as it allows studies of dataset sizes that may be closer to real-world applications.
It also introduces a number of opportunities to explore this further:
- Expansion of the resource estimation to account for additional strategies like multi-programming, different classical embeddings and batch-sizes.
- Integration of the resource estimation with PennyLane differentiation transforms to analyse arbitrary quantum circuits.
- Running the model on quantum hardware to understand the true practicality of the approach.
- Alternative embeddings for images such as angle embedding.
- Training with Quantum Mixup to determine whether this approach has merit.
Hopefully this leaves readers better aware of the practical constraints of QML and encourages thinking about alternative approaches for the NISQ era.
Thanks must go to James R. Wootton and Marcel Pfaffhauser for their experiments with and implementation Quantum Blur. Additionally to David Page for a lightning-fast model for classification on CIFAR-10.
And lastly:
Written with StackEdit.