When generating images for ever larger growing data sets, the performance in creating and writing them to the disk is no longer negligible. Imagine a data set with 10,000 astronomical sources, each observed in 10 bands. Just writing out an image for each band would result in 100,000 images – leaving apart any analysis plots. Now if each image takes just 100 milliseconds to be generated and saved, just the generation of the figures would already take more than 2.5 hours – and that’s without any data loading, processing, normalization or leave alone analysis. It would therefore be the goal to increase the performance and reduce the computing workload to the absolute minimum necessary, in order not to end up with too long runtime for the data reduction just to get a quick view on the data.
In this article I will quickly go over a few things you might want to keep in mind when generating a large number of figures using Matplotlib for Python. There are a couple obvious and not so obvious decisions you can make, in order to accelerate creation of your plots.
The test setup
This is a simplified version of the test setup I used, in order to investigate the creation times. In order to avoid any cached objects to be reused and flaw the benchmark, it was run for each setup individually in a fresh iPython session. To reduce the parameter space, I only compared PNG (Portable Network Graphics) and PDF (Portable Document Format) files, likely the two most widely used formats for screen and print media, respectively. Similarly only 100 and 300 DPI were used for resolution tests.
from matplotlib import pyplot as plt
import numpy as np
def testSavefig(format, method, dpi, data):
'''
Create a test figure.
@param format: The file format (png or pdf).
@param method: The method to be used (either 'fig' or 'plt')
@param dpi: The resolution to use in DPI.
@param data: The data to be plotted.
'''
# Backends to use
be = {'png': 'agg',
'pdf': 'pdf'}
# Create the figure
fig = plt.Figure()
ax = fig.add_subplot(111)
ax.plot(data)
# Save the figure to a file
if method == 'fig':
fig.savefig('test.%s'%format,
dpi=dpi,
backend=be[format])
elif method == 'plt':
plt.savefig('test.%s'%format,
dpi=dpi,
backend=be[format])
> format = 'png' # or 'pdf'
> method = 'fig' # or 'plt'
> dpi = 100 # resolution in DPI (here 100 or 300)
> n = 2 # data size order of magnitude
> data = np.arange(10**n) # simple slope
> %timeit testSavefig(format, data, method)
1.) How to save a figure? (pyplot vs. figure)
There are two major ways how to create a figure in Matplotlib. When working with pyplot you can use plt.savefig()
in order to save your active figure. Alternatively you can use the figures own fig.savefig()
method to save them. As the time to save a figure also depends on the number of objects in it, the number of data points is another parameter to keep in mind.
log(num points) | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|
pyplot | 81.3 | 83.8 | 122.0 | 427.0 | 546.0 | 5280.0 |
figure (PNG) | 63.5 | 62.3 | 65.8 | 70.4 | 139.0 | 867.0 |
figure (PDF) | 55.0 | 55.5 | 55.5 | 61.9 | 129.0 | 854.0 |
We see here a couple of interesting trends:
- The pyplot.savefig() method is significantly slower than the figures own savefig method.
- For the pyplot.savefig() method it does not matter if we save the figure as either PNG or PDF.
- PDF creation is faster than PNG creation when created with figure.savefig().
This means that if you want to create a large amount of figures, use the figure.savefig()
method, not the pyplot.savefig()
one, as the former outperforms the latter significantly.
Interesting side note: the bump in the pyplot curves at 105 data points is real and likely arises from some form of memory-object-size mismatch (simply speaking: imagine what happens if you have to put a a 9 byte object through a 8 byte register – you would have to do the operation twice, wasting almost 50% of the bandwidth)
2.) The resolution and format (PNG vs PDF vs DPI)
As we have already seen, saving a figure as PDF is faster than saving it as a PNG. So we will now compare PNG and PDF creation, and will also take into accound two different resolutions: 100 DPI and 300 DPI.
log(num points) | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|
PNG / 300 DPI | 176.0 | 171.0 | 171.0 | 180.0 | 251.0 | 981.0 |
PNG / 100 DPI | 63.5 | 62.3 | 65.8 | 70.4 | 139.0 | 867.0 |
PDF / 300 DPI | 55.3 | 55.2 | 55.9 | 62.2 | 130.0 | 837.0 |
PDF / 100 DPI | 55.0 | 55.5 | 55.5 | 61.9 | 129.0 | 854.0 |
Again, we see some interesting trends:
- Creation of PNGs is slower for higher resolutions, which is expected for a bitmap format.
- As seen before, PDFs are created faster than PNGs.
- For PDFs the resolution has no effect on the creation time (which is expected from a vector format).
- No matter which format or resolution is used, for extremely high number of data points, the differences become insignificant, indicating some other mechanism dominating the file generation process.
- On the other hand, below 100,000 data points the differences are approximately constant.
In conclusion this means that if your plot has less than 100,000 data points, the method you choose can make a significant difference on the runtime. If you create figures for screen inspection, choosing a lower resolution is advisable, although you should still be able to see the details you want to identify in your plots. A hybrid approach of creating PNG thumbnails and linking the PDFs for detailed inspection might be favorable over creating high resolution PNGs.
Although not included in the test, it is noteworthy that creating an Encapsulated Post Script (EPS) file, rather than a PDF file, results in another speed-up gain of up to 10 percent over PDF.
3.) The use of Latex
In Matplotlib you have three options when annotating you figures with mathematical expressions:
- use plain text and keep it simple (i.e. use no $[…]$ notation at all),
- use Matplotlibs own mathtext capabilities (usetex=False),
- or use an external Latex package (usetex=True).
Although it seems obvious, it is worth mentioning that the choice you make has a significant impact on the creation time of your figures. The following table lists the runtimes for the three methods, when only the x- and y-axis labels contain Latex expressions (or don’t contain it for the plain text example):
Mode | Runtime |
---|---|
no latex | 58 ms |
mathtext | 89 ms |
Latex | 155 ms |
From these results it is clear, that you should only use an external Latex package, when it is absolutely necessary, as it is about three times slower than the plain text generation and still almost a factor of two slower than using Matplotlibs own mathtext extension. You would probably only use an external Latex package for a small amount of figures, or even just for publication, when you do not iterate any more.
Closing remarks
Now, PNG is a screen format, whereas PDF is made for print media. So you might actually not have too much of a choice. For print media (E)PS might be an option, although these formats are widely considered outdated. For screen presentations you could maybe switch to SVG (Scalable Vector Graphics), when e.g. working with Jupyter Notebooks in a browser. This should result in similar results like PDF, as it is also a vector format, but as SVG can not be used in all kind of software and displaying it might even in these days still be slightly inconsistent, you would have to check on a case-by-case basis if this is an option.
Note that all the discussion in this article are only relevant, if you have to create a really large number of plots. If you only have to create a few figures for a journal publication or a presentation at a conference – do not bother. In these cases just use the most convenient approach.