Introduction to Multithreading in Python
Got slow python code?
Don't want to refactor to some other "low-level" language?
Want to impress your peers?
We got all that here!!
Now that you are actually here, today we will cover ways to optimize our Python code specifically using Multithreading and asyncio
.
In this article, you will learn:
- Different methods to speed up python
- What concurrency and parallelism are
- How to choose the appropriate speed up method
- How to use the Python
asyncio
library
Prerequisites
This article assumes that you know the basics of Python and have at least Python 3.6 installed to run the examples.
How to Speed Up Python Code?
Let's imagine you got slow python code.
What to do in this situation?
First of all, make sure to profile your code and get raw data for you to compare with before applying optimizations.
Second, if your code encompasses some business logic, make sure that they are well tested because we don't want our optimizations breaking things.
Thirdly, try to optimize the code itself. Things such as using proper algorithms and data structures.
You can check out similar optimizations here:
If it's still not enough and you don't want to refactor your code into a faster low-level language then let's optimize our code to use more of the hardware.
But what does that actually mean?
In layman's terms, it's multitasking.
But there are two different ways of doing that.
Concurrency
Concurrency is fake multitasking, meaning that you don't run things simultaneously. You instead take turns and hence making it look like you're running things simultaneously. But you might ask what do I "run" exactly? It differs based on which technique you use.
They are threads and processes.
If you look at it from a high-level perspective, they are all the same. They are simply blocks of code waiting to be run. But if you dig deeper, you would find that they are very different.
A process is an instance of a running program with all its code, memory, data and other resources. While a thread is a sequence of code that is executed within the scope of a process. You can have multiple threads running in a single process hence multithreaded
programming.
Multithreading in Python
Multithreading in Python is somewhat "different" because of the Python Global Interpreter Lock or GIL.
GIL allows Python to have one running thread at a time. Meaning that CPU bound operations would see no benefit from multithreading in Python.
On the other hand, if your bottleneck comes from Input/Output (IO) then you would benefit from multithreading in Python.
But there are two ways to implement multithreading in Python:
But what's the difference between the two?
The threading
library creates actual OS-level threads, but only one can be used at a time due to Python's GIL. On the other hand, asyncio
uses the concept of coroutines which are much more lightweight than threads. They take less memory, and it takes much less time to switch between coroutines. However you need to program specifically for asyncio
and use libraries that leverage asyncio
as well. Threading is less scalable, but you get to keep your "old" libraries and style of programming.
In general:
Useasyncio
when you can,threading
when you must.
Parallelism
Parallelism is true multitasking, meaning that you are literally running processes simultaneously. This is done using multiprocessing
, where you use multiple CPU cores to distribute tasks accordingly. This doesn't "break" up the code into parts, each core has a complete running copy of your program.
So which method do I choose?
Like everything on the internet, it depends. But it's pretty simple figuring out which method to use:
- If you got a CPU bound problem, then you would benefit from running multiple cores, hence the Python
multiprocessing
library will help. - If you got an IO-bound problem then use the
asyncio
orthreading
library ifasyncio
is not compatible.
Asyncio Code Example
Using asyncio
is pretty simple, if you ever used async/await
in JavaScript, it's almost syntactically the same.
Let's start with a classic, a simple hello world program:
import asyncio
async def main():
print('hello')
await asyncio.sleep(1)
print('world')
asyncio.run(main())
We declared async
before our main
function to tell Python that this is an asynchronous function. Inside the main function, we declare await
before our sleep method to tell Python to wait till the sleep function finishes.
Finally, we call the function using the run method that asyncio
provides.
If we run the program, we get this response:
hello
world
This is a pretty simple example, let's look at something more "real-world"
Let's imagine you were tasked to scrape a website, but using a synchronous python web client is pretty slow. Doing it asynchronously is much faster!
We will use the aiohttp
library to use its asynchronous HTTP client.
import aiohttp
import asyncio
async def main():
async with aiohttp.ClientSession() as session:
async with session.get('http://python.org') as response:
print("Status:", response.status)
print("Content-type:", response.headers['content-type'])
html = await response.text()
print("Body:", html[:15], "...")
asyncio.run(main())
These are the bare basics of asyncio
, if you want to learn more I would recommend checking out this article and tech talk.
Benefits and Downsides of Multithreading
Everything in software is relative, meaning that there are pros and cons to everything and it's up to us as software engineers to decide whether technology is useful for our use case.
Multithreading in Python is no different.
Let's start with the benefits.
If it's an IO-bound problem, multithreading will significantly improve performance.
That's pretty much it?
Well, asynchronous programming is a lot different from sequential programming. For some domains, it might be very beneficial to switch over to asynchronous programming, for others not so much.
The downside of multithreading is that it makes stuff a lot more complicated and when things get complicated, it gets harder to maintain. Another big thing is that it's hard to test due to flakiness and hard to debug.
At the end of the day, it all depends on your use case, so think wisely before committing.
Conclusion
Speeding up Python code can be a painful experience.
But at least now you know a trick or two on how to speed things up.
Today you learned:
- Multithreaded programming is when the program utilizes multiple threads to improve performance.
- Due to Python's GIL, it's impossible to use multiple threads at once, that is why multithreading is regarded more as asynchronous programming in Python.
- If you got a CPU bound problem, then your best bet is to use multiprocessing which bypasses the GIL.
- If you got an IO problem, then use the
asyncio
library. - Asynchronous programming is pretty complicated, so think a lot before committing to use it.
I hope you enjoyed this article, if you got any questions feel free to reach out to me on Twitter.
Thanks for reading