rss-bridge 2025-03-13T14:44:49+00:00

Capchan – Solving CAPTCHA with Image Classification

A few years ago, I tried my hand at the, now retired, CAPTCHA Forest CTF, which was part of the nullcon HackIM 2019. I wanted to solve it using computer vision and machine learning. This started me on a path of discovery and incremental improvements that finally resulted in capchan, a generic CAPTCHA to text tool.
This post is broken into four parts:

The first CTF
The second CTF
Neural Network Fundamentals
Creating capchan

ATTEMPT_ZERO
Starting the CTF – I connected to the netcat instance, but after staring at hexadecimal, I immediately closed it and moved on to try another CTF.

This post is broken into four parts:

The first CTF

The second CTF

Neural Network Fundamentals

Creating capchan

ATTEMPT_ZERO

Starting the CTF – I connected to the netcat instance, but after staring at hexadecimal, I immediately closed it and moved on to try another CTF.

That was my first experience with Image Classification, and sure, some might call it an “unsuccessful attempt”. However, I remained optimistic. A while later I set my sights on it again, failed to install some dependencies in Python – moved on to another CTF. At this point, we could definitely call it an unsuccessful attempt.

Other than noting that I should use venv a bit more, I didn’t fully commit to solving this CTF. That’s more so that I wanted to get the easy points out of the way for my leaderboard chasing, rather than spending a lot of time on a few points. That said, the concept remained super interesting and I was itching to return to it whenever time allowed.

Sensecon time, Matthew and his doppelgängers, Ethan & Adriaan, suggested “Image Geolocating through AI” as our 24 hour hackathon idea. Not just did I join the team to be with Me, Myself and I (Ethan and Matthew) I knew it would benefit me in the long run for solving the CAPTCHA CTF. One Sensecon and a few dedicated research weeks later – this is where I am.

CAPTCHA_ONE (foreshadowing with one here)

Unfortunately, my fellow buzzword enjoyers – for the first CTF AI/Machine Learning wasn’t needed to retrieve the flag. However, I’ll give you the lay of the land here.

So what does the CTF look like?

Initial prompt after connecting to the instance.

Not exactly the most attractive ‘image’ related CTF.

What’s happening? We connect to the instance using netcat, thereafter; the prompt states to solve the CAPTCHA 200 consecutive time in order for the flag to be given. The image is given in hexadecimal format. Viewing the image:

Echo image into file, then displaying said file.

From this, we can see that my terminal is cool and can display images, but also that the CAPTCHA is some sick joke of a hieroglyphic. Luckily the prompt was generous enough to provide the mappings as Bill Cipher, from Gravity Falls, substitution cipher.

Pretty Pictures.

Referencing the previous image, the CAPTCHA code is FBTP, easy peasy – lemons got squeezed. The issue is I had to solve this 200 times. The clear answer was automation – although, I truly believe that the devs that made CAPTCHA Forest underestimate my tenacity to manually do this 199 more times – I decided to ‘play the game’.

Rinse and Repeat

Example of loop with connecting, receiving code, entering code, i++.

How could I automate this? Checksum! Make a checksum of the image, since it will always be the same image for the same alphabetic characters. With this, one could connect to netcat, retrieve the image, make a checksum of the image and compare to the existing values to find the four character value, submit it back over the exiting netcat session – rinse and repeat. That’s probably what the ‘smart people’ would have done, but not around these parts. I wanted to solve this being as ‘close’ to the image as possible. So what did I decide on?

I decided to complete the challenge by using the pixel values, which is easy enough considering that the colour of the image would not determine the value of the characters. Solely taking pixel position into account, what I did was, once again, use the remote server to retrieve all the images until I had the full alphabet of ciphers. Then convert the images to black and white – just in case there were some dark spots or other values that could mess up the pixel data.

The other half of this was to make a script to retrieve the images from the server over netcat and compare pixel data, matching images would then be given the correct value and resubmitted over netcat until 200 iterations has been made.

TL;DR:

Take the hex over netcat and change it to a png.

This large image will have four symbols, so split it into four images instead, making it one symbol per image.

Determine the matches for each image and compile a four character code.

Send this back to netcat, and redo steps one through four.

More Visual

I now have Bill Cipher symbols as the images. A set of four symbols in a single image, which has now been made four images with a single symbol each. Why do it this way? We’ll, I am, in a way, hardcoding a checksum in a different way – with pixels. But having to test four images against 26 characters each is a lot better than checking against 26^4 or 456,976 possible combinations. Remember that I first have to define the value of each image since that the point of all this, and I am more than willing to do it 26 times, rather than 456,976 times over.

Displaying the Cipher with the alphabetic value of A.

25 more times of the above and offline data collection complete. What remains is splitting the image from netcat into four parts to the same size as the offline images and compare. Since the CAPTCHA consistently provides the symbols (no rotations, distortion, etc) I should be able to get a 1:1 match.

Below I have an example of my offline image I retrieved from netcat and an image from a more recent netcat interaction.

Example of ad-hoc image and a offline copy of the same cipher.

To visualise the above a bit more, this is what the script “sees” when comparing both images.

Cipher of C.

Cipher of C, again.

So with the 1:1 match, what’s left to do is have the script send the values of each split image to netcat and repeat the process 200 times, while, of course, combining the image results in a single four character code.

Example of the final few iterations of the CTF.

I made the scripts one by one for each problem and then added them all together to get to this point. And by adding them all together I mean this:

Lazy coding (I use vim BTW).

I am so sorry for those who code in Rust and Haskell – that they had to witness the abomination of what I have spawned in Python. *Sidenote: When one gets bored they tends to add script flags and coloured output… Also, yes. My script is called szizzleIMON – Szymon told me to use tesseract in standup and this was my retaliation to his good idea.*

Most of these scripts are simple and specific to the CTF, like connecting to netcat, taking the image, splitting it and sending it back to netcat (In a single script of course, lol). The most important thing is how the images were processed and compared:

The main determining script.

The compare_images function loops through each folder in my dataset. It takes these images and converts it to black and white, in case there isn’t any noise. It uses PIL to open each image then compares the size and pixel data to find a 1:1 match.

The link to the full code for both CAPTCHA CTFs is in the GitHub repo. Both CAPTCHA CTFs?

CAPTCHA_TWO (dun dun duuun)

https://ctftime.org/task/7508

To my absolute shock, when I submitted the flag, another CTF popped up – with some aura might I add, with its name being “Captcha Forest Harder”.

So why is it “Harder”? I had a quick glance at the new netcat instance and the description read; This challenge needs to be solved 120 times out of 200 tries to output the flag. I’ve never read anything scarier (other than the word “boo”).

[...]

Original source