Until relatively recently, chatbots could only respond to us with plain text… unable to see beyond that. But now, models like GPT-4 not only have the ability to understand what we write to them, but they have also learned to “see” images. And although it may sound like science fiction (until recently, it did…), it’s simpler to understand than you might think. Let’s take a look at how this works.
How on earth does a machine see?
To put it into perspective, imagine showing a photo of a beach sunset to a small child. They will recognize the sun, the sea, the sand… the usual stuff. Well, an artificial intelligence doesn’t have eyes or a childhood, so it needs another method. Let’s break it down (AI Image to Text/Prompt in Flat AI).
For a language model like GPT-4 to “see,” the process involves three key steps. I’m going to unpack them for you, but don’t worry—I’ll keep it simple and jargon-free.
- Translating pixels into numbers: the “language” of machines
To an AI, images are nothing more than a soup of numbers. For example, take a photo on your phone: each pixel has a color value, like a giant spreadsheet filled with numbers. But of course, a machine doesn’t know that’s a cat or a car. It needs help to find patterns.
This is where specialized neural networks come into play CNNs (Convolutional Neural Networks). Their job is clear: scan the image to find edges, textures, or shapes. For instance, if there are curved lines and shiny eyes, the CNN might deduce based on that data: “this looks like an animal.” But wait, there’s more.
- What matters here? Selecting the essentials
Once the image is translated into data, the model needs to separate the wheat from the chaff. Is it a person? Is there text? What emotion does it convey? Here, attention mechanisms (yes, like ours) help the AI focus on what’s truly important. It’s like when you read a page and underline key phrases to study later.
But the real “magic” happens next…
- Connecting images and words: the “universal language” of AI
Remember when you learned to associate the word “apple” with the red fruit? (I’m sure you don’t, but still.) It’s very similar to what these models do. Through multimodal learning, they relate millions of images to their descriptions. For example, they see a photo of a dog and read the text “puppy playing in the park.” Over time, they understand that certain shapes and colors correspond to certain words.
And just like that, the AI can do things like describe an image or even create one from scratch using just words! (Speaking of which, let me plug Flat AI a bit: you can do this on our free, no-signup AI image generator.)
Vision Transformers: The “new player” changing the game
Until recently, CNNs were the kings of image processing. But now there’s a new player in the field: Vision Transformers (ViT). What’s their advantage? Instead of analyzing small pieces of an image, like CNNs do, ViTs see the “big picture.”
Here’s an analogy: imagine taking apart a puzzle and spreading the pieces on a table. A ViT studies how each piece fits with the others, not just the ones next to it. This way, it understands broader contexts, like a steering wheel being inside a car, not floating in mid-air.
Why is this revolutionary? Because it brings machine vision closer to how we understand things: globally, not piece by piece.
CLIP: The “translator” between images and text
If there’s one model that makes me say, “This is mind-blowing!”, it’s CLIP, developed by OpenAI. Think of it as a bilingual friend who speaks both “image” and “text.” How does it work?
- It has two separate “brains”: one processes images, the other processes text.
- It’s trained on millions of photos and their descriptions (like “a cat sleeping on a red couch”).
- It learns to match each image with its correct text and also to distinguish incorrect pairs.
It’s like being shown photos of dishes and their recipes: over time, you’d know which ingredients correspond to each image. This way, CLIP can, for example, search for images based on text or generate accurate descriptions.
What’s all this for? Applications that are already here
You might be thinking, “Okay, this is interesting, but… how does this affect my daily life?” Here are some concrete examples:
Automatic photo tagging
Do you have thousands of photos on your phone? With this technology, the AI could organize them by location, people, or even mood. Say goodbye to searching for “beach 2018” among 500 photos!
Image-based search
Have you ever seen a dress on Instagram and wanted to find a similar one? Now, instead of typing “blue dress with flowers,” you upload the photo, and the AI shows you options. It’s like having a visual personal assistant!
Smarter content moderation
Social networks already use this to detect inappropriate content. But in the future, AI could understand complex contexts: differentiating between artistic nudity and explicit content, for example.
Assistants that see what you see
You can ask your phone, “What kind of plant is this?” while pointing it at a flower. The AI would analyze the image and respond in seconds. In fact, this is already possible with Google’s new Gemini model… you can try it in AI Studio. It’s spectacular… I highly recommend testing it out. Just click on ‘Show Gemini’ and grant permissions so it can see through your smartphone camera or, if you’re on a PC, see what you’re doing!
The future: Toward an AI that “feels” like us?
This is where I get excited. The advancements aren’t slowing down, and what’s coming could change everything:
Real-time AI
Soon, these models will analyze images instantly. Think of self-driving cars that “see” obstacles like a human, or augmented reality glasses that translate signs in another language while you walk.
Understanding the abstract
Today, AI recognizes objects. But what if it could detect emotions in a photo? Or interpret visual metaphors, like a broken heart drawn on paper?
Limitless creativity
Tools that generate images from text already exist. In the future, you could describe a complex scene (“a dragon in the sky over Times Square at sunset”) and get a realistic, accurate image in seconds. Designers and artists will have an incredible ally!
To wrap up…
If you’ve made it this far, thank you for letting me share this passion with you. The next time you upload a photo to social media or ask your favorite AI chatbot a question, remember: behind it all is a revolution of zeros and ones… and a little bit of magic.
And remember: artificial intelligence isn’t here to replace us, but to amplify what humans already do well. Including, of course, marveling at technological advancements.