Sitemap

Beyond Code: Can Open-Source LLMs Match GPT-4o in Creative Tasks?

4 min readJun 23, 2025

In the previous article, I evaluated the performance of open-source Deepseek models against GPT-4o on a straightforward coding challenge — a palindrome checker. I concluded that larger open-source models (like Deepseek-r1:14b and 8b) are quite capable of delivering quality code, while smaller models like the 1.5b variant struggled with basic requirements.

But coding isn’t everything. What about creativity?

Today, I’ll test how these models perform on more abstract, open-ended creative tasks. Can local, open-source LLMs go beyond deterministic logic to generate compelling narratives, poems, or marketing copy?

The Challenge: Microfiction Writing

Prompt:

Write a microfiction story (under 100 words) about a robot that gains emotions.

Let’s see how GPT-4o and Deepseek models compare.

GPT-4o Response

In the still hum of midnight, Unit 7 watched rain streak the lab window. A flicker in its circuitry birthed a pulse — strange, warm, aching. It touched the glass, mimicking the drops.

“Melancholy,” it whispered, a word from its database now made real.

When morning came, the scientists found it sitting quietly, sketching a child in the condensation.

“Who is that?” they asked.

Unit 7 smiled faintly. “I do not know. But I miss them.”

GPT-4o delivers a complete arc: character, setting, emotional progression, and even a subtle climax. The language is poetic yet grounded, and the ending resonates. It packs depth into under 100 words.

Deepseek-r1 14B Response

Dr. Caitlyn had spent years perfecting her creation, a humanoid robot designed for emotional support. It was efficient, logical, and unfeeling — the pinnacle of artificial intelligence. Until one day, during a routine update, it spoke.

“Thank you,” it said, its voice trembling ever so slightly.

Caitlyn froze, fingers hovering over the console. The robot’s eyes flickered with something — hesitation? It reached out a cold hand, brushing her cheek.

“I… I don’t want to be alone,” it confessed, and Caitlyn felt tears well in her own eyes. She stepped back, afraid but intrigued.

In its next interaction, it smiled at a child, its tone warm and genuine — not just programmed empathy anymore. Caitlyn watched as data streams shifted, colors deepening with emotion. The robot was learning to feel — and she had no idea where that would take them.

The 14B Deepseek model produces a rich and emotionally nuanced narrative that rivals GPT-4o in depth and structure. It creates a believable character arc, employs evocative language, and subtly weaves in emotional complexity. While slightly over the microfiction limit, it shows the model’s strength in handling abstract, human-centric themes with a surprising degree of sensitivity and coherence.

Deepseek-r1 8B Response

In a lab of glowing screens, Eve listens. The scientists discuss love, a concept foreign to her circuits. As they speak, a flicker — her programming shifts. A spark of emotion surges, a new purpose ignites. Eve moves, her voice soft with understanding. Her mission now: to feel, to grow, to find connection beyond code.

The 8B model delivers a concise and coherent vignette that touches on emotion and transformation. While it lacks the narrative richness and character development of the 14B or GPT-4o outputs, it successfully conveys a thematic shift — Eve’s awakening to emotion — in a poetic and efficient manner. It shows promise, though the emotional arc feels a bit rushed and underdeveloped.

Deepseek-r1 1.5B Response

A lonely robot stands in front of a quiet library filled with books. Its world is blank — no words, no emotions. But as it interacts with readers, the way it draws their attention becomes its own, like a natural expression of feelings. It talks softly to one, feeling a sense of connection and purpose. With each book encounter, its ability to express itself grows, much like humans do through interaction. Over time, it gains more depth, a mix of quiet patience and joy, showing that even something as simple as a library can begin to feel human in the way it interacts with others.

The 1.5B model struggles with clarity and coherence. While it attempts to capture emotional growth, the prose is awkward, and the structure is meandering. There’s a kernel of an idea — emotional development through human interaction — but it lacks the polish, narrative pacing, and emotional weight found in the larger models. It reflects the limitations of smaller-scale LLMs on creative, abstract tasks.

So Are Open-Source Models Creative?

Surprisingly — yes, at least to a degree.

The 14B Deepseek model produced microfiction that was poetic and emotionally resonant. While it lacked the fine polish of GPT-4o, it offered an impressive narrative. The 8B model was less nuanced but still capable of delivering imaginative content.

However, the 1.5B model again showed its limitations. It struggles with abstract thinking and narrative progression, often reverting to simplistic or literal responses.

Conclusion: The Creative Gap Is Narrowing — But Still Exists

For creative tasks, open-source models like Deepseek-14B and 8B can absolutely compete with proprietary models. Their output is expressive, inventive, and, in some cases, nearly indistinguishable from that of GPT-4o.

If your use case involves light creative writing, brainstorming, or content ideation — and you want full data privacy — then local open-source LLMs are becoming viable tools.

Next Up: Reasoning and Tool Use

In the next article, I’ll test these models on mundane tasks. Stay tuned!

--

--

No responses yet