Why DeepSeek Works
DeepSeek has an interesting architecture that helps it be more efficient and reduce the cost of training and inference while also allowing it to run on less performant GPUs.
A large portion of this efficiency comes down to two primary things: a mixture of Experts (MoE) and Multi-head Latent Attention (MLA). Yes, there are other lower-level innovations, too, but from what I understand, these are the large determining factors.
How Mixture of Experts (MoE) Works for Idiots
The team has opted for a Mixture of Experts (MoE) network that doesn’t rely on one dense model to answer all questions.
Instead, it uses MoE, a model made up of many smaller models that are trained on specific data/parameters, thus making them experts in that domain.
A few other models, like GPT4-Turbo and Google’s Switch Transformer, also use MoE in their designs.
This means that when you ask DeepSeek to interpret your recent STI test results, the prompt/input will be routed to the correct models, who are trained in medical literature, etc.
Then, immediately after, you ask it for the nearest pharmacy in your area. This input will be trained on Google Maps, OpenStreetMaps, and even local transit websites. kek.
Now, obviously, this is a super dumbed-down way to understand it, but in essence, the idea is that you don’t need to activate the whole network every time a question or prompt is asked.
“So, how does the LLM know which experts to route the input to?”
This is managed by the Gating Network, which serves as the brain behind Routing. This is typically a small neural network that decides which expert should be used for each input/question/prompt.
An exame of how DeepSeek works:
If you input:
"How do I write a Solidity smart contract?"
The gating network might select Expert #2 (Programming) and Expert #6 (Crypto/Finance) instead of Expert #5 (Medical Knowledge).
Then what happens?
Well, if you break it down piece by piece, it looks something similar to this…
The gating function scores each expert based on how relevant it is to the prompt:
The Gating Network determines each expert's highest probability score, and then these two experts will work together to get you the most optimal response.
I suppose the natural follow-up question after that is: “Well, how do the Gating Networks know how to assign scores to the correct experts?”
The Gating Network (which, remember, is just a small neural network itself) is trained alongside the LLM to ensure additional inputs are steering it in the right direction to select the best experts for the task at hand.
This is typically done through supervised and reinforcement learning, i.e., if the chosen experts perform well, the model reinforces those routing decisions and scoring.
If worse experts were chosen, the model would adjust weights/scoring in the gating network for the next time until the correct waiting is achieved based on the most optimal and desired outcome.
This also starts to veer towards asking whether humans should intervene at any point here and nudge it back onto the perceived right course or whether you should just let the machines do their learning themselves (that’s a conversation for another day).
The vast majority of models you have probably used in the wild, GPT-3/4, Llama, Gemini, etc., will either be Transformer models or dense models where the vast majority (if not all) of the network is activated when generating a response and is naturally more intensive.
So, now we know a fraction of how MoE works, what about MLA?
How Multi-headed Latent Attention (MLA) Works for Idiots
Multi-headed Latent Attention (MLA) is a novel iteration of Multi-Head Attention (MHA), which the DeepSeek team invented and proposed in their DeepSeek V2 Whitepaper.
This is new territory to me, so I’ll try to make it make sense. First, a bit of context: When you type a prompt/ask a question to an LLM, the model attempts to make sense of each word (token) and understand each individual token’s context to one another.
What does this mean, and why does it do that?
To wrap your head around what happens after you ask a question or create a prompt, I’ll stick with the Solidity question example we’ve used above.
Prompt: "How do I write a Solidity function that transfers tokens?"
Step 1 - Tokens Created:
["How", "do", "I", "write", "a", "Solidity", "function", "that", "transfers", "tokens", "?"]
Each token (words) will go through “attention” to determine what matters most so that it can generate the best response.
Step 2 - Each word (token) compares itself to one another.
Using MHA, the model would process this inefficiently:
- Each word compares itself to every other word.
- So "Solidity" would check if "How" is relevant.
- "Function" would check if "tokens" are relevant, etc.
Comparing every token to every other token is obviously a heavy task, particularly if you are importing heavy PDFs with a lot of text, imagery, etc. Hence, the excessive demand for computing and GPUs to perform such tasks.
On top of that they have to reprocess the content of each token to each token every time a new prompt is generated.
Step 3: Using MLA instead…
Instead of each token looking at every other token, MLA introduces latent memory slots, which, in regular person speak, means that there are dynamic memory stores where information of each word and its context are stored over time, which may look like this…

In this setup, tokens interact with the latent memory slots, maintaining a stored understanding of each word's meaning and context.
This allows each word to reference these latents instead of directly comparing with every other token.
By consulting these memory slots, the model can efficiently derive both the meaning and context of words, drastically streamlining the process of understanding and generating text. Which looks a little bit like this:

If all that sounds too complex, no worries. Essentially, with MLA, the relationships between words and their contexts are efficiently summarized by the aforementioned latents.
These latents have a sort of memory that helps the system remember past interactions, preventing the need to reprocess each word from scratch every time. This not only saves time but also reduces the computational effort required.
As the model ages, its understanding of the context between words (tokens) and its latent memory increases, creating a more powerful and efficient output each time.
There’s also the potential for drift, which I’m not going to get into here but will probs cover in an article about Mira Network in the very near future.
When you couple this with MoE, this makes up two of the main aspects of why DeepSeek has managed to make a significantly cheaper model ($0.0011 per 1,000 tokens) than ChatGPT ($0.03 per 1,000 tokens).
That’s a 27.27x difference for those who can’t do math.
What this architecture tells us about scalable systems, crypto and DeFAI in particular
The compartmentalizing of specific experts for specific tasks is kinda what we expect from the Ethereum L2 scaling roadmap. You are already seeing this playout to some degree.
In the recent blocmates 2025 thesis, I laid out my idea that general-purpose L2s are pretty much redundant apart from Base, which has won the demand for an EVM-compatible general-purpose L2.
Base has the distribution, and it has the pull for builders that want to build on the EVM. The on-chain user metrics back this up.
Now, I also believe the opposite to be true. Purpose-built L2s with a predefined value proposition built into the chain's architecture certainly have a place.
In this analogy, L2s have to be experts, and the routing to the experts has to be solved quickly. As DeepSeek’s model and gating network learn over time which weights to apply to which experts, intents-based systems and other chain messaging protocols will need to fulfill this role.
This should all be done under the hood, out of the way of the end user, just as 99% of you reading this were blissfully unaware of how DeepSeek worked before this article (me very much included).
To get to that point, the teams focus on solving account abstraction to enable an intents-based or even LayerZero-type architecture to route the desired action through the most desirable expert (purpose-built L2 like Superseed (CDP), megaETH (realtime blockchain), Mode (AI) or Sophon (consumer) ) in this example.
It isn’t just L2s, either. Generalized L1s are going to fade away, too, and purpose-built networks with their own use cases will start to take their own path. We have the infrastructure, so let’s see some apps.
Now, we are getting there with teams like Infinex basically creating THE on-chain portal to obfuscate the needless understanding of how the sausage is made using crypto-rails.
Similarly, when you ask DeepSeek to help you summarize a complex PDF and then ask it for a sourdough recipe, you have no understanding of what is happening, and nor should you.
We need to achieve total account and chain abstraction, which is only possible with app-specific or purpose-built chains.
It doesn’t seem like the Ethereum L1 is going to change meaningfully any time soon, and I don’t necessarily think it should either. Trying to go toe-to-toe with purpose-built L1s like Solana is a fight they will not win.
Thoughts on DeFAI
If anything, the DeepSeek moment was terrible for short-term prices. Everything assumed an overestimation of how scaling was achieved, i.e., more and more GPUs.
This release shows that it doesn’t necessarily have to be that way (to a degree).
Teams that previously were shut out of the conversation due to not being able to raise enough capital to fund their vision now have a much higher chance of getting a lower amount of start-up capital to at least give it a go.
I think that is incredible, and I see this trend continuing.
Expanding on where I think this immediately helps is the DeFi co-pilots, particularly the new wallet interfaces.
Yes, there is still a long long way to go to bring down the failure rate/hallucinations from such agents to the point in which you would trust them with a significant amount of your capital.
Mira Network has estimated that 25-30% of all agent outputs could be hallucinations and less-than-desirable actions as a consequence.
If you play this out for the current influencer-style agents, this is not really that big of a deal. But I think we have all seen the influencer-agent magic trick too many times now, and we are hungry for something more significant.
Mira’s goal is to use an aggregated and trustless network to verify the outputs of models/agents to reduce the failure rate by a factor of ten.
Teams like HeyAnon, Griffain, SphereOne (Orbit), and Wayfinder all have DeFi (DeFAI) co-pilots (full research report here), but are we ready to trust them with significant size?
I think we’re nearly there, and I love that there is still work to do. Will this supersede the current wallet UX?
What I’m trying to get at here is that wallets, portals, DeFAI co-pilots, whatever it is, this is the future of on-chain UX.
There’s a lot of overlap between how DeepSeek and different LLMs are scaling their own networks and the teething problems crypto faces with modularity.
When you tie it all together, you get something that looks pretty similar. Wallets and co-pilots will serve the user similarly to how the app-interface of ChatGPT and DeepSeek are currently.
Having the intents networks act as the gating network, pushing the desired action towards the best expert (app/purpose-built chain) for the intended use. We’re almost there.
Having AI embedded into the crypto infrastructure layer will only turbocharge this vision.
Final Thoughts
I joke about Ethereum a lot, but it’s only because I want it to succeed. I think it is a credit to humanity as a whole, but it needs to get a move on because the opportunity window is closing.
With the DeepSeek weights being made open-source, copying and iterations of their implementations will ripple through Silicon Valley, academia, the crypto space, and beyond.
Innovation begets innovation, and new use cases will appear that we can’t even imagine currently.
Having a predominantly open model from China is better for humanity than a proprietary closed model from anything funded by the West.
And when I say that, I don’t mean either side is right or wrong. I don’t give a shit politically, in all honesty.
I only care about freedom, free markets, and freedom of information. Bittensor, for example, is a beautiful example of the power of open-source AI and crypto rails combining to create an incredible globally powered and distributed system.
Open source will eat the world. It is just a matter of time. Something so powerful cannot be in a black box and should not be confined to a select few people who have internal biases, even if they don’t care to admit them.
Ciao for now.