Question regarding the two-layer linear (FNN) projection structure in EmbeddingGemma-300M
Subject: Question regarding the two-layer linear (FNN) projection structure in EmbeddingGemma-300M
Hello,
First of all, thank you for your excellent work in releasing EmbeddingGemma-300M — it’s an outstanding embedding model and has already proven very valuable for my retrieval / semantic-search workflows.
I have been reviewing the model architecture and documentation and have a technical question regarding the head/projection portion (after pooling) of the network, which I hope you might clarify.
From my understanding, the model appears to use the following structure after the token-level Transformer and pooling step:
- A Dense (linear) layer from dimension 768 → 3072, with activation set to identity
- Followed by a second Dense (linear) layer from dimension 3072 → 768, also with activation identity
- Then (presumably) a normalization (e.g., L2-norm) to produce the final 768-dim embedding
Because both layers use identity activations, mathematically this is equivalent to a single linear transformation from 768 → 768 (i.e., the two weight matrices multiply into one). However, the choice to expand to 3072 and then project back to 768 suggests a deliberate architectural decision. I’d appreciate if you could share any insight on the motivations behind this two-layer linear projection design.
In particular, I’m very curious about:
- Was the 768 → 3072 → 768 expansion-then-projection design chosen primarily to increase the model’s expressive capacity, perhaps allowing an internal “wider” representation space (even without a non-linearity) before compressing back to embedding space?
- Given that activation functions are identity, what practical benefit is achieved by first expanding then projecting rather than a direct 768→768 linear? For example: does it help with training stability, initialization dynamics, internal regularisation, quantisation/truncation (e.g., supporting the Matryoshka Representation Learning that allows embeddings of size 768→512→256→128) or any other downstream benefit?
- Does this structure specifically help when applying embedding truncation, quantisation (int4/int8), or on-device deployment scenarios (e.g., mobile/edge) by providing a “buffer” internal representation dimension?
- Or, alternatively, was the two-layer structure simply an implementation/engineering convenience (for example, to separate expansion & projection weights, better hardware/TPU kernel fusion, ONNX export, or internal library constraints) rather than purely a modelling/expressiveness choice?
- Finally, do you anticipate that future versions of the model might insert a non-linearity (or skip-connection) between these two projection layers, or are they intentionally kept linear for specific reasons?
Your insights would be extremely helpful for researchers and practitioners (including myself) in better understanding the embedding architecture and designing fine-tuning/quantisation/truncation pipelines around it.
Thank you very much for your time and for sharing your outstanding work.