January


I am committing myself to writing about what I've done every month this year. This month I didn't have an experiment in mind and didn't have the idea to write every month until nearly halfway through it. I'm going to speed through some of the first things I did this month. I started with the GPT-2 architecture and gradually modified the code to resemble the Llama architecture. Throughout that process I tried scaled sinusoidal embeddings, cartesian-to-hyperspherical conversions inside the embedding layer, a LayerNorm after the embedding layer, and more that didn't turn out well. I decided then to try "parallel decoder blocks" (ffn(x) + attn(x) + x instead of ffn(attn(x)) + x) and I got a notably lower loss, so kept that modified architecture throughout the rest of the months' work and called it ParaLlama (this probably isn't normal, I ran another control with the llama-like blocks and it got a better loss than the parallel models with all the same parameters, I should have run a few at the beginning at different seeds).

Llama2-70b Embeddings

I then started exploring using pretrained token embeddings from Llama2-70b in these models, by just simply adding a linear projection from a hidden size of 8192 to a smaller dimension after the token embedding, and from that smaller dimension to 8192 before the language modelling head. Using a linear projection also allows us to "flatten" the parameters down to the size of a model trained without the large embeddings, by simply performing a matrix multiplication between the projection and layer from Llama2-70b, which is super useful for inference! Early on, tried using different percentages of training allocated to "flattened mode" to see if it gave us any gains in loss but the best value ended up being 0% by a landslide. I also reran the small model with a learnable alpha for each residual operation set to 0 at initialization to tend the model's function closer to identity (ParaLlama-αp) which is something that "seems like it'd be done in those papers with the parallel blocks, idk, right? I think they do that." It didn't show any improvements here though, so it's left out in later (parallel) models.

EDIT: The learnable 0 alpha for residual ops looks like it could have come from ReZero!

Architecture/NicknameDimMax ParametersNon-embed ParametersApproximate durationFLOPsTokensTest lossAverageARC@25TruthfulQA@0WinoGrande@5HellaSwag@10MMLU@5
ParaLlama-p-medium1088545m 38h8.660x10^185.1x10^92.29529.18222.0145.4351.5430.0926.02
              
ParaLlama-p-small768369m 11h1.412x10^182.2x10^92.516328.43520.1447.2649.6427.9425.63
ParaLlama-αp-small768369m 11h1.412x10^182.2x10^92.5385      
ParaLlama-small-control (no embeddings)768119m95m6h1.246x10^182.2x10^92.535      
Llama-p-small (w/ embeddings)768369m101m11h1.412x10^182.2x10^92.48528.88221.6746.8951.2228.2725.24

Performance

I calculated a quick-and-dirty mean strided perplexity from the first 1k samples of the deduplicated pile with 512 token strides for other models trained on the Pile dataset as well as some control and our main experimental ParaLlama-p models.

Model / Arch / NicknameStrided Sliding-Window Deduped Pile PerplexityAvg Hidden SizeLayersTokens (billions)
pythia-70m-deduped22.39340019226074228.935126300b
pythia-160m-deduped13.93375110626220729.3876812300b
pythia-410m-deduped9.6184215545654331.29102424300b
pythia-1b-deduped7.88908767700195332.78204816300b
pythia-1.4b-deduped7.28699159622192435204824300b
pythia-2.8b-deduped in 8-bit (sorry) 36.72256032300b
      
Cerebras-GPT-111M21.55065536499023427.75768102.2b
Cerebras-GPT-256M15.20349693298339829.381088145.1b
Cerebras-GPT-590M12.09820079803466829.1415361811.8B
      
ParaLlama-small-control14.519485473632812 768102.2b
ParaLlama-αp-small14.520842552185059 768102.2b (transformer) + 1400b* (embeddings)
      
ParaLlama-p-small14.09421730041503928.435768102.2b (transformer) + 1400b* (embeddings)
ParaLlama-p-medium10.68078804016113329.1821088145.1b (transformer) + 1400b* (embeddings)
      
Llama-like (no embeds, llama-like composition of ffn & attn)13.88230133056640628.608768102.2b
Llama-p-small (w/ embeds)13.56510925292968828.882768102.2b + 1400b* (embeddings)

* not directly from pile but likely large overlap

I think that I got a really particularly lucky seed when testing if the parallel blocks work better. It's the 30th by the time I'm training the new control "just to see if it really did work just in case," and "to fill out the table better," which turned out to be the right decision, with a very disappointing result. I did have just enough time to train a Llama-like model with the expanded embeddings from Llama2-70b though!

Architecture/NicknameDimMax ParametersNon-embed ParametersApproximate durationFLOPsTokensTest lossAverageARC@25TruthfulQA@0WinoGrande@5HellaSwag@10MMLU@5
Llama-p-small768369m101m11h1.412x10^182.2x10^92.48328.88221.6746.8951.2228.2725.24
Llama-like (no embeds, llama-like composition of ffn & attn)768119m95m7h1.246x10^182.2x10^92.50328.60820.6546.1450.2827.7226.86

 

Some (More) Things that Didn't Work Very Well

Virtual Token Models

I also had an idea to split each token embedding into multiple "virtual token embeddings" to run side-by-side (eight 16d tokens could become sixteen 8d tokens), to increase the amount of information we're working with per natural language token while keeping the same size transformer, but we cant do that without a complete transformation of the token embeddings (any residual transformation here gives really bad loss), so I first add a nn.Linear(max_dimension, max_dimension) operation, then a custom Split(virtual_tokens) class after the WTE and a similar linear and 'recombine' operation before the LM_HEAD. I have to test small here because training a VTM's transformer on d natural language tokens is essentially training a "normal" transformer on d*virtual_tokens tokens, so I decrease the number of virtual tokens by making the first linear transformation go from max_dimension to a much smaller dimension that can still be split into multiple tokens.

Each virtual token uses the same position_id as it's origin natural language token.

Architecture/NicknameDimMaximum ParametersApproximate durationFLOPsTokensTest loss
ParaLlama-p-micro256273m5.8h1.384x10^172.2x10^93.11
2VTM-micro.344m8h1.083x10^182.2x10^93.018
3VTM-micro.349m10h1.143x10^182.2x10^92.996
4VTM-micro.353m11h1.204x10^182.2x10^92.994

The VTM models end up slightly better than a model that utilizes the same dimension but in comparison to models that use a similar number of flops per forward they're worse, if parameter count is to be prioritized above all else, this is a viable architecture.

Mixture of Embeddings

A mixture of embeddings model is the ParaLlama architecture, with a new Embedding class. This new Embedding class consists of a traditional embedding layer that turns a token into a vector the size of which we'll call the proxy embedding dimension, then one projection for each imported embedding is done on the proxy embeddings, from the proxy embedding dimension to the vocab size of each of the imported embeddings. We also, from the proxy embedding, with a linear layer, estimate an alpha for each resulting embedding out, after they're projected to the model's dimensions with another linear layer. We then just perform a weighted sum based on that alpha! The language modelling head is unchanged (a simple Linear layer without any warm start).

I only had enough time for one test of this architecture so the embeddings I used were from Llama2-70b, Pythia-12b, and UL2-20b. That's vocabs of size 32000 with 8192 dimensions, 50688 with 5120 dimensions, and 32128 with 4096 dimensions. A proxy embedding size of 8 was used with a hidden size of 768, 10 layers, and 12 attention heads.

Architecture/NicknameDimMax ParametersNon-embed ParametersApproximate durationFLOPsTokensTest lossAverageARC@25TruthfulQA@0WinoGrande@5HellaSwag@10MMLU@5
ParaLlama-p-small768369m 11h1.412x10^182.2x10^92.516328.43520.1447.2649.6427.9425.63
MoEmbed-base768787m119m16.5h1.760x10^182.2x10^92.754      

This is without using a mixture of language modelling heads but due to the time constraint set, I don't have enough time to implement and train a model with a mixture of language modelling heads.

Fin

if you want better benchmark score using the embeddings of a model that gets a better benchmark score is... something you can do that won't do a lot. I suspect that these models, since they have access to a very large embedding-space and only are utilizing a very small part of it, could be more effective when finetuned than base "normal" models finetuned but didn't have time to check.

Final (Extra) Notes