January
I am committing myself to writing about what I've done every month this year. This month I didn't have an experiment in mind and didn't have the idea to write every month until nearly halfway through it. I'm going to speed through some of the first things I did this month. I started with the GPT-2 architecture and gradually modified the code to resemble the Llama architecture. Throughout that process I tried scaled sinusoidal embeddings, cartesian-to-hyperspherical conversions inside the embedding layer, a LayerNorm after the embedding layer, and more that didn't turn out well. I decided then to try "parallel decoder blocks" (ffn(x) + attn(x) + x
instead of ffn(attn(x)) + x
) and I got a notably lower loss, so kept that modified architecture throughout the rest of the months' work and called it ParaLlama (this probably isn't normal, I ran another control with the llama-like blocks and it got a better loss than the parallel models with all the same parameters, I should have run a few at the beginning at different seeds).
Llama2-70b Embeddings
I then started exploring using pretrained token embeddings from Llama2-70b in these models, by just simply adding a linear projection from a hidden size of 8192 to a smaller dimension after the token embedding, and from that smaller dimension to 8192 before the language modelling head. Using a linear projection also allows us to "flatten" the parameters down to the size of a model trained without the large embeddings, by simply performing a matrix multiplication between the projection and layer from Llama2-70b, which is super useful for inference! Early on, tried using different percentages of training allocated to "flattened mode" to see if it gave us any gains in loss but the best value ended up being 0% by a landslide. I also reran the small model with a learnable alpha for each residual operation set to 0 at initialization to tend the model's function closer to identity (ParaLlama-αp) which is something that "seems like it'd be done in those papers with the parallel blocks, idk, right? I think they do that." It didn't show any improvements here though, so it's left out in later (parallel) models.
EDIT: The learnable 0 alpha for residual ops looks like it could have come from ReZero!
Architecture/Nickname | Dim | Max Parameters | Non-embed Parameters | Approximate duration | FLOPs | Tokens | Test loss | Average | ARC@25 | TruthfulQA@0 | WinoGrande@5 | HellaSwag@10 | MMLU@5 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ParaLlama-p-medium | 1088 | 545m | 38h | 8.660x10^18 | 5.1x10^9 | 2.295 | 29.182 | 22.01 | 45.43 | 51.54 | 30.09 | 26.02 | |
ParaLlama-p-small | 768 | 369m | 11h | 1.412x10^18 | 2.2x10^9 | 2.5163 | 28.435 | 20.14 | 47.26 | 49.64 | 27.94 | 25.63 | |
ParaLlama-αp-small | 768 | 369m | 11h | 1.412x10^18 | 2.2x10^9 | 2.5385 | |||||||
ParaLlama-small-control (no embeddings) | 768 | 119m | 95m | 6h | 1.246x10^18 | 2.2x10^9 | 2.535 | ||||||
Llama-p-small (w/ embeddings) | 768 | 369m | 101m | 11h | 1.412x10^18 | 2.2x10^9 | 2.485 | 28.882 | 21.67 | 46.89 | 51.22 | 28.27 | 25.24 |
Performance
I calculated a quick-and-dirty mean strided perplexity from the first 1k samples of the deduplicated pile with 512 token strides for other models trained on the Pile dataset as well as some control and our main experimental ParaLlama-p models.
Model / Arch / Nickname | Strided Sliding-Window Deduped Pile Perplexity | Avg | Hidden Size | Layers | Tokens (billions) |
---|---|---|---|---|---|
pythia-70m-deduped | 22.393400192260742 | 28.93 | 512 | 6 | 300b |
pythia-160m-deduped | 13.933751106262207 | 29.38 | 768 | 12 | 300b |
pythia-410m-deduped | 9.61842155456543 | 31.29 | 1024 | 24 | 300b |
pythia-1b-deduped | 7.889087677001953 | 32.78 | 2048 | 16 | 300b |
pythia-1.4b-deduped | 7.286991596221924 | 35 | 2048 | 24 | 300b |
pythia-2.8b-deduped in 8-bit (sorry) | 36.72 | 2560 | 32 | 300b | |
Cerebras-GPT-111M | 21.550655364990234 | 27.75 | 768 | 10 | 2.2b |
Cerebras-GPT-256M | 15.203496932983398 | 29.38 | 1088 | 14 | 5.1b |
Cerebras-GPT-590M | 12.098200798034668 | 29.14 | 1536 | 18 | 11.8B |
ParaLlama-small-control | 14.519485473632812 | 768 | 10 | 2.2b | |
ParaLlama-αp-small | 14.520842552185059 | 768 | 10 | 2.2b (transformer) + 1400b* (embeddings) | |
ParaLlama-p-small | 14.094217300415039 | 28.435 | 768 | 10 | 2.2b (transformer) + 1400b* (embeddings) |
ParaLlama-p-medium | 10.680788040161133 | 29.182 | 1088 | 14 | 5.1b (transformer) + 1400b* (embeddings) |
Llama-like (no embeds, llama-like composition of ffn & attn) | 13.882301330566406 | 28.608 | 768 | 10 | 2.2b |
Llama-p-small (w/ embeds) | 13.565109252929688 | 28.882 | 768 | 10 | 2.2b + 1400b* (embeddings) |
* not directly from pile but likely large overlap
I think that I got a really particularly lucky seed when testing if the parallel blocks work better. It's the 30th by the time I'm training the new control "just to see if it really did work just in case," and "to fill out the table better," which turned out to be the right decision, with a very disappointing result. I did have just enough time to train a Llama-like model with the expanded embeddings from Llama2-70b though!
Architecture/Nickname | Dim | Max Parameters | Non-embed Parameters | Approximate duration | FLOPs | Tokens | Test loss | Average | ARC@25 | TruthfulQA@0 | WinoGrande@5 | HellaSwag@10 | MMLU@5 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Llama-p-small | 768 | 369m | 101m | 11h | 1.412x10^18 | 2.2x10^9 | 2.483 | 28.882 | 21.67 | 46.89 | 51.22 | 28.27 | 25.24 |
Llama-like (no embeds, llama-like composition of ffn & attn) | 768 | 119m | 95m | 7h | 1.246x10^18 | 2.2x10^9 | 2.503 | 28.608 | 20.65 | 46.14 | 50.28 | 27.72 | 26.86 |
Some (More) Things that Didn't Work Very Well
Virtual Token Models
I also had an idea to split each token embedding into multiple "virtual token embeddings" to run side-by-side (eight 16d tokens could become sixteen 8d tokens), to increase the amount of information we're working with per natural language token while keeping the same size transformer, but we cant do that without a complete transformation of the token embeddings (any residual transformation here gives really bad loss), so I first add a nn.Linear(max_dimension, max_dimension)
operation, then a custom Split(virtual_tokens)
class after the WTE
and a similar linear and 'recombine' operation before the LM_HEAD
. I have to test small here because training a VTM's transformer on d
natural language tokens is essentially training a "normal" transformer on d*virtual_tokens
tokens, so I decrease the number of virtual tokens by making the first linear transformation go from max_dimension
to a much smaller dimension that can still be split into multiple tokens.
Each virtual token uses the same position_id
as it's origin natural language token.
Architecture/Nickname | Dim | Maximum Parameters | Approximate duration | FLOPs | Tokens | Test loss |
---|---|---|---|---|---|---|
ParaLlama-p-micro | 256 | 273m | 5.8h | 1.384x10^17 | 2.2x10^9 | 3.11 |
2VTM-micro | . | 344m | 8h | 1.083x10^18 | 2.2x10^9 | 3.018 |
3VTM-micro | . | 349m | 10h | 1.143x10^18 | 2.2x10^9 | 2.996 |
4VTM-micro | . | 353m | 11h | 1.204x10^18 | 2.2x10^9 | 2.994 |
The VTM models end up slightly better than a model that utilizes the same dimension but in comparison to models that use a similar number of flops per forward they're worse, if parameter count is to be prioritized above all else, this is a viable architecture.
Mixture of Embeddings
A mixture of embeddings model is the ParaLlama architecture, with a new Embedding class. This new Embedding class consists of a traditional embedding layer that turns a token into a vector the size of which we'll call the proxy embedding dimension, then one projection for each imported embedding is done on the proxy embeddings, from the proxy embedding dimension to the vocab size of each of the imported embeddings. We also, from the proxy embedding, with a linear layer, estimate an alpha for each resulting embedding out, after they're projected to the model's dimensions with another linear layer. We then just perform a weighted sum based on that alpha! The language modelling head is unchanged (a simple Linear layer without any warm start).
I only had enough time for one test of this architecture so the embeddings I used were from Llama2-70b, Pythia-12b, and UL2-20b. That's vocabs of size 32000 with 8192 dimensions, 50688 with 5120 dimensions, and 32128 with 4096 dimensions. A proxy embedding size of 8 was used with a hidden size of 768, 10 layers, and 12 attention heads.
Architecture/Nickname | Dim | Max Parameters | Non-embed Parameters | Approximate duration | FLOPs | Tokens | Test loss | Average | ARC@25 | TruthfulQA@0 | WinoGrande@5 | HellaSwag@10 | MMLU@5 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ParaLlama-p-small | 768 | 369m | 11h | 1.412x10^18 | 2.2x10^9 | 2.5163 | 28.435 | 20.14 | 47.26 | 49.64 | 27.94 | 25.63 | |
MoEmbed-base | 768 | 787m | 119m | 16.5h | 1.760x10^18 | 2.2x10^9 | 2.754 |
This is without using a mixture of language modelling heads but due to the time constraint set, I don't have enough time to implement and train a model with a mixture of language modelling heads.
Fin
if you want better benchmark score using the embeddings of a model that gets a better benchmark score is... something you can do that won't do a lot. I suspect that these models, since they have access to a very large embedding-space and only are utilizing a very small part of it, could be more effective when finetuned than base "normal" models finetuned but didn't have time to check.
Final (Extra) Notes
I didn't touch on flattening the models enough! Most of these models can have their embeddings "flattened" or "collapsed" since they're made of just a bunch of linear operations, you can combine them all onto the embedding layer itself and it should be the same size as a normal model's embedding layer! I did play around with it a very small amount and got some what I think are floating point calculation errors where the performance is slightly worse when running the flattened model on evals. In theory, though, you could find a way to merge the embedding layers operations into one reasonably small layer without loss degradation if you wanted to serve the base model cheaper. For finetuning though, optimizing the full operations should give a far better loss than finetuning the flattened version.
The approximate duration may not be consistent throughout the experiments, even little things like opening my window to allow a cold breeze speeds up the Vector, so it's mainly there as a "guess" to how long the model would take to train on a machine with 2 NVIDIA RTX-A6000 graphics cards and a Ryzen threadripper pro 5955wx 16-cores x 32.
I used Cerebras models as reference for how much data to train the models with, having a minimum amount of 2.2GT, but that most likely is not "compute optimal," new scaling laws to fit training with specific pretrained embeddings would have to be estimated.
I launched 93 training runs for this by hand! If you want this kind of dedication to a project I'm available for hire but don't hire me yet because I have an appointment I need my insurance for first but think about it jot my name down in a note come back to it in march
^-^
if you think I'd fit in your team or if you want to give me money for fun which you should do^-^
(2/14/2024) A correction was issued for the average benchmark scores, all GSM8k scores are now included in the averages to match how Pythia and Cerebras' scores were calculated. GSM8K scores for each of my custom models are 0, and are not included in the tables because I didn't make this site easily editable oops