January

I am committing myself to writing about what I've done every month this year. This month I didn't have an experiment in mind and didn't have the idea to write every month until nearly halfway through it. I'm going to speed through some of the first things I did this month. I started with the GPT-2 architecture and gradually modified the code to resemble the Llama architecture. Throughout that process I tried scaled sinusoidal embeddings, cartesian-to-hyperspherical conversions inside the embedding layer, a LayerNorm after the embedding layer, and more that didn't turn out well. I decided then to try "parallel decoder blocks" (ffn(x) + attn(x) + x instead of ffn(attn(x)) + x) and I got a notably lower loss, so kept that modified architecture throughout the rest of the months' work and called it ParaLlama (this probably isn't normal, I ran another control with the llama-like blocks and it got a better loss than the parallel models with all the same parameters, I should have run a few at the beginning at different seeds).

Llama2-70b Embeddings

I then started exploring using pretrained token embeddings from Llama2-70b in these models, by just simply adding a linear projection from a hidden size of 8192 to a smaller dimension after the token embedding, and from that smaller dimension to 8192 before the language modelling head. Using a linear projection also allows us to "flatten" the parameters down to the size of a model trained without the large embeddings, by simply performing a matrix multiplication between the projection and layer from Llama2-70b, which is super useful for inference! Early on, tried using different percentages of training allocated to "flattened mode" to see if it gave us any gains in loss but the best value ended up being 0% by a landslide. I also reran the small model with a learnable alpha for each residual operation set to 0 at initialization to tend the model's function closer to identity (ParaLlama-αp) which is something that "seems like it'd be done in those papers with the parallel blocks, idk, right? I think they do that." It didn't show any improvements here though, so it's left out in later (parallel) models.

EDIT: The learnable 0 alpha for residual ops looks like it could have come from ReZero!

Architecture/Nickname	Dim	Max Parameters	Non-embed Parameters	Approximate duration	FLOPs	Tokens	Test loss	Average	ARC@25	TruthfulQA@0	WinoGrande@5	HellaSwag@10	MMLU@5
ParaLlama-p-medium	1088	545m		38h	8.660x10^18	5.1x10^9	2.295	29.182	22.01	45.43	51.54	30.09	26.02

ParaLlama-p-small	768	369m		11h	1.412x10^18	2.2x10^9	2.5163	28.435	20.14	47.26	49.64	27.94	25.63
ParaLlama-αp-small	768	369m		11h	1.412x10^18	2.2x10^9	2.5385
ParaLlama-small-control (no embeddings)	768	119m	95m	6h	1.246x10^18	2.2x10^9	2.535
Llama-p-small (w/ embeddings)	768	369m	101m	11h	1.412x10^18	2.2x10^9	2.485	28.882	21.67	46.89	51.22	28.27	25.24

Performance

I calculated a quick-and-dirty mean strided perplexity from the first 1k samples of the deduplicated pile with 512 token strides for other models trained on the Pile dataset as well as some control and our main experimental ParaLlama-p models.

Model / Arch / Nickname	Strided Sliding-Window Deduped Pile Perplexity	Avg	Hidden Size	Layers	Tokens (billions)
pythia-70m-deduped	22.393400192260742	28.93	512	6	300b
pythia-160m-deduped	13.933751106262207	29.38	768	12	300b
pythia-410m-deduped	9.61842155456543	31.29	1024	24	300b
pythia-1b-deduped	7.889087677001953	32.78	2048	16	300b
pythia-1.4b-deduped	7.286991596221924	35	2048	24	300b
pythia-2.8b-deduped in 8-bit (sorry)		36.72	2560	32	300b

Cerebras-GPT-111M	21.550655364990234	27.75	768	10	2.2b
Cerebras-GPT-256M	15.203496932983398	29.38	1088	14	5.1b
Cerebras-GPT-590M	12.098200798034668	29.14	1536	18	11.8B

ParaLlama-small-control	14.519485473632812		768	10	2.2b
ParaLlama-αp-small	14.520842552185059		768	10	2.2b (transformer) + 1400b* (embeddings)

ParaLlama-p-small	14.094217300415039	28.435	768	10	2.2b (transformer) + 1400b* (embeddings)
ParaLlama-p-medium	10.680788040161133	29.182	1088	14	5.1b (transformer) + 1400b* (embeddings)

Llama-like (no embeds, llama-like composition of ffn & attn)	13.882301330566406	28.608	768	10	2.2b
Llama-p-small (w/ embeds)	13.565109252929688	28.882	768	10	2.2b + 1400b* (embeddings)

* not directly from pile but likely large overlap

I think that I got a really particularly lucky seed when testing if the parallel blocks work better. It's the 30th by the time I'm training the new control "just to see if it really did work just in case," and "to fill out the table better," which turned out to be the right decision, with a very disappointing result. I did have just enough time to train a Llama-like model with the expanded embeddings from Llama2-70b though!

Architecture/Nickname	Dim	Max Parameters	Non-embed Parameters	Approximate duration	FLOPs	Tokens	Test loss	Average	ARC@25	TruthfulQA@0	WinoGrande@5	HellaSwag@10	MMLU@5
Llama-p-small	768	369m	101m	11h	1.412x10^18	2.2x10^9	2.483	28.882	21.67	46.89	51.22	28.27	25.24
Llama-like (no embeds, llama-like composition of ffn & attn)	768	119m	95m	7h	1.246x10^18	2.2x10^9	2.503	28.608	20.65	46.14	50.28	27.72	26.86

Some (More) Things that Didn't Work Very Well

Virtual Token Models

I also had an idea to split each token embedding into multiple "virtual token embeddings" to run side-by-side (eight 16d tokens could become sixteen 8d tokens), to increase the amount of information we're working with per natural language token while keeping the same size transformer, but we cant do that without a complete transformation of the token embeddings (any residual transformation here gives really bad loss), so I first add a nn.Linear(max_dimension, max_dimension) operation, then a custom Split(virtual_tokens) class after the WTE and a similar linear and 'recombine' operation before the LM_HEAD. I have to test small here because training a VTM's transformer on d natural language tokens is essentially training a "normal" transformer on d*virtual_tokens tokens, so I decrease the number of virtual tokens by making the first linear transformation go from max_dimension to a much smaller dimension that can still be split into multiple tokens.

Each virtual token uses the same position_id as it's origin natural language token.

Architecture/Nickname	Dim	Maximum Parameters	Approximate duration	FLOPs	Tokens	Test loss
ParaLlama-p-micro	256	273m	5.8h	1.384x10^17	2.2x10^9	3.11
2VTM-micro	.	344m	8h	1.083x10^18	2.2x10^9	3.018
3VTM-micro	.	349m	10h	1.143x10^18	2.2x10^9	2.996
4VTM-micro	.	353m	11h	1.204x10^18	2.2x10^9	2.994

The VTM models end up slightly better than a model that utilizes the same dimension but in comparison to models that use a similar number of flops per forward they're worse, if parameter count is to be prioritized above all else, this is a viable architecture.

Mixture of Embeddings

A mixture of embeddings model is the ParaLlama architecture, with a new Embedding class. This new Embedding class consists of a traditional embedding layer that turns a token into a vector the size of which we'll call the proxy embedding dimension, then one projection for each imported embedding is done on the proxy embeddings, from the proxy embedding dimension to the vocab size of each of the imported embeddings. We also, from the proxy embedding, with a linear layer, estimate an alpha for each resulting embedding out, after they're projected to the model's dimensions with another linear layer. We then just perform a weighted sum based on that alpha! The language modelling head is unchanged (a simple Linear layer without any warm start).

I only had enough time for one test of this architecture so the embeddings I used were from Llama2-70b, Pythia-12b, and UL2-20b. That's vocabs of size 32000 with 8192 dimensions, 50688 with 5120 dimensions, and 32128 with 4096 dimensions. A proxy embedding size of 8 was used with a hidden size of 768, 10 layers, and 12 attention heads.

Architecture/Nickname	Dim	Max Parameters	Non-embed Parameters	Approximate duration	FLOPs	Tokens	Test loss	Average	ARC@25	TruthfulQA@0	WinoGrande@5	HellaSwag@10	MMLU@5
ParaLlama-p-small	768	369m		11h	1.412x10^18	2.2x10^9	2.5163	28.435	20.14	47.26	49.64	27.94	25.63
MoEmbed-base	768	787m	119m	16.5h	1.760x10^18	2.2x10^9	2.754

This is without using a mixture of language modelling heads but due to the time constraint set, I don't have enough time to implement and train a model with a mixture of language modelling heads.

Fin

if you want better benchmark score using the embeddings of a model that gets a better benchmark score is... something you can do that won't do a lot. I suspect that these models, since they have access to a very large embedding-space and only are utilizing a very small part of it, could be more effective when finetuned than base "normal" models finetuned but didn't have time to check.

Final (Extra) Notes

I didn't touch on flattening the models enough! Most of these models can have their embeddings "flattened" or "collapsed" since they're made of just a bunch of linear operations, you can combine them all onto the embedding layer itself and it should be the same size as a normal model's embedding layer! I did play around with it a very small amount and got some what I think are floating point calculation errors where the performance is slightly worse when running the flattened model on evals. In theory, though, you could find a way to merge the embedding layers operations into one reasonably small layer without loss degradation if you wanted to serve the base model cheaper. For finetuning though, optimizing the full operations should give a far better loss than finetuning the flattened version.

The approximate duration may not be consistent throughout the experiments, even little things like opening my window to allow a cold breeze speeds up the Vector, so it's mainly there as a "guess" to how long the model would take to train on a machine with 2 NVIDIA RTX-A6000 graphics cards and a Ryzen threadripper pro 5955wx 16-cores x 32.
I used Cerebras models as reference for how much data to train the models with, having a minimum amount of 2.2GT, but that most likely is not "compute optimal," new scaling laws to fit training with specific pretrained embeddings would have to be estimated.
I launched 93 training runs for this by hand! If you want this kind of dedication to a project I'm available for hire but don't hire me yet because I have an appointment I need my insurance for first but think about it jot my name down in a note come back to it in march ^-^ if you think I'd fit in your team or if you want to give me money for fun which you should do ^-^

(2/14/2024) A correction was issued for the average benchmark scores, all GSM8k scores are now included in the averages to match how Pythia and Cerebras' scores were calculated. GSM8K scores for each of my custom models are 0, and are not included in the tables because I didn't make this site easily editable oops