summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorNikolay Zinov <nzinov@yandex-team.ru>2022-06-26 09:47:50 +0300
committerGitHub <noreply@github.com>2022-06-26 09:47:50 +0300
commitd921248abe3bbbd6cf2f7968dfa48a1e88612460 (patch)
treef0a8e795d0817697f73d356724e292ac358198eb
parentdeb045def565ad4d765341c08dbd73583bfab3d3 (diff)
Add training details in the README
-rw-r--r--README.md34
1 files changed, 34 insertions, 0 deletions
diff --git a/README.md b/README.md
index 4a19ed0..5998c74 100644
--- a/README.md
+++ b/README.md
@@ -31,3 +31,37 @@ You can start with the following scripts:
## License
The model is published under the Apache 2.0 license that permits both research and commercial use, Megatron-LM is licensed under the [Megatron-LM license](megatron_lm/LICENSE).
+
+## Training details
+
+### Dataset composition
+
+Dataset used for the training of YaLM-100B is comprised of the following parts (rough percentages are measured in tokens seen by the model)
+
+* **25%** [The Pile](https://pile.eleuther.ai/) — open English dataset by Eleuther AI team
+
+* **75%** Texts in Russian collected by our team (percentages of the whole dataset are given)
+
+ * 49% Russian web pages from Yandex Search index filtered from ~100Tb to ~1Tb by the following heuristics:
+ 1. LSH Deduplication — clusters of similar texts were truncated to just one text each
+ 2. Length filtration — too short or too long texts or texts with too few natural sentences were discarded.
+ 3. Entropy filtration — texts with too high or too low entropy were discarded
+ 4. Domain filtration — domains with repetitive texts (like online retail) were discarded
+ 5. Classifier filtration — dataset of good texts was collected in a manner similar to WebText from pages linked in tweets in Russian that have at least one reply. Then a classifier was trained to distinguish those good texts from random pages from the dataset. Texts from the original crawled dataset with low classifier scores were then discarded
+
+ * 12% News from various sources from Yandex Search index
+
+ * 10% Books from the dataset used in [Russian Distributional Thesarus](https://russe.nlpub.org/downloads/)
+
+ * 3% Misc texts from the [Taiga Dataset](https://tatianashavrina.github.io/taiga_site/)
+
+ * 1.5% Dialogues from social media preprocessed in a manner similar to how Reddit is proccessed in The Pile
+
+ * 0.5% Russian portion of Wikipedia
+
+Some subsets were traversed up to 3 times during the training.
+
+
+### Training process
+
+Model was trained on a cluster of 800 A100 for ~65 days. In that time it consumed 300B tokens. You can see TensorBoard with LR and ramp up schedule, training metrics and our "thermometers" on the [HF page](https://huggingface.co/yandex/yalm-100b).