awacke1 commited on
Commit
9b8e3a6
Β·
1 Parent(s): 7d02872

Update app.py

Browse files
Files changed (1) hide show
  1. app.py +6 -6
app.py CHANGED
@@ -70,22 +70,22 @@ https://paperswithcode.com/paper/bloom-a-176b-parameter-open-access
70
  ## ChatGPT Datasets - Details πŸ“š
71
 
72
  - **WebText:** A dataset of web pages crawled from domains on the Alexa top 5,000 list. This dataset was used to pretrain GPT-2.
73
- - [WebText: A Large-Scale Unsupervised Text Corpus](https://arxiv.org/abs/1902.10197) by Radford et al.
74
 
75
  - **Common Crawl:** A dataset of web pages from a variety of domains, which is updated regularly. This dataset was used to pretrain GPT-3.
76
- - [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165) by Brown et al.
77
 
78
  - **BooksCorpus:** A dataset of over 11,000 books from a variety of genres.
79
- - [Scalable Methods for 8 Billion Token Language Modeling](https://arxiv.org/abs/1907.05019) by Zhu et al.
80
 
81
  - **English Wikipedia:** A dump of the English-language Wikipedia as of 2018, with articles from 2001-2017.
82
- - [Improving Language Understanding by Generative Pre-Training](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf) by Radford et al.
83
 
84
  - **Toronto Books Corpus:** A dataset of over 7,000 books from a variety of genres, collected by the University of Toronto.
85
- - [Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond](https://arxiv.org/abs/1812.10464) by Schwenk and Douze.
86
 
87
  - **OpenWebText:** A dataset of web pages that were filtered to remove content that was likely to be low-quality or spammy. This dataset was used to pretrain GPT-3.
88
- - [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165) by Brown et al.
89
 
90
 
91
  ## Big Science Model πŸš€
 
70
  ## ChatGPT Datasets - Details πŸ“š
71
 
72
  - **WebText:** A dataset of web pages crawled from domains on the Alexa top 5,000 list. This dataset was used to pretrain GPT-2.
73
+ - [WebText: A Large-Scale Unsupervised Text Corpus by Radford et al.](https://paperswithcode.com/dataset/webtext)
74
 
75
  - **Common Crawl:** A dataset of web pages from a variety of domains, which is updated regularly. This dataset was used to pretrain GPT-3.
76
+ - [Language Models are Few-Shot Learners](https://paperswithcode.com/dataset/common-crawl) by Brown et al.
77
 
78
  - **BooksCorpus:** A dataset of over 11,000 books from a variety of genres.
79
+ - [Scalable Methods for 8 Billion Token Language Modeling](https://paperswithcode.com/dataset/bookcorpus) by Zhu et al.
80
 
81
  - **English Wikipedia:** A dump of the English-language Wikipedia as of 2018, with articles from 2001-2017.
82
+ - [Improving Language Understanding by Generative Pre-Training](https://huggingface.co/spaces/awacke1/WikipediaUltimateAISearch?logs=build) Space for Wikipedia Search
83
 
84
  - **Toronto Books Corpus:** A dataset of over 7,000 books from a variety of genres, collected by the University of Toronto.
85
+ - [Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond](https://paperswithcode.com/dataset/bookcorpus) by Schwenk and Douze.
86
 
87
  - **OpenWebText:** A dataset of web pages that were filtered to remove content that was likely to be low-quality or spammy. This dataset was used to pretrain GPT-3.
88
+ - [Language Models are Few-Shot Learners](https://paperswithcode.com/dataset/openwebtext) by Brown et al.
89
 
90
 
91
  ## Big Science Model πŸš€