Spaces:
Running
Running
Update README.md
Browse files
README.md
CHANGED
|
@@ -6,7 +6,139 @@ colorTo: blue
|
|
| 6 |
sdk: static
|
| 7 |
pinned: false
|
| 8 |
---
|
| 9 |
-
Welcome - This organization holds
|
| 10 |
<h1><center>🥫Datasets📊</center></h1>
|
| 11 |
<div align="center">Curated Datasets: <a href = "https://www.kaggle.com/datasets">Kaggle</a>. <a href="https://www.nlm.nih.gov/research/umls/index.html">NLM UMLS</a>. <a href="https://loinc.org/downloads/">LOINC</a>. <a href="https://www.cms.gov/medicare/icd-10/2022-icd-10-cm">ICD10 Diagnosis</a>. <a href="https://icd.who.int/dev11/downloads">ICD11</a>. <a href="https://paperswithcode.com/datasets?q=medical&v=lst&o=newest">Papers,Code,Datasets for SOTA in Medicine</a>. <a href="https://paperswithcode.com/datasets?q=mental&v=lst&o=newest">Mental</a>. <a href="https://paperswithcode.com/datasets?q=behavior&v=lst&o=newest">Behavior</a>. <a href="https://www.cms.gov/medicare-coverage-database/downloads/downloads.aspx">CMS Downloads</a>. <a href="https://www.cms.gov/medicare/fraud-and-abuse/physicianselfreferral/list_of_codes">CMS CPT and HCPCS Procedures and Services</a>
|
| 12 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
sdk: static
|
| 7 |
pinned: false
|
| 8 |
---
|
| 9 |
+
Welcome - This organization holds examples and links for this session.
|
| 10 |
<h1><center>🥫Datasets📊</center></h1>
|
| 11 |
<div align="center">Curated Datasets: <a href = "https://www.kaggle.com/datasets">Kaggle</a>. <a href="https://www.nlm.nih.gov/research/umls/index.html">NLM UMLS</a>. <a href="https://loinc.org/downloads/">LOINC</a>. <a href="https://www.cms.gov/medicare/icd-10/2022-icd-10-cm">ICD10 Diagnosis</a>. <a href="https://icd.who.int/dev11/downloads">ICD11</a>. <a href="https://paperswithcode.com/datasets?q=medical&v=lst&o=newest">Papers,Code,Datasets for SOTA in Medicine</a>. <a href="https://paperswithcode.com/datasets?q=mental&v=lst&o=newest">Mental</a>. <a href="https://paperswithcode.com/datasets?q=behavior&v=lst&o=newest">Behavior</a>. <a href="https://www.cms.gov/medicare-coverage-database/downloads/downloads.aspx">CMS Downloads</a>. <a href="https://www.cms.gov/medicare/fraud-and-abuse/physicianselfreferral/list_of_codes">CMS CPT and HCPCS Procedures and Services</a>
|
| 12 |
|
| 13 |
+
# 👋 Two easy ways to turbo boost your AI learning journey! 💻
|
| 14 |
+
# 🌐 AI Pair Programming
|
| 15 |
+
## Open 2 Browsers to:
|
| 16 |
+
1. __🌐 ChatGPT__ [URL](https://chat.openai.com/chat) or [URL2](https://platform.openai.com/playground) and
|
| 17 |
+
2. __🌐 Huggingface__ [URL](https://huggingface.co/awacke1) in separate browser windows.
|
| 18 |
+
1. 🤖 Use prompts to generate a streamlit program on Huggingface or locally to test it.
|
| 19 |
+
2. 🔧 For advanced work, add Python 3.10 and VSCode locally, and debug as gradio or streamlit apps.
|
| 20 |
+
3. 🚀 Use these two superpower processes to reduce the time it takes you to make a new AI program! ⏱️
|
| 21 |
+
|
| 22 |
+
Example Starter Prompt:
|
| 23 |
+
Write a streamlit program that demonstrates Data synthesis.
|
| 24 |
+
Synthesize data from multiple sources to create new datasets.
|
| 25 |
+
Use two datasets and demonstrate pandas dataframe query merge and join
|
| 26 |
+
with two datasets in python list dictionaries:
|
| 27 |
+
List of Hospitals that are over 1000 bed count by city and state, and
|
| 28 |
+
State population size and square miles.
|
| 29 |
+
Perform a calculated function on the merged dataset.
|
| 30 |
+
|
| 31 |
+
# 🎥 YouTube University Method:
|
| 32 |
+
1. 🏋️♀️ Plan two hours each weekday to exercise your body and brain.
|
| 33 |
+
2. 🎬 Make a playlist of videos you want to learn from on YouTube. Save the links to edit later.
|
| 34 |
+
3. 🚀 Try watching the videos at a faster speed while exercising, and sample the first five minutes of each video.
|
| 35 |
+
4. 📜 Reorder the playlist so the most useful videos are at the front, and take breaks to exercise.
|
| 36 |
+
5. 📝 Practice note-taking in markdown to instantly save what you want to remember. Share your notes with others!
|
| 37 |
+
6. 👥 AI Pair Programming Using Long Answer Language Models with Human Feedback
|
| 38 |
+
|
| 39 |
+
## 🎥 2023 AI/ML Advanced Learning Playlists:
|
| 40 |
+
1. [2023 QA Models and Long Form Question Answering NLP](https://www.youtube.com/playlist?list=PLHgX2IExbFovrkkx8HMTLNgYdjCMNYmX_)
|
| 41 |
+
2. [FHIR Bioinformatics Development Using AI/ML and Python, Streamlit, and Gradio - 2022](https://www.youtube.com/playlist?list=PLHgX2IExbFovoMUC3hYXeFegpk_Y0Lz0Q)
|
| 42 |
+
3. [2023 ChatGPT for Coding Assistant Streamlit, Gradio and Python Apps](https://www.youtube.com/playlist?list=PLHgX2IExbFouOEnppexiKZVdz_k5b0pvI)
|
| 43 |
+
4. [2023 BigScience Bloom - Large Language Model for AI Systems and NLP](https://www.youtube.com/playlist?list=PLHgX2IExbFouqnsIqziThlPCX_miiDq14)
|
| 44 |
+
5. [2023 Streamlit Pro Tips for AI UI UX for Data Science, Engineering, and Mathematics](https://www.youtube.com/playlist?list=PLHgX2IExbFou3cP19hHO9Xb-cN8uwr5RM)
|
| 45 |
+
6. [2023 Fun, New and Interesting AI, Videos, and AI/ML Techniques](https://www.youtube.com/playlist?list=PLHgX2IExbFotoMt32SrT3Xynt5BXTGnEP)
|
| 46 |
+
7. [2023 Best Minds in AGI AI Gamification and Large Language Models](https://www.youtube.com/playlist?list=PLHgX2IExbFotmFeBTpyje1uI22n0GAkXT)
|
| 47 |
+
8. [2023 State of the Art for Vision Image Classification, Text Classification and Regression, Extractive Question Answering and Tabular Classification](https://www.youtube.com/playlist?list=PLHgX2IExbFotPcPu6pauNHOoZTTbnAQ2F)
|
| 48 |
+
9. [2023 AutoML DataRobot and AI Platforms for Building Models, Features, Test, and Transparency](https://www.youtube.com/playlist?list=PLHgX2IExbFovsY2oGbDwdEhPrakkC8i3g)
|
| 49 |
+
|
| 50 |
+
### Comparison of Large Language Models
|
| 51 |
+
| Model Name | Model Size (in Parameters) |
|
| 52 |
+
| ----------------- | -------------------------- |
|
| 53 |
+
| BigScience-tr11-176B | 176 billion |
|
| 54 |
+
| GPT-3 | 175 billion |
|
| 55 |
+
| OpenAI's DALL-E 2.0 | 500 million |
|
| 56 |
+
| NVIDIA's Megatron | 8.3 billion |
|
| 57 |
+
| Transformer-XL | 250 million |
|
| 58 |
+
| XLNet | 210 million |
|
| 59 |
+
|
| 60 |
+
## ChatGPT Datasets 📚
|
| 61 |
+
- WebText
|
| 62 |
+
- Common Crawl
|
| 63 |
+
- BooksCorpus
|
| 64 |
+
- English Wikipedia
|
| 65 |
+
- Toronto Books Corpus
|
| 66 |
+
- OpenWebText
|
| 67 |
+
-
|
| 68 |
+
## ChatGPT Datasets - Details 📚
|
| 69 |
+
- **WebText:** A dataset of web pages crawled from domains on the Alexa top 5,000 list. This dataset was used to pretrain GPT-2.
|
| 70 |
+
- [WebText: A Large-Scale Unsupervised Text Corpus by Radford et al.](https://paperswithcode.com/dataset/webtext)
|
| 71 |
+
- **Common Crawl:** A dataset of web pages from a variety of domains, which is updated regularly. This dataset was used to pretrain GPT-3.
|
| 72 |
+
- [Language Models are Few-Shot Learners](https://paperswithcode.com/dataset/common-crawl) by Brown et al.
|
| 73 |
+
- **BooksCorpus:** A dataset of over 11,000 books from a variety of genres.
|
| 74 |
+
- [Scalable Methods for 8 Billion Token Language Modeling](https://paperswithcode.com/dataset/bookcorpus) by Zhu et al.
|
| 75 |
+
- **English Wikipedia:** A dump of the English-language Wikipedia as of 2018, with articles from 2001-2017.
|
| 76 |
+
- [Improving Language Understanding by Generative Pre-Training](https://huggingface.co/spaces/awacke1/WikipediaUltimateAISearch?logs=build) Space for Wikipedia Search
|
| 77 |
+
- **Toronto Books Corpus:** A dataset of over 7,000 books from a variety of genres, collected by the University of Toronto.
|
| 78 |
+
- [Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond](https://paperswithcode.com/dataset/bookcorpus) by Schwenk and Douze.
|
| 79 |
+
- **OpenWebText:** A dataset of web pages that were filtered to remove content that was likely to be low-quality or spammy. This dataset was used to pretrain GPT-3.
|
| 80 |
+
- [Language Models are Few-Shot Learners](https://paperswithcode.com/dataset/openwebtext) by Brown et al.
|
| 81 |
+
|
| 82 |
+
## Big Science Model 🚀
|
| 83 |
+
- 📜 Papers:
|
| 84 |
+
1. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model [Paper](https://arxiv.org/abs/2211.05100)
|
| 85 |
+
2. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism [Paper](https://arxiv.org/abs/1909.08053)
|
| 86 |
+
3. 8-bit Optimizers via Block-wise Quantization [Paper](https://arxiv.org/abs/2110.02861)
|
| 87 |
+
4. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation [Paper](https://arxiv.org/abs/2108.12409)
|
| 88 |
+
5. [Other papers related to Big Science](https://huggingface.co/models?other=doi:10.57967/hf/0003)
|
| 89 |
+
6. [217 other models optimized for use with Bloom](https://huggingface.co/models?other=bloom)
|
| 90 |
+
|
| 91 |
+
- 📚 Datasets:
|
| 92 |
+
|
| 93 |
+
**Datasets:**
|
| 94 |
+
1. - **Universal Dependencies:** A collection of annotated corpora for natural language processing in a range of languages, with a focus on dependency parsing.
|
| 95 |
+
- [Universal Dependencies official website.](https://universaldependencies.org/)
|
| 96 |
+
2. - **WMT 2014:** The fourth edition of the Workshop on Statistical Machine Translation, featuring shared tasks on translating between English and various other languages.
|
| 97 |
+
- [WMT14 website.](http://www.statmt.org/wmt14/)
|
| 98 |
+
3. - **The Pile:** An English language corpus of diverse text, sourced from various places on the internet.
|
| 99 |
+
- [The Pile official website.](https://pile.eleuther.ai/)
|
| 100 |
+
4. - **HumanEval:** A dataset of English sentences, annotated with human judgments on a range of linguistic qualities.
|
| 101 |
+
- [HumanEval: An Evaluation Benchmark for Language Understanding](https://github.com/google-research-datasets/humaneval) by Gabriel Ilharco, Daniel Loureiro, Pedro Rodriguez, and Afonso Mendes.
|
| 102 |
+
5. - **FLORES-101:** A dataset of parallel sentences in 101 languages, designed for multilingual machine translation.
|
| 103 |
+
- [FLORES-101: A Massively Multilingual Parallel Corpus for Language Understanding](https://flores101.opennmt.net/) by Aman Madaan, Shruti Rijhwani, Raghav Gupta, and Mitesh M. Khapra.
|
| 104 |
+
6. - **CrowS-Pairs:** A dataset of sentence pairs, designed for evaluating the plausibility of generated text.
|
| 105 |
+
- [CrowS-Pairs: A Challenge Dataset for Plausible Plausibility Judgments](https://github.com/stanford-cogsci/crows-pairs) by Andrea Madotto, Zhaojiang Lin, Chien-Sheng Wu, Pascale Fung, and Caiming Xiong.
|
| 106 |
+
7. - **WikiLingua:** A dataset of parallel sentences in 75 languages, sourced from Wikipedia.
|
| 107 |
+
- [WikiLingua: A New Benchmark Dataset for Cross-Lingual Wikification](https://arxiv.org/abs/2105.08031) by Jiarui Yao, Yanqiao Zhu, Ruihan Bao, Guosheng Lin, Lidong Bing, and Bei Shi.
|
| 108 |
+
8. - **MTEB:** A dataset of English sentences, annotated with their entailment relationships with respect to other sentences.
|
| 109 |
+
- [Multi-Task Evaluation Benchmark for Natural Language Inference](https://github.com/google-research-datasets/mteb) by Michał Lukasik, Marcin Junczys-Dowmunt, and Houda Bouamor.
|
| 110 |
+
9. - **xP3:** A dataset of English sentences, annotated with their paraphrase relationships with respect to other sentences.
|
| 111 |
+
- [xP3: A Large-Scale Evaluation Benchmark for Paraphrase Identification in Context](https://github.com/nyu-dl/xp3) by Aniket Didolkar, James Mayfield, Markus Saers, and Jason Baldridge.
|
| 112 |
+
10. - **DiaBLa:** A dataset of English dialogue, annotated with dialogue acts.
|
| 113 |
+
- [A Large-Scale Corpus for Conversation Disentanglement](https://github.com/HLTCHKUST/DiaBLA) by Samuel Broscheit, António Branco, and André F. T. Martins.
|
| 114 |
+
|
| 115 |
+
- 📚 Dataset Papers with Code
|
| 116 |
+
1. [Universal Dependencies](https://paperswithcode.com/dataset/universal-dependencies)
|
| 117 |
+
2. [WMT 2014](https://paperswithcode.com/dataset/wmt-2014)
|
| 118 |
+
3. [The Pile](https://paperswithcode.com/dataset/the-pile)
|
| 119 |
+
4. [HumanEval](https://paperswithcode.com/dataset/humaneval)
|
| 120 |
+
5. [FLORES-101](https://paperswithcode.com/dataset/flores-101)
|
| 121 |
+
6. [CrowS-Pairs](https://paperswithcode.com/dataset/crows-pairs)
|
| 122 |
+
7. [WikiLingua](https://paperswithcode.com/dataset/wikilingua)
|
| 123 |
+
8. [MTEB](https://paperswithcode.com/dataset/mteb)
|
| 124 |
+
9. [xP3](https://paperswithcode.com/dataset/xp3)
|
| 125 |
+
10. [DiaBLa](https://paperswithcode.com/dataset/diabla)
|
| 126 |
+
|
| 127 |
+
# Deep RL ML Strategy 🧠
|
| 128 |
+
The AI strategies are:
|
| 129 |
+
- Language Model Preparation using Human Augmented with Supervised Fine Tuning 🤖
|
| 130 |
+
- Reward Model Training with Prompts Dataset Multi-Model Generate Data to Rank 🎁
|
| 131 |
+
- Fine Tuning with Reinforcement Reward and Distance Distribution Regret Score 🎯
|
| 132 |
+
- Proximal Policy Optimization Fine Tuning 🤝
|
| 133 |
+
- Variations - Preference Model Pretraining 🤔
|
| 134 |
+
- Use Ranking Datasets Sentiment - Thumbs Up/Down, Distribution 📊
|
| 135 |
+
- Online Version Getting Feedback 💬
|
| 136 |
+
- OpenAI - InstructGPT - Humans generate LM Training Text 🔍
|
| 137 |
+
- DeepMind - Advantage Actor Critic Sparrow, GopherCite 🦜
|
| 138 |
+
- Reward Model Human Prefence Feedback 🏆
|
| 139 |
+
|
| 140 |
+
For more information on specific techniques and implementations, check out the following resources:
|
| 141 |
+
- OpenAI's paper on [GPT-3](https://arxiv.org/abs/2005.14165) which details their Language Model Preparation approach
|
| 142 |
+
- DeepMind's paper on [SAC](https://arxiv.org/abs/1801.01290) which describes the Advantage Actor Critic algorithm
|
| 143 |
+
- OpenAI's paper on [Reward Learning](https://arxiv.org/abs/1810.06580) which explains their approach to training Reward Models
|
| 144 |
+
- OpenAI's blog post on [GPT-3's fine-tuning process](https://openai.com/blog/fine-tuning-gpt-3/)
|