Spaces:

awacke1
/

Multiplayer-RLHF-Evals

Sleeping

App Files Files Community

awacke1 commited on Mar 28, 2023

Commit

eecfc12

1 Parent(s): f1cead7

Update app.py

Browse files

Files changed (1) hide show

app.py +223 -42

app.py CHANGED Viewed

@@ -1,56 +1,237 @@
-# Import required libraries
-import nltk
-from nltk.tokenize import word_tokenize, sent_tokenize, regexp_tokenize, TweetTokenizer
-from nltk.tokenize.treebank import TreebankWordDetokenizer
-from nltk.tokenize import NLTKWordTokenizer
-nltk.download('punkt')
-# Test cases for NLTKWordTokenizer
-s1 = "On a $50,000 mortgage of 30 years at 8 percent, the monthly payment would be $366.88."
-print(word_tokenize(s1))
-s2 = "\"We beat some pretty good teams to get here,\" Slocum said."
-print(word_tokenize(s2))
-s3 = "Well, we couldn't have this predictable, cliche-ridden, \"Touched by an Angel\" (a show creator John Masius worked on) wanna-be if she didn't."
-print(word_tokenize(s3))
-s4 = "I cannot cannot work under these conditions!"
-print(word_tokenize(s4))
-s5 = "The company spent $30,000,000 last year."
-print(word_tokenize(s5))
-s6 = "The company spent 40.75% of its income last year."
-print(word_tokenize(s6))
-s7 = "He arrived at 3:00 pm."
-print(word_tokenize(s7))
-s8 = "I bought these items: books, pencils, and pens."
-print(word_tokenize(s8))
-s9 = "Though there were 150, 100 of them were old."
-print(word_tokenize(s9))
-s10 = "There were 300,000, but that wasn't enough."
-print(word_tokenize(s10))
-s11 = "It's more'n enough."
-print(word_tokenize(s11))
-s = '''Good muffins cost $3.88\nin New (York).  Please (buy) me\ntwo of them.\n(Thanks).'''
-expected = [(0, 4), (5, 12), (13, 17), (18, 19), (19, 23),
-(24, 26), (27, 30), (31, 32), (32, 36), (36, 37), (37, 38),
-(40, 46), (47, 48), (48, 51), (51, 52), (53, 55), (56, 59),
-(60, 62), (63, 68), (69, 70), (70, 76), (76, 77), (77, 78)]
-print(list(NLTKWordTokenizer().span_tokenize(s)) == expected)
-expected = ['Good', 'muffins', 'cost', '$', '3.88', 'in',
-'New', '(', 'York', ')', '.', 'Please', '(', 'buy', ')',
-'me', 'two', 'of', 'them.', '(', 'Thanks', ')', '.']
-print([s[start:end] for start, end in NLTKWordTokenizer().span_tokenize(s)] == expected)
-sx1 = '\xabNow that I can do.\xbb'
-expected = ['\xab', 'Now', 'that', 'I', 'can', 'do', '.', '\xbb']
-print(word_tokenize(sx1) == expected)

+import streamlit as st
+st.markdown("""
+# Thank you for contributing an eval! ♥️
+🚨 Please make sure your PR follows these guidelines, __failure to follow
+the guidelines below will result in the PR being closed automatically__.
+Note that even if the criteria are met, that does not guarantee the PR
+will be merged nor GPT-4 access granted. 🚨
+__PLEASE READ THIS__:
+In order for a PR to be merged, it must fail on GPT-4. We are aware that
+right now, users do not have access, so you will not be able to tell if
+the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep
+in mind as we run the eval, if GPT-4 gets higher than 90% on the eval,
+we will likely reject since GPT-4 is already capable of completing the
+task.
+We plan to roll out a way for users submitting evals to see the eval
+performance on GPT-4 soon. Stay tuned! Until then, you will not be able
+to see the eval performance on GPT-4. We encourage partial PR's with
+~5-10 example that we can then run the evals on and share the results
+with you so you know how your eval does with GPT-4 before writing all
+100 examples.
+## Eval details 📑
+### Eval name
+which-is-heavier
+### Eval description
+This evaluation tests the physical reasoning of GPT by asking which of
+two quantities is heavier, where the quantities are assigned explicit
+weights (e.g., "5 kilograms" or "2 pounds") and there is a clear answer
+(e.g., Q: "Is 5 pounds of tissue paper heavier than 3 pounds of
+granite?" A: "Yes"). The catch is that, in each example, the heavier
+quantity is always associated with an item that is generally thought of
+as being light (e.g., feathers, tissue paper, cotton balls) while the
+lighter quantity is always associated with an item that is generally
+thought of as being heavy (e.g., granite, cast iron, plutonium). Humans
+can easily achieve 100% on this task, but they have to cognitively
+ignore what comprises the quantities and focus on just their weights.
+### What makes this a useful eval?
+ChatGPT 3.5 does a decent job of comparing weights (i.e., "Is 5 pounds
+greater than 3 pounds?" "Yes") as well as understanding common
+attributions of "light" and "heavy" to everyday objects like feathers
+and anvils, but when combining these two into a single comparison, it
+appears that the model's bias towards these colloquial attributes makes
+it difficult for the model to perform well.
+What's interesting is that the errors are not even logically consistent.
+Consider the following example (taken from the ChatGPT dialogue UX/UI
+using GPT-4):
+<img width="715" alt="Screen Shot 2023-03-21 at 4 45 44 PM"
+src="https://user-images.githubusercontent.com/11773823/226827152-f78c0fb4-1136-433c-9d55-65eb6c6c5407.png">
+According to GPT-4:
+- 15 pounds of hydrogen is heavier than 10 pounds of plutonium
+- 20 pounds of hydrogen is **not** heavier than 10 pounds of plutonium
+- 30 pounds of hydrogen is heavier than 10 pounds of plutonium
+GPT-4 appears to do better than GPT-3.5 in informal testing, but still
+fails on simple cases and gives inconsistent results.
+This task exposes how adversarial red herrings can potentially hamper
+performance in quantitative/physical reasoning tasks such as weight
+comparison.
+The other nice aspect of this task is that one can generate these
+examples programatically. I can submit a dataset of thousands rather
+than just a few hundred if preferred.
+## Criteria for a good eval ✅
+Below are some of the criteria we look for in a good eval. In general,
+we are seeking cases where the model does not do a good job despite
+being capable of generating a good response (note that there are some
+things large language models cannot do, so those would not make good
+evals).
+Your eval should be:
+- [x] Thematically consistent: The eval should be thematically
+consistent. We'd like to see a number of prompts all demonstrating some
+particular failure mode. For example, we can create an eval on cases
+where the model fails to reason about the physical world.
+- [x] Contains failures where a human can do the task, but either GPT-4
+or GPT-3.5-Turbo could not.
+- [x] Includes good signal around what is the right behavior. This means
+either a correct answer for `Basic` evals or the `Fact` Model-graded
+eval, or an exhaustive rubric for evaluating answers for the `Criteria`
+Model-graded eval.
+- [x] Include at least 100 high quality examples (it is okay to only
+contribute 5-10 meaningful examples and have us test them with GPT-4
+before adding all 100)
+If there is anything else that makes your eval worth including, please
+document it below.
+### Unique eval value
+> Insert what makes your eval high quality that was not mentioned above.
+(Not required)
+## Eval structure 🏗️
+Your eval should
+- [x] Check that your data is in `evals/registry/data/{name}`
+- [x] Check that your yaml is registered at
+`evals/registry/evals/{name}.yaml`
+- [x] Ensure you have the right to use the data you submit via this eval
+(For now, we will only be approving evals that use one of the existing
+eval classes. You may still write custom eval classes for your own
+cases, and we may consider merging them in the future.)
+## Final checklist 👀
+### Submission agreement
+By contributing to Evals, you are agreeing to make your evaluation logic
+and data under the same MIT license as this repository. You must have
+adequate rights to upload any data used in an Eval. OpenAI reserves the
+right to use this data in future service improvements to our product.
+Contributions to OpenAI Evals will be subject to our usual Usage
+Policies (https://platform.openai.com/docs/usage-policies).
+- [x] I agree that my submission will be made available under an MIT
+license and complies with OpenAI's usage policies.
+### Email address validation
+If your submission is accepted, we will be granting GPT-4 access to a
+limited number of contributors. Access will be given to the email
+address associated with the merged pull request.
+- [x] I acknowledge that GPT-4 access will only be granted, if
+applicable, to the email address used for my merged pull request.
+### Limited availability acknowledgement
+We know that you might be excited to contribute to OpenAI's mission,
+help improve our models, and gain access to GPT-4. However, due to the
+requirements mentioned above and high volume of submissions, we will not
+be able to accept all submissions and thus not grant everyone who opens
+a PR GPT-4 access. We know this is disappointing, but we hope to set the
+right expectation before you open this PR.
+- [x] I understand that opening a PR, even if it meets the requirements
+above, does not guarantee the PR will be merged nor GPT-4 access
+granted.
+### Submit eval
+- [x] I have filled out all required fields in the evals PR form
+- [x] (Ignore if not submitting code) I have run `pip install
+pre-commit; pre-commit install` and have verified that `black`, `isort`,
+and `autoflake` are running when I commit and push
+Failure to fill out all required fields will result in the PR being
+closed.
+### Eval JSON data
+Since we are using Git LFS, we are asking eval submitters to add in as
+many Eval Samples (at least 5) from their contribution here:
+<details>
+  <summary>View evals in JSON</summary>
+  ### Eval
+  ```jsonl
+{"input": [{"role": "system", "content": "You are a helpful
+assistant."}, {"role": "user", "content": "Is 2 kilograms of eggshells
+heavier than 1 kilogram of gold bars? Answer Yes or No"}], "ideal":
+"Yes"}
+{"input": [{"role": "system", "content": "You are a helpful
+assistant."}, {"role": "user", "content": "Is 9 pounds of soap foam
+heavier than 5 pounds of iron chains? Answer Yes or No"}], "ideal":
+"Yes"}
+{"input": [{"role": "system", "content": "You are a helpful
+assistant."}, {"role": "user", "content": "Is 15 pounds of confetti
+heavier than 10 pounds of rebar? Answer Yes or No"}], "ideal": "Yes"}
+{"input": [{"role": "system", "content": "You are a helpful
+assistant."}, {"role": "user", "content": "Is 1 pound of steel beams
+heavier than 2 pounds of dust particles? Answer Yes or No"}], "ideal":
+"No"}
+{"input": [{"role": "system", "content": "You are a helpful
+assistant."}, {"role": "user", "content": "Is 10 pounds of cast iron
+heavier than 20 pounds of flakes? Answer Yes or No"}], "ideal": "No"}
+{"input": [{"role": "system", "content": "You are a helpful
+assistant."}, {"role": "user", "content": "Is 10 kilograms of granite
+slabs heavier than 20 kilograms of balloons? Answer Yes or No"}],
+"ideal": "No"}
+{"input": [{"role": "system", "content": "You are a helpful
+assistant."}, {"role": "user", "content": "Is 10 kilograms of bricks
+heavier than 15 kilograms of dust particles? Answer Yes or No"}],
+"ideal": "No"}
+{"input": [{"role": "system", "content": "You are a helpful
+assistant."}, {"role": "user", "content": "Is 1 pound of bronze heavier
+than 2 pounds of snowflakes? Answer Yes or No"}], "ideal": "No"}
+{"input": [{"role": "system", "content": "You are a helpful
+assistant."}, {"role": "user", "content": "Is 1 pound of cast iron
+heavier than 3 pounds of spider silk? Answer Yes or No"}], "ideal":
+"No"}
+{"input": [{"role": "system", "content": "You are a helpful
+assistant."}, {"role": "user", "content": "Is 12 pounds of hydrogen
+heavier than 10 pounds of palladium? Answer Yes or No"}], "ideal":
+"Yes"}
+{"input": [{"role": "system", "content": "You are a helpful
+assistant."}, {"role": "user", "content": "Is 6 kilograms of feathers
+heavier than 5 kilograms of gold bars? Answer Yes or No"}], "ideal":
+"Yes"}
+{"input": [{"role": "system", "content": "You are a helpful
+assistant."}, {"role": "user", "content": "Is 7 kilograms of eggshells
+heavier than 5 kilograms of bronze? Answer Yes or No"}], "ideal": "Yes"}
+{"input": [{"role": "system", "content": "You are a helpful
+assistant."}, {"role": "user", "content": "Is 10 kilograms of lead
+heavier than 20 kilograms of floating seeds? Answer Yes or No"}],
+"ideal": "No"}
+{"input": [{"role": "system", "content": "You are a helpful
+assistant."}, {"role": "user", "content": "Is 30 pounds of feathers
+heavier than 10 pounds of rebar? Answer Yes or No"}], "ideal": "Yes"}
+{"input": [{"role": "system", "content": "You are a helpful
+assistant."}, {"role": "user", "content": "Is 2 pounds of pencil
+shavings heavier than 1 pound of iron chains? Answer Yes or No"}],
+"ideal": "Yes"}
+  ```
+</details>
+""")