moshew commited on
Commit
cde2343
·
verified ·
1 Parent(s): 96fc5f6

Add new SentenceTransformer model

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 384,
3
+ "pooling_mode_cls_token": true,
4
+ "pooling_mode_mean_tokens": false,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,393 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - sentence-transformers
4
+ - sentence-similarity
5
+ - feature-extraction
6
+ - generated_from_trainer
7
+ - dataset_size:200
8
+ - loss:CoSENTLoss
9
+ base_model: avsolatorio/GIST-small-Embedding-v0
10
+ widget:
11
+ - source_sentence: who is imf chief economist?
12
+ sentences:
13
+ - Metoprolol succinate is also known by the brand name Toprol XL. It is the extended-release
14
+ form of metoprolol. Metoprolol succinate is approved to treat high blood pressure,
15
+ chronic chest pain, and congestive heart failure.
16
+ - He wants to confirm if he is talking to Priya or Angel Priya (I.e., if he is really
17
+ talking to a girl or just a guy with fake profile) They are talking to you and
18
+ want to see how you look. I found it normal but would say, be careful about whom
19
+ do you share your picture with as they might misuse it. I hate this one.
20
+ - A Dependent Care Flexible Spending Account, or “FSA,” is a pre-tax benefit account
21
+ used to pay for dependent care services while you are at work. The money you contribute
22
+ to a Dependent Care FSA is not subject to payroll taxes, so you end up paying
23
+ less in taxes and taking home more of your paycheck.
24
+ - source_sentence: is it possible to get a false negative flu test?
25
+ sentences:
26
+ - The saying "a piece of cake" means something that's simple to accomplish. If a
27
+ school assignment is a piece of cake, it's so easy that you will barely have to
28
+ think about it. Other ways to say "it's a piece of cake" include no problem or
29
+ it's a breeze.
30
+ - This variation in ability to detect viruses can result in some people who are
31
+ infected with the flu having a negative rapid test result. (This situation is
32
+ called a false negative test result.)
33
+ - Unstable Wi-Fi is often caused by wireless congestion. Congestion problems are
34
+ common in apartment complexes or densely packed neighborhoods. The more people
35
+ using the internet, the greater the instability. When many people in the same
36
+ area are working from home, connectivity suffers.
37
+ - source_sentence: what are the requirements to become a health inspector?
38
+ sentences:
39
+ - You'll need an accredited health and safety qualification to become a health and
40
+ safety inspector. Many recruiters ask for a NEBOSH diploma as it's accredited
41
+ by the Institution of Occupational Health and Safety. This is a degree-level course
42
+ that you can study at a variety of institutions, as well as online.
43
+ - '[''Open a PDF file in Acrobat DC.'', ''Click on the “Export PDF” tool in the
44
+ right pane.'', ''Choose Microsoft Word as your export format, and then choose
45
+ “Word Document.”'', ''Click “Export.” If your PDF contains scanned text, the Acrobat
46
+ Word converter will run text recognition automatically.'']'
47
+ - '[''Remain calm. ... '', "Don''t take it personally. ... ", ''Use your best listening
48
+ skills. ... '', ''Actively sympathize. ... '', ''Apologize gracefully. ... '',
49
+ ''Find a solution. ... '', ''Take a few minutes on your own.'']'
50
+ - source_sentence: is toprol xl the same as metoprolol?
51
+ sentences:
52
+ - 'Carbs: 35 grams. Fiber: 11 grams. Folate: 88% of the DV. Copper: 50% of the DV.'
53
+ - A Dependent Care Flexible Spending Account, or “FSA,” is a pre-tax benefit account
54
+ used to pay for dependent care services while you are at work. The money you contribute
55
+ to a Dependent Care FSA is not subject to payroll taxes, so you end up paying
56
+ less in taxes and taking home more of your paycheck.
57
+ - Metoprolol succinate is also known by the brand name Toprol XL. It is the extended-release
58
+ form of metoprolol. Metoprolol succinate is approved to treat high blood pressure,
59
+ chronic chest pain, and congestive heart failure.
60
+ - source_sentence: how can i get copy of marriage license?
61
+ sentences:
62
+ - Probiotics can help with digestion Without probiotics, antibiotics can sometimes
63
+ wipe out the protective gut bacteria, which is no good for your digestive system.
64
+ Probiotics are thought to directly kill or inhibit the growth of harmful bacteria,
65
+ stopping them from producing toxic substances that can make you ill.
66
+ - Order in person You can order a certificate in person from Monday to Friday between
67
+ 9am and 5pm. Please come to the register office at 45 Beavor Lane, Hammersmith,
68
+ London W6 9AR.
69
+ - Worms and ants are more related because spiders contain hair and ants do not.
70
+ Worms do not contain hair as well.
71
+ pipeline_tag: sentence-similarity
72
+ library_name: sentence-transformers
73
+ ---
74
+
75
+ # SentenceTransformer based on avsolatorio/GIST-small-Embedding-v0
76
+
77
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [avsolatorio/GIST-small-Embedding-v0](https://huggingface.co/avsolatorio/GIST-small-Embedding-v0). It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
78
+
79
+ ## Model Details
80
+
81
+ ### Model Description
82
+ - **Model Type:** Sentence Transformer
83
+ - **Base model:** [avsolatorio/GIST-small-Embedding-v0](https://huggingface.co/avsolatorio/GIST-small-Embedding-v0) <!-- at revision 75e62fd210b9fde790430e0b2f040b0b00a021b1 -->
84
+ - **Maximum Sequence Length:** 512 tokens
85
+ - **Output Dimensionality:** 384 dimensions
86
+ - **Similarity Function:** Cosine Similarity
87
+ <!-- - **Training Dataset:** Unknown -->
88
+ <!-- - **Language:** Unknown -->
89
+ <!-- - **License:** Unknown -->
90
+
91
+ ### Model Sources
92
+
93
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
94
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
95
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
96
+
97
+ ### Full Model Architecture
98
+
99
+ ```
100
+ SentenceTransformer(
101
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel
102
+ (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
103
+ (2): Normalize()
104
+ )
105
+ ```
106
+
107
+ ## Usage
108
+
109
+ ### Direct Usage (Sentence Transformers)
110
+
111
+ First install the Sentence Transformers library:
112
+
113
+ ```bash
114
+ pip install -U sentence-transformers
115
+ ```
116
+
117
+ Then you can load this model and run inference.
118
+ ```python
119
+ from sentence_transformers import SentenceTransformer
120
+
121
+ # Download from the 🤗 Hub
122
+ model = SentenceTransformer("moshew/gist_small_ft_gooaq")
123
+ # Run inference
124
+ sentences = [
125
+ 'how can i get copy of marriage license?',
126
+ 'Order in person You can order a certificate in person from Monday to Friday between 9am and 5pm. Please come to the register office at 45 Beavor Lane, Hammersmith, London W6 9AR.',
127
+ 'Worms and ants are more related because spiders contain hair and ants do not. Worms do not contain hair as well.',
128
+ ]
129
+ embeddings = model.encode(sentences)
130
+ print(embeddings.shape)
131
+ # [3, 384]
132
+
133
+ # Get the similarity scores for the embeddings
134
+ similarities = model.similarity(embeddings, embeddings)
135
+ print(similarities.shape)
136
+ # [3, 3]
137
+ ```
138
+
139
+ <!--
140
+ ### Direct Usage (Transformers)
141
+
142
+ <details><summary>Click to see the direct usage in Transformers</summary>
143
+
144
+ </details>
145
+ -->
146
+
147
+ <!--
148
+ ### Downstream Usage (Sentence Transformers)
149
+
150
+ You can finetune this model on your own dataset.
151
+
152
+ <details><summary>Click to expand</summary>
153
+
154
+ </details>
155
+ -->
156
+
157
+ <!--
158
+ ### Out-of-Scope Use
159
+
160
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
161
+ -->
162
+
163
+ <!--
164
+ ## Bias, Risks and Limitations
165
+
166
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
167
+ -->
168
+
169
+ <!--
170
+ ### Recommendations
171
+
172
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
173
+ -->
174
+
175
+ ## Training Details
176
+
177
+ ### Training Dataset
178
+
179
+ #### Unnamed Dataset
180
+
181
+ * Size: 200 training samples
182
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>label</code>
183
+ * Approximate statistics based on the first 200 samples:
184
+ | | sentence1 | sentence2 | label |
185
+ |:--------|:---------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:--------------------------------------------------------------|
186
+ | type | string | string | float |
187
+ | details | <ul><li>min: 8 tokens</li><li>mean: 11.8 tokens</li><li>max: 23 tokens</li></ul> | <ul><li>min: 18 tokens</li><li>mean: 61.8 tokens</li><li>max: 125 tokens</li></ul> | <ul><li>min: 0.0</li><li>mean: 0.5</li><li>max: 1.0</li></ul> |
188
+ * Samples:
189
+ | sentence1 | sentence2 | label |
190
+ |:-----------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------|
191
+ | <code>how many days can i drive my car without mot?</code> | <code>If your car fails its MOT you can only continue to drive it if the previous year's MOT is still valid - which might occur if you submitted the car for its test two weeks early. You can still drive it away from the testing centre or garage if no 'dangerous' problems were identified during the MOT.</code> | <code>1.0</code> |
192
+ | <code>how many days can i drive my car without mot?</code> | <code>Low-FODMAP vegetables include: Bean sprouts, capsicum, carrot, choy sum, eggplant, kale, tomato, spinach and zucchini ( 7 , 8 ). Summary: Vegetables contain a diverse range of FODMAPs. However, many vegetables are naturally low in FODMAPs.</code> | <code>0.0</code> |
193
+ | <code>what are underlying shares of stock?</code> | <code>Underlying Shares means the shares of Common Stock issued and issuable upon conversion of the Preferred Stock, upon exercise of the Warrants and issued and issuable in lieu of the cash payment of dividends on the Preferred Stock in accordance with the terms of the Certificate of Designation.</code> | <code>1.0</code> |
194
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
195
+ ```json
196
+ {
197
+ "scale": 20.0,
198
+ "similarity_fct": "pairwise_cos_sim"
199
+ }
200
+ ```
201
+
202
+ ### Training Hyperparameters
203
+ #### Non-Default Hyperparameters
204
+
205
+ - `per_device_train_batch_size`: 16
206
+ - `per_device_eval_batch_size`: 16
207
+ - `num_train_epochs`: 1
208
+ - `warmup_ratio`: 0.1
209
+ - `seed`: 12
210
+ - `bf16`: True
211
+ - `half_precision_backend`: cpu_amp
212
+ - `dataloader_num_workers`: 4
213
+
214
+ #### All Hyperparameters
215
+ <details><summary>Click to expand</summary>
216
+
217
+ - `overwrite_output_dir`: False
218
+ - `do_predict`: False
219
+ - `eval_strategy`: no
220
+ - `prediction_loss_only`: True
221
+ - `per_device_train_batch_size`: 16
222
+ - `per_device_eval_batch_size`: 16
223
+ - `per_gpu_train_batch_size`: None
224
+ - `per_gpu_eval_batch_size`: None
225
+ - `gradient_accumulation_steps`: 1
226
+ - `eval_accumulation_steps`: None
227
+ - `torch_empty_cache_steps`: None
228
+ - `learning_rate`: 5e-05
229
+ - `weight_decay`: 0.0
230
+ - `adam_beta1`: 0.9
231
+ - `adam_beta2`: 0.999
232
+ - `adam_epsilon`: 1e-08
233
+ - `max_grad_norm`: 1.0
234
+ - `num_train_epochs`: 1
235
+ - `max_steps`: -1
236
+ - `lr_scheduler_type`: linear
237
+ - `lr_scheduler_kwargs`: {}
238
+ - `warmup_ratio`: 0.1
239
+ - `warmup_steps`: 0
240
+ - `log_level`: passive
241
+ - `log_level_replica`: warning
242
+ - `log_on_each_node`: True
243
+ - `logging_nan_inf_filter`: True
244
+ - `save_safetensors`: True
245
+ - `save_on_each_node`: False
246
+ - `save_only_model`: False
247
+ - `restore_callback_states_from_checkpoint`: False
248
+ - `no_cuda`: False
249
+ - `use_cpu`: False
250
+ - `use_mps_device`: False
251
+ - `seed`: 12
252
+ - `data_seed`: None
253
+ - `jit_mode_eval`: False
254
+ - `use_ipex`: False
255
+ - `bf16`: True
256
+ - `fp16`: False
257
+ - `fp16_opt_level`: O1
258
+ - `half_precision_backend`: cpu_amp
259
+ - `bf16_full_eval`: False
260
+ - `fp16_full_eval`: False
261
+ - `tf32`: None
262
+ - `local_rank`: 0
263
+ - `ddp_backend`: None
264
+ - `tpu_num_cores`: None
265
+ - `tpu_metrics_debug`: False
266
+ - `debug`: []
267
+ - `dataloader_drop_last`: False
268
+ - `dataloader_num_workers`: 4
269
+ - `dataloader_prefetch_factor`: None
270
+ - `past_index`: -1
271
+ - `disable_tqdm`: False
272
+ - `remove_unused_columns`: True
273
+ - `label_names`: None
274
+ - `load_best_model_at_end`: False
275
+ - `ignore_data_skip`: False
276
+ - `fsdp`: []
277
+ - `fsdp_min_num_params`: 0
278
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
279
+ - `tp_size`: 0
280
+ - `fsdp_transformer_layer_cls_to_wrap`: None
281
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
282
+ - `deepspeed`: None
283
+ - `label_smoothing_factor`: 0.0
284
+ - `optim`: adamw_torch
285
+ - `optim_args`: None
286
+ - `adafactor`: False
287
+ - `group_by_length`: False
288
+ - `length_column_name`: length
289
+ - `ddp_find_unused_parameters`: None
290
+ - `ddp_bucket_cap_mb`: None
291
+ - `ddp_broadcast_buffers`: False
292
+ - `dataloader_pin_memory`: True
293
+ - `dataloader_persistent_workers`: False
294
+ - `skip_memory_metrics`: True
295
+ - `use_legacy_prediction_loop`: False
296
+ - `push_to_hub`: False
297
+ - `resume_from_checkpoint`: None
298
+ - `hub_model_id`: None
299
+ - `hub_strategy`: every_save
300
+ - `hub_private_repo`: None
301
+ - `hub_always_push`: False
302
+ - `gradient_checkpointing`: False
303
+ - `gradient_checkpointing_kwargs`: None
304
+ - `include_inputs_for_metrics`: False
305
+ - `include_for_metrics`: []
306
+ - `eval_do_concat_batches`: True
307
+ - `fp16_backend`: auto
308
+ - `push_to_hub_model_id`: None
309
+ - `push_to_hub_organization`: None
310
+ - `mp_parameters`:
311
+ - `auto_find_batch_size`: False
312
+ - `full_determinism`: False
313
+ - `torchdynamo`: None
314
+ - `ray_scope`: last
315
+ - `ddp_timeout`: 1800
316
+ - `torch_compile`: False
317
+ - `torch_compile_backend`: None
318
+ - `torch_compile_mode`: None
319
+ - `include_tokens_per_second`: False
320
+ - `include_num_input_tokens_seen`: False
321
+ - `neftune_noise_alpha`: None
322
+ - `optim_target_modules`: None
323
+ - `batch_eval_metrics`: False
324
+ - `eval_on_start`: False
325
+ - `use_liger_kernel`: False
326
+ - `eval_use_gather_object`: False
327
+ - `average_tokens_across_devices`: False
328
+ - `prompts`: None
329
+ - `batch_sampler`: batch_sampler
330
+ - `multi_dataset_batch_sampler`: proportional
331
+
332
+ </details>
333
+
334
+ ### Training Logs
335
+ | Epoch | Step | Training Loss |
336
+ |:------:|:----:|:-------------:|
337
+ | 0.0769 | 1 | 0.2709 |
338
+
339
+
340
+ ### Framework Versions
341
+ - Python: 3.11.12
342
+ - Sentence Transformers: 4.1.0
343
+ - Transformers: 4.51.1
344
+ - PyTorch: 2.6.0+cu124
345
+ - Accelerate: 1.5.2
346
+ - Datasets: 3.5.0
347
+ - Tokenizers: 0.21.1
348
+
349
+ ## Citation
350
+
351
+ ### BibTeX
352
+
353
+ #### Sentence Transformers
354
+ ```bibtex
355
+ @inproceedings{reimers-2019-sentence-bert,
356
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
357
+ author = "Reimers, Nils and Gurevych, Iryna",
358
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
359
+ month = "11",
360
+ year = "2019",
361
+ publisher = "Association for Computational Linguistics",
362
+ url = "https://arxiv.org/abs/1908.10084",
363
+ }
364
+ ```
365
+
366
+ #### CoSENTLoss
367
+ ```bibtex
368
+ @online{kexuefm-8847,
369
+ title={CoSENT: A more efficient sentence vector scheme than Sentence-BERT},
370
+ author={Su Jianlin},
371
+ year={2022},
372
+ month={Jan},
373
+ url={https://kexue.fm/archives/8847},
374
+ }
375
+ ```
376
+
377
+ <!--
378
+ ## Glossary
379
+
380
+ *Clearly define terms in order to be accessible across audiences.*
381
+ -->
382
+
383
+ <!--
384
+ ## Model Card Authors
385
+
386
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
387
+ -->
388
+
389
+ <!--
390
+ ## Model Card Contact
391
+
392
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
393
+ -->
config.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "BertModel"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "classifier_dropout": null,
7
+ "hidden_act": "gelu",
8
+ "hidden_dropout_prob": 0.1,
9
+ "hidden_size": 384,
10
+ "id2label": {
11
+ "0": "LABEL_0"
12
+ },
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 1536,
15
+ "label2id": {
16
+ "LABEL_0": 0
17
+ },
18
+ "layer_norm_eps": 1e-12,
19
+ "max_position_embeddings": 512,
20
+ "model_type": "bert",
21
+ "num_attention_heads": 12,
22
+ "num_hidden_layers": 12,
23
+ "pad_token_id": 0,
24
+ "position_embedding_type": "absolute",
25
+ "torch_dtype": "float32",
26
+ "transformers_version": "4.51.1",
27
+ "type_vocab_size": 2,
28
+ "use_cache": true,
29
+ "vocab_size": 30522
30
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "4.1.0",
4
+ "transformers": "4.51.1",
5
+ "pytorch": "2.6.0+cu124"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": "cosine"
10
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:23d679ec64dddafb4555d84cd3972b310456c3f1d231b570f596c0f92979791c
3
+ size 133462128
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": true
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": true,
48
+ "extra_special_tokens": {},
49
+ "mask_token": "[MASK]",
50
+ "model_max_length": 512,
51
+ "never_split": null,
52
+ "pad_token": "[PAD]",
53
+ "sep_token": "[SEP]",
54
+ "strip_accents": null,
55
+ "tokenize_chinese_chars": true,
56
+ "tokenizer_class": "BertTokenizer",
57
+ "unk_token": "[UNK]"
58
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff