--- tags: - mteb - sentence-transformers - transformers - sentence-similarity language: - en - zh license: apache-2.0 --- # Conan-Embedding-v2 ## What's New? - **Performance** Conan-Embedding-v2 has now achieved SOTA performance on the MTEB leaderboard for both Chinese and English. - **Cross-lingual Retrieval between Chinese and English** Conan-Embedding-v2 supports cross-lingual retrieval between Chinese and English samples. - **Longer Context Support** Conan-Embedding-v2 now supports a context length of 32,768 tokens. - **Conan 1.4B Large Model Trained from Scratch** A vocabulary and large language model trained from scratch, with a pre-trained model and vocabulary more tailored to the Embedding scenario, delivering stronger performance. The Conan-1.4B base model will be open-sourced. Community workers can train their own Embedding models based on the Conan-1.4B base model. ## Performance Performance of Conan-Embedding-v2 on MTEB for Chinese and English ![MTEB Result](./src/mteb_res_v2.png) **English** | Embedding TaskMertric | Class. Acc. (12) | Clust V-Meas. (11) | PairClass AP (3) | Rerank MAP (4) | Retri nDCG @ 10 (15) | STS Spear. (12) | SummSpear. (1) | Avg.(56) | |:-----------------------:|:----------------:|:------------------:|:----------------:|:--------------:|:--------------------:|:---------------:|:--------------:|:---------:| | bge-multilingual-gemma2 | 88.08 | 54.65 | 85.97 | 59.72 | 59.24 | 83.88 | 31.20 | 69.88 | | e5-mistral-7b-instruct | 79.89 | 51.44 | 88.42 | 49.78 | 57.62 | 84.32 | **36.57** | 67.98 | | gte-Qwen2-7B-instruct | 86.58 | 56.92 | 85.90 | **61.42** | 59.11 | 83.06 | 31.35 | 69.95 | | stella-en-1.5B-v5 | 87.63 | 57.69 | 88.07 | 61.21 | 61.01 | 84.51 | 31.49 | 71.19 | | bge-en-icl | 88.95 | 57.89 | 88.14 | 59.86 | 62.16 | 84.24 | 30.77 | 71.67 | | NV-Embed-v2 | **90.37** | 58.46 | 88.67 | 60.65 | 62.65 | 84.31 | 30.70 | 72.31 | | **Conan-embedding-v2** | 90.15 | **60.86** | **93.47** | 60.89 | **66.40** | **85.73** | 28.08 | **74.22** | **Chinese** | Embedding TaskMertric | Class.Acc. (9) | ClustV-Meas. (4) | PairClassAP (2) | RerankMAP (4) | RetrinDCG @ 10 (8) | STSSpear. (8) | Avg.(35) | |:-----------------------:|:--------------:|:----------------:|:---------------:|:-------------:|:------------------:|:-------------:|:---------:| | e5-mistral-7b-instruct | 72.96 | 52.30 | 72.19 | 61.86 | 61.75 | 48.34 | 59.92 | | gte-Qwen2-1.5B-instruct | 72.53 | 54.61 | 86.91 | 68.21 | 71.86 | 60.05 | 67.12 | | bge-multilingual-gemma2 | 75.31 | 59.30 | 86.67 | 68.28 | 73.73 | 55.19 | 67.64 | | gte-Qwen2-7B-instruct | 75.77 | 66.06 | 87.48 | 68.92 | 75.71 | 65.20 | 71.62 | | xiaobu-embedding-v2 | 76.53 | 65.17 | 91.87 | 72.58 | 76.50 | 64.18 | 72.36 | | Conan-embedding-v1 | **76.77** | 66.33 | 91.66 | 72.76 | 76.67 | 63.67 | 72.50 | | **Conan-embedding-v2** | 76.47 | **68.84** | **92.44** | **74.41** | **78.31** | **65.48** | **74.24** | ## Model Detail ### Model Structure **Conan-Embedding-v2 Structure:** ``` SentenceTransformer( (0): Transformer({ 'max_seq_length': 32768, 'do_lower_case': False }) with Transformer model: ConanEmbedModel, (1): Pooling({ 'word_embedding_dimension': 3584, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True }), (2): Dense({ 'in_features': 3584, 'out_features': 3584, 'bias': True, 'activation_function': 'torch.nn.modules.linear.Identity' }) ) ``` **Key Specifications of Conan-1.4B (Transformer):** - Number of Parameters (Non-Dense-Layer): 1.48B - Vocabulary Size: 150,000 - Number of Layers: 8 - Hidden Layer Dimension: 3584 - Number of Attention Heads (GOA): 32 for Q and 8 for KV - Intermediate Dimension of FFN Layer: 8192 - Maximum Context Window: 32,768 Tokens For more model details, please refer to ```model/modeling_conan.py``` and ```config.json```, or stay tuned for the upcoming open-source release of Conan-1.4B Base Model. ### Tokenizer We trained the Tokenizer on a large-scale multilingual dataset to build a standard BBPE(Byte-level Byte Pair Encoding) tokenizer with a vocabulary size of 150,000. ## Technical Report We will soon release our technical report. ## Using Conan-Embedding-v2 Use ```/model/conan_api_client.py``` to access our test API. A sample call is as follows: ``` from modeling_conan import ConanClient AK = os.getenv("CONAN_AK") SK = os.getenv("CONAN_SK") client = ConanClient(ak=AK, sk=SK, url="https://ai.om.qq.com/api/conan/v2") res = client.embed("Hello!") print(res) ``` This is a temporary calling solution. Please contact us to obtain an access token. In the future, we will provide high-performance, cost-effective, and reliable Embedding services on Tencent Cloud. --- **About** Created by the Tencent BAC Group. All rights reserved.