XingTuLab
/

BinSeek-Embedding

   - binary
   - sentence-similarity
   - feature-extraction
+---
+# BinSeek: Cross-modal Retrieval Models for Stripped Binary Analysis
+BinSeek is the first two-stage cross-modal retrieval framework specifically designed for stripped binary code analysis. It bridges the semantic gap between natural language queries and binary code (decompiled pseudocode), enabling effective retrieval of relevant binary functions from large-scale codebases.
+BinSeek addresses these challenges with a two-stage retrieval strategy:
+- **BinSeek-Embedding**: An embedding model trained to learn the semantic relevance between binary code and natural language descriptions, used for efficient first-stage candidate retrieval.
+- **BinSeek-Reranker**: A reranking model that carefully judges the relevance of candidate code to the description with calling context augmentation for more precise results.
+<p align="center">
+  <img src="https://raw.githubusercontent.com/XingTuLab/BinSeek/main/assets/binseek.png" alt="Overview of BinSeek" width="95%">
+</p>
+## Model Information
+| Model                                                              | Domain | Parameters | Embedding Dim | Max Tokens |
+|:-------------------------------------------------------------------|:------:|:----------:|:-------------:|:----------:|
+| [🤗 BinSeek-Embedding](https://huggingface.co/XingTuLab/BinSeek-Embedding) | Binary |    0.3B    |     1024      |    4096    |
+| [🤗 BinSeek-Reranker](https://huggingface.co/XingTuLab/BinSeek-Reranker)   | Binary |    0.6B    |       /       |   16384    |
+BinSeek achieves advanced performance on binary code retrieval:
+| Model                    | Model Size | Recall@1 | Recall@3 | MRR@3  |
+|:-------------------------|:----------:|:--------:|:--------:|:------:|
+| Qwen3-Embedding-8B       | 8B         |  57.50   |  65.00   | 60.75  |
+| BinSeek-Embedding        | 0.3B       |  67.00   |  80.50   | 72.83  |
+| Qwen3-Reranker-8B        | 8B         |  62.50   |  80.50   | 70.83  |
+| BinSeek-Reranker         | 0.6B       |  61.50   |  83.00   | 70.50  |
+| BinSeek (Emb+ Rerank)    | /          |  76.75   |  84.50   | 80.25  |
+## Model Usage
+### Dependencies
+```bash
+pip install torch sentence-transformers>=5.1.2 transformers>=4.57.1
+```
+Our models are compatible with the following frameworks. We recommend using the **two-stage pipeline** (Embedding + Reranker) for optimal retrieval performance.
+### Sentence-Transformers
+```python
+import torch
+from sentence_transformers import SentenceTransformer, CrossEncoder
+# Query and Corpus
+query = "A function that implements XTEA encryption algorithm"
+# Binary pseudocode corpus (decompiled by IDA Pro)
+corpus = [
+'''char *__fastcall sub_100000924(__int64 a1, __int64 a2, unsigned int a3)
+{
+  unsigned int i; // [xsp+1Ch] [xbp-34h]
+  char *v5; // [xsp+20h] [xbp-30h]
+  unsigned int v6; // [xsp+2Ch] [xbp-24h]
+  __int64 v9; // [xsp+40h] [xbp-10h] BYREF
+  v6 = a3;
+  v9 = 0;
+  if ( a3 % 8 )
+    v6 = a3 + 8 - a3 % 8;
+  v5 = (char *)malloc(v6);
+  __memset_chk(v5, 0, v6, -1);
+  for ( i = 0; i < v6; i += 8 )
+  {
+    v9 = *(_QWORD *)(a1 + (int)i);
+    sub_100000A68(32, (unsigned int *)&v9, a2);
+    __memcpy_chk(&v5[i], &v9, 8, -1);
+  }
+  return v5;
+}''',
+'''void *__fastcall sub_401000(size_t size){
+    void *ptr = malloc(size);
+    if (!ptr) { perror("malloc failed"); exit(1); }
+    return ptr;
+}''',
+'''int __fastcall sub_402000(char *s1, char *s2){
+    return strcmp(s1, s2);
+}''',
+# ... more functions in your corpus
+]
+# the context functions (concatenated into a single string) for each binary function in the corpus, selected from callees, see our paper for more details
+corpus_context = [
+'''__int64 __fastcall sub_100000A68(__int64 result, unsigned int *a2, __int64 a3)
+{
+  unsigned int v3; // [xsp+8h] [xbp-28h]
+  unsigned int v4; // [xsp+Ch] [xbp-24h]
+  unsigned int v5; // [xsp+10h] [xbp-20h]
+  unsigned int i; // [xsp+14h] [xbp-1Ch]
+  v5 = *a2;
+  v4 = a2[1];
+  v3 = 0;
+  for ( i = 0; i < (unsigned int)result; ++i )
+  {
+    v5 += (((v4 >> 5) ^ (16 * v4)) + v4) ^ (v3 + *(_DWORD *)(a3 + 4LL * (v3 & 3)));
+    v3 -= 1640531527;
+    v4 += (((v5 >> 5) ^ (16 * v5)) + v5) ^ (v3 + *(_DWORD *)(a3 + 4LL * ((v3 >> 11) & 3)));
+  }
+  *a2 = v5;
+  a2[1] = v4;
+  return result;
+}''',
+"",
+"",
+# ... more context functions in your corpus
+]
+# Embedding-based Retrieval
+embedding_model = SentenceTransformer(
+    "XingTuLab/BinSeek-Embedding",
+    model_kwargs={"dtype": torch.bfloat16},
+    trust_remote_code=True
+)
+query_embeddings = embedding_model.encode([query])
+corpus_embeddings = embedding_model.encode(corpus, batch_size=64)
+similarity_matrix = embedding_model.similarity(query_embeddings, corpus_embeddings)
+scores = similarity_matrix[0].cpu().float().numpy()
+top_k = 10  # Number of candidates to retrieve
+top_k_indices = scores.argsort()[::-1][:top_k]
+candidates = [corpus[i] for i in top_k_indices]
+print("=== Stage 1: Embedding Retrieval Results ===")
+for i, idx in enumerate(top_k_indices):
+    print(f"Rank {i+1}: Score={scores[idx]:.4f}, Corpus Index={idx}")
+def build_candidates_with_context(candidates_ids):
+    candidates_with_context = []
+    for candidate_id in candidates_ids:
+        data = f"<pseudocode>\n{corpus[candidate_id]}\n</pseudocode>\n<context>\n{corpus_context[candidate_id]}\n</context>"
+        candidates_with_context.append(data)
+    return candidates_with_context
+candidates_with_context = build_candidates_with_context(top_k_indices)
+# Reranking for Precise Results
+reranker = CrossEncoder(
+    "XingTuLab/BinSeek-Reranker",
+    model_kwargs={"dtype": torch.bfloat16},
+    trust_remote_code=True
+)
+reranked_results = reranker.rank(query, candidates_with_context)
+print("\n=== Stage 2: Reranking Results ===")
+print(f"Query: {query}")
+for rank in reranked_results:
+    original_idx = top_k_indices[rank['corpus_id']]
+    print(f"Rank {reranked_results.index(rank)+1}: Score={rank['score']:.4f}, Corpus Index={original_idx}")
+```
+### Transformers
+```python
+import torch
+import numpy as np
+from transformers import AutoModel, AutoTokenizer, AutoModelForSequenceClassification
+# Query and Corpus
+query = "A function that implements XTEA encryption algorithm"
+# Binary pseudocode corpus (decompiled by IDA Pro)
+corpus = [
+'''char *__fastcall sub_100000924(__int64 a1, __int64 a2, unsigned int a3)
+{
+  unsigned int i; // [xsp+1Ch] [xbp-34h]
+  char *v5; // [xsp+20h] [xbp-30h]
+  unsigned int v6; // [xsp+2Ch] [xbp-24h]
+  __int64 v9; // [xsp+40h] [xbp-10h] BYREF
+  v6 = a3;
+  v9 = 0;
+  if ( a3 % 8 )
+    v6 = a3 + 8 - a3 % 8;
+  v5 = (char *)malloc(v6);
+  __memset_chk(v5, 0, v6, -1);
+  for ( i = 0; i < v6; i += 8 )
+  {
+    v9 = *(_QWORD *)(a1 + (int)i);
+    sub_100000A68(32, (unsigned int *)&v9, a2);
+    __memcpy_chk(&v5[i], &v9, 8, -1);
+  }
+  return v5;
+}''',
+'''void *__fastcall sub_401000(size_t size){
+    void *ptr = malloc(size);
+    if (!ptr) { perror("malloc failed"); exit(1); }
+    return ptr;
+}''',
+'''int __fastcall sub_402000(char *s1, char *s2){
+    return strcmp(s1, s2);
+}''',
+# ... more functions in your corpus
+]
+# the context functions (concatenated into a single string) for each binary function in the corpus, selected from callees, see our paper for more details
+corpus_context = [
+'''__int64 __fastcall sub_100000A68(__int64 result, unsigned int *a2, __int64 a3)
+{
+  unsigned int v3; // [xsp+8h] [xbp-28h]
+  unsigned int v4; // [xsp+Ch] [xbp-24h]
+  unsigned int v5; // [xsp+10h] [xbp-20h]
+  unsigned int i; // [xsp+14h] [xbp-1Ch]
+  v5 = *a2;
+  v4 = a2[1];
+  v3 = 0;
+  for ( i = 0; i < (unsigned int)result; ++i )
+  {
+    v5 += (((v4 >> 5) ^ (16 * v4)) + v4) ^ (v3 + *(_DWORD *)(a3 + 4LL * (v3 & 3)));
+    v3 -= 1640531527;
+    v4 += (((v5 >> 5) ^ (16 * v5)) + v5) ^ (v3 + *(_DWORD *)(a3 + 4LL * ((v3 >> 11) & 3)));
+  }
+  *a2 = v5;
+  a2[1] = v4;
+  return result;
+}''',
+"",
+"",
+# ... more context functions in your corpus
+]
+# Embedding-based Retrieval
+embed_tokenizer = AutoTokenizer.from_pretrained(
+    "XingTuLab/BinSeek-Embedding",
+    trust_remote_code=True
+)
+embed_model = AutoModel.from_pretrained(
+    "XingTuLab/BinSeek-Embedding",
+    dtype=torch.bfloat16,
+    trust_remote_code=True
+).eval().cuda()
+def get_embeddings(texts, tokenizer, model, max_length=4096):
+    inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt", max_length=max_length)
+    inputs = {k: v.cuda() for k, v in inputs.items()}
+    with torch.no_grad():
+        outputs = model(**inputs)
+        # Last token pooling: use attention_mask to find last valid token position
+        attention_mask = inputs["attention_mask"]
+        last_token_indices = attention_mask.sum(dim=1) - 1  # (batch_size,)
+        batch_indices = torch.arange(outputs.last_hidden_state.size(0), device=outputs.last_hidden_state.device)
+        embeddings = outputs.last_hidden_state[batch_indices, last_token_indices, :]
+        embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
+    return embeddings.cpu().float().numpy()
+query_embedding = get_embeddings([query], embed_tokenizer, embed_model)
+corpus_embeddings = get_embeddings(corpus, embed_tokenizer, embed_model)
+scores = np.dot(query_embedding, corpus_embeddings.T)[0]
+top_k = 10
+top_k_indices = np.argsort(scores)[::-1][:min(top_k, len(corpus))]
+candidates = [corpus[i] for i in top_k_indices]
+print("=== Stage 1: Embedding Retrieval Results ===")
+for i, idx in enumerate(top_k_indices):
+    print(f"Rank {i+1}: Score={scores[idx]:.4f}, Corpus Index={idx}")
+def build_candidates_with_context(candidates_ids):
+    candidates_with_context = []
+    for candidate_id in candidates_ids:
+        data = f"<pseudocode>\n{corpus[candidate_id]}\n</pseudocode>\n<context>\n{corpus_context[candidate_id]}\n</context>"
+        candidates_with_context.append(data)
+    return candidates_with_context
+candidates_with_context = build_candidates_with_context(top_k_indices)
+# Reranking for Precise Results
+rerank_tokenizer = AutoTokenizer.from_pretrained(
+    "XingTuLab/BinSeek-Reranker",
+    trust_remote_code=True
+)
+rerank_model = AutoModelForSequenceClassification.from_pretrained(
+    "XingTuLab/BinSeek-Reranker",
+    dtype=torch.bfloat16,
+    trust_remote_code=True
+).eval().cuda()
+def rerank(query, candidates, tokenizer, model, max_length=16384):
+    pairs = [[query, cand] for cand in candidates]
+    inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors="pt", max_length=max_length)
+    inputs = {k: v.cuda() for k, v in inputs.items()}
+    with torch.no_grad():
+        logits = model(**inputs).logits.squeeze(-1)
+        scores = torch.sigmoid(logits).float().cpu().numpy()  # Apply sigmoid activation
+    return scores
+rerank_scores = rerank(query, candidates_with_context, rerank_tokenizer, rerank_model)
+reranked_order = np.argsort(rerank_scores)[::-1]
+print("\n=== Stage 2: Reranking Results ===")
+print(f"Query: {query}")
+for i, idx in enumerate(reranked_order):
+    original_idx = top_k_indices[idx]
+    print(f"Rank {i+1}: Score={rerank_scores[idx]:.4f}, Corpus Index={original_idx}")
+```
+## License
+This project is under the GPL-3.0 License, and it is for research purposes only. Please use responsibly and in accordance with applicable laws and regulations.
+## Citation
+If you find our work helpful, feel free to give us a cite.
+```bibtex
+@misc{chen2025BinSeek,
+      title={Cross-modal Retrieval Models for Stripped Binary Analysis},
+      author={Guoqiang Chen and Lingyun Ying and Ziyang Song and Daguang Liu and Qiang Wang and Zhiqi Wang and Li Hu and Shaoyin Cheng and Weiming Zhang and Nenghai Yu},
+      year={2025},
+      eprint={2512.10393},
+      archivePrefix={arXiv},
+      primaryClass={cs.SE},
+      url={https://arxiv.org/abs/2512.10393},
+}
+```