Ch3nYe commited on
Commit
26d5ad1
·
verified ·
1 Parent(s): b78aa10

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +324 -1
README.md CHANGED
@@ -12,4 +12,327 @@ tags:
12
  - binary
13
  - sentence-similarity
14
  - feature-extraction
15
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  - binary
13
  - sentence-similarity
14
  - feature-extraction
15
+ ---
16
+
17
+
18
+ # BinSeek: Cross-modal Retrieval Models for Stripped Binary Analysis
19
+
20
+ BinSeek is the first two-stage cross-modal retrieval framework specifically designed for stripped binary code analysis. It bridges the semantic gap between natural language queries and binary code (decompiled pseudocode), enabling effective retrieval of relevant binary functions from large-scale codebases.
21
+
22
+ BinSeek addresses these challenges with a two-stage retrieval strategy:
23
+
24
+ - **BinSeek-Embedding**: An embedding model trained to learn the semantic relevance between binary code and natural language descriptions, used for efficient first-stage candidate retrieval.
25
+ - **BinSeek-Reranker**: A reranking model that carefully judges the relevance of candidate code to the description with calling context augmentation for more precise results.
26
+
27
+ <p align="center">
28
+ <img src="https://raw.githubusercontent.com/XingTuLab/BinSeek/main/assets/binseek.png" alt="Overview of BinSeek" width="95%">
29
+ </p>
30
+
31
+ ## Model Information
32
+
33
+ | Model | Domain | Parameters | Embedding Dim | Max Tokens |
34
+ |:-------------------------------------------------------------------|:------:|:----------:|:-------------:|:----------:|
35
+ | [🤗 BinSeek-Embedding](https://huggingface.co/XingTuLab/BinSeek-Embedding) | Binary | 0.3B | 1024 | 4096 |
36
+ | [🤗 BinSeek-Reranker](https://huggingface.co/XingTuLab/BinSeek-Reranker) | Binary | 0.6B | / | 16384 |
37
+
38
+
39
+ BinSeek achieves advanced performance on binary code retrieval:
40
+
41
+ | Model | Model Size | Recall@1 | Recall@3 | MRR@3 |
42
+ |:-------------------------|:----------:|:--------:|:--------:|:------:|
43
+ | Qwen3-Embedding-8B | 8B | 57.50 | 65.00 | 60.75 |
44
+ | BinSeek-Embedding | 0.3B | 67.00 | 80.50 | 72.83 |
45
+ | Qwen3-Reranker-8B | 8B | 62.50 | 80.50 | 70.83 |
46
+ | BinSeek-Reranker | 0.6B | 61.50 | 83.00 | 70.50 |
47
+ | BinSeek (Emb+ Rerank) | / | 76.75 | 84.50 | 80.25 |
48
+
49
+
50
+ ## Model Usage
51
+
52
+ ### Dependencies
53
+
54
+ ```bash
55
+ pip install torch sentence-transformers>=5.1.2 transformers>=4.57.1
56
+ ```
57
+
58
+ Our models are compatible with the following frameworks. We recommend using the **two-stage pipeline** (Embedding + Reranker) for optimal retrieval performance.
59
+
60
+ ### Sentence-Transformers
61
+
62
+ ```python
63
+ import torch
64
+ from sentence_transformers import SentenceTransformer, CrossEncoder
65
+
66
+ # Query and Corpus
67
+ query = "A function that implements XTEA encryption algorithm"
68
+
69
+ # Binary pseudocode corpus (decompiled by IDA Pro)
70
+ corpus = [
71
+ '''char *__fastcall sub_100000924(__int64 a1, __int64 a2, unsigned int a3)
72
+ {
73
+ unsigned int i; // [xsp+1Ch] [xbp-34h]
74
+ char *v5; // [xsp+20h] [xbp-30h]
75
+ unsigned int v6; // [xsp+2Ch] [xbp-24h]
76
+ __int64 v9; // [xsp+40h] [xbp-10h] BYREF
77
+
78
+ v6 = a3;
79
+ v9 = 0;
80
+ if ( a3 % 8 )
81
+ v6 = a3 + 8 - a3 % 8;
82
+ v5 = (char *)malloc(v6);
83
+ __memset_chk(v5, 0, v6, -1);
84
+ for ( i = 0; i < v6; i += 8 )
85
+ {
86
+ v9 = *(_QWORD *)(a1 + (int)i);
87
+ sub_100000A68(32, (unsigned int *)&v9, a2);
88
+ __memcpy_chk(&v5[i], &v9, 8, -1);
89
+ }
90
+ return v5;
91
+ }''',
92
+ '''void *__fastcall sub_401000(size_t size){
93
+ void *ptr = malloc(size);
94
+ if (!ptr) { perror("malloc failed"); exit(1); }
95
+ return ptr;
96
+ }''',
97
+ '''int __fastcall sub_402000(char *s1, char *s2){
98
+ return strcmp(s1, s2);
99
+ }''',
100
+ # ... more functions in your corpus
101
+ ]
102
+ # the context functions (concatenated into a single string) for each binary function in the corpus, selected from callees, see our paper for more details
103
+ corpus_context = [
104
+ '''__int64 __fastcall sub_100000A68(__int64 result, unsigned int *a2, __int64 a3)
105
+ {
106
+ unsigned int v3; // [xsp+8h] [xbp-28h]
107
+ unsigned int v4; // [xsp+Ch] [xbp-24h]
108
+ unsigned int v5; // [xsp+10h] [xbp-20h]
109
+ unsigned int i; // [xsp+14h] [xbp-1Ch]
110
+
111
+ v5 = *a2;
112
+ v4 = a2[1];
113
+ v3 = 0;
114
+ for ( i = 0; i < (unsigned int)result; ++i )
115
+ {
116
+ v5 += (((v4 >> 5) ^ (16 * v4)) + v4) ^ (v3 + *(_DWORD *)(a3 + 4LL * (v3 & 3)));
117
+ v3 -= 1640531527;
118
+ v4 += (((v5 >> 5) ^ (16 * v5)) + v5) ^ (v3 + *(_DWORD *)(a3 + 4LL * ((v3 >> 11) & 3)));
119
+ }
120
+ *a2 = v5;
121
+ a2[1] = v4;
122
+ return result;
123
+ }''',
124
+ "",
125
+ "",
126
+ # ... more context functions in your corpus
127
+ ]
128
+
129
+ # Embedding-based Retrieval
130
+ embedding_model = SentenceTransformer(
131
+ "XingTuLab/BinSeek-Embedding",
132
+ model_kwargs={"dtype": torch.bfloat16},
133
+ trust_remote_code=True
134
+ )
135
+
136
+ query_embeddings = embedding_model.encode([query])
137
+ corpus_embeddings = embedding_model.encode(corpus, batch_size=64)
138
+
139
+ similarity_matrix = embedding_model.similarity(query_embeddings, corpus_embeddings)
140
+ scores = similarity_matrix[0].cpu().float().numpy()
141
+ top_k = 10 # Number of candidates to retrieve
142
+ top_k_indices = scores.argsort()[::-1][:top_k]
143
+ candidates = [corpus[i] for i in top_k_indices]
144
+
145
+ print("=== Stage 1: Embedding Retrieval Results ===")
146
+ for i, idx in enumerate(top_k_indices):
147
+ print(f"Rank {i+1}: Score={scores[idx]:.4f}, Corpus Index={idx}")
148
+
149
+ def build_candidates_with_context(candidates_ids):
150
+ candidates_with_context = []
151
+ for candidate_id in candidates_ids:
152
+ data = f"<pseudocode>\n{corpus[candidate_id]}\n</pseudocode>\n<context>\n{corpus_context[candidate_id]}\n</context>"
153
+ candidates_with_context.append(data)
154
+ return candidates_with_context
155
+
156
+ candidates_with_context = build_candidates_with_context(top_k_indices)
157
+
158
+ # Reranking for Precise Results
159
+ reranker = CrossEncoder(
160
+ "XingTuLab/BinSeek-Reranker",
161
+ model_kwargs={"dtype": torch.bfloat16},
162
+ trust_remote_code=True
163
+ )
164
+
165
+ reranked_results = reranker.rank(query, candidates_with_context)
166
+
167
+ print("\n=== Stage 2: Reranking Results ===")
168
+ print(f"Query: {query}")
169
+ for rank in reranked_results:
170
+ original_idx = top_k_indices[rank['corpus_id']]
171
+ print(f"Rank {reranked_results.index(rank)+1}: Score={rank['score']:.4f}, Corpus Index={original_idx}")
172
+ ```
173
+
174
+ ### Transformers
175
+
176
+ ```python
177
+ import torch
178
+ import numpy as np
179
+ from transformers import AutoModel, AutoTokenizer, AutoModelForSequenceClassification
180
+
181
+ # Query and Corpus
182
+ query = "A function that implements XTEA encryption algorithm"
183
+
184
+ # Binary pseudocode corpus (decompiled by IDA Pro)
185
+ corpus = [
186
+ '''char *__fastcall sub_100000924(__int64 a1, __int64 a2, unsigned int a3)
187
+ {
188
+ unsigned int i; // [xsp+1Ch] [xbp-34h]
189
+ char *v5; // [xsp+20h] [xbp-30h]
190
+ unsigned int v6; // [xsp+2Ch] [xbp-24h]
191
+ __int64 v9; // [xsp+40h] [xbp-10h] BYREF
192
+
193
+ v6 = a3;
194
+ v9 = 0;
195
+ if ( a3 % 8 )
196
+ v6 = a3 + 8 - a3 % 8;
197
+ v5 = (char *)malloc(v6);
198
+ __memset_chk(v5, 0, v6, -1);
199
+ for ( i = 0; i < v6; i += 8 )
200
+ {
201
+ v9 = *(_QWORD *)(a1 + (int)i);
202
+ sub_100000A68(32, (unsigned int *)&v9, a2);
203
+ __memcpy_chk(&v5[i], &v9, 8, -1);
204
+ }
205
+ return v5;
206
+ }''',
207
+ '''void *__fastcall sub_401000(size_t size){
208
+ void *ptr = malloc(size);
209
+ if (!ptr) { perror("malloc failed"); exit(1); }
210
+ return ptr;
211
+ }''',
212
+ '''int __fastcall sub_402000(char *s1, char *s2){
213
+ return strcmp(s1, s2);
214
+ }''',
215
+ # ... more functions in your corpus
216
+ ]
217
+ # the context functions (concatenated into a single string) for each binary function in the corpus, selected from callees, see our paper for more details
218
+ corpus_context = [
219
+ '''__int64 __fastcall sub_100000A68(__int64 result, unsigned int *a2, __int64 a3)
220
+ {
221
+ unsigned int v3; // [xsp+8h] [xbp-28h]
222
+ unsigned int v4; // [xsp+Ch] [xbp-24h]
223
+ unsigned int v5; // [xsp+10h] [xbp-20h]
224
+ unsigned int i; // [xsp+14h] [xbp-1Ch]
225
+
226
+ v5 = *a2;
227
+ v4 = a2[1];
228
+ v3 = 0;
229
+ for ( i = 0; i < (unsigned int)result; ++i )
230
+ {
231
+ v5 += (((v4 >> 5) ^ (16 * v4)) + v4) ^ (v3 + *(_DWORD *)(a3 + 4LL * (v3 & 3)));
232
+ v3 -= 1640531527;
233
+ v4 += (((v5 >> 5) ^ (16 * v5)) + v5) ^ (v3 + *(_DWORD *)(a3 + 4LL * ((v3 >> 11) & 3)));
234
+ }
235
+ *a2 = v5;
236
+ a2[1] = v4;
237
+ return result;
238
+ }''',
239
+ "",
240
+ "",
241
+ # ... more context functions in your corpus
242
+ ]
243
+
244
+ # Embedding-based Retrieval
245
+ embed_tokenizer = AutoTokenizer.from_pretrained(
246
+ "XingTuLab/BinSeek-Embedding",
247
+ trust_remote_code=True
248
+ )
249
+ embed_model = AutoModel.from_pretrained(
250
+ "XingTuLab/BinSeek-Embedding",
251
+ dtype=torch.bfloat16,
252
+ trust_remote_code=True
253
+ ).eval().cuda()
254
+
255
+ def get_embeddings(texts, tokenizer, model, max_length=4096):
256
+ inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt", max_length=max_length)
257
+ inputs = {k: v.cuda() for k, v in inputs.items()}
258
+ with torch.no_grad():
259
+ outputs = model(**inputs)
260
+ # Last token pooling: use attention_mask to find last valid token position
261
+ attention_mask = inputs["attention_mask"]
262
+ last_token_indices = attention_mask.sum(dim=1) - 1 # (batch_size,)
263
+ batch_indices = torch.arange(outputs.last_hidden_state.size(0), device=outputs.last_hidden_state.device)
264
+ embeddings = outputs.last_hidden_state[batch_indices, last_token_indices, :]
265
+ embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
266
+ return embeddings.cpu().float().numpy()
267
+
268
+ query_embedding = get_embeddings([query], embed_tokenizer, embed_model)
269
+ corpus_embeddings = get_embeddings(corpus, embed_tokenizer, embed_model)
270
+
271
+ scores = np.dot(query_embedding, corpus_embeddings.T)[0]
272
+ top_k = 10
273
+ top_k_indices = np.argsort(scores)[::-1][:min(top_k, len(corpus))]
274
+ candidates = [corpus[i] for i in top_k_indices]
275
+
276
+ print("=== Stage 1: Embedding Retrieval Results ===")
277
+ for i, idx in enumerate(top_k_indices):
278
+ print(f"Rank {i+1}: Score={scores[idx]:.4f}, Corpus Index={idx}")
279
+
280
+ def build_candidates_with_context(candidates_ids):
281
+ candidates_with_context = []
282
+ for candidate_id in candidates_ids:
283
+ data = f"<pseudocode>\n{corpus[candidate_id]}\n</pseudocode>\n<context>\n{corpus_context[candidate_id]}\n</context>"
284
+ candidates_with_context.append(data)
285
+ return candidates_with_context
286
+
287
+ candidates_with_context = build_candidates_with_context(top_k_indices)
288
+
289
+ # Reranking for Precise Results
290
+ rerank_tokenizer = AutoTokenizer.from_pretrained(
291
+ "XingTuLab/BinSeek-Reranker",
292
+ trust_remote_code=True
293
+ )
294
+ rerank_model = AutoModelForSequenceClassification.from_pretrained(
295
+ "XingTuLab/BinSeek-Reranker",
296
+ dtype=torch.bfloat16,
297
+ trust_remote_code=True
298
+ ).eval().cuda()
299
+
300
+ def rerank(query, candidates, tokenizer, model, max_length=16384):
301
+ pairs = [[query, cand] for cand in candidates]
302
+ inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors="pt", max_length=max_length)
303
+ inputs = {k: v.cuda() for k, v in inputs.items()}
304
+ with torch.no_grad():
305
+ logits = model(**inputs).logits.squeeze(-1)
306
+ scores = torch.sigmoid(logits).float().cpu().numpy() # Apply sigmoid activation
307
+ return scores
308
+
309
+ rerank_scores = rerank(query, candidates_with_context, rerank_tokenizer, rerank_model)
310
+ reranked_order = np.argsort(rerank_scores)[::-1]
311
+
312
+ print("\n=== Stage 2: Reranking Results ===")
313
+ print(f"Query: {query}")
314
+ for i, idx in enumerate(reranked_order):
315
+ original_idx = top_k_indices[idx]
316
+ print(f"Rank {i+1}: Score={rerank_scores[idx]:.4f}, Corpus Index={original_idx}")
317
+ ```
318
+
319
+
320
+ ## License
321
+
322
+ This project is under the GPL-3.0 License, and it is for research purposes only. Please use responsibly and in accordance with applicable laws and regulations.
323
+
324
+ ## Citation
325
+
326
+ If you find our work helpful, feel free to give us a cite.
327
+
328
+ ```bibtex
329
+ @misc{chen2025BinSeek,
330
+ title={Cross-modal Retrieval Models for Stripped Binary Analysis},
331
+ author={Guoqiang Chen and Lingyun Ying and Ziyang Song and Daguang Liu and Qiang Wang and Zhiqi Wang and Li Hu and Shaoyin Cheng and Weiming Zhang and Nenghai Yu},
332
+ year={2025},
333
+ eprint={2512.10393},
334
+ archivePrefix={arXiv},
335
+ primaryClass={cs.SE},
336
+ url={https://arxiv.org/abs/2512.10393},
337
+ }
338
+ ```