Question Answering
LiteRT
English
File size: 4,038 Bytes
6094c84
 
 
 
 
 
72e0314
 
 
 
 
 
 
 
 
3316f93
 
61a0d0c
6094c84
72e0314
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3316f93
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
---
license: apache-2.0
language:
- en
pipeline_tag: question-answering
---

# litert-community/Gecko-110m-en

This model provides a few variants of the embedding model published in the [Gecko paper](https://arxiv.org/abs/2403.20327) that are ready for deployment on Android or iOS using [LiteRT stack](https://ai.google.dev/edge/litert) or [google ai edge RAG SDK](https://ai.google.dev/edge/mediapipe/solutions/genai/rag).

## Use the models

### Android

* Try out the gecko embedding model in the [google ai edge RAG SDK](https://ai.google.dev/edge/mediapipe/solutions/genai/rag). You can find the SDK on [GitHub](https://github.com/google-ai-edge/ai-edge-apis/tree/main/local_agents/rag) or follow our [android guide](https://ai.google.dev/edge/mediapipe/solutions/genai/rag/android)
to install directly from Maven. We have also published a
[sample app](https://github.com/google-ai-edge/ai-edge-apis/tree/main/examples/rag).
* Use the sentencepiece model as the tokenizer for the Gecko embedding model.

## Performance

### Android

Note that all benchmark stats are from a Samsung S23 Ultra.

<table border="1">
  <tr>
   <th></th>
   <th>Backend</th>
   <th>Max sequence length</th>
   <th>Init time (ms)</th>
   <th>Inference time (ms)</th>
   <th>Memory (RSS in MB)</th>
   <th>Model size (MB)</th>
  </tr>
  <tr>
   <td><p style="text-align: right">dynamic_int8</p></td>
   <td><p style="text-align: right">GPU</p></td>
   <td><p style="text-align: right">256</p></td>
   <td><p style="text-align: right">1306.06</p></td>
   <td><p style="text-align: right">76.2</p></td>
   <td><p style="text-align: right">604.5</p></td>
   <td><p style="text-align: right">114</p></td>
  </tr>
  <tr>
   <td><p style="text-align: right">dynamic_int8</p></td>
   <td><p style="text-align: right">GPU</p></td>
   <td><p style="text-align: right">512</p></td>
   <td><p style="text-align: right">1363.38</p></td>
   <td><p style="text-align: right">173.2</p></td>
   <td><p style="text-align: right">604.6</p></td>
   <td><p style="text-align: right">120</p></td>
  </tr>
  <tr>
   <td><p style="text-align: right">dynamic_int8</p></td>
   <td><p style="text-align: right">GPU</p></td>
   <td><p style="text-align: right">1024</p></td>
   <td><p style="text-align: right">1419.87</p></td>
   <td><p style="text-align: right">397</p></td>
   <td><p style="text-align: right">871.1</p></td>
   <td><p style="text-align: right">145</p></td>
  </tr>
  <tr>
   <td><p style="text-align: right">dynamic_int8</p></td>
   <td><p style="text-align: right">CPU</p></td>
   <td><p style="text-align: right">256</p></td>
   <td><p style="text-align: right">11.03</p></td>
   <td><p style="text-align: right">147.6</p></td>
   <td><p style="text-align: right">126.3</p></td>
   <td><p style="text-align: right">114</p></td>
  </tr>
  <tr>
   <td><p style="text-align: right">dynamic_int8</p></td>
   <td><p style="text-align: right">CPU</p></td>
   <td><p style="text-align: right">512</p></td>
   <td><p style="text-align: right">30.04</p></td>
   <td><p style="text-align: right">353.1</p></td>
   <td><p style="text-align: right">225.6</p></td>
   <td><p style="text-align: right">120</p></td>
  </tr>
  <tr>
   <td><p style="text-align: right">dynamic_int8</p></td>
   <td><p style="text-align: right">CPU</p></td>
   <td><p style="text-align: right">1024</p></td>
   <td><p style="text-align: right">79.17</p></td>
   <td><p style="text-align: right">954</p></td>
   <td><p style="text-align: right">619.5</p></td>
   <td><p style="text-align: right">145</p></td>
  </tr>
</table>

*   Model Size: measured by the size of the .tflite flatbuffer (serialization format for LiteRT models)
*   Memory: indicator of peak RAM usage
*   The inference is run on CPU is accelerated via the LiteRT [XNNPACK](https://github.com/google/XNNPACK) delegate with 4 threads
*   The inference on GPU is accelerated via LiteRT GPU delegate.
*   Benchmark is done assuming XNNPACK cache is enabled
*   dynamic_int8: quantized model with int8 weights and float activations.