Some question about model
- I check it find negative scores in my rerank list, it not normalized 0-1? If I want a threshold to clear completely irrelevant chunk, which scores suggest?
- No special token in first Q and D, if the query too long, it will interference first doc_emb?
- Can I change prompt if I just use it in Chinese?
Glad to hear from you. Regarding to your concerns:
- the returned score is cosine (normalized in [-1, 1]) between the query and document embedding. For the threshold, I would suggest you chose the positive threshold (usually you can set it > 0.2 by default) based on your own data.
- I could not get your question. What do you mean "no special token in First Q and D"
- The current prompt works with multilingual, you don't need to change it.
Glad to hear from you. Regarding to your concerns:
- the returned score is cosine (normalized in [-1, 1]) between the query and document embedding. For the threshold, I would suggest you chose the positive threshold (usually you can set it > 0.2 by default) based on your own data.
- I could not get your question. What do you mean "no special token in First Q and D"
- The current prompt works with multilingual, you don't need to change it.
About question 1, I found that under a certain query the score of a doc was usually different if it was in the different position of the document list, especially in short document scenario.
query_a -> [doc_a, doc_b, doc_c]: score_a=0.15
query_a -> [doc_b, doc_c, doc_a]: score_a=0.28
It seems that there are no means of position bias correction in the loss function or training recipe. Then how can we determine a threshold for irrelevant data filtering task?
In your case, do you think document a should be relevant or irrelevant?
In your case, do you think document a should be relevant or irrelevant?
I'm working on a query-tag matching project where the tag length is typically less than 5 words and trying to use jina-reranker as a relevance filter on mined query-tag pairs. So the score of positive sample here is quite low(around 0.1-0.2) and unstable(diff to one decimal place) to the position of document list.
A workaround I am using now is to calculate single query-doc(tag) relevance score one by one but its not that efficient in massive data inference.
BTW calculating relevance score of query and tag from dual-tower recall model(BGE-m3 here) is not that good since the overlap of the range of score for positive and negative samples is quite high in my dataset.
Very interesting use case. Intuitively, we could try the following two approaches for stable output:
- rephrase the "tag" to be longer and descriptive. I believe this would help the model to be more confident of the output.
- repeat the inference two times with randomly input order, and average the scores
Yes, there is no free lunch that you need consider the tradeoff the efficiency and accuracy. Again, your feedback gives us a lot of inspiration, and we will keep cooking for better models.
Hi, sorry, I not clearly stated "No special token in first Q and D, if the query too long, it will interference first doc_emb?" My this Q mean is: In v3 input, Query and Document1 is concat like this: xxxyyyy,no special token like this: xxx<|doc_emb|>yyy,if Query longest than Document1 like this: xxxxxxxxxxxxxxxxyy,D1 in embedding message will be extreme compression to Q emedding or not?
Meanwhile, why use Causal attention not use common attention?
D1 in embedding message will be extreme compression to Q emedding or not?
tbh, I'm not sure. It depends on our training data. The model learn from the data. I would say this model would handle the length bias in some way.
Because most of the advanced LMs are casual based, we don't want to destroy the well learned capacity of the pre-trained model.