Solr平台化搜索实战必知场景

简介:

【提醒】


这个page是个人汇总了maillist、自己在搜索平台化、通用化过程中遇到的种种需求,为了避开必要的“敬业竞争禁止等”,特地从外网搜罗并汇总代表性的需求。构成基于solr搜索“策略”参考、搜索应用查询的方案参考,但是,性能问题特别是高级用法,在大数据量时,务必压测,做到心里有底。


这里面给出的方法绝大部分基于solr接口、配置。不针对深入定制的详细说明。针对深入定制的经验,这里找不到答案,有兴趣私下交流。


 整个汇总抛砖引入,各个点没有做系统、全面的论证,内容基本来自网络,总体方向和大点没有问题。如果发现细处不对,也请指出。谢谢!


目录

1 3.4.0 得分的问题...

1

2
配置方法
...

1

3
问题和需求
...

3

4 Payload问题...

3

5 Custom sort (score + custom value) 
4

6 BoostQParserPlugin.

4

7 how can I limit by score before sorting in a
solr query.

6

8 Score
filter
.

7

9 Boost score for early
matches
.

10

10 Solr: How can I get all documents ordered by
score with a list of keywords?.

11

11 Solr changes document's score when its random
field value altered.

13

12 Relevance
Customization
.

15

13 Modify SOLR scoring.

15

14 Change order before
returning data
.

16

15 limiting the
total number of documents matched
.

17

 

3.4.0 
得分的问题

(7) 
得分因子是可以调整的,但是得分因子的增加、得分公式的扩展,无法直接从
solr配置插入。----但是,可以扩展lucene的代码或者参数
spanquery,重新一个query,插入solr,这样工作量稍大.另外,社区提供了bm25pagerank等排序batch,对lucene 
有所以了解后,就可以直接引用了。

 

(16) 
在排序上,对与去重或者对应基于时间动态性上,还没有现成的支持。去重是指排序的前几条结果,可能某个域值完全相同了,或者某几个域值完全相同,导致看起来,靠前的结果带有一些关联字段的
聚集性,对有些应用来说,并不是最好的。

在时间因素上动态性,也没有直接支持,也只能靠间接的按时间排序来实现。

这个问题其实不是lucenesolr要关注的吧,应该是应用的特殊性导致的吧。


配置方法


全局配置
 schema.xml

Similarity

A (global) declaration can be used to specify a
custom Similarity implementation that you want Solr to use when
dealing with your index. A Similarity can be specified either by
referring directly to the name of a class with a no-arg
constructor...

 

 

<similarity
class="org.apache.lucene.search.similarities.DefaultSimilarity"/>

...or by referencing a
SimilarityFactory implementation, which may take
optional init params....

<similarity
class="solr.DFRSimilarityFactory">

  <str
name="basicModel">P</str>

  <str
name="afterEffect">L</str>

  <str
name="normalization">H2</str>

  <float
name="c">7</float>

</similarity>

Begining with Solr4.0, Similarity
factories such as
 SchemaSimilarityFactory
can also support specifying specific
Similarity implementations on individual field types...

 

<types>

 
<fieldType name="text_dfr"
class="solr.TextField">

   
<analyzer
class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>

   
<similarity
class="solr.DFRSimilarityFactory">

     
<str
name="basicModel">I(F)</str>

     
<str
name="afterEffect">B</str>

     
<str
name="normalization">H2</str>

   
</similarity>

 
</fieldType>

 
<fieldType name="text_ib"
class="solr.TextField">

   
<analyzer
class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>

   
<similarity
class="solr.IBSimilarityFactory">

     
<str
name="distribution">SPL</str>

     
<str
name="lambda">DF</str>

     
<str
name="normalization">H2</str>

   
</similarity>

 
</fieldType>

  ...

</types>

<similarity
class="solr.SchemaSimilarityFactory"/>

If no (global) is configured in the schema.xml file,
an implicit instance of
 DefaultSimilarityFactory
is used.

 


问题和需求

By
DefaultComputerValue

By CustomScore, By
DefaultComputerValue

CustomScore*fa +
DefaultComputerValue* fb

Doc1  10100  10*0.8+
100*0.2=28

Doc2  199   
1*0.8 + 99 *0.2 =20.6

Doc3  398   
3*0.8+ 98* 0.2 =22

Doc4  2050  
20*0.8+ 50*0.2=36

 

Solr3.4.0 
得分代码分析

abstract class
SimilarityFactory


成员变量
  public abstract
Similarity getSimilarity();

 

Payload问题

http://wiki.apache.org/lucene-java/Payloads

Scoring payloads involves
overriding the Similarity.scorePayload() method. For example, if
one has implemented storing a Float payload, it could be used for
scoring in the following way:

<span lang="EN-US"><span>&nbsp; </span>public float scorePayload(byte [] payload, int offset, int length) {</span>
<span lang="EN-US"><span>&nbsp;&nbsp;&nbsp; </span>assert length == 4;</span>
<span lang="EN-US"><span>&nbsp;&nbsp;&nbsp; </span>int accum = ((payload[0+offset]&amp;0xff)) |</span>
<span lang="EN-US"><span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span>((payload[1+offset]&amp;0xff)&lt;&lt;8) |</span>
<span lang="EN-US"><span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span>((payload[2+offset]&amp;0xff)&lt;&lt;16)<span>&nbsp; </span>|</span>
<span lang="EN-US"><span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span>((payload[3+offset]&amp;0xff)&lt;&lt;24);</span>
<span lang="EN-US"><span>&nbsp;&nbsp;&nbsp; </span>return Float.intBitsToFloat(accum);</span>
<span lang="EN-US"><span>&nbsp; </span>}</span>

Don't forget to activate
your Similarity implementation using IndexSearcher.setSimilarity().
Also, note that even then not all queries will actually make use of
your method. For example, you will need to use BoostingTermQuery
instead of TermQuery. QueryParser currently (Lucene 2.3.2) always
uses TermQuery and you will need to extend QueryParser and
overwrite getFieldQuery().

Note, that is just one
possible way of scoring a payload. Payloads are application
specific. For example payload Token Filters see the payload package
in the contrib/Analysis module.

Custom sort (score + custom
value)

http://grokbase.com/t/lucene/solr-user/08b25j6ked/custom-sort-score-custom-value

Hi,

I want to implement a custom sort in
Solr based on a combination of relevance (Solr gives me it yet
=> score) and a custom value I've calculated
previously for each document. I see two options:

1. Use a function query (I'm using a
DisMaxRequestHandler).
2. Create a component that set SortSpec with a sort that has a
custom
ComparatorSource (similar to QueryElevationComponent).

The first option has the problem:
While the relevance value changes for
every query, my custom value is constant for each doc. It implies
queries
with documents that have high relevance are less affected with my
custom
value. On the other hand, queries with low relevance are affected a
lot with my custom value. Can it be proportional with a function
query? (i.e. docs with low relevance are less affected by my custom
value).

 

The second option has the problem:
Solr score isn't normalized. I need it normalized in order to apply
my custom value in the sortValue function in
ScoreDocComparator.What do you think? What's the best option in
that case? Another option?

Thank you in advance,

George

BoostQParserPlugin

http://lucene.apache.org/solr/api-4_0_0-BETA/org/apache/solr/search/BoostQParserPlugin.html

org.apache.solr.search

Class
BoostQParserPlugin

 

http://stackoverflow.com/questions/3035831/solr-lucene-scorer

Scorer are parts of lucene
Queries via the 'weight' query method.

In short, the framework
calls Query.weight(..).scorer(..) . Have a look at

http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Query.html

http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Weight.html

http://lucene.apache.org/jva/2_4_0/api/org/apache/lucene/search/Scorer.html

To use your own Query class
in Solr, you'll need to implement your own solr QueryParserPlugin
that uses your own QParser that generates your previously
implemented lucene Query. You then can use it in Solr specified
here:

http://wiki.apache.org/solr/SolrPlugins#QParserPlugin

This part on implementation
should stay simple as this is just some glueing code.

Enjoy hacking
Solr!

share|improve this
answer

answered Jun 14 '10 at
10:33

 

 

You can override the logic
solr scorer uses. Solr uses DefaultSimilarity class for scoring. 1)
make a class extending DefaultSimilarity. 2) override the functions
tf(), idf() etc according to your need.

public class
CustomSimilarity extends DefaultSimilarity {

public CustomSimilarity()
{

  super();

}

 public float tf(int
freq) {

  //your
code 

  return (float)
1.0;

}

public float idf(int
docFreq, int numDocs) {

  //your code

  return (float)
1.0;

}

}

3) After creating a class
compile and make a jar. 4) put the jar in lib folder of
corresponding index or core. 5) Change the schema.xml of
corresponding index .CustomSimilarity"/>

You can check out various
factors affecting score here

For your requirement you can
create buckets if your score is in specific range. Also read about
field boosting, document boosting etc. That might be helpful in
your case.

 

http://stackoverflow.com/questions/11748487/how-can-i-filter-solr-results-by-custom-score


How can I filter SOLR results by custom score

I'm using solr function
queries to generate my own custom score. I achieve this using
something along these lines:

<code><span lang="EN-US">q=_val_:"my_custom_function()"</span></code>

This populates the score
field as expected, but it also includes documents that score 0. I
need a way to filter the results so that scores below zero are not
included.

I realize that I'm using
score in a non-standard way and that normally the score that
lucene/solr produce is not absolute. However, producing my own
score works really well for my needs.

I've tried using {!frange
l=0} but this causes the score for all documents to be
"1.0".

I suspect pseudo-fields
could be used, but since solr 4 is still alpha, I'm looking for a
way to do it using Solr 3.1.


how can I limit by score
before sorting in a solr query

I am searching "product
documents". In other words, my solr documents are product records.
I want to get say the top 50 matching products for a query. Then I
want to be able to sort the top 50 scoring documents by name or
price. I'm not seeing much on how to do this, since sorting by
score, then by name or price won't really help, since scores are
floats.

I wouldn't mind if I could
do something like map the scores to ranges (like a score of
8.0-8.99 would go in the 8 bucket score), then sort by range, then
by names, but since there is basically no normalization to scoring,
this would still make things a bit harder.

Tl;dr How do I exclude low
scoring documents from the solr result set before sorting? solr
scoring

share|improve this
question

asked Dec 7 '10 at
22:21

 

3
Answers

You can use frange to
achieve this, as long as you don't want to sort on score (in which
case I guess you could just do the filtering on the client side).
Your query would be something along the lines of:

q={!frange
l=5}query($qq)&qq=[awesome
product]&sort=price asc

Set the l argument in the
q-frange-parameter to the lower bound you want to filter score on,
and replace the qq parameter with your user query.

answered Dec 8 '10 at
10:23

Karl Johansson

1,046310

 

thanks, since I can get a
reasonable frange from the first time the results are displayed
sorted by score alone, this works great! – Zak Dec 9 '10 at
18:40

I don't think you can simply
exclude low scoring documents from the solr result set before
sorting

because the relevance score
is only meaningful for a given combination of search query and
resulting document list. I.e. scores are only meaningful within a
given search and you cannot set some threshold for all
searches.

If you were using Java (or
PHP) you could get the top 50 documents and then re-sort this list
in your programming language but I don't think you can do it with
just SOLR.

Anyway, I would recommend
you don't go down this route of re-sorting the results from SOLR,
as it will simply confuse the user. People expect search results to
be like Google (and most other search engines), where results come
back in some form of TFIDF ranking.

Having said that, you could
use some other criteria to separate documents with the same
relevance scores by adding an index-time boost factor based on a
price range scale.

I'd suggest you use SOLR to
its strengths and use facets. Provide a price range facet on the
left (like Ebay, Amazon, et al.) and/or a product category facet,
etc. Also provide a "sort" widget to allow the results to be sorted
by product name, if the user wants it.

[EDIT] this question might
also be useful:

Digg-like search result
ranking with Lucene / Solr ?

As observed by Karl
Johansson, you could do the filtering on the client side: load the
first 50 rows of the response (sorted by score desc) and then
manipulate them in JS for example.

The jQuery DataTables plugin
works fantastically for that kind of thing: sorting, sorting on
multiple columns, dynamic filtering, etc. -- and with only 50 rows
it would be very fast too, so that users can "play" with the
sorting and filtering until they find what they want.

Score filter

http://lucene.472066.n3.nabble.com/score-filter-td493438.html

Hello, Is there a way to set a score filter? I tried
"+score:[1.2 TO *]" but it did not work.
Many thanks,

What's the motivation for
wanting to do this?  The reason I ask, is
score is a relative thing determined by Lucene based on your index
statistics. 
It is only meaningful for comparing the results of a specific query
with a specific instance of the index.  In other words, it
isn't useful to filter on b/c there is no way of knowing what a
good cutoff value would be.  So, you won't be able
to do score:[1.2 TO  *] because score is a
not an actual Field.

 

That being said, you
probably could implement a HitCollector at the Lucene level and
somehow hook it into Solr to do what you want.  Or, of course, just
stop processing the results in your app after you see a score below
a certain value.  Naturally, this still
means you have to retrieve the results.

 

Re: score filter

In my case, for example
searching a book. Some of the returned documents are with high
relevance (score > 3), but some of document with low
score (<0.01) are useless.

 

Without a "score filter", I
have to go through each document to find out the number of
documents I'm interested (score > nnn). This causes
some problem for pagination.  For example if I only
need to display the first 10 records I need to retrieve all 1000
documents to figure out the number of meaningful documents which
have score > nnn.

Thx,

Kevin

 

What's the motivation for
wanting to do this?  The reason I ask, is
score is a relative thing determined by Lucene based on your index
statistics. 
It is only meaningful for comparing the results of a specific query
with a specific instance of the index.  In other words, it
isn't useful to filter on b/c there is no way of knowing what a
good cutoff value would be.  So, you won't be able
to do score:[1.2 TO *] because score is a not an actual
Field.

 

That being said, you
probably could implement a HitCollector at the Lucene level and
somehow hook it into Solr to do what you want.  Or, of course, just
stop processing the results in your app after you see a score below
a certain value.  Naturally, this still
means you have to retrieve the results.

-Grant

 

Re: score filter

At what point do you draw
the line? 
0.01 is too low, but what about 0.5 or 0.3?  In fact, there may be
queries where 0.01 is relevant.

 

Relevance is a tricky thing
and putting in arbitrary cutoffs is usually not a good thing. An
alternative might be to instead look at the difference between
scores and see if the gap is larger than some delta, but even that
is subject to the vagaries of scoring.

 

What kind of relevance
testing have you done so far to come up with  

those values?  See also

http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Debugging-Relevance-Issues-in-Search/

 

Re: score filter

Just did some research. It
seems that it's doable with additional code added to Solr but not
out of box. Thank you, Grant.

 

At what point do you draw
the line? 
0.01 is too low, but what about 0.5 or 0.3?  In fact, there may be
queries where 0.01 is relevant.

 

Relevance is a tricky thing
and putting in arbitrary cutoffs is usually not a good thing. An
alternative might be to instead look at the difference between
scores and see if the gap is larger than some delta, but even that
is subject to the vagaries of scoring.

 

What kind of relevance
testing have you done so far to come up with those
values?  See
also

http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Debugging-Relevance-Issues-in-Search/

 

Re: score filter

Don't bother doing this. It
doesn't work.

This seems like a good idea,
something that would be useful for almost every Lucene
installation, but it isn't in Lucene because it does not work in
the real world.

 

A few problems:

* Some users want every
match and don't care how many pages of results they look
at.

 

* Some users are very bad at
creating queries that match their information needs. Others are
merely bad, not very bad. The good matches for their query are on
top, but the good matches for

their information need are
on the third page.

 

* Misspellings can put the
right match (partial match) at the bottom. I did this yesterday at
my library site, typeing "Katherine Kerr" instead of the correct
"Katharine Kerr".

Their search engine showed
no matches (grrr), so I had to search again with "Kerr".

 

* Most users do not know how
to repair their queries, like I did with "Katherine Kerr", changing
it to "Kerr". Even if they do, you shouldn't make them. Just show
the weakly relevant results.

 

* Documents have errors,
just like queries. I find bad data on our site about once a month,
and we have professional editors. We still haven't fixed our entry
for "Betty Page" to read "Bettie Page".

 

* People may use non-title
words in the query, like searching for "batman" when they want "The
Dark Knight".

 

So, don't do this. If you
are forced to do it, make sure that you measure your search quality
before and after it is implemented, because it will get worse. Then
you can stop doing it.

wunder

 

Re: score filter

+1.  Of course it is
doable, but that doesn't mean you should, which is what I was
trying to say before, (but was typing on my iPod so it wasn't fast)
and which Walter has done so.  It is entirely
conceivable to me that someone could search for a very common word
such that the score of all relevant (and thus, "good") documents
are below your predefined threshold.

 

At any rate, proceed at your
own peril. 
To implement it, look into the SearchComponent
functionality.

 

Re: score filter

Hello Grant,

I need to frame a query that
is a combination of two query parts and I use a 'function' query to
prepare the same. Something like:

q={!type=func q.op=AND
df=text}product(query($uq,0.0),query($cq,0.1))

 

where $uq and $cq are two
queries.

 

Now, I want a search result
returned only if I get a hit on $uq. So, I specify default value of
$uq query as 0.0 in order for the final score to be zero in cases
where $uq doesn't record a hit. Even though, the scoring works as
expected (i.e, document that don't match $uq have a score of zero),
all the documents are returned as search results. Is there a way to
filter search results that have a score of zero?

 

Thanks for your
help,

Debdoot

 

Re: score filter

: I need to frame a query
that is a combination of two query parts and I use a 'function'
query to prepare the same. Something like:

: q={!type=func q.op=AND
df=text}product(query($uq,0.0),query($cq,0.1))

: where $uq and $cq are two
queries.

:

: Now, I want a search
result returned only if I get a hit on $uq. So, I specify default
value of $uq query as 0.0 in order for the final score to be zero
in cases where $uq doesn't record a hit. Even though, the scoring
works as expected (i.e, document that don't match $uq have a score
of zero), all  the documents are
returned as search results. Is there a way to filter search results
that have a score of zero?

 

a) you could wrap your query
in {!frange} .. but that will make everything

that does have a
value> 0.0 get the same final score

 

b) you could use an
fq={!frange} that refers back to your original $q

 

c) you could just use an fq
that refers directly to your $uq since that's

what you say you actaully
want to filter on in the first place..

 

uq=...

cq=...

q={!type=func q.op=AND
df=text}product(query($uq,0.0),query($cq,0.1))

fq={!v=uq}

Boost score for early matches


Solr - How to boost score for early matches?

up vote 1 down vote
favorite

How can I boost the score
for documents in which my query matches a particular field earlier.
For example, searching for "super man" should give "super man
returns" a higher score than "there is my super man". Is this
possible?

 

Uh, store the first few
words explicitly in another field, and boost matches on this field.
– aitchnyu Aug 22 at 9:45

 

The problem there is that
the size of the query can vary from say 3 characters to say 100
characters, and so determining how many words/chars to index
separately can be difficult. – techfoobar Aug 22 at 9:49

 

Secondly, suppose i index
the first 25 characters, and one record has "my super man blah.."
and another record has "super man returns blah.." - both will match
the query "super man" and both will be boosted when i boost this
secondary field. – techfoobar Aug 22 at 9:50

 

2 Answers

 

Thank you for the answer.
But i solved it today by using the approach i've outlined in my
answer. – techfoobar Aug 22 at 18:33

 

But this is not going to
work if the words do not occur at the very start. May want to check
out payloads as well where u can add index time suggestions as laid
down in the second option. – Jayendra Aug 22 at 18:35

 

Will check that out as well.
However, the current solution can be made to work to a large extent
by fine tuning the ps parameter to make it more lenient. I
currently use 2 (dist between 2 terms in the pf) and it seems to be
working quite well for my medium sized data set (1000s of records,
greatly varying in content). Will check out your point and let you
know if it helped. – techfoobar Aug 22 at 18:38

up
vote 0 down vote accepted       
Solved it myself after reading a LOT about this online. What
specifically helped me was a reply on nabble which goes like (I
used dismax, so explaining that here):


      
Create a separate field named say 'nameString' which stores the
value as "START "


      
Change the search query to "START "


      
Add the new field nameString as one of the fields to look in in the
query fields param (qf)


      
While searching use the parameter pf (phrase field) as the new
field nameString with a phrase slop of 1 or 2 (lower values would
mean stricter searching)

Your final query params will
be something like:

q=_START_

defType=dismax

qf=name
nameString

pf=nameString

ps=2

 


Solr: How can I get all
documents ordered by score with a list of
keywords?

I have a Solr 3.1 database
containing Emails with two fields:


      
datetime


      
text

For the query I have two
parameters:


      
date of today


      
keyword array("important thing", "important too", "not so
important, but more than average")

Is it possible to create a
query to

1.     
get ALL documents of this day AND

2.     
sort them by relevancy by ordering them so that the email with
contains most of my keywords(important things) scores
best?

The part with the date is
not very complicated:

fq=datetime[YY-MM-DDT00:00:00.000Z TO
YY-MM-DDT23:59:59.999Z]

I know that you can boost
the keywords this way:

q=text:"first keyword"^5 OR
text:"second one"^2 OR text:"minus scoring"^0.5 OR
text:"*"

But how do I only use the
keywords to sort this list and get ALL entries instead of doing a
realy query and get only a few entries back?

Thanks for help!

 

2 Answers

You need to specify your
terms in the main query and then change your date query to be a
filter query on these results by adding the following.

fq=datetime[YY-MM-DDT00:00:00.000Z TO
YY-MM-DDT23:59:59.999Z]

So you should have something
like this:

q=&fq=datetime[YY-MM-DDT00:00:00.000Z TO
YY-MM-DDT23:59:59.999Z]

Edit: A little more about
filter queries (as suggested by rfreak).

From Solr Wiki - FilterQuery
Guidance - "Now, what is a filter query? It is simply a part of a
query that is factored out for special treatment. This is achieved
in Solr by specifying it using the fq (filter query) parameter
instead of the q (main query) parameter. The same result could be
achieved leaving that query part in the main query. The difference
will be in query efficiency. That's because the result of a filter
query is cached and then used to filter a primary query result
using set intersection."

These should be sorted by
relevancy score already, that is just the default behavior of Solr.
You can see the score by adding that field.

fl=*,score

If you use the Full
Interface for Make A Query on the Admin Interface on your Solr
installation at http:////admin/form.jspyou will see where you can
specify the filter query, fields, and other options. You can check
out the Solr Wiki for more details on the options and how they are
used.

I hope that this helps
you.

 

+! The filter query is an
excellent suggestion. You may consider adding a bit about the
advantage of using the filter query there. – rfeak May 27 '11 at
14:55

 

Thank you! The filter query
is working as expected. But unfortunately I still dont know how to
handle the keywords because they filter the emails instead of only
sort them. – Daniel May 27 '11 at 16:06

Sorting by relevance is default behavior on solr/lucene.If
your results are unsatisfied, try to put the keywords in
quotes

//Edit: Folowing the answer
from Paige Cook, use somethink like that

q="important
thing"&fq=datetime[YY-MM-DDT00:00:00.000Z TO
YY-MM-DDT23:59:59.999Z]

//2. nd update. By thinking
about this answer: quotes are not an good idea, because in this
case you will only receive "important thing" mails, but no
"important too"

The Point is: what keywords
you are using. Because: searching for -- important thing -- results
in the highest scores for "important thing" mails. But lucene does
not know, how to score "important too" or "not so important, but
more than average" in relation to your keywords. An other idea
would be searching only for "important". But the field-values
"importand thing" and "importand too" gives nearly the same score
values,because 50% of the searched keywords (in this key:
"imported") are part of the field-value. So probably you have to
change your keywords. It could work after changeing "importend to"
into "also an important mail", to get the beast ratio of
search-word "important" and field-value in order to score the
shortest Mail-discripton to the highest value.

 

Thanks for your answer! You
point exactly to my problem because the keywords filter the
documents instead of only sorting them all an influencing the
relevancy score. I do not know how to handle this. – Daniel May 27
'11 at 16:13

Was this post useful to
you?    


Solr changes document's
score when its random field value altered

http://stackoverflow.com/questions/6254587/solr-changes-documents-score-when-its-random-field-value-altered

1 down vote
favorite

I need to navigate forth and
back in Solr results set ordered by score viewing documents one by
one. To visualise that, first a list of document titles is
presented to user, then he or she can click one of the title to see
more details and then needs to have an opportunity to move to the
next document in the original list without getting back and
clicking another title.

During viewing documents get
changed: their dynamic field is modified (or created is not exists
yet) to mark that document has already been viewed (used in other
search).

The problem I face is that
when the document is altered and re-indexed to keep those changes,
sometimes (and not always, which is very disturbing) its place in
the results set for the same query changes (in other words, it's
score changes as that doesn't happen when browsing results sorted
by one of the documents' fields). So, "Previous" / "Next"
navigation doesn't work properly.

I'm not using any custom
weighting or boosters on fields for score calculation. Also, that
dynamic field changed during browsing doesn't participate in the
query used to get the record set browsed.

So, the questions are: can
the modification of the document's field not included in the query
change its relevance score? And if it can, then how can I control
that?

UPDATE

I did some tests and can add
the following:

1.     
Document changes its place in the result set even if no field is
amended - just requesting the document and re-indexing it without
any changes to its fields makes it take another place next time the
same query over the same index is executed.

2.     
That happens even if the result set is sorted explicitly
("first_name DESC"), so score (which depends on the update date) is
not involved. The document stays the same, its field result set is
sorted by is the same, yet its position changes.

Still have no idea how to
avoid that.

 

2 Answers

In Solr, if your field is
"indexed", it will have an effect on the relevancy ranking
("stored" fields show up in search results but are not necessarily
searchable). If the fields in question aren't marked as indexed
then you are good to go. Note that "indexed" and "stored" are not
necessarily the same, hence you confusion about results lists
changing even though not all fields are shown (a field can be
"indexed" and not "stored" as well).

In this case I think you
want your "viewed" field to be "stored" but not "indexed". If you
really want to control the query, you can use copyField to copy the
relevant results into a single searchable field. You can also boost
terms or documents so that certain fields are "less important" to
the search query.

If you want to see how the
relevancy rankings are calculated, you can add "debugQuery=on" to
the end of your Solr Query (see the Relevancy FAQ for more
info).

However, all that being
said, I would recommend you cache your search result query (at
least for the first page for your results), since you will always
have results changing (documents added, removed by other users,
etc). Your best bet is to design a UI that anticipates this, or at
least batches a user's query.

 

Thanks, for some reason I
was sure changes to fields not participating in the query don't
affect the calculated score. In my case it is necessary to have
this field indexed as there is another query where I need to filter
documents searching only viewed or only not viewed before. Caching
is also not suitable as users is supposed to navigate through the
whole result set, not only through the page (well, caching still
possible and to be honest bearable in terms of resources but just
not elegant). I'll try to boost the field being searched and tell
if that works. – Yuriy Jun 7 '11 at 7:45

 

Just noticed that it also
happens when the results are sorted by other field than score. How
that's possible? I thought if ordering is specified and score is
not in the clause explicitly (say, ordering is like "first_name
DESC"), it doesn't influence the ordering. However, it seems it
does. How can I get rid of that? – Yuriy Jun 8 '11 at
14:11

 

Okay, looks like boosting
works, but has no effect. If I boost the field I am searching in,
all the matches are boosted equally and still the recently
re-indexed documents get some delta in their relevance which makes
difference. There should be a way to exclude the date of last
update from the ordering completely but I can't find it yet... –
Yuriy Jun 8 '11 at 14:50

 

feedback

I've
found the solution which doesn't eliminate the problem completely
but makes it much less likely to happen.

So the problem happens when
the documents are sorted by some field and there is a number of
them with the same value in this field (e.g. result set is sorted
by first name, and there are 100 entries for "John").

This is when the indexed
time gets involved - apparently Solr uses it to sort the documents
when their main sorting fields are identical. To make this case
much less probable, you need to add more sorting fields, e.g.
"first_name desc" should become "first_name desc, last_name desc,
register_date asc".

Also, adding document's
unique id as the last sorting field should remove the problem
completely (the set of sorting fields will never be identical for
any two documents in the index).

share|improve this
answer

 

Relevance Customization

http://lucene.472066.n3.nabble.com/Relevance-Customization-td501310.html

Hi all.

I want to know if its
possible to customize the solr relevance, somehing

like this:

1 - I create a static score
for each document and index it.

2 - I change the relevance
to Score(Solr) + Score(Static) where the solr score is equal to 30%
of the total score. Mixing the two scores into only one.

 

This is defferent of sorting
by mine static socre and after by solr score because I don't want
to kill solr score, just give it a little less
importance.

There is a way to do
this?

Thank's

 

Re: Relevance Customization

It can be done with
something like q=yourQuery _val_:yourStaticScoreField

http://wiki.apache.org/solr/FunctionQuery#fieldvalue

 

But this adds solr score
with static score. I am not sure how to get 30% of solr score. May
be something like?

q=yourQuery^0.3 _val_:yourStaticScoreField^0.7

Modify
SOLR scoring

Hi everybody,

I'm using SOLR with a schema
(for example) like this:  parutiondate, date,
indexed, not stored

fulltext, stemmed, indexed,
not stored

 

I know it's possible to
order by a field or more, but I want to order by score and modify
the "scrore"" formula.  I'll want keep the SOLR
score but add a new parameter in the formula to boost the score of
the most recent document.

What is the best way to do
this ?

Thanks.

Excuse for my
english.

 

RE: modify SOLR scoring

I believe you can use a
function query to do this:

http://wiki.apache.org/solr/FunctionQuery

if you embed the following
in your query, you should get a boost for more recent date
values:

_val_:"ord(dateField)"

Where "dateField" is the
field name of the date you want to use.

 

Re: modify SOLR scoring

http://lucene.472066.n3.nabble.com/modify-SOLR-scoring-td497348.html

I am interested in a very
similar topic like yours. I want to modify the field named "score"
and the document boost but not reindex the all fields  since it would take to
much power.

Please let me know if you
find a solution to this.

Kindly


Change order before
returning data

http://stackoverflow.com/questions/4965172/change-order-before-returning-data

 

Is there any way to change
order of result in SOLR. E.g when I query in SOLR i will get 1000
records with highest score, then in those 1000 records I will use
my own
function to change order again
 and just get 10 records of
those records. I can get 1000 records and process by php or java,
but I have to transfer 1000 records from SOLR server to webserver
and I dont want that, I just want to get 10 records after changing
order and use paging. Is SOLR support this kind of custom
function?

 

 Answers

If you function can be
applied when the records are initially indexed, you can do it there
and add the result as a value on the record. Then sort the result
set by the precalculated
value
. If not, i haven't worked with it directly, but this
thread seems to have the answer you're looking for

 

Hi My case is very special,
I had preindex score in database already. Let me give one example,
I have shopping site, when I search for TV LCD 32 inch, I got many
result from some different branch like LG, Toshiba ... and may
result for LG appear consequently I want to separate it e.g I dont
want 3 results for LG sit next together, Currently I get 1000 best
records (base on score) and change the order again using PHP, now I
want to move this job to SOLR (I dont want transfer data to much
between SOLR and Webserver, I just need 10 records to display) –
user612433 Feb 11 '11 at 3:45

 

Yes you can create a column
with the info you want to be taken into account into the
score.

For ex, for a "popularity"
column, your query would be:

your query &&
_val_:"popularity"^0.7

0.7 being the boost factor
into the final score. you can also filter the result set to get
less results:

your query &&
fq=popularity:[10 TO *]

 

 

limiting the total number of documents
matched

http://search-lucene.com/m/4AHNF17wIJW1/

 

Re: limiting the total
number of documents matched

Yonik Seeley 2010-07-17,
00:55

On Wed, Jul 14, 2010 at 5:46
PM, Paul <[EMAIL PROTECTED]>
wrote:

 I thought of another
way to do it, but I still have one thing I don't know how to do. I
could do the search without sorting for the 50th page, then look at
the relevancy score on the first item on that page, then repeat the
search, but add score > that relevancy as a
parameter. Is it possible to do a search with "score:[5 to *]"?
It didn't work in  my first
attempt.

 

frange could possible help (range query on an arbitrary
function).

http://www.lucidimagination.com/blog/tag/frange/

 

So perhaps something
like

q={!frange
l=0.85}query($qq)

qq=

 

where 0.85 is the lower
bound you want for scores and qq is the normal relevancy
query

-Yonik

http://www.lucidimagination.com

 

On Wed, Jul 14, 2010 at 5:34
PM, Paul <[EMAIL PROTECTED]>
wrote:

 I was hoping for a way
to do this purely by configuration and making the correct GET
requests, but if there is a way to do it by creating a custom
Request Handler, I suppose I could plunge into that. Would that
yield the best results, and would that be particularly
difficult?

 

>> On Wed, Jul 14, 2010 at
4:37 PM, Nagelberg, Kallin

So you want to take the top
1000 sorted by score, then sort those by another field. It's a
strange case, and I can't think of a clean way to accomplish it.
You could do it in two queries, where the first is by score and you
only request your IDs to keep it snappy, then do a second query
against the IDs and sort by your other field. 1000 seems like a lot
for that approach, but who knows until you try it on your
data.

>>> -Kallin
Nagelberg

 

>>> Subject:
limiting the total number of documents matched

I'd like to limit the total
number of documents that are returned for a search, particularly
when the sort order is not based on relevancy.In other words, if
the user searches for a very common term, they might get tens of
thousands of hits, and if they sort by "title", then very high
relevancy documents will be interspersed with very low relevancy
documents. I'd like to set a limit to the 1000 most relevant
documents, then sort those by title. Is there a way to do
this?

 

I guess I could always
retrieve the top 1000 documents and sort them in the client, but
that seems particularly inefficient. I can't find any other way to
do this, though.

相关文章
|
8月前
|
安全 Linux 开发工具
Elasticsearch 搜索入门技术之一
Elasticsearch 搜索入门技术之一
264 1
|
5月前
|
SQL JSON 大数据
ElasticSearch的简单介绍与使用【进阶检索】 实时搜索 | 分布式搜索 | 全文搜索 | 大数据处理 | 搜索过滤 | 搜索排序
这篇文章是Elasticsearch的进阶使用指南,涵盖了Search API的两种检索方式、Query DSL的基本语法和多种查询示例,包括全文检索、短语匹配、多字段匹配、复合查询、结果过滤、聚合操作以及Mapping的概念和操作,还讨论了Elasticsearch 7.x和8.x版本中type概念的变更和数据迁移的方法。
ElasticSearch的简单介绍与使用【进阶检索】 实时搜索 | 分布式搜索 | 全文搜索 | 大数据处理 | 搜索过滤 | 搜索排序
|
存储 搜索推荐 数据挖掘
深入探索Elasticsearch搜索引擎:高效搜索和分析的利器
在现代信息时代,数据量爆炸式增长,如何高效地搜索、分析和检索数据成为了一个重要的挑战。Elasticsearch作为一款分布式、实时搜索和分析引擎,为我们提供了强大的解决方案。本文将深入探讨Elasticsearch的基本概念、特点,以及如何在实际应用中应用它来实现高效的搜索和分析。
140 1
|
前端开发 微服务 Python
厉害了!如何在 Gihub 快速搜索开源项目?
很多的小伙伴,经常会有这样的困惑,我看了很多技术的学习文档、书籍、甚至视频,我想动手实践,于是我打开了GitHub,想找个开源项目,进行学习,获取项目实战经验。这个时候很多小伙伴就会面临这样的问题:“我不会搜啊,我该怎么找呀?”,最终只能放弃。相信看完这篇文章,你就可以学会如何精准地在GitHub搜索项目。
|
搜索推荐 程序员
助你掌握搜索神器,10个实用的Elasticsearch查询技巧
Elasticsearch是一个非常流行的搜索引擎,已经成为了许多企业的首选解决方案。然而,我们要想成为一个优秀的程序员,就必须掌握各种查询技巧。本文将向大家介绍10个实用ES的查询技巧。
|
自然语言处理 搜索推荐 安全
使用 Elasticsearch 搭建自己的搜索系统,这个厉害了。。
使用 Elasticsearch 搭建自己的搜索系统,这个厉害了。。
222 0
使用 Elasticsearch 搭建自己的搜索系统,这个厉害了。。
|
SQL 自然语言处理 Java
Elasticsearch连续剧之实战搜索文档
前几篇文章中,小编给大家介绍了一些es的基本操作,还有常用分词器的搭建,现在给大家来示范一下es的一些常见文档搜索方式
|
自然语言处理 数据库 索引
全文检索工具elasticsearch:第四章:开发电商的搜索列表功能
全文检索工具elasticsearch:第四章:开发电商的搜索列表功能
229 0
全文检索工具elasticsearch:第四章:开发电商的搜索列表功能
|
索引
ELASTICSEARCH实现相似搜索思路
ELASTICSEARCH实现相似搜索思路
151 0
|
自然语言处理 达摩院 搜索推荐
【新版本】开放搜索开源兼容版,支持Elasticsearch做搜索召回引擎
9月15日阿里云开放搜索重磅发布【开源兼容版】,搜索召回环节同时支持阿里云自研Ha3引擎与阿里云Elasticsearch引擎,并提供多行业的搜索算法能力,助力企业高效实现搜索效果深度优化。
820 0
【新版本】开放搜索开源兼容版,支持Elasticsearch做搜索召回引擎

热门文章

最新文章

下一篇
开通oss服务