关于TrieField的全面认识、理解、运用

简介: 关于trieField的理解补充下3篇文档,相当的系统、全面!看相关文档连接,不解释。 http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/org/apache/lucene/search/NumericR

关于trieField的理解补充下3篇文档,相当的系统、全面!看相关文档连接,不解释。


http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/org/apache/lucene/search/NumericRangeQuery.html

http://blog.csdn.net/fancyerii/article/details/7256379

http://hadoopcn.iteye.com/blog/1550402

http://rdc.taobao.com/team/jm/archives/1699

 


 
extends 
MultiTermQuery

A 
Query
 that matches numeric values within a
specified range. To use this, you must first index the numeric
values using 
NumericField
 (expert: 
NumericTokenStream
). If your terms are instead
textual, you should use 
TermRangeQuery
. 
NumericRangeFilter
 is the filter equivalent of
this query.

You create a new NumericRangeQuery with the static factory
methods, eg:

 

matches all documents whose float valued "weight" field ranges
from 0.03 to 0.10, inclusive.

The performance of NumericRangeQuery is much better than the
corresponding 
TermRangeQuery
 because the number of terms that
must be searched is usually far fewer, thanks to trie indexing,
described below.

You can optionally specify a 
precisionStep
 when creating this query. This is
necessary if you've changed this configuration from its default (4)
during indexing. Lower values consume more disk space but speed up
searching. Suitable values are between 1 and 8. A
good starting point to test is 4, which is the default value
for all Numeric* classes. See 
below
 for details.

This query defaults to 
MultiTermQuery.CONSTANT_SCORE_AUTO_REWRITE_DEFAULT
 for 32 bit
(int/float) ranges with precisionStep ≤8 and 64 bit (long/double)
ranges with precisionStep ≤6. Otherwise it uses 
MultiTermQuery.CONSTANT_SCORE_FILTER_REWRITE
 as the number of
terms is likely to be high. With precision steps of ≤4, this query
can be run with one of the BooleanQuery rewrite methods without
changing BooleanQuery's default max clause count.

How it works

See the publication about panFMP, where this algorithm was described
(referred to as TrieRangeQuery):

Schindler, U, Diepenbroek, M, 2008.
Generic XML-based Framework for Metadata Portals.
Computers & Geosciences 34 (12), 1947-1955. doi:10.1016/j.cageo.2008.02.023

A quote from this paper: Because Apache Lucene is a
full-text search engine and not a conventional database, it cannot
handle numerical ranges (e.g., field value is inside user defined
bounds, even dates are numerical values). We have developed an
extension to Apache Lucene that stores the numerical values in a
special string-encoded format with variable precision (all
numerical values like doubles, longs, floats, and ints are
converted to lexicographic sortable string representations and
stored with different precisions (for a more detailed description
of how the values are stored, see 
NumericUtils
). A range is then divided recursively
into multiple intervals for searching: The center of the range is
searched only with the lowest possible precision in the
trie, while the boundaries are matched more exactly. This
reduces the number of terms dramatically.

For the variant that stores long values in 8 different
precisions (each reduced by 8 bits) that uses a lowest precision of
1 byte, the index contains only a maximum of 256 distinct values in
the lowest precision. Overall, a range could consist of a
theoretical maximum of 7*255*2 + 255 = 3825 distinct
terms (when there is a term for every distinct value of an
8-byte-number in the index and the range covers almost all of them;
a maximum of 255 distinct values is used because it would always be
possible to reduce the full 256 values to one term with degraded
precision). In practice, we have seen up to 300 terms in most cases
(index with 500,000 metadata records and a uniform value
distribution).

Precision
Step

You can choose any precisionStep when encoding
values. Lower step values mean more precisions and so more terms in
index (and index gets larger). On the other hand, the maximum
number of terms to match reduces, which optimized query speed. The
formula to calculate the maximum term count is:

 

(this formula is only correct, when
bitsPerValue/precisionStep is an integer; in other
cases, the value must be rounded up and the last summand must
contain the modulo of the division as precision step)
. For
longs stored using a precision step of 4, n = 15*15*2 + 15 =
465
, and for a precision step of 2, n = 31*3*2 + 3 =
189
. But the faster search speed is reduced by more seeking
in the term enum of the index. Because of this, the ideal
precisionStep value can only be found out by testing.
Important: You can index with a lower precision step value
and test search speed using a multiple of the original step
value.

Good values for precisionStep are depending on
usage and data type:

  • The default for all data types is 4, which is used, when
    no precisionStep is given.
  • Ideal value in most cases for 64 bit data types
    (long, double) is 6 or 8.
  • Ideal value in most cases for 32 bit data types
    (int, float) is 4.
  • For low cardinality fields larger precision steps are good. If
    the cardinality is < 100, it is fair to use 
    Integer.MAX_VALUE
     (see below).
  • Steps ≥64 for long/double and ≥32 for
    int/float produces one token per value in the index and
    querying is as slow as a conventional 
    TermRangeQuery
    . But it can be used to produce
    fields, that are solely used for sorting (in this case simply use

    Integer.MAX_VALUE
     as precisionStep).
    Using 
    NumericFields
     for sorting is ideal, because
    building the field cache is much faster than with text-only
    numbers. These fields have one term per value and therefore also
    work with term enumeration for building distinct lists (e.g. facets
    / preselected values to search for). Sorting is also possible with
    range query optimized fields using one of the above
    precisionSteps.

Comparisons of the different types of RangeQueries on an index
with about 500,000 docs showed that 
TermRangeQuery
 in boolean rewrite mode (with
raised 
BooleanQuery
 clause count) took about 30-40 secs
to complete, 
TermRangeQuery
 in constant score filter rewrite
mode took 5 secs and executing this class took <100ms to
complete (on an Opteron64 machine, Java 1.5, 8 bit precision step).
This query type was developed for a geographic portal, where the
performance for e.g. bounding boxes or exact date/time stamps is
important.

Since:
2.9
See Also:

Serialized Form
相关文章
|
并行计算 算法 C++
统一内存统一内存的基本概念和使用
统一内存统一内存的基本概念和使用
2770 0
统一内存统一内存的基本概念和使用
|
10月前
|
持续交付 网络安全 数据安全/隐私保护
ios微信双开(免费版)怎么弄的?如何下载
作为一名移动端自动化开发工程师,我曾在多个iOS越狱项目中处理过应用多开需求。本文将
|
10月前
|
数据采集 机器学习/深度学习 边缘计算
Python爬虫动态IP代理报错全解析:从问题定位到实战优化
本文详解爬虫代理设置常见报错场景及解决方案,涵盖IP失效、403封禁、性能瓶颈等问题,提供动态IP代理的12种核心处理方案及完整代码实现,助力提升爬虫系统稳定性。
569 0
html+js+css实现的建筑方块立体数字时钟源码
html+js+css实现的建筑方块立体数字时钟源码
649 33
|
存储 Dubbo Java
分布式 RPC 底层原理详解,看这篇就够了!
本文详解分布式RPC的底层原理与系统设计,大厂面试高频,建议收藏。关注【mikechen的互联网架构】,10年+BAT架构经验倾囊相授。
分布式 RPC 底层原理详解,看这篇就够了!
|
安全 测试技术 量子技术
量子计算硬件:超导量子比特的最新进展
量子计算作为信息技术的前沿领域,超导量子比特作为其核心组件,近年来取得了显著进展。本文介绍了超导量子比特的基本原理、制造与性能提升、最新技术成果及未来展望,展示了其在密码学、化学和材料科学等领域的潜在应用,预示着量子计算时代的到来。
|
Linux 开发工具
成功解决:CentOS 7中如何配置修改Vim
这篇文章介绍了如何在CentOS 7系统中配置和修改Vim编辑器的设置。文章首先指导读者如何检查Vim是否已经安装,如果未安装完全,提供了安装Vim的命令。接着,文章详细说明了如何编辑`/etc/vimrc`文件来配置Vim,包括设置显示行号、显示当前模式、光标位置信息、自动缩进和语法高亮等。最后,文章通过对比展示了配置前后使用vi和vim打开相同文本的效果差异,强调了Vim配置后的优势。
成功解决:CentOS 7中如何配置修改Vim
|
算法
FM算法介绍
概述 FM (Factorization Machine) 算法可进行回归和二分类预测,它的特点是考虑了特征之间的相互作用,是一种非线性模型,目前FM算法是推荐领域被验证的效果较好的推荐方案之一,在诸多电商、广告、直播厂商的推荐领域有广泛应用。
13013 0
|
Web App开发 开发者 iOS开发
新版本浏览器为何无法访问部分本地项目
情况 Chrome 以及Safari 浏览器在最近的自动升级之后,在地址栏输入本地项目地址(simple.app)时,显示: 新版本浏览器无法访问部分本地项目,对开发者有相当大的影响 新版本浏览器无法访问部分本地项目,对开发者有相当大的影响 Chrome 版本为63.0,Safari 的版本为11.0.2。
1361 0
|
监控
画图解释FHSS、DSSS扩频原理以及计算规则
画图解释FHSS、DSSS扩频原理以及计算规则
1400 0

热门文章

最新文章