lucene.net - Lucene - custom analyzer/tokenizer to index JSON key pair values -

June 15, 2011

i'm aiming store , index json key pair values. ideally store them in constant fieldname. (for simplicity sake, "grades")

an example of incoming json object:

    "data": [{         "key": "dp01",         "value": "excellent"     }, {         "key": "dp02",         "value": "average"     }, {         "key": "dp03",         "value": "negative"     }]

the json object serialized , stored is, index in way enable me search within same field key , value. main idea search multiple values within same lucene field.

any suggestions on how structure indexing? lets imagine example search using following query:

[grades: "key:dp01 uniqueidasdelimiter value:excellent"]

how customer analyzer/tokenizer achieve ?

edit: attempt depict goal more accurately.

think of typical relational type of structure (for simplicity sake).

each document website.
a website can have multiple images (and other important metadata).

each image has multiple sets of free keyvaluepair properties:

{     "key": "scenery",     "value": "nature" }, {     "key": "style",     "value": "vintage" }

another set:

{     "key": "scenery",     "value": "industrial" }, {     "key": "style",     "value": "vintage" }

my challenge come similar type of structure , index in way enables me build queries such as:

a website image of scenery:industrial , style:vintage.

i'm taking wrong approach indicated andy pook. ideas how efficiently flatten out these properties?

a common "problem" think indexes , documents having consistent set of fields. not same relational database tables of fixed set of columns.

in previous life had entity set of "attributes". key/value collection (much grades).

each document created fields named each attribute ie "attr-thing" value added "not_analyzed".

so, in example i'd create fields like

new field("grade-"+gradeid, grade, field.store.no, field.index.not_analyzed)

then can search query "grade-dp01:excellent".

alternatively can have fixed field name (similar @cris-almodovar) , set value "id=grade". again not_analyzed. search "grade:dp01=excellent".

either work. i've used both approaches success typically prefer first.

additional in response edit...

i think understand problem... if had "scenery=industrial style=vintage" , "scenery=nature style=modern" wouldn't want match if searched "nature , vintage", right?

you add "imagetype" field each set value "scenery=industrial style=vintage abc=xyz" keywordanalyzer (just splits space).

then search imagetype:"scenery=industrial style=vintage"~2. using slop phrase guarantees values in same field , slop allows order different or there values. number you'd have figure out based on number of properties expect in each field. simplistically, if expect there max of n values slop should n too.

Search This Blog

Color

lucene.net - Lucene - custom analyzer/tokenizer to index JSON key pair values -

Comments

Post a Comment

Popular posts from this blog

javascript - jQuery: Add class depending on URL in the best way -

Redirect to a HTTPS version using .htaccess -

Unlimited choices in BASH case statement -