search_component_document.md 11 KB

Edit Raw Blame History



Document

Mtas can produce statistics on used terms for the individual listed documents. To get this information, in Solr requests, besides the parameter to enable the Mtas query component, the following parameter should be provided.


Parameter
Value
Obligatory


mtas.document
true
yes


Multiple document results can be produced within the same request. To distinguish them, a unique identifier has to be provided for each of the required document results.


Parameter
Value
Info
Obligatory


mtas.document.<identifier>.key
<string>
key used in response
no


mtas.document.<identifier>.field
<string>
Mtas field
yes


mtas.document.<identifier>.prefix
<string>
prefix
yes


mtas.document.<identifier>.number
<double>
create list with specified number of most frequent items
no


mtas.document.<identifier>.type
<string>
required type of statistics

no


mtas.document.<identifier>.regexp
<string>
regular expression condition on term
no


mtas.document.<identifier>.ignoreRegexp
<string>
regular expression condition for terms that have to be ignored
no


List

A list can be provided, specifying the set of terms to consider when computing the result.


Parameter
Value
Info
Obligatory


mtas.document.<identifier>.list
<string>
comma separated list of values
yes


mtas.document.<identifier>.listRegexp
<boolean>
list of values are to be interpreted as regular expressions
no


mtas.document.<identifier>.listExpand
<boolean>
expand the matches on values from list
no


mtas.document.<identifier>.listExpandNumber
<double>
number of expansions of matches on values from list
no


Ignore list

Also a ignore list can be provided, specifying the set of terms not to consider when computing the result.


Parameter
Value
Info
Obligatory


mtas.document.<identifier>.ignoreList
<string>
comma separated list of values
yes


mtas.document.<identifier>.ignoreListRegexp
<boolean>
list of values are to be interpreted as regular expressions
no


Examples


Basic : Statistics unique words for each document

Regexp : Most frequent words containing only letters a-z and minimum length 5

List : Statistics for a provided list of words

Ignore : Statistics for a provided list of regular expressions, ignoring another list of regular expressions


Basic

Example

Statistics for set of unique tokens with prefix t (words) for each listed document.

Request and response

fq=%7B%21mtas_cql+field%3D%22text%22+query%3D%22%5B%5D%22+++%7D&q=%2A%3A%2A&mtas=true&mtas.document=true&mtas.document.0.field=text&mtas.document.0.prefix=t&mtas.document.0.key=words&mtas.document.0.type=all&fl=*&start=0&rows=2&wt=json&indent=true
"mtas":{
    "document":[{
        "key":"words",
        "list":[{
            "documentKey":"4115a95c-011c-11e4-b0ff-51bcbd7c379f",
            "sumsq":113964.0,
            "populationvariance":126.5639231447591,
            "max":166.0,
            "sum":3336.0,
            "kurtosis":92.19837080635624,
            "standarddeviation":11.257199352433314,
            "n":789,
            "quadraticmean":12.01836364230935,
            "min":1.0,
            "median":1.0,
            "variance":126.72453726042504,
            "mean":4.228136882129286,
            "geometricmean":1.9285975498109995,
            "sumoflogs":518.209740627951,
            "skewness":8.377350653392202},
          {
            "documentKey":"4115aac4-011c-11e4-b0ff-51bcbd7c379f",
            "sumsq":25489.0,
            "populationvariance":35.695641666666134,
            "max":77.0,
            "sum":1563.0,
            "kurtosis":72.57030420433823,
            "standarddeviation":5.979568021426876,
            "n":600,
            "quadraticmean":6.517796151051877,
            "min":1.0,
            "median":1.0,
            "variance":35.75523372287092,
            "mean":2.6050000000000004,
            "geometricmean":1.5249529474773036,
            "sumoflogs":253.1781332820801,
            "skewness":7.70682353088895}]}]}

  
Regexp

Example

Most frequent tokens containing only letters a-z and minimum length 5 with prefix t (words) for each listed document.

Regexp


[a-z]{5,}

Request and response

fq=%7B%21mtas_cql+field%3D%22text%22+query%3D%22%5B%5D%22+++%7D&q=%2A%3A%2A&mtas=true&mtas.document=true&mtas.document.0.field=NLContent_mtas&mtas.document.0.prefix=t&mtas.document.0.key=list+of+words&mtas.document.0.type=n%2Csum%2Cmean&mtas.document.0.regexp=%5Ba-z%5D%7B5%2C%7D&mtas.document.0.number=5&fl=%2A&start=0&rows=2&wt=json&indent=true
"mtas":{
    "document":[{
        "key":"list of words",
        "list":[{
            "documentKey":"c0c4200c-1eee-11e5-b891-f48ce0be173a",
            "list":[{
                "sum":471,
                "key":"zijne"},
              {
                "sum":317,
                "key":"eenen"},
              {
                "sum":304,
                "key":"zegde"},
              {
                "sum":249,
                "key":"hebben"},
              {
                "sum":229,
                "key":"welke"}],
            "mean":4.552402402402403,
            "sum":30319,
            "n":6660},
          {
            "documentKey":"c0c453d8-1eee-11e5-b891-f48ce0be173a",
            "list":[{
                "sum":348,
                "key":"heeft"},
              {
                "sum":243,
                "key":"hebben"},
              {
                "sum":199,
                "key":"prins"},
              {
                "sum":173,
                "key":"vader"},
              {
                "sum":161,
                "key":"komen"}],
            "mean":4.641632967456191,
            "sum":24104,
            "n":5193}]}]}

  
List

Example

Statistics for a provided list of words for each listed document.

List

koe,paard,schaap,geit,kip

Request and response

fq=%7B%21mtas_cql+field%3D%22text%22+query%3D%22%5Bt_lc%3D%5C%22koe%5C%22%7Ct_lc%3D%5C%22paard%5C%22%7Ct_lc%3D%5C%22schaap%5C%22%5D%22+++%7D&q=%2A%3A%2A&mtas=true&mtas.document=true&mtas.document.0.field=text&mtas.document.0.prefix=t_lc&mtas.document.0.key=list+of+words&mtas.document.0.type=n%2Csum%2Cmean&mtas.document.0.list=koe%2Cpaard%2Cschaap%2Cgeit%2Ckip&mtas.document.0.listRegexp=false&mtas.document.0.listExpand=false&mtas.document.0.number=100&fl=%2A&start=0&rows=2&wt=json&indent=true
"mtas":{
    "document":[{
        "key":"list of words",
        "list":[{
            "documentKey":"c0c46b7a-1eee-11e5-b891-f48ce0be173a",
            "list":[{
                "sum":3,
                "key":"paard"},
              {
                "sum":2,
                "key":"schaap"}],
            "mean":2.5,
            "sum":5,
            "n":2},
          {
            "documentKey":"c0c453d8-1eee-11e5-b891-f48ce0be173a",
            "list":[{
                "sum":31,
                "key":"paard"},
              {
                "sum":1,
                "key":"kip"}],
            "mean":16.0,
            "sum":32,
            "n":2}]}]}

  
Ignore

Example

Statistics for a provided list of regular expressions, ignoring another list of regular expressions for each listed document.

Regexp

[a-z]{7,}

Ignore

[a-z]{10,}

List

een.*,.*heid

Ignore list

een.*heid,ee.*nheid

Request and response

fq=%7B%21mtas_cql+field%3D%22text%22+query%3D%22%5Bt_lc%3D%5C%22eenheid%5C%22%5D%22+++%7D&q=%2A%3A%2A&mtas=true&mtas.document=true&mtas.document.0.field=text&mtas.document.0.prefix=t_lc&mtas.document.0.key=advanced+list+of+words&mtas.document.0.type=n%2Csum%2Cmean&mtas.document.0.regexp=%5Ba-z%5D%7B7%2C%7D&mtas.document.0.list=een.%2A%2C.%2Aheid&mtas.document.0.listRegexp=true&mtas.document.0.listExpand=true&mtas.document.0.listExpandNumber=3&mtas.document.0.ignoreRegexp=%5Ba-z%5D%7B10%2C%7D&mtas.document.0.ignoreList=een.%2Aheid%2Cee.%2Anheid&mtas.document.0.ignoreListRegexp=true&mtas.document.0.number=10&fl=text_numberOfPositions%2CNLCore_NLIdentification_nederlabID%2CNLProfile_name%2CNLTitle_title&start=0&rows=2&wt=json&indent=true
"mtas":{
    "document":[{
        "key":"advanced list of words",
        "list":[{
            "documentKey":"c0c41486-1eee-11e5-b891-f48ce0be173a",
            "list":[{
                "sum":166,
                "list":{
                  "droefheid":{
                    "sum":36},
                  "godheid":{
                    "sum":22},
                  "waarheid":{
                    "sum":22}},
                "key":".*heid"},
              {
                "sum":93,
                "list":{
                  "eenigen":{
                    "sum":46},
                  "eensklaps":{
                    "sum":32},
                  "eenigste":{
                    "sum":3}},
                "key":"een.*"}],
            "mean":5.886363636363637,
            "sum":259,
            "n":44},
          {
            "documentKey":"c0c453d8-1eee-11e5-b891-f48ce0be173a",
            "list":[{
                "sum":36,
                "list":{
                  "afscheid":{
                    "sum":12},
                  "hoogheid":{
                    "sum":4},
                  "bezigheid":{
                    "sum":3}},
                "key":".*heid"},
              {
                "sum":24,
                "list":{
                  "eenvoudig":{
                    "sum":15},
                  "eenzame":{
                    "sum":3},
                  "eenmaal":{
                    "sum":2}},
                "key":"een.*"}],
            "mean":3.1578947368421053,
            "sum":60,
            "n":19}]}]}


Lucene

To get statistics on used terms for the listed documents directly in Lucene, ComponentDocument together with the provided collect method can be used.