search_component_document.md 11 KB

Document

Mtas can produce statistics on used terms for the individual listed documents. To get this information, in Solr requests, besides the parameter to enable the Mtas query component, the following parameter should be provided.

Parameter Value Obligatory
mtas.document true yes

Multiple document results can be produced within the same request. To distinguish them, a unique identifier has to be provided for each of the required document results.

Parameter Value Info Obligatory
mtas.document.<identifier>.key <string> key used in response no
mtas.document.<identifier>.field <string> Mtas field yes
mtas.document.<identifier>.prefix <string> prefix yes
mtas.document.<identifier>.number <double> create list with specified number of most frequent items no
mtas.document.<identifier>.type <string> required type of statistics no
mtas.document.<identifier>.regexp <string> regular expression condition on term no
mtas.document.<identifier>.ignoreRegexp <string> regular expression condition for terms that have to be ignored no

List

A list can be provided, specifying the set of terms to consider when computing the result.

Parameter Value Info Obligatory
mtas.document.<identifier>.list <string> comma separated list of values yes
mtas.document.<identifier>.listRegexp <boolean> list of values are to be interpreted as regular expressions no
mtas.document.<identifier>.listExpand <boolean> expand the matches on values from list no
mtas.document.<identifier>.listExpandNumber <double> number of expansions of matches on values from list no

Ignore list

Also a ignore list can be provided, specifying the set of terms not to consider when computing the result.

Parameter Value Info Obligatory
mtas.document.<identifier>.ignoreList <string> comma separated list of values yes
mtas.document.<identifier>.ignoreListRegexp <boolean> list of values are to be interpreted as regular expressions no

Examples

  1. Basic : Statistics unique words for each document
  2. Regexp : Most frequent words containing only letters a-z and minimum length 5
  3. List : Statistics for a provided list of words
  4. Ignore : Statistics for a provided list of regular expressions, ignoring another list of regular expressions

Basic

Example
Statistics for set of unique tokens with prefix t (words) for each listed document.

Request and response
fq=%7B%21mtas_cql+field%3D%22text%22+query%3D%22%5B%5D%22+++%7D&q=%2A%3A%2A&mtas=true&mtas.document=true&mtas.document.0.field=text&mtas.document.0.prefix=t&mtas.document.0.key=words&mtas.document.0.type=all&fl=*&start=0&rows=2&wt=json&indent=true

"mtas":{
    "document":[{
        "key":"words",
        "list":[{
            "documentKey":"4115a95c-011c-11e4-b0ff-51bcbd7c379f",
            "sumsq":113964.0,
            "populationvariance":126.5639231447591,
            "max":166.0,
            "sum":3336.0,
            "kurtosis":92.19837080635624,
            "standarddeviation":11.257199352433314,
            "n":789,
            "quadraticmean":12.01836364230935,
            "min":1.0,
            "median":1.0,
            "variance":126.72453726042504,
            "mean":4.228136882129286,
            "geometricmean":1.9285975498109995,
            "sumoflogs":518.209740627951,
            "skewness":8.377350653392202},
          {
            "documentKey":"4115aac4-011c-11e4-b0ff-51bcbd7c379f",
            "sumsq":25489.0,
            "populationvariance":35.695641666666134,
            "max":77.0,
            "sum":1563.0,
            "kurtosis":72.57030420433823,
            "standarddeviation":5.979568021426876,
            "n":600,
            "quadraticmean":6.517796151051877,
            "min":1.0,
            "median":1.0,
            "variance":35.75523372287092,
            "mean":2.6050000000000004,
            "geometricmean":1.5249529474773036,
            "sumoflogs":253.1781332820801,
            "skewness":7.70682353088895}]}]}

Regexp

Example
Most frequent tokens containing only letters a-z and minimum length 5 with prefix t (words) for each listed document.

Regexp

[a-z]{5,}

Request and response
fq=%7B%21mtas_cql+field%3D%22text%22+query%3D%22%5B%5D%22+++%7D&q=%2A%3A%2A&mtas=true&mtas.document=true&mtas.document.0.field=NLContent_mtas&mtas.document.0.prefix=t&mtas.document.0.key=list+of+words&mtas.document.0.type=n%2Csum%2Cmean&mtas.document.0.regexp=%5Ba-z%5D%7B5%2C%7D&mtas.document.0.number=5&fl=%2A&start=0&rows=2&wt=json&indent=true

"mtas":{
    "document":[{
        "key":"list of words",
        "list":[{
            "documentKey":"c0c4200c-1eee-11e5-b891-f48ce0be173a",
            "list":[{
                "sum":471,
                "key":"zijne"},
              {
                "sum":317,
                "key":"eenen"},
              {
                "sum":304,
                "key":"zegde"},
              {
                "sum":249,
                "key":"hebben"},
              {
                "sum":229,
                "key":"welke"}],
            "mean":4.552402402402403,
            "sum":30319,
            "n":6660},
          {
            "documentKey":"c0c453d8-1eee-11e5-b891-f48ce0be173a",
            "list":[{
                "sum":348,
                "key":"heeft"},
              {
                "sum":243,
                "key":"hebben"},
              {
                "sum":199,
                "key":"prins"},
              {
                "sum":173,
                "key":"vader"},
              {
                "sum":161,
                "key":"komen"}],
            "mean":4.641632967456191,
            "sum":24104,
            "n":5193}]}]}

List

Example
Statistics for a provided list of words for each listed document.

List
koe,paard,schaap,geit,kip

Request and response
fq=%7B%21mtas_cql+field%3D%22text%22+query%3D%22%5Bt_lc%3D%5C%22koe%5C%22%7Ct_lc%3D%5C%22paard%5C%22%7Ct_lc%3D%5C%22schaap%5C%22%5D%22+++%7D&q=%2A%3A%2A&mtas=true&mtas.document=true&mtas.document.0.field=text&mtas.document.0.prefix=t_lc&mtas.document.0.key=list+of+words&mtas.document.0.type=n%2Csum%2Cmean&mtas.document.0.list=koe%2Cpaard%2Cschaap%2Cgeit%2Ckip&mtas.document.0.listRegexp=false&mtas.document.0.listExpand=false&mtas.document.0.number=100&fl=%2A&start=0&rows=2&wt=json&indent=true

"mtas":{
    "document":[{
        "key":"list of words",
        "list":[{
            "documentKey":"c0c46b7a-1eee-11e5-b891-f48ce0be173a",
            "list":[{
                "sum":3,
                "key":"paard"},
              {
                "sum":2,
                "key":"schaap"}],
            "mean":2.5,
            "sum":5,
            "n":2},
          {
            "documentKey":"c0c453d8-1eee-11e5-b891-f48ce0be173a",
            "list":[{
                "sum":31,
                "key":"paard"},
              {
                "sum":1,
                "key":"kip"}],
            "mean":16.0,
            "sum":32,
            "n":2}]}]}

Ignore

Example
Statistics for a provided list of regular expressions, ignoring another list of regular expressions for each listed document.

Regexp
[a-z]{7,}

Ignore
[a-z]{10,}

List
een.*,.*heid

Ignore list
een.*heid,ee.*nheid

Request and response
fq=%7B%21mtas_cql+field%3D%22text%22+query%3D%22%5Bt_lc%3D%5C%22eenheid%5C%22%5D%22+++%7D&q=%2A%3A%2A&mtas=true&mtas.document=true&mtas.document.0.field=text&mtas.document.0.prefix=t_lc&mtas.document.0.key=advanced+list+of+words&mtas.document.0.type=n%2Csum%2Cmean&mtas.document.0.regexp=%5Ba-z%5D%7B7%2C%7D&mtas.document.0.list=een.%2A%2C.%2Aheid&mtas.document.0.listRegexp=true&mtas.document.0.listExpand=true&mtas.document.0.listExpandNumber=3&mtas.document.0.ignoreRegexp=%5Ba-z%5D%7B10%2C%7D&mtas.document.0.ignoreList=een.%2Aheid%2Cee.%2Anheid&mtas.document.0.ignoreListRegexp=true&mtas.document.0.number=10&fl=text_numberOfPositions%2CNLCore_NLIdentification_nederlabID%2CNLProfile_name%2CNLTitle_title&start=0&rows=2&wt=json&indent=true

"mtas":{
    "document":[{
        "key":"advanced list of words",
        "list":[{
            "documentKey":"c0c41486-1eee-11e5-b891-f48ce0be173a",
            "list":[{
                "sum":166,
                "list":{
                  "droefheid":{
                    "sum":36},
                  "godheid":{
                    "sum":22},
                  "waarheid":{
                    "sum":22}},
                "key":".*heid"},
              {
                "sum":93,
                "list":{
                  "eenigen":{
                    "sum":46},
                  "eensklaps":{
                    "sum":32},
                  "eenigste":{
                    "sum":3}},
                "key":"een.*"}],
            "mean":5.886363636363637,
            "sum":259,
            "n":44},
          {
            "documentKey":"c0c453d8-1eee-11e5-b891-f48ce0be173a",
            "list":[{
                "sum":36,
                "list":{
                  "afscheid":{
                    "sum":12},
                  "hoogheid":{
                    "sum":4},
                  "bezigheid":{
                    "sum":3}},
                "key":".*heid"},
              {
                "sum":24,
                "list":{
                  "eenvoudig":{
                    "sum":15},
                  "eenzame":{
                    "sum":3},
                  "eenmaal":{
                    "sum":2}},
                "key":"een.*"}],
            "mean":3.1578947368421053,
            "sum":60,
            "n":19}]}]}

Lucene

To get statistics on used terms for the listed documents directly in Lucene, ComponentDocument together with the provided collect method can be used.