Search engines

What is search?

In general

hash_table.png

Linear search

This is the rough let’s-get-this-done search algorithm; it gets the work done, but it’s not very efficient.

def linear_search(array, key)
  if array.index(key).nil?
    return -1
  else
    return "#{key} at index #{array.index(key)}"
  end
end


arr = [7, 6, 25, 19, 8, 14, 3, 16, 2, 0]
key = 3

p linear_search(arr, key)

Binary search

def binary_search(array, key)
    low, high = 0, array.length - 1
    while low <= high
      mid = (low + high) >> 1
      case key <=> array[mid]
	when 1
	  low = mid + 1
	when -1
	  high = mid - 1
	else
	  return mid
      end
    end
end

arr = [1,3,4,12,16,21,34,45,55,76,99,101]
key = 3
p binary_search(arr, key)

Compare

require 'benchmark'; require './searches'
# ruby 2.6
arr = (1..).step(5).take(1000000)
key = 1000

Benchmark.bm do |x|
  x.report('linear') { linear_search(arr, key) }
  x.report('binary') { binary_search(arr, key) }
end
#        user     system      total        real
# linear  0.006069   0.000000   0.006069 (  0.006106)
# binary  0.000012   0.000000   0.000012 (  0.000011)

Compilers e.g. Lexer, Tokenizer.

Regex

/W[aeiou]rd/.match("Word")
# => #<MatchData "Word">

grep, ag, rigrep etc

  • We use theese utils everyday with cat or similar
cat smt | rg something useful

Analyze

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html

POST _analyze
{
  "analyzer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

#[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]

Tokenize

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html

POST _analyze
{
  "tokenizer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

# [ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone ]

E.g.

Chahge text into stream of tokens text.

  • Change case to uniqe one
  • remove "the", "a" e.g. "stop words"
  • remove whitespases
  • Could remove plural indication

Lets create search engine

What we could do

  • We could loop throught every word and strore findings in some state
  • We cold sort our text
  • We could tokenize our text
  • Underneeth the hood it parse text into tree structure and use search algorithm to retreave data

Lucene

  • The fundamental concepts in Lucene are index, document, field and term.
  • An index contains a sequence of documents.
  • Document is a sequence of fields.
  • Field is a named sequence of terms.
  • Term is a sequence of bytes.

Elasticsearch

Distributed, RESTful search and analytics.

Based on Lucene

Thats mean that they both use same format for indexing

Near Real time

Cluster

Each cluster has a single master node which is chosen automatically by the cluster and which can be replaced if the current master node fails.

Replicas

Each primary shard can have zero or more replicas. A replica is a copy of the primary shard, and has two purposes:

  • increase failover: a replica shard can be promoted to a primary shard if the primary fails
  • increase performance: get and search requests can be handled by primary or replica shards. By default, each primary shard has one replica, but the number of replicas can be changed dynamically on an existing index. A replica shard will never be started on the same node as its primary shard.

Shards

A shard is a single Lucene instance. It is a low-level “worker” unit which is managed automatically by Elasticsearch. An index is a logical namespace which points to primary and replica shards.

Setup

$ wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.5.4.tar.gz
$ tar xvf elasticsearch-6.5.4.tar.gz
$ cd elasticsearch-6.5.4 && ./bin/elasticsearch

Kibana

kibana_ui.png

Kibana

kibana_dash.png

Logstash

Logstash is the central dataflow engine in the Elastic Stack for gathering, enriching, and unifying all of your data regardless of format or schema.

JRuby

Plugins is just ruby gem

Document Based

Event is just json document

Node

One isstance of elasticsearch

Querises

Just json in body of http request

REST API interface

  • Check your cluster, node, and index health, status, and statistics
  • Administer your cluster, node, and index data and metadata
  • Perform CRUD (Create, Read, Update, and Delete) and search operations against your indexes
  • Execute advanced search operations such as paging, sorting, filtering, scripting, aggregations, and many others

Cluster Health

curl -X GET "localhost:9200/_cat/health?v"
epoch      timestamp cluster       status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1475247709 17:01:49  elasticsearch green           1         1      0   0    0    0        0             0                  -                100.0%

Index

# Create

curl -X PUT "localhost:9200/customer?pretty"
# Show

curl -X GET "localhost:9200/_cat/indices?v"

health status index    uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   customer 95SQ4TSUT7mWBT7VNHH67A   5   1          0            0       260b           260b

Index (continue)

# Delete

curl -X DELETE "localhost:9200/customer?pretty"
curl -X GET "localhost:9200/_cat/indices?v"

health status index uuid pri rep docs.count docs.deleted store.size pri.store.size

Create

# Add document
curl -X PUT "localhost:9200/customer/_doc/1?pretty" \
-H 'Content-Type: application/json' -d'
{
  "name": "John Doe"
}
'

Show

# Show with id 1
curl -X GET "localhost:9200/customer/_doc/1?pretty"
{
  "_index" : "customer",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "_source" : { "name": "John Doe" }
}

Update

curl -X POST "localhost:9200/customer/_doc/1/_update?pretty" \
     -H 'Content-Type: application/json' -d'
{
  "doc": { "name": "Jane Doe", "age": 20 }
}
'
curl -X POST "localhost:9200/customer/_doc/1/_update?pretty" \
     -H 'Content-Type: application/json' -d'
{
  "script" : "ctx._source.age += 5"
}
'

Batch processing

curl -X POST "localhost:9200/customer/_doc/_bulk?pretty" \
     -H 'Content-Type: application/json' -d'
{"update":{"_id":"1"}}
{"doc": { "name": "John Doe becomes Jane Doe" } }
{"delete":{"_id":"2"}}
'

Seaching

curl -X GET "localhost:9200/bank/_search" \
     -H 'Content-Type: application/json' -d'
{
  "query": { "match_all": {} },
  "sort": [
    { "account_number": "asc" }
  ]
}
'

curl -X GET "localhost:9200/bank/_search" \
     -H 'Content-Type: application/json' -d'
{
  "query": { "match": { "account_number": 20 } }
}
'

Filtering

curl -X GET "localhost:9200/bank/_search" \
     -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "must": { "match_all": {} },
      "filter": {
	"range": {
	  "balance": {
	    "gte": 20000,
	    "lte": 30000
	  }
	}
      }
    }
  }
}
'

Solr

Based on Lucene

Useless errors

solr_shit.png

UI

solr_dash.png

Response

solr_response.png

Query

curl -XGET http://localhost:8983/solr/books/query -d '
{
    "query": {
	"bool": {
	    "must_not": "{!frange u:3.0}ranking"
	}
    },
    "filter: [
	"title:solr",
	{ "lucene" : {"df: "content", query : "lucene solr" }}
    ]
}'

Manipulation throught REST API

https://lucene.apache.org/solr/guide/7_5/json-query-dsl.html

solr_collection/(create|update|delete)

Enterprise level search solution

Tantivy(Rust)

https://github.com/tantivy-search/tantivy https://github.com/voloyev/actix_tantivy

Like Lucene but crate(lib in rust ecosystem)

Can be integrated into your app

We could create libs with ffi to get data from tantivy

Our just use as api