项目作者: willf

项目描述 :
Python中一个简单的内存倒排索引
高级语言: Python
项目地址: git://github.com/willf/inverted_index.git
创建时间: 2016-07-28T19:18:56Z
项目社区:https://github.com/willf/inverted_index

开源协议:BSD 2-Clause "Simplified" License

下载


Inverted Index

A simple in-memory inverted index system, with a modest query language.

  1. i = inverted_index.Index()
  2. i.index(1, "this is the day they give babies away with half a pound of tea")
  3. i.index(1, "if you know any ladies who need any babies just send them round to ")
  4. i.index(2, "babies are born in the circle of the sun")
  5. results, err = i.query("babies")
  6. print(results)
  7. {1,2}
  8. results, err = i.query("babies AND ladies")
  9. print(results)
  10. {1}
  11. i.index(3, "WHERE ARE THE BABIES", tokenizer=lambda s:s.lower().split())
  12. results, err = i.query("babies")
  13. print(results)
  14. {1,2,3}
  15. i.unindex(3)
  16. results, err = i.query("babies")
  17. print(results)
  18. {1,2}

Any hashable object can be the “document”, and a tokenizer can be specified to tokenize the
text to index. There are also add_token and add_tokens methods to directly index on individual
tokens.

The query language is very simple: it understands AND and OR, NOT, and parentheses. For example:

  1. term OR term
  2. term AND term OR term
  3. (term AND term) OR term
  4. NOT term
  5. NOT term AND (term OR term)

AND, OR, and NOT have equal precedence, so use parentheses to disambiguate.

I’m pretty sure you don’t want to use this in production code :)