项目作者: wangcongcong123

项目描述 :
A web app: Retrieving contents within a site
高级语言: Jupyter Notebook
项目地址: git://github.com/wangcongcong123/insite_retrieve.git
创建时间: 2020-07-18T21:34:14Z
项目社区:https://github.com/wangcongcong123/insite_retrieve

开源协议:MIT License

下载


InsiteRetriever: A web app for retrieving contents within a site.

Motivation

This app is inspired by the fact that many bloggers who are skilled in one field/domain or another maintain blogging sites to share their knowledge but there are usually no searching boxes provided on their sites. However, more broadly, this app usage is not limited in blogging sites but can be applied to any other type of sites.

Due to the lack of resources in comparison to big search engines like Google, this functions only for in-site searching while at paragraph level. Hopefully this helps information seekers to get domain-specific knowledge efficiently from an expert’s (e.g., the bloggers) site or any other type of sites.

Demo (Notebook)

demo

Features

Quick Start

  1. git clone https://github.com/wangcongcong123/insite_retrieve.git
  2. cd insite_retrieve
  3. pip install -r requirements.txt
  4. apt-get install maven -qq
  5. git clone https://github.com/castorini/anserini.git
  6. cd anserini
  7. mvn clean package appassembler:assemble -DskipTests -Dmaven.javadoc.skip=true
  8. cd ..
  9. streamlit run app.py

Todo ideas

Let me know if any questions, feedback is welcome. Contributions or pull requests are highly encouraged. Below gives some Todo ideas.

  • Now the system only works using sparse retrieval-based BM25 model, so more work can go to extend the system to support dense retrieval-based techniques such as the recent advance: RetriBERT.
  • Now the system only includes a word cloud image indicating what the topics of a site are generally about, so more work can go to add more features such as topic modelling on the site’s contents.
  • More work can go to the presentation of retrived results, such as presenting them with more supplementary data (title, paragraph original location), highlighting exactly-matched words, better rendering etc.
  • Now the contents from each site are coarsely extracted, more filtering strategies are expected for improving the quality of the extracted contents.