Skip to content

Semantic Funding Search (SeFuSe)

Semantic search for funding programs in the Federal Funding Database (FΓΆrderdatenbank des Bundes).

πŸ‘‰ Check out on GitHub


Overview

Read the Docs.

SeFuSe is a tool for semantic search of funding programs in the Federal Funding Database.

The idea: users enter their project description into a web interface and automatically receive matching funding programs, including a short description and a direct link to the funding database.

A short demo is available here: β–Ά YouTube Video


How It Works

The system is based on an embedding model and a vector database, which is regularly populated with new programs from the funding database.

Pipeline

  1. Retrieval of funding programs from the funding database (regularly updated).
  2. Extraction & preprocessing of short descriptions for semantic search.
  3. User input: A project description is entered via the web interface.
  4. Semantic search: The system identifies relevant funding programs.
  5. Output: Matching programs with links to the corresponding funding database entries.

Motivation

Comparable projects rely on OpenAI’s Custom GPTs. This means that project ideas are regularly sent to commercial providers such as OpenAI. With SeFuSe, you can run your entire setup locally - your data remains on your own server.


Docker Compose Installation & Configuration

SeFuSe is designed to run as a fully self-contained, local AI system using Docker Compose. It orchestrates four services:

Service Role
Qdrant Vector database for storing and searching embeddings
Ollama Local LLM runtime for generating embeddings
FastAPI Backend API for data processing, embedding, and search
Streamlit Web UI for semantic search

All services communicate over Docker’s internal network using service names (e.g. qdrant, ollama, fastapi).


Service Breakdown

Qdrant (Vector Database)

qdrant:
  image: qdrant/qdrant:latest
  ports:
    - "6333:6333"
  volumes:
    - ./data/qdrant:/qdrant/storage

Qdrant stores all embedding vectors and metadata.

  • Persistent storage: ./data/qdrant
  • Port 6333: Used by FastAPI for similarity search

This ensures that embeddings survive container restarts.


Ollama (Local Embedding Model)

ollama:
  build: ./ollama
  ports:
    - "11434:11434"
  volumes:
    - ./ollama/data:/root/.ollama
  environment:
    - MODEL=nomic-embed-text

Ollama runs the embedding model locally. The model is downloaded and cached in ./ollama/data.

Environment variables

Variable Purpose
MODEL Name of the embedding model to load (e.g. nomic-embed-text)

This value must match the MODEL used by FastAPI and Streamlit.


FastAPI (Backend & Scheduler)

fastapi:
  build: ./fastapi
  ports:
    - "8000:8000"
  volumes:
    - ./data/funding_data:/app/data
    - ./data_processing/src:/app/data_processing
  depends_on:
    - qdrant
    - ollama
  environment:
    - MODEL=nomic-embed-text
    - TOKENIZER=nomic-ai/nomic-embed-text-v1.5
    - CRON_TRIGGER_DATA_PROCESSING=0
    - CRON_TRIGGER_EMBEDDING=4
    - DOWNLOAD_FILE=https://...
    - OLLAMA_URL=http://ollama:11434
    - VECTOR_DB_HOST=qdrant
    - QDRANT_PORT=6333

FastAPI is the brain of the system. It:

  • Downloads funding data
  • Processes and cleans it
  • Generates embeddings via Ollama
  • Stores and queries vectors in Qdrant
  • Exposes APIs for Streamlit

Environment variables

Variable Meaning
MODEL Embedding model name (must match Ollama + Streamlit)
TOKENIZER HuggingFace tokenizer used for chunking text
CRON_TRIGGER_DATA_PROCESSING Hour (0–23) when funding data is refreshed
CRON_TRIGGER_EMBEDDING Hour (0–23) when new embeddings are generated
DOWNLOAD_FILE URL of the funding dataset (Parquet ZIP)
OLLAMA_URL Internal Ollama API endpoint
VECTOR_DB_HOST Qdrant hostname inside Docker
QDRANT_PORT Qdrant service port

Example:

CRON_TRIGGER_DATA_PROCESSING=0   β†’ run at midnight
CRON_TRIGGER_EMBEDDING=4        β†’ run at 04:00 AM

Streamlit (Web UI)

streamlit:
  build: ./streamlit
  ports:
    - "8501:8501"
  volumes:
    - ./data/funding_data:/app/data
  depends_on:
    - fastapi
  environment:
    - MODEL=nomic-embed-text
    - FASTAPI_URL=http://fastapi:8000

Streamlit provides the user interface where users enter project descriptions and view matching funding programs.

Environment variables

Variable Purpose
MODEL Must match FastAPI and Ollama
FASTAPI_URL Internal URL of the FastAPI service

How the System Works Together

  1. Ollama runs the embedding model
  2. FastAPI sends text chunks to Ollama and receives vectors
  3. FastAPI stores vectors in Qdrant
  4. Streamlit sends search queries to FastAPI
  5. FastAPI performs vector search in Qdrant
  6. Results are returned to Streamlit

All data and models persist on disk through Docker volumes.


Start the System

From the project root:

docker-compose up --build

Then open:


Project Structure

This repository is with clear separation between data storage, data processing, backend services, and user-facing applications.

Core Directories

  • data/ Central location for persisted data used across services.

  • funding_data/ – Raw and processed funding datasets.

  • qdrant/ – Persistent storage for the Qdrant vector database.

  • data_processing/ Contains the data ingestion and transformation pipeline responsible for preparing funding data.

  • src/ – Application source code following a clean src layout.

    • config/ – Centralized configuration handling.
    • processing/ – Core data transformation logic (cleaning, UUID generation, value extraction).
    • utils/ – Helper utilities for downloading and extracting data.
    • main.py – Entry point for running the data processing workflow.
    • requirements.txt – Python dependencies for the data processing service.
  • fastapi/ FastAPI-based backend service that exposes APIs and interacts with Qdrant and processed data.

  • src/ – Backend application code.

    • main.py – API entry point.
    • utils/ – FastAPI and Qdrant helper utilities.
    • requirements.txt – Backend dependencies.
    • Dockerfile – Container definition for the API service.
  • streamlit/ Streamlit-powered frontend application for exploring and visualizing funding data.

  • src/ – Streamlit application code.

    • app.py – Main dashboard entry point.
    • utils/ – UI and data access helpers.
    • requirements.txt – Frontend dependencies.
    • Dockerfile – Container definition for the Streamlit app.
  • ollama/ Contains Docker configuration and initialization scripts for local model serving.

  • init_models.sh – Script for downloading and initializing models.

  • data/ – Persistent model data.

  • docs/ Project documentation built with MkDocs, structured to mirror the codebase.

./
β”œβ”€β”€ .dockerignore
β”œβ”€β”€ .gitignore
β”œβ”€β”€ .gitlab-ci.yml
β”œβ”€β”€ LICENSE
β”œβ”€β”€ README.md
β”œβ”€β”€ THIRD_PARTY_LICENSES.txt
β”œβ”€β”€ data
β”‚   β”œβ”€β”€ funding_data
β”‚   β”‚   β”œβ”€β”€ .gitkeep
β”‚   └── qdrant
β”‚       β”œβ”€β”€ .gitkeep
β”œβ”€β”€ data_processing
β”‚   β”œβ”€β”€ data
β”‚   β”‚   └── .gitkeep
β”‚   β”œβ”€β”€ requirements.txt
β”‚   └── src
β”‚       β”œβ”€β”€ config
β”‚       β”‚   β”œβ”€β”€ __init__.py
β”‚       β”‚   └── config.py
β”‚       β”œβ”€β”€ main.py
β”‚       β”œβ”€β”€ processing
β”‚       β”‚   β”œβ”€β”€ __init__.py
β”‚       β”‚   β”œβ”€β”€ cleaner.py
β”‚       β”‚   β”œβ”€β”€ uuid_generator.py
β”‚       β”‚   └── value_extractor.py
β”‚       └── utils
β”‚           β”œβ”€β”€ __init__.py
β”‚           β”œβ”€β”€ downloader.py
β”‚           └── extractor.py
β”œβ”€β”€ docker-compose.yml
β”œβ”€β”€ docs
β”‚   β”œβ”€β”€ data_processing
β”‚   β”‚   β”œβ”€β”€ config
β”‚   β”‚   β”‚   └── config.md
β”‚   β”‚   β”œβ”€β”€ main.md
β”‚   β”‚   β”œβ”€β”€ processing
β”‚   β”‚   β”‚   β”œβ”€β”€ cleaner.md
β”‚   β”‚   β”‚   β”œβ”€β”€ uuid_generator.md
β”‚   β”‚   β”‚   └── value_extractor.md
β”‚   β”‚   └── utils
β”‚   β”‚       β”œβ”€β”€ downloading.md
β”‚   β”‚       └── extractor.md
β”‚   β”œβ”€β”€ fastapi
β”‚   β”‚   β”œβ”€β”€ main.md
β”‚   β”‚   └── utils
β”‚   β”‚       β”œβ”€β”€ fastapi_utils.md
β”‚   β”‚       └── qdrant_utils.md
β”‚   └── streamlit
β”‚       β”œβ”€β”€ app.md
β”‚       └── utils
β”‚           └── utils.md
β”œβ”€β”€ fastapi
β”‚   β”œβ”€β”€ Dockerfile
β”‚   β”œβ”€β”€ data
β”‚   β”‚   └── .gitkeep
β”‚   β”œβ”€β”€ requirements.txt
β”‚   └── src
β”‚       β”œβ”€β”€ main.py
β”‚       └── utils
β”‚           β”œβ”€β”€ __init__.py
β”‚           β”œβ”€β”€ fastapi_utils.py
β”‚           └── qdrant_utils.py
β”œβ”€β”€ mkdocs.yml
β”œβ”€β”€ ollama
β”‚   β”œβ”€β”€ Dockerfile
β”‚   β”œβ”€β”€ data
β”‚   β”‚   β”œβ”€β”€ .gitkeep
β”‚   └── init_models.sh
└── streamlit
    β”œβ”€β”€ Dockerfile
    β”œβ”€β”€ data
    β”‚   └── .gitkeep
    β”œβ”€β”€ requirements.txt
    └── src
        β”œβ”€β”€ app.py
        └── utils
            β”œβ”€β”€ __init__.py
            └── utils.py

Acknowledgements

This project builds upon the data collected by jstet and pr130, creators of the
Funding Scraper project.

Their work on scraping and providing structured funding data forms the foundation of this project.