Semantic Funding Search (SeFuSe)
Semantic search for funding programs in the Federal Funding Database (FΓΆrderdatenbank des Bundes).
π Check out on GitHub
Overview
SeFuSe is a tool for semantic search of funding programs in the Federal Funding Database.
The idea: users enter their project description into a web interface and automatically receive matching funding programs, including a short description and a direct link to the funding database.
A short demo is available here: βΆ YouTube Video
How It Works
The system is based on an embedding model and a vector database, which is regularly populated with new programs from the funding database.
Pipeline
- Retrieval of funding programs from the funding database (regularly updated).
- Extraction & preprocessing of short descriptions for semantic search.
- User input: A project description is entered via the web interface.
- Semantic search: The system identifies relevant funding programs.
- Output: Matching programs with links to the corresponding funding database entries.
Motivation
Comparable projects rely on OpenAIβs Custom GPTs. This means that project ideas are regularly sent to commercial providers such as OpenAI. With SeFuSe, you can run your entire setup locally - your data remains on your own server.
Docker Compose Installation & Configuration
SeFuSe is designed to run as a fully self-contained, local AI system using Docker Compose. It orchestrates four services:
| Service | Role |
|---|---|
| Qdrant | Vector database for storing and searching embeddings |
| Ollama | Local LLM runtime for generating embeddings |
| FastAPI | Backend API for data processing, embedding, and search |
| Streamlit | Web UI for semantic search |
All services communicate over Dockerβs internal network using service names (e.g. qdrant, ollama, fastapi).
Service Breakdown
Qdrant (Vector Database)
qdrant:
image: qdrant/qdrant:latest
ports:
- "6333:6333"
volumes:
- ./data/qdrant:/qdrant/storage
Qdrant stores all embedding vectors and metadata.
- Persistent storage:
./data/qdrant - Port 6333: Used by FastAPI for similarity search
This ensures that embeddings survive container restarts.
Ollama (Local Embedding Model)
ollama:
build: ./ollama
ports:
- "11434:11434"
volumes:
- ./ollama/data:/root/.ollama
environment:
- MODEL=nomic-embed-text
Ollama runs the embedding model locally.
The model is downloaded and cached in ./ollama/data.
Environment variables
| Variable | Purpose |
|---|---|
MODEL |
Name of the embedding model to load (e.g. nomic-embed-text) |
This value must match the MODEL used by FastAPI and Streamlit.
FastAPI (Backend & Scheduler)
fastapi:
build: ./fastapi
ports:
- "8000:8000"
volumes:
- ./data/funding_data:/app/data
- ./data_processing/src:/app/data_processing
depends_on:
- qdrant
- ollama
environment:
- MODEL=nomic-embed-text
- TOKENIZER=nomic-ai/nomic-embed-text-v1.5
- CRON_TRIGGER_DATA_PROCESSING=0
- CRON_TRIGGER_EMBEDDING=4
- DOWNLOAD_FILE=https://...
- OLLAMA_URL=http://ollama:11434
- VECTOR_DB_HOST=qdrant
- QDRANT_PORT=6333
FastAPI is the brain of the system. It:
- Downloads funding data
- Processes and cleans it
- Generates embeddings via Ollama
- Stores and queries vectors in Qdrant
- Exposes APIs for Streamlit
Environment variables
| Variable | Meaning |
|---|---|
MODEL |
Embedding model name (must match Ollama + Streamlit) |
TOKENIZER |
HuggingFace tokenizer used for chunking text |
CRON_TRIGGER_DATA_PROCESSING |
Hour (0β23) when funding data is refreshed |
CRON_TRIGGER_EMBEDDING |
Hour (0β23) when new embeddings are generated |
DOWNLOAD_FILE |
URL of the funding dataset (Parquet ZIP) |
OLLAMA_URL |
Internal Ollama API endpoint |
VECTOR_DB_HOST |
Qdrant hostname inside Docker |
QDRANT_PORT |
Qdrant service port |
Example:
CRON_TRIGGER_DATA_PROCESSING=0 β run at midnight
CRON_TRIGGER_EMBEDDING=4 β run at 04:00 AM
Streamlit (Web UI)
streamlit:
build: ./streamlit
ports:
- "8501:8501"
volumes:
- ./data/funding_data:/app/data
depends_on:
- fastapi
environment:
- MODEL=nomic-embed-text
- FASTAPI_URL=http://fastapi:8000
Streamlit provides the user interface where users enter project descriptions and view matching funding programs.
Environment variables
| Variable | Purpose |
|---|---|
MODEL |
Must match FastAPI and Ollama |
FASTAPI_URL |
Internal URL of the FastAPI service |
How the System Works Together
- Ollama runs the embedding model
- FastAPI sends text chunks to Ollama and receives vectors
- FastAPI stores vectors in Qdrant
- Streamlit sends search queries to FastAPI
- FastAPI performs vector search in Qdrant
- Results are returned to Streamlit
All data and models persist on disk through Docker volumes.
Start the System
From the project root:
docker-compose up --build
Then open:
- Streamlit UI: http://localhost:8501
- FastAPI Docs: http://localhost:8000/docs
- Qdrant UI: http://localhost:6333/dashboard
Project Structure
This repository is with clear separation between data storage, data processing, backend services, and user-facing applications.
Core Directories
-
data/ Central location for persisted data used across services.
-
funding_data/β Raw and processed funding datasets. -
qdrant/β Persistent storage for the Qdrant vector database. -
data_processing/ Contains the data ingestion and transformation pipeline responsible for preparing funding data.
-
src/β Application source code following a cleansrclayout.config/β Centralized configuration handling.processing/β Core data transformation logic (cleaning, UUID generation, value extraction).utils/β Helper utilities for downloading and extracting data.main.pyβ Entry point for running the data processing workflow.requirements.txtβ Python dependencies for the data processing service.
-
fastapi/ FastAPI-based backend service that exposes APIs and interacts with Qdrant and processed data.
-
src/β Backend application code.main.pyβ API entry point.utils/β FastAPI and Qdrant helper utilities.requirements.txtβ Backend dependencies.Dockerfileβ Container definition for the API service.
-
streamlit/ Streamlit-powered frontend application for exploring and visualizing funding data.
-
src/β Streamlit application code.app.pyβ Main dashboard entry point.utils/β UI and data access helpers.requirements.txtβ Frontend dependencies.Dockerfileβ Container definition for the Streamlit app.
-
ollama/ Contains Docker configuration and initialization scripts for local model serving.
-
init_models.shβ Script for downloading and initializing models. -
data/β Persistent model data. -
docs/ Project documentation built with MkDocs, structured to mirror the codebase.
./
βββ .dockerignore
βββ .gitignore
βββ .gitlab-ci.yml
βββ LICENSE
βββ README.md
βββ THIRD_PARTY_LICENSES.txt
βββ data
β βββ funding_data
β β βββ .gitkeep
β βββ qdrant
β βββ .gitkeep
βββ data_processing
β βββ data
β β βββ .gitkeep
β βββ requirements.txt
β βββ src
β βββ config
β β βββ __init__.py
β β βββ config.py
β βββ main.py
β βββ processing
β β βββ __init__.py
β β βββ cleaner.py
β β βββ uuid_generator.py
β β βββ value_extractor.py
β βββ utils
β βββ __init__.py
β βββ downloader.py
β βββ extractor.py
βββ docker-compose.yml
βββ docs
β βββ data_processing
β β βββ config
β β β βββ config.md
β β βββ main.md
β β βββ processing
β β β βββ cleaner.md
β β β βββ uuid_generator.md
β β β βββ value_extractor.md
β β βββ utils
β β βββ downloading.md
β β βββ extractor.md
β βββ fastapi
β β βββ main.md
β β βββ utils
β β βββ fastapi_utils.md
β β βββ qdrant_utils.md
β βββ streamlit
β βββ app.md
β βββ utils
β βββ utils.md
βββ fastapi
β βββ Dockerfile
β βββ data
β β βββ .gitkeep
β βββ requirements.txt
β βββ src
β βββ main.py
β βββ utils
β βββ __init__.py
β βββ fastapi_utils.py
β βββ qdrant_utils.py
βββ mkdocs.yml
βββ ollama
β βββ Dockerfile
β βββ data
β β βββ .gitkeep
β βββ init_models.sh
βββ streamlit
βββ Dockerfile
βββ data
β βββ .gitkeep
βββ requirements.txt
βββ src
βββ app.py
βββ utils
βββ __init__.py
βββ utils.py
Acknowledgements
This project builds upon the data collected by jstet and pr130, creators of the
Funding Scraper project.
Their work on scraping and providing structured funding data forms the foundation of this project.