Processing documents—especially large ones—via an API like Gemini’s can get expensive fast. Every call involves uploading files, generating responses and burning too much token. If you’re dealing with the same documents repeatedly (say, a user asking multiple questions about a PDF), reprocessing it from scratch each time is wasteful. So, I set out to create a small service that’s smart about caching: it remembers what it’s seen, reuses it when possible and scales gracefully.

I wanted a caching system that was dynamic—capable of handling both ephemeral in-memory needs and persistent storage across sessions. It had to be smart enough to recognize when a document’s already been processed, flexible enough to work with URLs and robust enough to recover from failures. This led me to a hybrid approach: an in-memory cache for speed, paired with Elasticsearch for persistence and metadata management.

If you want to use such an infrastructure, you will basically need two things:

You can check how to easily get API Key from here:

Get a Gemini API key  |  Google AI for Developers

For running Elasticsearch in your local environment for development purposes:

Install Elasticsearch with Docker | Elasticsearch Guide [8.17] | Elastic

If we set these two prerequisites, we can implement the following implementation, but I would like to mention another important point. When we store the caches on Gemini's own infrastructure, you will be subject to the following charges if you have a billing account on GCP, if you exceed the limits:

Free Tier Paid Tier, per 1M tokens in USD
Input price Free of charge $0.10 (text / image / video)$0.70 (audio)
Output price Free of charge $0.40
Context caching price Free of charge $0.025 / 1,000,000 tokens (text/image/video)$0.175 / 1,000,000 tokens (audio)Available April 15, 2025
Context caching (storage) Free of charge, up to 1,000,000 tokens of storage per hourAvailable April 15, 2025 $1.00 / 1,000,000 tokens per hourAvailable April 15, 2025
Tuning price Not available Not available
Grounding with Google Search Free of charge, up to 500 RPD 1,500 RPD (free), then $35 / 1,000 requests
Used to improve our products Yes No

Steps to Implement

Now we can talk about implementation! The code lives in three main files: main.py (the core logic), config.py (settings) and prompts.py (question formatting). In this post, I’ll break it down module by module, explain the implementation, and share the rationale behind my choices—especially why we are using hybrid approach with Elasticsearch.

Step 1: Setting the Stage with config.py