A practical hands-onLearn how to add fast, meaning-aware caching to your local LLM workflows using pgvector, embeddings, and a clean proxy wrapper pattern. tutorial for Java developers
The dynamic proxy approach is particularly elegant here. Using InvocationHandler to intercept AI service calls without modifying the interface keeps everything clean and maintanble. One thing I'd be curious about is how the system handles concurrent requests with the same prompt do you see any benefit from a short-lived in-memory cache layer before hitting pgvector?
How do we externalize the system message? In a way it helps to make it configurable without changes to code and deployment. I tried to intercept the requests to RegisterAiService similar to the one done here but unfortunately langchain does not allow modification to Chat message request. Is System message provider a viable option? Appreciate if you could share your insight
The dynamic proxy approach is particularly elegant here. Using InvocationHandler to intercept AI service calls without modifying the interface keeps everything clean and maintanble. One thing I'd be curious about is how the system handles concurrent requests with the same prompt do you see any benefit from a short-lived in-memory cache layer before hitting pgvector?
I mean, sure. Memory is just a lot faster.
With regards to concurrent requests, I havent had a lot of time to load test it. Would indeed be interesting.
How do we externalize the system message? In a way it helps to make it configurable without changes to code and deployment. I tried to intercept the requests to RegisterAiService similar to the one done here but unfortunately langchain does not allow modification to Chat message request. Is System message provider a viable option? Appreciate if you could share your insight