At Gentrace, we help companies test and evaluate their AI systems at scale.
As our platform grew more sophisticated, we found ourselves managing an increasingly complex microservices architecture including our core API and web application to specialized components like WebSocket servers and task runners. With enterprise customers requiring self-hosted deployments and strict security requirements, we needed a robust solution for securing service-to-service communication that would work reliably in their own infrastructure environments.
The Initial Challenge
Our Kubernetes-based architecture included several components that needed to be secured when deployed to customer environments:
- Frontend application
- API server
- WebSocket server for real-time updates
- Task runners for evaluation jobs
- Task scheduler for evaluation jobs
- Multiple databases (PostgreSQL, ClickHouse)
- Kafka for event streaming
Initially, we relied on Kubernetes' built-in networking and basic TLS termination at the ingress level. But as we onboarded more enterprise customers, we faced increasing demands for:
- End-to-end encryption between services (mTLS)
- Fine-grained access control
- Certificate management
- Network policy enforcement
- Service-to-service traffic visibility
Enter Istio
After evaluating various options, we chose Istio as our service mesh solution because it provides robust security features while remaining transparent to our application code.
For example, Istio automatically encrypted traffic between services without any code changes. Our applications simply connected to hostnames like kafka:9092
or postgres:5432
, while Istio handled the security behind the scenes.
The implementation was surprisingly straightforward. Here's what it involved:
- Installing Istio and its Custom Resource Definitions (CRDs) in the Kubernetes cluster
- Tagging the namespace for automatic sidecar injection
- Adding conditional annotations to our services
Here's how we integrated Istio into our Helm chart:
# values.yaml
istio:
# -- Enable Istio integration
enabled: true
# -- Istio injection label value
injection: "true"
# -- Additional Istio annotations
annotations: {}
For each deployment, we added a conditional block:
metadata:
labels:
app: {{ .Values.app.name }}
{{- if .Values.istio.enabled }}
annotations:
sidecar.istio.io/inject: {{ .Values.istio.injection | quote }}
{{- with .Values.istio.annotations }}
{{- toYaml . | nindent 8 }}
{{- end }}
{{- end }}
Making It Optional
A key requirement was making Istio optional for customers who might want to use their own service mesh or none at all. Our Helm chart makes this straightforward:
- Istio can be enabled/disabled via a single value
- All Istio-specific configurations are conditionally applied
- The application works identically whether Istio is enabled or not
For example, here's how we handle the API deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
labels:
app: api
spec:
template:
metadata:
labels:
app: api
{{- if .Values.istio.enabled }}
annotations:
sidecar.istio.io/inject: {{ .Values.istio.injection | quote }}
{{- end }}
The Benefits
While zero-trust security with mTLS encryption was our primary goal, Istio provided additional benefits that delighted both us and our customers:
- Simplified Certificate Management: Istio handles certificate generation, rotation, and distribution automatically.
- Fine-grained Access Control: We can define precise rules about which services can communicate with each other.
- Observability: The Istio sidecar proxies automatically collect detailed metrics about service-to-service communication, which feeds into a built-in Prometheus instance for monitoring.
- Service Topology Visualization: Istio includes Kiali, a powerful visualization tool that teaks can use to monitor and audit service interactions in real-time. Simply running
istioctl dashboard kiali
brings up an interactive view of your service mesh. - Easy Customer Adoption: Customers can enable Istio with a single configuration value:
istio:
enabled: true
Implementation Details
Our implementation covers all components of our stack:
- Core Services: API, Web App, and WebSocket server
- Data Services: PostgreSQL, ClickHouse, and Kafka
- Background Workers: Task runners and schedulers
- Object Storage: Built-in MinIO for S3-compatible storage, with support for customer-provided S3 credentials. This handles file ingestion needs for AI evaluations, including PDFs and other documents.
Each component gets Istio sidecar injection when enabled:
template:
metadata:
labels:
app: {{ .Values.taskrunner.name }}
{{- if .Values.istio.enabled }}
annotations:
sidecar.istio.io/inject: {{ .Values.istio.injection | quote }}
{{- end }}
Lessons Learned
- Keep It Optional: Not every customer needs or wants a service mesh, especially since sidecars eat some hardware resources. Making it optional via configuration was crucial.
- Consistent Implementation: Apply the same pattern across all services to maintain predictability.
- Documentation Matters: Clear documentation about prerequisites and setup steps helps customers succeed.
- Resource Planning: Account for the additional resources needed by Istio sidecars.
Looking Forward
The modular nature of our Helm chart means we can easily extend our service mesh capabilities. Future enhancements might include:
- Custom Istio configurations for specific deployments (e.g., traffic policies, network rules, and routing behaviors to meet IT department requirements)
- Integration with external certificate authorities and X.509 certificate management systems like HashiCorp Vault for automated certificate signing and distribution
- Info: Want to try this yourself? Check out our self-hosted repository with complete Helm charts and documentation. You'll need to contact [email protected] for secure access tokens to access our Quay registry.
Join Us
We're always looking for engineers who enjoy solving complex systems challenges. If you:
- Build secure, scalable architectures
- Care about AI safety and reliability
Check out our open positions at gentrace.ai/eng
The journey to securing microservices communication doesn't have to be complex. With tools like Istio and thoughtful implementation, you can achieve robust security while maintaining simplicity for your users.