
Staff Software Development Engineer (Data Engineer)
Job Description
You are a retrieval-oriented engineer with deep expertise in high-dimensional data, relational structures, and large-scale knowledge representation. You thrive on the challenge of bridging the gap between raw data and semantic understanding, building the backbone for next-generation AI and discovery systems. You are passionate about data topology, latent space optimization, and the performance tuning of complex query engines. You constantly strive to reduce "time-to-insight" and maximize the precision of information retrieval at scale.
The pace of our growth is incredible—if you want to tackle the foundational challenges of RAG (Retrieval-Augmented Generation), knowledge graphs, and semantic search at a global scale, join us!
Key Responsibilities:
-
Lead the design and development of hybrid retrieval architectures combining vector similarity search with structured graph traversals.
-
Architect scalable data pipelines for the ingestion, embedding, and indexing of massive, multi-modal datasets.
-
Innovate and prototype advanced retrieval techniques, including multi-stage re-ranking, graph-tooling for LLMs, and dynamic metadata filtering.
-
Design and implement schemas for complex knowledge graphs, ensuring high-performance relationship mapping and ontological integrity.
-
Build automated data validation and drift detection systems to monitor the quality of embeddings and the health of the vector space.
-
Drive technical implementation of "Memory" systems for AI agents, focusing on long-term persistence, observability, and sub-second latency.
-
Champion data organization standards, ensuring that disparate data sources are unified into a coherent, searchable knowledge base.
-
Collaborate with AI Research and Product teams to evaluate emerging database technologies (e.g., HNSW optimizations, GraphRAG) and integrate them into production.
Skills and Attributes for Success:
-
7+ years of experience in data engineering or backend systems with a focus on high-performance data retrieval and storage.
-
BE/B.Tech in Computer Science, Mathematics, or equivalent. MS or PhD in a related field is a plus.
-
Expert proficiency in Python, Java, or Go, with a strong grasp of distributed system design patterns.
-
Deep understanding of Vector Databases, including indexing strategies (HNSW, IVFFlat, PQ) and distance metrics (Cosine, Euclidean, Dot Product). Experience with Pinecone, Milvus, Weaviate, or Qdrant.
-
Strong background in Graph Databases (Neo4j, AWS Neptune, or ArangoDB) and query languages like Cypher or Gremlin.
-
Experience with Data Modeling and organization, specifically in building semantic layers, ontologies, and taxonomies.
-
Hands-on experience with LLM orchestration frameworks (LangChain, LlamaIndex) and embedding models (OpenAI, HuggingFace, Cohere).
-
Proficiency in large-scale data processing using Spark, Flink, or Kafka for real-time indexing and ETL.
-
Understanding of Information Retrieval (IR) fundamentals, including BM25, TF-IDF, and reciprocal rank fusion.
-
Experience with cloud-native infrastructure (AWS/GCP/Azure) and container orchestration (Kubernetes).
Preferred Education and Experience:
- Bachelors/master's in computer science or a related field with 7-9 years of professional experience