Job Title: DevOps Specialist & Data Engineer
Location: Remote
Type: Full-time
Experience Level: Senior
Industry: Generative AI / Artificial Intelligence / Machine Learning
Reports To: Head of Engineering / CTO
About Us
Ready to join a cutting edge AI company? We’re on a mission to become the OpenAI of the spicy content industry, building a full-spectrum ecosystem of revolutionary AI infrastructure and products. Our platform, OhChat, features digital twins of real-world personalities and original AI characters, enabling users to interact with lifelike AI-generated characters through text, voice, and images, with a roadmap that includes agentic superModels, API integrations, and video capabilities.
Role Overview
We are looking for a Senior DevOps Specialist with a strong python and data engineering background to support our R&D and tech teams by designing, building, and maintaining robust infrastructure and data pipelines across AWS and GCP. You will be instrumental in ensuring our systems are scalable, observable, cost-effective, and secure. This role is hands-on, cross-functional, and central to our product and research success.
Key Responsibilities
DevOps & Infrastructure
* Design, implement, and maintain infrastructure on AWS and Google Cloud Platform (GCP) to support high-performance computing workloads and scalable services.
* Collaborate with R&D teams to provision and manage compute environments for model training and experimentation.
* Maintain / monitor systems, implement observability solutions (e.g., logging, metrics, tracing), and proactively resolve infrastructure issues.
* Manage CI/CD pipelines for rapid, reliable deployment of services and models.
* Ensure high availability, disaster recovery, and robust security practices across environments.
Data Engineering
* Build and maintain data processing pipelines for model training, experimentation, and analytics.
* Work closely with machine learning engineers and researchers to understand data requirements and workflows.
* Design and implement solutions for data ingestion, transformation, and storage using tools such as Scrappy, Playwright, agentic workflows (e.g. crawl4ai) or equivalent.
* Optimize and benchmark AI training / inference / data workflows to ensure high performance, scalability, cost and an exceptional customer experience.
* Maintain data quality, lineage, and compliance across multiple environments.
Key Requirements
* 5+ years of experience in DevOps, Site Reliability Engineering, or Data Engineering roles.
* Deep expertise with AWS and GCP, including services like EC2, S3, Lambda, IAM, GKE, BigQuery, and more.
* Strong proficiency in infrastructure-as-code tools (e.g., Terraform, Pulumi, CloudFormation).
* Extensive hands-on experience with Docker, Kubernetes, and CI/CD tools such as GitHub Actions, Bitbucket Pipelines, or Jenkins, with a strong ability to optimize CI/CD workflows as well as AI training and inference pipelines for performance and reliability."
* Exceptional programming skills in Python. You are expected to write clean, efficient, and production-ready code. You should be highly proficient with modern Python programming paradigms and tooling.
* Proficiency in data-centric programming and scripting languages (e.g., Python, SQL, Bash).
* Proven experience designing and maintaining scalable ETL/ELT pipelines.
* Focused, sharp, and results-oriented: You are decisive, work with a high degree of autonomy, and consistently deliver high-quality results. You are quick to understand and solve the core of a problem and know how to summarize it efficiently for stakeholders.
* Effective communicator and concise in reporting: You should be able to communicate technical insights in a clear and actionable manner, both verbally and in written form. Your reports should be precise, insightful, and aligned with business objectives.
Nice to Have
* Experience supporting AI/ML model training infrastructure (e.g., GPU orchestration, model serving) for both Diffusion- and LLM pipelines.
* Familiarity with data lake architectures and tools like Delta Lake, LakeFS, or Databricks.
* Knowledge of security and compliance best practices (e.g., SOC2, ISO 27001).
* Exposure to MLOps platforms or frameworks (e.g., MLflow, Kubeflow, Vertex AI).
What We Offer
* Competitive salary + equity
* Flexible work environment and remote-friendly culture
* Opportunities to work on cutting-edge AI/ML technology
* Fast-paced environment with high impact and visibility
* Professional growth support and resources