2023-07-14 06:14:43
2023-07-14
Architect (7+ Yrs)
W2 - Permanent
India
No
No
Job details »
Experience: 7-10 Years
Location: Bangalore
Job description:
- Our Team Site Reliability and Engineering group focuses on producing mission-critical infrastructure, tools, and processes that will ensure highest levels of availability and reliability of all our websites.
- SRE’s drives standardization and service focused instrumentation, provides subject matter expertise, resolves break/fix scenarios, engaging broader teams as necessary; and partners/leads to achieve continuous improvement.
- In addition SRE’s contributes to command-and-control related activities focused on restoration of complex outages, and rapid restoration.
- Site Reliability Engineers are hybrid systems and software engineers who are responsible and take ownership for reliability, scalability, automation, and other issues related to uptime and availability of e-commerce/Retail and Enterprise platform.
- Our goal is to build, scale and guard the systems that delights the customers.
- Your Opportunity You are right for the job if you are comfortable with System design, Architecture, deep technical Linux, networking topics, and distributed architectures.
- You will work cross-functionally amongst a variety of teams and be a core contributor in every significant engineering service or solution that we deliver to our stakeholders.
- You will excel if you have enthusiasm for digging deep, and a flare for sharp technical communication, prioritization, and organization.
- You will work directly with our Software Engineering teams to build our next generation “always up” cloud-based e-commerce/Retail and Enterprise platform.
Your Responsibility:
- On Call responsibilities to help minimize MTTD and MTTR
- Experience with containerization and container platforms. (e.g., Docker, Kubernetes, Docker EE, OpenShift, Mesosphere)
- Should have skills to understand debugging info , “Drain” traffic away from a cluster, Rollback a bad software push , block or rate limiting unwanted traffic, bring up additional serving capacity thru autoscaling features and use the monitoring systems(for alerting and dashboards)
- Engage with enterprise and business/infrastructure functions to establish, track, and optimize operational metrics and targets in line with SRE principles (SLO/SLI, Latency percentiles , error budgets, tech debt and setup alert guidelines )
- Work with Observability tools and enterprise monitoring solutions like Dynatrace, AppDynamics, New Relic, Prometheus, Graphite, Grafana, Nagios, Sensu and Splunk . Should be able to write promQLs and Splunk queries .
- Programming/Tooling and Automation experience in one or more of the following languages: Golang, Java, Python, Typescript, Node and Shell .
- Good understanding of Kafka internals , SQL/noSQL databases like Cassandra , Elasticsearch and Postgress and In-Memory Caching frameworks like Memcached .
- Influence, design and create new architectures, standards, and methods for large-scale enterprise systems.
- Design, write and build tools to improve the reliability, latency, availability and scalability of e-commerce/Retail and Enterprise products.
- Engender reliability and availability starting with metrics and measurements.
- Enable scaling by providing tools, developing training and/or augmenting processes.
- Build tools/automate to prevent re-occurrence of problem to mission critical products/services.
- Augment existing instrumentation to build a cohesive picture of the characteristics of our systems with special attention to points of failure.
- Participate in capacity planning, demand forecasting, software performance analysis and system tuning.
- Develop a deep understanding of the numerous services and applications that come together to deliver e-commerce/Retail and Enterprise products
- Root-cause analysis complex problems involving multiple parties, networks, hardware, and software that relate to scaling and performance.
- Secure the system from issues, be they real, perceived, or notional. Additional responsibilities may include:
- Scripting and Development responsibilities: Develop software in several modern languages. Develops large/complex database-backed systems and understands DB schema and query performance. Utilises professional best practices in day-to-day work like revision control, unit testing, or other. Applies statistical data analysis techniques.
- Networking responsibilities: Understanding and performing TCP dumps, snoop, and other network sniffers. Understands and applies knowledge of most protocols (TCP/IP, HTTP, UDP, etc.)
- Lead end-to-end audit of monitors and alarms based on subsystem knowledge.
- Utilises time management and project management skills to lead the resolution of issues in a timely and organised manner, effectively communicating necessary information. May consult directly with developers or third-party vendors; provides subject matter expertise.
Your Qualifications:
- Bachelor's Degree or Master’s Degree with 8+ years of experience in Computer Science or related field.
- Proficient in any of the programming languages like Java, GoLang, etc
- Experience in designing, investigating, analysing, and troubleshooting large-scale enterprise systems.
- Methodical and systematic problem-solving approach, combined with a solid awareness of ownership, initiative, and drive.
- Fluency with running services at scale; In depth understanding of Unix systems internals and networking.
- Networking knowledge and in depth understanding of network concepts, such as different protocols (TCP/IP, UDP, ICMP, etc.), MAC addresses, IP packets, DNS, OSI layers, and load balancing).
- Understanding of Unix/Linux systems from kernel to shell and beyond, taking in system libraries, file systems, and client-server protocols along the way. Experience administering Linux systems in a production environment.
- Experience with distributed version control like Git or similar
- Experience with IaaS and PaaS providers such as AWS, AZURE OpenStack, GCP
- Experience with containerisation and container platforms. (e.g., Docker, Kubernetes, Docker EE, OpenShift, Mesosphere). Experience with enterprise monitoring solutions like AppDynamics, New Relic, Prometheus, Graphite, Grafana, Nagios, Sensu and Splunk
About us:
At Cloudely, we work with a single mission: Transform the way clients experience Product & Implementation, Development, and Support. Growth is a journey and never a destination. We are constantly thriving to grow in gaining the trust of clients globally in offering services across Salesforce, Oracle, Robotic Process Automation, DevOps, Web, and Mobile Programming to name a few. And we are just getting started!We have fabulous opportunities for you to grow along with us! At Cloudely, you will get what you are looking for: the scope to learn, prove and grow.The way to your dream job and organization is just a click away. Share your resume at [email protected]. To know more about us, please visit www.cloudely.com.