We met Omri during one of our early customer interactions and were quickly impressed – Omri started working with Linux (almost) since the time he can remember himself. At the age of 13 he was already using BBS to download and try out new types of software. “It was mainly games back then, but soon I started reading about a new free Operating system, Linux. It was horrible to set up, it took 30 1.44-inch disks and a few long days to install. Then the Internet came along, my brother was connected with a few people that had accounts in universities around the world, I learned about Phreaking – we had phone numbers with which we could dial out to everywhere in the world for free. Until Internet cost over 100 NIS (25 $) per month, I didn’t pay for it at all.”
Omri quickly immersed himself in the inner workings of the Internet, protocols, networks, security, batch scripts. “I was a real Script kiddy”, he recalls.
What drew you to Linux?
“When I was 13, my brother (16 at the time) wrote a system called BASS (Bulk Auditing Security Scanner – and was part of the IAP) – one of the first Internet security scanning systems ever written.
He scanned the whole web for security issues (around 36 million servers circa 1999) and published the results online. It made quite a lot of buzz, including an article in “7 Days” (a top Israeli magazine at the time). It was cool, and I was hooked.”
Omri developed his hobby until he was 18, when he had to join the IDF (Israel has mandatory military service). Surprisingly, he was recruited to the artillery corps and not an intelligence or technological unit.
“After the army I traveled the world a bit and decided to gain experience in other fields I was less familiar with such as sales. I took my girlfriend at the time and we went to Vegas to try our hands at sales – she sold cosmetics and I sold mobile phones. It was an amazing time for me – I learned a lot while doing it.” This was in 2006, just before the iPhone came out and changed this business completely.
So when did your hobby become a job?
“After exploring the world, I came back to Israel I started working for Interbit – a training, consulting and software development company. I started as the IT manager, responsible for the maintenance & support of a couple hundreds of PCs and servers. When I got there everything was manual – the guy I replaced manually updated the software on every machine using CDs, it was a catastrophe and simply un-maintainable.
My manager was really great – she gave me total freedom to do what I thought was needed. I automated all the day-to-day processes by creating a Linux-based remote install server I could manage remotely. This change pushed the organization from a state where it would take half a day to install and configure new software to it happening in a matter of minutes with a click of a button from my mobile phone, while sitting at home. By automating everything, I had time to leverage the immense amount of resources, classrooms and instructors I work with to learn and expand my knowledge.
This was before Puppet, Hadoop, Openstack, etc., but I did go into advanced Java classes, app performance tuning, Solaris and more. I was there for 2.5 years.” After the first year Omri started instructing Linux courses, and after 2 years he started teaching Java.
What happened next?
“After that, I joined a friend who owned a consulting and training company, and started offering our services to large and small companies, while still instructing for Interbit and John Bryce.“
In 2009 Omri started working as a consultant for Liveperson. “I saw an opportunity working for Liveperson, and understood that I could make a difference there, so when they offered me a full time position I decided to accept the challenge. At that time they were still using old-fashioned installation scripts and methods; they were using Solaris as the operating system and manual deployment procedures, with some scripts and release documentations. Things didn’t work well and broke often. We realized we needed to automate everything.
We started a small test implementing Puppet, and even though some in the company resisted the change, its advantages were obvious and eventually it became the standard way of deploying code and applications in production.
In addition to Puppet, we also started testing Hadoop for our big data efforts. and I had the chance to play with very cool technologies, some of which became the backbone of the company’s technological stack, and to this day they are still managing to remain cutting edge in many fields.”
In addition to Puppet, we also started testing Hadoop for our big data efforts. I had the chance to play with very cool technologies, some of which became the backbone of the company’s technological stack, and still managing to remain cutting edge in many fields.”
Their monitoring solution was HPOV (HP’s Open view) for Alerts and Graphite for Metrics. Log analysis was done using Splunk. They had a 4 people team doing monitoring only. They had their own server farm with approx. 4,000 physical machines.
What do you do in Avantisteam?
I manage the operations team, which leads the work on infrastructure, monitoring, uptime management, backup plans, efficiency and more. We are also highly involved in choosing which technologies the company is going to work with to enable integration and overall stability.
How is your team structured?
- Uri, senior operations – focused on longer term projects such as configuring new systems.
- Amir, junior operations – focused on the ongoing tasks the team needs to handle.
- Yaara, NOC – gets alerted 24/7 via SMS and passes only critical events to the relevant team member according to our escalation policy.
- Leonid, consultant – working on strategic projects such as Docker integration.
So what is DevOps for you?
DevOps are what I call Integrators Plus. I see developers that go into operations and people from operations that enter the developer’s world. In Avantis, most DevOps personnel come from operations background, they have experience in scripting, Puppet, and configuration management systems, with a strong focus on Linux because this is what we need. Most developers fall short in operations; they lack the experience, knowledge and understanding of how things actually work on the operating system level, while people with operations experience find it easier to get the job done. In Avantis the focus for the Devops team is on integrating things and making sure they work together.
DevOps is a lot about working with data – I believe in collecting as much data as possible, even if you are not sure how to use it yet. This tactic has proven itself a number of times in the past. It’s cheap to collect metrics so if you are already collecting some things why not collect all of it? I believe in measuring everything. You need to know how your app behaves in different situations – try to bend it and see where it breaks. We are doing this for critical parts of our apps, trying to avoid affecting our actual service (although in some cases we are OK with affecting our service as part of a test, as we believe it’s supremely important to find our pain-points sooner rather than later).
In my experience, pure DevOps can mostly only be found in small teams, such as a startup with a couple of founders that are also writing the code themselves and therefore look at the development and the operations as a holistic part of the business. That being said, there are a few companies like Google, Facebook and Netflix who are already doing it right – they do not look at a server as a single disconnected unit but as part of a bigger system. Also they always aim for automatization and treat infrastructure as code.
Another important aspect of DevOps is continuous delivery. The idea is to enable a developer to commit a line of code automatically through staging, QA and production as one holistic flow. Most companies I know are not there yet. This is the Holy Grail of DevOps.
Do you do continuous integration?
No, we are not fully automatic yet. We do a bit of continuous integration but we don’t do enough testing – as it takes a lot of time to accomplish. This is actually absurd, since the time you don’t invest now to write the tests, you pay for later on in many hours of maintenance, bugs fixes, etc. The issue is that this is hard to quantify. It’s much easier to quantify development time – say I can cut it from 1 month to 1 week – vs. quantifying the cost of monitoring and supporting it after it is in production. The business would rather cut the development time to 1 week by not having a test, but then when it’s in production the problems start surfacing and might cost more in the long run. This is a common tension that exists in many companies.
How do you measure your success?
My key responsibility is to enable our company to focus on the business. Enable the business people to focus on the monetization of our technology and enable the developers to focus on improving our technology. Once you get to a point where you are able to stabilize things, it becomes very obvious you are doing the right things.
We are not working with a specific SLA goal and we are not measuring it actively. My job is to optimize uptime vs. the related costs of maintaining it. The reason is that the time and effort it would take to make these measurements is not worth it. We are trying to strike the right balance between uptime and resources needed to achieve it. In some cases getting to 100% SLA is not worth the investment since you would need to exponentially increase your investment both in terms of time and in terms of money, to get that extra bit of uptime. If you actually look at what this incremental uptime is worth financially you might find that it’s just not worth it.
We are also measuring efficiency in many other areas – ensuring the servers are working in the optimal capacity, for example.
What are your key challenges in your current role?
The key challenge is what I like to call “People Compliance” – even if you define a very organized and structured way to do things, when you have business goals which the development team needs to achieve in a short period of time, people tend to “cut corners”. In the short term you might deploy an updated version sooner, but in the long term you will always pay for it in some other way. For example, when you’ll try to scale things out, you might encounter problems. One of these might be that one of your applications was written in PHP. Why? Because you have developers that write in PHP… The cost of re-writing the application in another language was X but the cost of not re-writing it was another 3-4 servers, which is Y. It’s very difficult to calculate the cost of each decision, and to predict which is the wiser investment in the long run.
This also relates to choosing the right technology stacks. One of our challenges is making sure the whole organization is working with the technology stack that is best for the company.
Another challenge is implementing Continuous Delivery. It’s not easy to execute a strategy – it requires people to comply with a defined “way of working”. The way we tackle this is by continuously implementing small changes in methodology instead of trying to change everything all at once. This change is lead by the DevOps team. One of the changes we’ve implemented to help promote continuous delivery is to constantly analyze our application metrics before deploying the code.
Another challenge is automation – many DevOps I know did a good job at integrating but did not focus enough on automation. Whatever you do on one server should be scalable to a thousand servers pretty simply.
Legacy appliances are also a challenge. This is even more challenging for companies in which there is a fast churn rate of employees. Pretty soon you are left with applications with no “owner” and when you need to make code changes the code becomes unmaintainable. Our strategy is simple – kill all the legacy applications which don’t have an owner.
Another challenge is to reduce the number of interruptions my team members are dealing with. If anything isn’t working, my team is the first place where a developer or business analyst would check. This can be very time consuming. For example – we work with Puppet, so anytime a developer needs to make changes in the configuration he comes to us. But there is no reason for this to be the case, the developers should do these changes by themselves. On the other hand, we MUST remain the gatekeepers, so we approve changes before they go to production.
Lastly, another challenge is to balance between working with procedures and moving quickly. I personally dislike bureaucracy and hate “red-tape” mentality. I tend to trust the people I work with, but in some cases – especially when you are working with less experienced people – you need to apply a little procedure.
Where are you investing most of your resources?
We are the most “resource intensive” team in the company. The Hardware and Software we purchase are expensive – to give you a taste, a Hadoop farm capable of holding 1.5 petabytes of storage costs us ~$1M.
What are the key resources you are using to build your knowledge and expertise?
Community – friends and colleagues are my number one resource. I maintain relationships with many of my colleagues and hear about their experience working with new technologies and solutions. Another great resource is conferences. When you’ve been in a company for a long time you start getting used to the way things work there. Conferences are a great way to inspire yourself to do things differently. Velocity, for example, is an amazing conference I love attending. It’s already created a lot of value for me.
Which solutions are you using today for data analysis?
- We are using ELK – ElasticSearch, Logstash and Kibana. We’re thinking of using it to replace our existing business dashboard. It’s fairly easy to customize the dashboards according to specific business questions and KPIs. I like this stack because it is very powerful, open source, flexible, performs very well (besides Logstash, which does have some performance issues), and it’s pretty easy to customize.
- Anodot – We are trying them for our anomaly detection metrics.
- MSSql – Mainly for analysts who want to export the data into xls files.
- We are also working on bringing in new tools such as Tableau and Vertica (HP) – this need is coming from the BI team.
What is missing in data analytics today?
Most tools out there are just not good enough. The big tools such as Tableau, Vertica (HP) and Splunk are very expensive, while our analysts’ need are relatively simple – the ability to find an anomaly in the data and then find correlations between this anomaly and other data points. One issue is scale – we have a terabyte of data coming into our system every day. When you look at over a billion events, even after you aggregate them, it’s too much data for the existing tools to deal with. In that sense, AI Joe‘s solution is great for us – not only does it automatically find anomalies, it also “cuts” the manual analysis time by bundling it with insights that are relevant to our analysts.