Interview with Stefan Schmid, Prof. at University of Vienna

We spoke to Stefan Schmid, Professor at the Faculty of Computer Science of the University of Vienna, Austria. It was a pleasure to discuss with him the NetTest project, which was partially founded by ZOIS. 

ZOIS: Thank you for taking the time for this interview. Before we start, can you tell us how you got in contact with the Zero Outage Association and what are your intentions to cooperate with us?

Stefan Schmid: I think that I was pointed to it by the Dean of the faculty of computer science at the University of Vienna. He approached me knowing that my work is related to critical infrastructure and he thought I should apply for this grant. My research is very much related to reliable networks and reliable infrastructures. This encouraged me to apply.

ZOIS: If you look at our association, we have four main streams. One is a Platform stream, which is more technology based, then we have People, Processes and Security. Where do you think your work fits in the best?

Stefan Schmid: In our research group, we’re mostly doing technical work, which includes theory for reliable systems and building systems. So that’s why I think for me the platform stream is relevant. We also conduct research on security, and I am actually considering to apply for funding in the security space as well.

ZOIS: If you look into your project and the content, and would like to explain it to somebody who has no idea about IT at all; who does not know about networks and the stack and computer systems,, how would you describe your work?

Stefan Schmid: We are currently seeing a strong push toward digitalisation in our society, now also because of the virus, which makes us dependent on the digital infrastructure. Most applications and services today are distributed, and often based on the cloud. Hence the interconnecting networks have become a critical infrastructure in our society. This project is about making these networks reliable.

ZOIS: Great, thank you for this explanation. Can you please tell us in short what is your project about?

Stefan Schmid: Well, it turns out that most of the outages that we have today in the networks are actually due to human errors. There have been a lot of incidents recently and even tech savvy companies like GitHub and others are struggling to provide reliable network services. The network complexity and the frequency of human errors led us to believe that we should make networks and network operations more automated. It’s one of the hottest topics in networking research. The main challenge that we address is to make the network reliable, policy compliant and secure, even under failures. The bigger the networks, the higher the probability of a failure. Indeed, in big data centers today link failures are common and cannot be avoided, which is why we need software that can cope with and mask such failures so services are not affected.

The only thing that we can do is to build software or something on a higher layer that deals with these failures when they happen ensuring that the service is still available. The main contribution we make in this project is to ensure that the policy and the network configurations are such that even under failure networks provide the dependability and the security guarantees and policy compliance. Our vision is that the networks should take care of themselves on a software layer.

ZOIS: Thank you very much. So I understand it is something also touching the people area because human beings make mistakes and in case of an outage or failure, the network will look after itself and will kind of heal itself automatically. It looks like this is how the internet is designed. If there’s one half collapsing, automatically, it gets rerouted to somewhere else.

Stefan Schmid: The technology we envision is actually reminiscent of self-driving cars, which optimise routes automatically, accounting for congestion. An interesting question here is also, which control can be automated and which control needs to stay with humans. For example, like in a self-driving car, situations may occur where the automated logic realises that things are not going as expected. The question arises at which point should we give back the control to the driver? This is the same with the network.
The kind of networks that we consider is mostly ISP networks in this project. The case study that we ran is with a provider called NORDUnet which connects different countries in Scandinavia with each other, and to the cloud. As you said, the internet has been designed so it in a way repairs itself and this property led to the success of the Internet in the first place. There has always been protection mechanisms and this fact was one thing that was actually leading to the success of the internet in the first place. Because since it has been invented, there have hardly been any outages.

Nowadays the challenge is, that the kind of requirements that we have on the network also rise. When the internet was built, security was a much smaller concern than it is now and the type of policy requirements that we have, have increased. For example, you have a traffic policy saying that if traffic goes from this country to that country, it should not route through another country. So it has become more complex than simply reachability. We want to support much richer policies with this project e.g. I want to send traffic from place A to place B, we want the traffic explicitly to go via place C before it reaches B, because C has certain security checks. This was our main motivation in building this tool. It is optimised especially MPLS and segment routing type of networks.

ZOIS: Thank you. Let’s go a little bit more in the direction of your tool. I guess the tool cannot heal networks, but what can the tool do?

Stefan Schmid: Right. The tool of course cannot fix physical failures. We accept physical failures, but the existing networks have protection mechanisms, that allows them to quickly change the route in case of a physical failure. Still, checking whether these configurations are policy compliant is something that is very difficult to do. Our contribution is that we build the software that allows the network to check its configurations itself and compare it to the network policy. If there is a mismatch, it can give an alert that there’s something that has to be changed there.

ZOIS: So if you allow me to make a metaphor; if I want to travel with my car from A to B, and I want to be as fast as possible, your tool will tell me where there are traffic lights or roadworks, that I cannot pass.

Stefan Schmid: Right, using your analogy, you can see it like this: if I want to travel with my car from A to B, and I want to be as fast as possible, our tool will allow you to determine the best route even accounting for traffic jams and road works which could occur in the worst case. Basically, it is what we call ‘What if’ analysis.

ZOIS: So if we would like to use this navigation system for MPLS networks, where do I have to look and where do I get it?

Stefan Schmid: If you want to have a look at it, you should check out our website. You will not only find the description of the main ideas and theory of the tool, you will also be able to use the tool demo.

ZOIS: You work at the university, which is a scientific environment and the zero outage industry is a very economy driven industry. So how do you see a cooperation with the Zero Outage association?

Stefan Schmid: This is exactly one of the additional reasons why I wanted to apply for this grant. We would like to get exposure and also connections in the industry. We were already in touch with Stefan Kasulke, who put me further in touch with some industrial leaders actually interested in this technology. This is a very valuable aspect of this project to have these possibilities to connect and also get feedback on our technology.

ZOIS: Thank you. What would be the outlook? When this project is done, can we expect more from you?

Stefan Schmid: Of course, I’m happy to continue the collaboration. This is the first step; the field to make networks self driving is not completed with this project. We have a lot of additional ideas and there’s additional networks that need to be supported. We also need to improve further the performance. It is also in our interest to involve machine learning approaches to complement our worst case guarantees, leveraging predictions about what could be possible failure scenarios. This is a research project that we have for the next couple of years, so please keep following us and it will be nice to continue the cooperation.

ZOIS: Good. Thank you very much Stefan for the interview.

Stefan Schmid: Thank you.