Defending DRAM for data safety and security in the cloud with Dr. Stefan Saroiu

Published

head shot of Dr. Stefan Saroiu for the Microsoft Research Podcast

Episode 120 | July 8, 2020

Dynamic random-access memory – or DRAM – is the most popular form of volatile computer memory in the world but it’s particularly susceptible to Rowhammer, an adversarial attack that can cause data loss and security exploits in everything from smart phones to the cloud.

Today, Dr. Stefan Saroiu (opens in new tab), a Senior Principal Researcher in MSR’s Mobility and Networking group (opens in new tab), explains why DRAM remains vulnerable to Rowhammer attacks today, even after several years of mitigation efforts, and then tells us how a new approach involving bespoke extensibility mechanisms for DRAM might finally hammer Rowhammer in the fight to keep data safe and secure.

Related:


Transcript

Stefan Saroiu: So our philosophy is that what we would like to see in the standard, rather than describing the solution for Rowhammer, what we would like to see is describing extensibility mechanisms that companies, hardware vendors, can implement their favorite form of mitigations, the one that works best for their particular type of memory by leveraging these extensions.

Host: You’re listening to the Microsoft Research Podcast, a show that brings you closer to the cutting-edge of technology research and the scientists behind it. I’m your host, Gretchen Huizinga.

Host: Dynamic random-access memory – or DRAM – is the most popular form of volatile computer memory in the world but it’s particularly susceptible to Rowhammer, an adversarial attack that can cause data loss and security exploits in everything from smart phones to the cloud.

Today, Dr. Stefan Saroiu, a Senior Principal Researcher in MSR’s Mobility and Networking group, explains why DRAM remains vulnerable to Rowhammer attacks, even after several years of mitigation efforts, and then tells us how a new approach involving bespoke extensibility mechanisms for DRAM might finally hammer Rowhammer in the fight to keep data safe and secure. That and much more on this episode of the Microsoft Research Podcast.

Host: Stefan Saroiu, welcome to the podcast.

Stefan Saroiu: Thank you, Gretchen. It’s great to be here.

Host: Some of my favorite people on the planet are working on making things work for us and you’re one of those people. So, first, thanks. As we begin though, let’s talk about your people for a minute. You’re a Senior Principle Researcher in the Mobility and Networking group, which isn’t totally separate from Systems and Networking, but they’re not totally the same either. So, give us a verbal Venn diagram of the two groups, why they exist, where they’re different and where they overlap, and how, in broad strokes, each of them is working to make our lives better.

Stefan Saroiu: Yes, thank you for the kind words, Gretchen. So back in the day, Microsoft Research had a single Systems and Networking group and as the group got larger, the group split into several smaller groups like the Systems group, the Security group, the Distributed Systems group and the Mobility and Networking group. But we’re all systems researchers at the end of the day, whether we work on operating systems or networks or mobile systems or on distributed systems. So, I’m part of the Mobility and Networking group. But over my research career, my work has focused on systems, both in terms of mobile systems and networking systems. And for the past couple of years, these systems that I’ve been working on aim to improve the security of users and the security of infrastructure.

Host: Let’s get specific and talk about the work you do within the Mobility and Networking group now. So, sort of in general, what big problems are you trying to solve as a researcher and, and maybe more importantly, why does the world need you to solve them? What gets you up in the morning?

Stefan Saroiu: So, I do two kinds of work. The first kind is creative work because I really value creativity very highly and I believe it’s very difficult to come up with a truly creative idea. The second kind of work that I do is driven by intellectual curiosity and by revisiting assumptions or turning them on their head. And I strongly believe that the role of an expert is to break preconceived assumptions and rules. Unfortunately, you have to be an expert first. In fact, trying to break assumptions before understanding deeply an area and a problem is a very bad idea. So, I’ve been working on secure systems research for almost a decade now. We built a secure network tracing system that offers very strong privacy. So, for example, network operators can monitor their networks in a way that all the sensitive data is locked down without anybody being able to subvert it or use it in any ways other than originally intended. We built sensors that can attest that information is correct and has not been manipulated or changed. So, as a simple example, consider a photo where one can check whether the photo has been photoshopped or is indeed captured by a proper camera. Then I worked on a secure payment system called Zero-Effort Payments, that was a little like the precursor of Amazon Go Store. So, our system was a little different in that, you’d pick up the food and you’d go through a cashier who’d ring you up, but you’d not have to do any explicit thing to actually pay. The system would know who you are, and since you’d have to pre-register with the system and the payment would be processed. So, I’ve worked on all these things. I also worked on a firmware TPM, which brings trusted computing to mobile devices and it works in millions of smart phones and tablets today.

Host: Mmm.

Stefan Saroiu: But for the past couple of years, I’ve worked on Azure security in particular, and we started a project called Project STEMA. STEMA stands for Secure Trustworthy and Enhanced Memory for Azure and we’ve been focusing a lot on Rowhammer attacks.

Host: Well, let’s talk about memory and computer memory specifically…

Stefan Saroiu: Yeah.

Host: … since it’s a foundational storage unit for digital data, but there are many kinds of containers, as you well know, so let’s do a quick primer for the flavor that we’re really most interested in today, which is DRAM. So, how does it work physically? What are its vulnerabilities, both internally and externally? And you don’t need to get ridiculously granular here because I saw your hundred and fourteen-page deck and a hundred pages of it is explaining DRAM! No, I’m kidding Don’t be afraid to get as technical as you need to set the problem up.

Stefan Saroiu: Okay. So, DRAM is the world’s most popular form of volatile memory. Pretty much every form of computing out there has DRAM. You can find DRAM in smart phones, in tablets, in PCs… You can find DRAM in cars. You can find DRAM in washing machines. And a DRAM cell stores a zero or a one, and it does that by using a very simple circuit with one capacitor. And a capacitor can be charged or discharged, and that can mean a one or a zero. So, for example, if you want to store the value one/zero/one/zero, you just sort of have four cells and you have one charge capacitor and one discharge capacitor, one charge and one discharge and you encode one/zero/one/zero that way. Now, capacitors leak over time. They sort of lose their charge over time. So, DRAM has to continuously refresh these capacitors. And the cells are built to maintain their charge for a small period of time, say something like sixty four milliseconds, and the contract is that the hardware has to make sure that every single cell in its DRAM is refreshed once within sixty four milliseconds and in that way the cell maintains its data, its charge. Now, DRAM cells are organized in rows and columns and when you read a value from DRAM, you read by row, and the way you read this is by switching some transistors in such a way that the capacitors are then coupled with some sensors. So, the sensors sense whether these capacitors are charged or discharged and then they can translate that into data. Now, unfortunately what’s happening is that when you actually sense the data on the capacitors, it turns out that rows located in the vicinity, in the adjacency, of this row you’re trying to read, those capacitors also get affected and they get affected by having them discharge faster than normal. And this phenomenon is called a DRAM Disturbance Error because, by causing them to discharge faster within a sixty-four-millisecond period, you lose the content of that cell and in some sense the bit flips that way. And the bit that flips is one that you never actually meant to read or access before. Maybe you don’t even have control over it. Maybe it’s some other software component that controls it.

Host: Okay.

Stefan Saroiu: So that’s where sort of the concern lies.

Host: Right.

Stefan Saroiu: In the DRAM space, there is this Rowhammer attack and the contract, from day one when you build any system, any software system, anything you want, any computer, the contract is that if you give me a piece of memory, when I write something to it, I want to be able to read what I wrote. And with Rowhammer, you violate this very simple contract. You read a different value than the one you wrote. And, doing that you can basically exploit systems in ways that are unimaginable before.

Host: Well, since we’re talking about Rowhammer right now, let’s move into it. As you put it once, it’s one of the hottest research topics in the security research community. So, give us a levelset. What is Rowhammer, specifically, and why, particularly, does it make cloud providers and server farmers nervous?

Stefan Saroiu: So, I described this DRAM Disturbance Errors effect and this effect gets worse as DRAM gets denser and denser and we want DRAM to get denser and denser because that’s how we store more capacity, that’s how we build better DRAM. But the phenomenon gets worse, this DRAM Disturbance Error. A Rowhammer attack, it’s an attack in which an adversary generates a workload that exploits disturbance errors to flip the value of bits that have critical importance to the security of the system. Like for example, bits that form a secret key. And cloud vendors are very nervous because the entire business model is one where you have multiple parties share your hardware. In particular, in this case, they share your memory. They share your DRAM. Well, what if one of these customers becomes rogue? They themselves get exploited through some other attack. Can they attack other customers by flipping bits in their memory? And yes, they can attack it in very devastating ways.

Host: How did Rowhammer get its name? Is it because of the rows in the DRAM?

Stefan Saroiu: I was explaining how, when you access a row, an adjacent row gets affected by that, and the attack, in order to create this disturbance error, what you have to do is you have to keep accessing that row over and over and over again and that’s the term hammering the row. And the attack got this name Rowhammer.

Host: So, if I’m an attacker, am I trying to do something specific or am I just trying to mess you up?

Stefan Saroiu: Oh, that’s a great question. Depends, right? So, as a cloud provider, cloud providers are nervous about both scenarios. A Rowhammer attack in general refers to flipping a security-critical bit, so by flipping that bit, I’m trying to target something specifically. I’m trying to exploit something. However, the simpler, and in fact, the likelier form of attack is one where I’m just messing up the bits. And systems, actually, today in the cloud, they’re pretty good at detecting when these bits are messed up, but if the bits are messed up, there is just very little we can do about that. I see these bits and I’ve encoded enough redundancy in the data to know that they’re messed up, but I can’t recover to where I was before.

Host: Right.

Stefan Saroiu: And there is really not a good way to solve that problem. Once the bits have flipped, it’s like I can’t go back, and the best thing I can hope for is maybe reboot the server and let’s start all over again. And that’s also very, very bad for a cloud customer because there are a lot of workloads in the cloud that have a lot of data in memory, they do a lot of computation, maybe they train a machine learning model and then, you know, for many, many days, and at some point you say, oh, sorry, guys, you have to start all over again because we’ve messed it up.

(music plays)

Host: Before we dive into the technical aspects of your research on the Rowhammer threat, I find this whole drama really fascinating and I think it would be good to set the stage and the cast of characters. We’ve talked about cloud providers. Who are the other players? Who provides to the providers and what’s their motivation? Who sets and guards the standards, and finally, who’s got an eye on everybody?

Stefan Saroiu: It’s a fascinating landscape. Microsoft is a cloud provider and I started with cloud providers because part of our role at Microsoft Research is also to make the cloud better. There are several other players, and one big players are the companies that sell DRAM. And when the attack was first described or published, which was in 2014, the hardware vendors jumped to quickly dismiss these concerns. It’s, we knew about Rowhammer, but that’s a problem that that older type of memory has. The newer type of memory, it doesn’t have this problem anymore. And in fact, there are quotes online where vendors claim that DDR4, which is the memory that we always use today, is Rowhammer-free. And of course, researchers have shown over and over again that DDR4 is not Rowhammer-free and then they said, oh, yes, but then you should buy this newer DDR4 that has a form of defense called TRR and just this year, earlier, there was a wonderful paper from an academic group in the Netherlands that showed a huge number of DDR4 DRAM with this form of defense TRR being vulnerable by slightly changing the form of the attack, okay? So basically what the vendors have done is they’ve patched the old way of mounting the attack, but they haven’t told anyone how they patched it and you just have to sort of try different things until one of the new things clicks and then you can bypass those defenses. So, now we have the cloud providers, the DRAM vendors and then the security and research community. And there is this feeding loop where that the DRAM vendors say, oh, yeah, we knew about it. The new memory is safe. Give it a year or two. The security and research community says, oh, it’s not safe… and sort of the cloud providers and the smart phone manufacturer as well, are caught in the middle.

Host: So Rowhammer is a problem. It’s a big one, and it hasn’t been ignored, but it hasn’t been solved either. So, when we talked before, you said that more than forty papers…

Stefan Saroiu: Oh, yeah.

Host: …have been published on this subject and DRAM still remains as vulnerable as ever. So, what has the academic community done to date to try to solve the Rowhammer problem, and what, to date, have they got right and got wrong?

Stefan Saroiu: Right, so there’s sort of two bodies of work in academia. One is the security research community. And then there is the computer architecture community, and to give them credit, actually, the computer architecture community were the first ones to show this problem to sort of raise the flag saying, hey, we have a DRAM Disturbance Error. And the architecture community has been very good at putting forward the whole bunch of Rowhammer mitigation proposals. However, all these proposals, they come with trade-offs and implementing one of these mitigations inside of DRAM will make that DRAM ultimately more expensive in some way. Maybe it will decrease the density, maybe the DRAM vendors will have to add extra memory or extra sort of counters to keep track of who’s accessing what. And the market forces in the DRAM world are in such a way that they need to use every single piece of real estate they have to just cram more and more cells. And that’s where the security research community comes in where they keep sort of reverse engineering and trying different things and they’ve found ways to go around those mitigations and show new forms of attack. So, to be fair to them, it’s sort of also a business thing. Like I said, people knew about Rowhammer and there were discussions in their sort of standardization body – that is an organization called JEDEC where a lot of hardware vendors and software companies actually participate – there’s been a lot of discussion over the years on implementing a solution and in fact, that’s what they’re doing now. There won’t be a single solution for Rowhammer that will work for every single type of memory out there and for which every single hardware vendor will be willing to actually implement. So our philosophy is that what we would like to see in the standard, rather than describing the solution for Rowhammer, what we would like to see is describing extensibility mechanisms that companies, hardware vendors, can implement their favorite form of mitigations, the one that works best for their particular type of memory by leveraging these extensions. So that’s what we’re trying to sort of change and shift.

Host: In light of all that, Stefan, tell us about your most recent work that involves what you called an end-to-end methodology to help cloud providers determine if they’re susceptible to Rowhammer because that’s that upstream approach that you’re talking about instead of the patch afterwards that’s impossible. So in the context of our cast of characters, and against the backdrop of computer memory solutions that have trust issues, tell us how you are attacking this, what your methodology is, how successful it is, and what are the key challenges that you face?

Stefan Saroiu: So, what we tried to do was we tried to help the software company by building a systematic and scalable testing methodology to test whether your DRAM is susceptible to Rowhammer attack. And to build such a methodology, you have to overcome two practical challenges. You have to devise a sequence of instructions that your processor executes, that hammers the memory at the fastest possible rate. You want to create what we call the “worst-case testing conditions” for memory. The second thing you want to do, you want to know where you’re hammering. Remember I was telling you how DRAM disturbance occurs to rows that are adjacent to the row you’re hammering. Well, the rows that are adjacent are the worst affected, but even rows that are nearby are affected. So, a row that’s sort of two rows away, like the next neighbor or something like that, can be affected, but it’s very difficult to affect a row that’s very far somewhere inside of your array. So, you have to actually know, what is the row-by-row layout of your DRAM chip? And this is, in fact, a trade secret. What we did was we built a hardware fault injector that allows us to… you can think of it like short-circuiting the memory in such a way that we can actually always create these Rowhammer attacks by not letting the memory refresh itself. So, if you hammer a row and the memory never refreshes, you’re going to flip bits eventually because the capacitors will lose their charge. Then you go and study the patterns of how these bits have flipped and that tells you about the layouts of the cells inside of the DRAM. Because guess what? The row you hammered, most of the bits that flipped are going to be in its adjacent rows. And there will be some bits flipped in the next to the adjacent rows, and then fewer bits and so on, so you create these kind of heat maps so you can reverse, really, row-by-row adjacency by this form of short-circuiting the memory or sort of suppressing refresh commands.

Host: Okay, so you’re reverse engineering to find out what’s going on?

Stefan Saroiu: Yes, we have a methodology… our methodology can reverse engineer every single DDR4 DIMM in the world. And what you actually end up discovering when you’re reverse engineering is that these maps change. They change from one vendor to another and they can also change from one DIMM to another depending on the DIMM’s revision. It’s called Post Packaging Repair. So, we can actually also measure how many fixes that DRAM has had before it was shipped to you.

Host: OkaySo this methodology has to attack, for lack of a better word…

Stefan Saroiu: Yes, yes.

Host: …every vendor’s particular proprietary chip…

Stefan Saroiu: Yes.

Host: …and within the vendors, there’s different chips as well. So, you’ve got a lot of things you have to be looking at. How’s it working so far?

Stefan Saroiu: Now, that’s a great question. It’s very difficult to test every single chip that a cloud provider has. So, instead what we’re doing is we are mapping the DRAM fabrication process for different DRAM devices and for different vendors. And then within those packets, we sample and we test. And we actually, what we do, we look at the trends. And we want to make sure that the workloads that we see in the cloud will not generate activations that will actually start flipping bits.

Host: Right.

Stefan Saroiu: Because I was telling you, the DRAM gets worse over time, not better. So in fact, we can even, by waving our hands a little bit, we can predict how many years in the future it’s going to be until we’re going to see workloads reach a point where, just by using the memory, they’re going to start flipping bits. And it’s our job to influence sort of a more principled approach to fixing the problem rather than a band-aid approach and also to keep an eye, not just for Microsoft, but the entire cloud industry, as to at which point we’ll have to do something to make sure that the workloads are not going to actually start causing bit flips. So for example, one of the things you might want to do when you actually detect that a virtual machine starts accessing the memory in a way that might actually induce bit flips, you could try to slow it down or migrate it to a new DRAM or do something like that.

Host: Yeah, so there’s more solutions than just fixing the chip?

Stefan Saroiu: Right.

Host: There’s other mitigations.

Stefan Saroiu: In fact, I think the solutions will have to span the entire stack. There’ll be some fixing the chip things, but those fixing the chips have to sort of be programmed or used by things higher up, both by the CPU and by the software.

Host: Well, that leads well into the next question. There’s a lot of people that need to be involved in this so, tell us a little bit about who you’re working with as partners and what kinds of cooperative expertise do you need to solve for X in the systems security equation?

Stefan Saroiu: None of these works I’ve been describing is mine alone and in Microsoft Research, I’ve worked with a small team of very talented people who have expertise that is very, very different than mine. You know, I come from a computer science background and I am not sort of equipped to short circuit memory or anything like that. And in fact, until a couple of years ago, I really didn’t understand how DRAM works very well. So, we have that sort of expertise in Microsoft Research to basically build hardware prototypes that actually can inject these sort of failures into hardware, into the memory.

Host: Okay.

Stefan Saroiu: And then I also collaborate very strongly with a group of wonderful engineers in Azure. There is a group called the Next Cloud System Architecture, or NCSA, and these folks have decades of expertise of understanding how DRAM works. And working with JEDEC and working with the memory vendors and they’ve been very good in two ways. One was in describing to us how memory works in ways that go beyond what the manual can teach you, and sort of what the concerns are and the limitations, and the forces that act when you actually build these circuits in practice. And the second way that they’ve been very helpful was that, when we interact with JEDEC, and we try to sort of make the shift, they’ve been very good at coaching us on how to put forward that proposal in a way that’s more amenable.

Host: Who are your other kind of big partnership associations? Are you working with other academics? Are you working with other industry? Are you working with other cloud providers?

Stefan Saroiu: We’re lucky that we have very strong collaborations with two top academic places. One is ETH Zurich and the other one is at Max Planck in Germany. And we are working closely with companies that can be massively affected by Rowhammer. When security researchers actually go and sort of find a new way to actually attack the memory, they go through a process that’s called Responsible Disclosure. So, what that means is that they will not make their findings publicly available, but instead they’re going to reach out to all the industry involved and describe their findings and give some time to the industry to form a response. And once this period has ended, then the research becomes public. So when these new forms of attacks came about, there was a group of companies formed that started looking at these problems again and the first thing that they had on their mind is, like look, you know, so we have these research results, but can anybody go and independently validate them on their hardware? So, we at Project STEMA were the first to actually validate these findings on server-grade hardware… hardware that is run in the data centers.

Host: Right. Well, it sounds like you all have the same goals. You don’t want things to break. You want things to work well. Ultimately, you want customers to have safe data and things not to break…

Stefan Saroiu: We do, and in fact, I was mentioning how, for Rowhammer and for testing memory, understanding row by row adjacency is very, very important. And I also said how DRAM vendors do not want to reveal this information, okay? In fact, the extensibility mechanisms we proposed for people to build their own forms of Rowhammer mitigations, from the beginning, we designed them in a way where DRAM vendors do not have to tell anyone these adjacency maps. In fact, there is a large swath of Rowhammer mitigations that people have proposed over the past five or six years that all rest on the assumption that the software company will have complete access to this information and these companies are very reluctant, so rather than mandating that, or forcing them to do something they don’t want to do, instead, we designed this with the assumption that hey, you guys don’t have to tell us anything. And we think that this has a much better likelihood of being adopted in practice and then again, not mandating the solution, letting people build their favorite solution for the kind of hardware they want to use. You know, the Rowhammer solution that you build for the DRAM in a server running in Azure is very, very different than the Rowhammer solution that you should build for an IoT device that has a little bit of DRAM and runs a little bit of code.

(music plays)

Host: We’ve reached the “what could possibly go wrong?” segment of the podcast where I ask all my guests what keeps them up at night, and so, despite the fact that the majority of your work could actually be classified as a research response to what keeps us all up at night, sometimes the so-called solutions actually present new problems. So, do you have any concerns about the work you’re doing and if so, how are you addressing them?

Stefan Saroiu: My concern is, I was telling you how we have this hardware out there, whether it’s DRAM, whether it’s CPUs, whether it’s, you know, chip sets, whether GPUs, so on and so forth, and this hardware is very, very complex and one of the things that we’ve learned is that in the quest of higher performance, we have designed this hardware in ways that can be exploitable, and these exploits are done in such a way that we never thought possible before and what keeps me up at night is, I don’t really understand everything that’s going on in a DRAM device. What if there’s another way out there that you can actually mount these attacks in very, very simple ways? And unfortunately, with a cloud and with the consolidation of the entire computing power into data centers, I’m concerned that we might have an event that wipes out an entire cloud, that wipes out, you know, a big part of our infrastructure, that shuts down the entire internet. And you know, we’re going to have exploits and little things here and there. We’ve always had those. We’re going to continue to have them. But we’ve really never had a single wipe out event. So, having a wipe out event would be quite, quite concerning.

Host: So how are you thinking about that as a researcher? I mean, it’s one of those giant problems which you think I can’t possibly solve this, but are there any strategies that come into your head, given the kind of work you do, that say, hey, this is how we might try to mitigate such an event?

Stefan Saroiu: It’s a very hard problem. You’re absolutely right. My part comes when it’s about DRAM, because I understand DRAM better than many people, and my part is to make sure that there won’t be a wipe out event happening because of DRAM… From the point of view of DRAM, I hope that my work plays a role in that.

Host: That’s a beautiful way to frame it, Stefan, because as you point out, there’s many fronts in this war against, you know, people that are working to keep things safe and secure and people that are working to tear things down. So, you say, hey, at least on my watch, the DRAM part is going to be good!

Stefan Saroiu: That’s right.

Host: I love that.

Stefan Saroiu: That’s right.

Host: Well, I don’t want to let you go before we talk a little bit about industry standards and the tension between gatekeepers and practitioners in a time when technical innovation is moving so fast that regulatory bodies have a hard time keeping up. So, what are the key challenges to organizations like JEDEC in your field and how would you frame the role of these kinds of gatekeepers in the future?

Stefan Saroiu: So, the industry is changing at a fantastic pace and the role of JEDEC was actually to standardize how memory is used. At least, let’s talk about DRAM. They actually standardize things other than DRAM, but DRAM is a big part of it. And in the 80s and in the 90s, we needed that because we were building PCs and we had a massive number of stakeholders, of people who were actually building all sorts of hardware components, and you wanted these hardware components, when you put them in a box, to all work together. Now, there is a little less of that. There’s more of data centers and there’s no need for the computers that Google puts in their data centers to make sure they work with the computers that Microsoft puts in their data centers. All they have to do is to offer a software platform that is common enough that people can actually use it. But because we see this consolidation, I believe there is less of a need for standardization happening because if a cloud provider buys memory from three different vendors, all four parties can agree on how they build their hardware in using their data center. So, I believe we’re going to see an increasing amount of fragmentation that way.

Host: Well, so, what does an organization like JEDEC have to do to stay alive?

Stefan Saroiu: I think JEDEC has to shift the way they view themselves as a specification of what the functionality of the hardware is to one that specifies mechanisms that are flexible enough to allow increasing amounts of innovation from the different stakeholders in place.

Host: Well, every researcher has a unique life story and it’s time to hear yours, but I want to preface this by noting that you’ve gone to the trouble of including what I would call a scholarly genealogy on your personal website, so we can trace your academic ancestors back to, on one side of the family, the 17th century! So, tell us your story, Stefan. Upon whose shoulders do you stand, both personally and academically, and how did you get where you are today from where you started back in your early years?

Stefan Saroiu: You know, when I look at it, it feels very humbling. Every once in a while, I go back and take a pause to reflect because you look at those names and you go, oh, gosh, those are some big shoes to fill! I had the opportunity to work with two different advisors in my graduate school and therefore I have two genealogies. And like you said, one goes back to the 17th century and included names like, Karl Jacobi and John McCarthy. And Jacobi was mostly known for math and things like elliptic functions and number theory, and John McCarthy is known for being one of the founders of AI. John’s student was Barbara Liskov, who just won the Turing Award a couple of years ago and she’s sort of a role model for me. So yeah, I feel very, very humbled and I look at that, periodically to sort of remind myself as to where the bar is!

Host: Right? So, tell, tell us a little bit about you, then. You’ve come into this in the 20th century and have started to make your own mark. How did you get from A to B?

Stefan Saroiu: I was born and grew up behind the iron curtain in Bucharest, Romania, and I was very lucky to attend a high school that was very strong academically, especially in mathematics and computer science. So, in the late 80s and early 90s, I was sort of studying things like algorithms and graph theory and programming languages and so forth and we also did a lot of math. And in some sense, I was very lucky to have this training in math and computer science, but it came at a cost that I am terrible at anything else! And when I finished high school, my family decided to immigrate to Canada and they went to Calgary, Alberta, and when I got there, Canada had a program where immigrants who lacked English skills, they would be enrolled in free English as a Second Language kind of form of schooling and I spent three months going to English school with my mom and dad. And my mom and I were in the same class, actually. We were in Level 3 and my dad was in Level 1. And I remember… I remember very clearly, sort of, meeting my dad in the hallways during lunch and having lunch together. It’s – yeah, this was…

Host: I’m dying!

Stefan Saroiu: …I was… I was about nineteen, yeah, I was nineteen at the time. A couple of years later, I went to college at the University of Waterloo, which is a great school in computer science in Canada, and Waterloo has this wonderful co-op program where, as part of graduating college, you have to actually go do internships in industry. And I was lucky to do three internships with Microsoft back in the 90s. And I came to Seattle, and I saw Seattle, and when time came to go to graduate school, one of the places I applied to was University of Washington in Seattle. So, I came to U Dub, and once I graduated, I wanted to join academia. So, I went and took a job as a professor at the University of Toronto. And at Toronto I worked with some fabulous students. And then after a couple of years at Toronto, I kind of started missing the intensity of the west coast and the tech scene here. One of the things I didn’t realize before leaving Seattle is that, west coast is really one of the best place on earth to be a computer scientist because you meet a lot of people who understand what you’re doing and speak your language. And I talked to some of my friends and they offered me to come interview at MSR. I knew Seattle. And I came, and I never left, and I had a wonderful time!

Host: What is one interesting thing we might not know about you, maybe it’s a personality trait, a defining life moment, hobby, side quest that has impacted your life or career?

Stefan Saroiu: A lot of computer science researchers in the US who are not born here, they came here to do their PhDs. But I came much, much earlier, and I did not come with a plan to actually continue my education or pursue any advanced education. So, I have a lot of immigrant stories and I think a lot of those sort of have marked the way I think. People had a hard time working with me because I would insist that we work as hard as possible, and we never have any moment of relaxation or anything like that and I really, truly believe that that’s actually very detrimental to a researcher. A researcher has to be a little bit more balanced and I remember very clearly, Steve Gribble, one of my advisors, telling me over and over again, Stefan, it’s not just about working harder. It’s also about working smarter. And it took me a long time to understand what he meant. You have to let time allow you to have the flow of creativity and see things in ways that maybe others haven’t seen them before.

Host: As we close, I’d like you to take a shot at painting a picture of a future world in which you’ve been wildly successful. At the end of your career, what do you hope to have accomplished as a scientist and how will your research have made a difference in our lives?

Stefan Saroiu: Thank you for asking this question. I actually thought about this and I’m thinking quite a bit about it. If you had asked me this question ten years ago, my answer would have been, I want to make sure that my research work is being used by millions of people. And I was very fortunate to be able to accomplish that at MSR, not just once, but maybe a couple of times. So, going forward, again, I see my role as keeping an eye on making sure that we avoid a form of a wipe out event, at the very least, that exploits some form of DRAM. And if, you know, a decade or two from now, we manage to say, hey, yeah, you know, we’ve had compromises here and there, but for the most part, the internet worked really well, cloud computing worked really well, you know, AI worked really well… then I hope that at least a little part of that was due to my work as well.

Host: Stefan Saroiu, I for one am glad you’re doing the job you’re doing. Thank you for that and thank you for joining us today on the podcast.

Stefan Saroiu: Thank you and thank you for your insightful questions, Gretchen.

(music plays)

To learn more about Dr. Stefan Saroiu, and the ongoing fight against Rowhammer attacks, visit Microsoft.com/research

Related publications

Continue reading

See all podcasts