Monday, October 5, 2015

Examining Linux Process Memory: Part 2

Hi everyone!  This is the final installment in our look at Linux process memory.  Before we dive in, I want to mention one thing.  Going forward, I think the cadence of these posts is going to change a bit.  Due to some things going on, I am going to try to post once every two weeks for a little while.  This blog is not my day job, so I can only spend spare time on it.

With that out of the way, let's pick up where we left off last week.

Last week, we talked about procfs and how we can get information about processes running on a Linux box.  I took some of those ideas and wrote a Python script that looks in the dynamically allocated memory for a process (or set of processes) and extracts what we might find interesting (specified using regular expressions).  When I say dynamically allocated memory, I am talking about heaps and stacks (dynamically allocated relative to the program).  If you recall from last week, /proc/<pid>/maps also contains files that are loaded into memory to support the execution of the program.  Since examining the process memory of any process (yours or not) requires root on all of the systems I have tried this on, I figured it was not worth the effort to scan files loaded into memory.  In theory, if you are root, you can read those files through conventional means, so why spend cycles reading them with this script?  I have not found a good place to host the script (maybe GitHub or GitLab), so I will talk about the relevant pieces of the code for now and update the post with a link to the script wherever it will be hosted.

EDIT: Uploaded the code to Gitlab here.  Take a look!

There are two main things the program has to do, and I will discuss them then list the code:
  • Figure out what regions of memory are mapped for a given process
  • Look through those regions for the artifacts we want based on the regular expressions we have defined

Mapping a Process

In order to figure out which regions of memory are mapped for a given process, we have to look in /proc/<pid>/maps.  We could take everything we find, but remember we want to be somewhat efficient and only scan the regions which have the best chance of having what we are looking for.   To do that, we will look only for regions that have "[heap]" or "[stack" in them.  The braces are important because we do not want to catch files with the names "heap" or "stack" in them.

011ed000-012ac000 rw-p 00000000 00:00 0                                  [heap] 
7f659dbdf000-7f659dbf4000 r-xp 00000000 08:02 25303334                   /usr/lib/
7ffcc891d000-7ffcc893e000 rw-p 00000000 00:00 0                          [stack]

In the example above, we only want regions of memory that are heaps or stacks, and not the libraries that the program loads.  Let's take a look at the code that does this:

This function reads /proc/pid/maps and determines what regions we are going to read
pid: the PID we want to look at
regions: the dictionary of regions we will update if necessary
def FindMemoryRegionsForPID(self, pid, regions):
        print('[{0}] Scanning PID {1} for memory changes...'.format('%H:%M.%S'),
        with open('/proc/{0}/maps'.format(pid), 'r') as regionInfo:
            for region in regionInfo:
                # Right now we are focusing on the dynamically allocated sections of memory
                We could do every region, but since many regions are
                simply imports of files, we want to focus on the heap and stack
                where dynamically assigned information should be.
                We could easily remove the limiter to return all regions.
                regionSplit = region.split()
                # The path name is the last column
                if regionSplit[-1] == '[heap]' or regionSplit[-1].startswith('[stack'):
                    regionStartEnd = regionSplit[0].split('-')
                    # The address is in hex, so we will cast it to a base 16 (hex) int
                    startOfRegion = int(regionStartEnd[0], 16)
                    endOfRegion = int(regionStartEnd[1], 16)
                    if startOfRegion not in regions:
                        print('[{0}] Found a new region in PID {1}'.format('%H:%M.%S'),
                        regions[startOfRegion] = { 'size': endOfRegion - startOfRegion,
                                                   'hash': '',
                                                   'lastHash': ''

This code (and the code for the other main function of the script) are part of a class I wrote to have threads to handle each identified PID.  So the basic idea here is that we read /proc/<pid>/maps and for each line, we will identify if it is a heap or stack, and if it is, add it to a dictionary that records the regions we ar einterested in.  Since the number of stacks can grow over time, we need a way to store the ones we want so that we can pass the list off to the function that reads through the memory.  The most interesting lines here are lines 25 and 26 where the offsets are stored.  If you look at the example above, you can see the format of each line.  The two memory addresses are in the first segment separated by a '-'.  If we cast those integers as is (without the 16), Python will try to interpret those as base 10 and fail because the letters A-F are not a part of base 10.  By telling Python we will be giving it base 16 (hexadecimal) numbers, it can handle them appropriately.  Also notice that we are not storing the end of the region in our dictionary, but rather the size.  For our purposes, we do not need to explicitly know where a region ends, just where it starts and how large it is.  You will see why in the next section.

Finding The Artifacts

Now that we know where the regions we care about are, we have to read them.  This can be done by opening up the process memory (/proc/<pid>/mem), going to the start of the region we care about, and then reading a number of bytes equal to the size of the region.  After that, we will have a bunch of bytes that we need to make sense of.  Since regular expressions operate on strings, we have to decode the bytes into a string which we then pass to the regex.  Let's take a look at the code snippet:

with open('/proc/{0}/mem'.format(, 'rb', 0) as memory:
    rawBytes =['size'])
    regionData['hash'] = hashlib.sha1(rawBytes).hexdigest()
    decodedBytes = rawBytes.decode('utf-8', errors='ignore')

results = re.findall(filter, decodedBytes)

The first line opens the memory "file" as unbuffered (the '0') bytes (the 'b').  We then seek (go to) the start of the region, read the amount of memory corresponding to the size of the region, take the hash of the region, and then decode the bytes.  I take the hash of the region because if it has not changed, there is no reason to scan it for artifacts.  When we are decoding the bytes, we want to make sure they are decoded as unicode (in case there are unicode strings we want to look for) and ignore any errors (characters that cannot be translated).  Finally, we find all of the instances that match the regex filter in the decoded memory region.  The actual code supports multiple regex patterns.

That is basically it though.  Reading memory in Python is very easy, and surprisingly does not require the use of ptrace.  If we wanted to use ptrace, we could write the program in C, or use Python ctypes to create a binding for ptrace.  If you are interested in seeing that, let me know, and we could probably do that for a future post.

Why is this tool useful?

You might be wondering why this tool might be useful to you.  If you have root on a box, you could lay down a keylogger or read whatever file you want.  What if you want to capture user input that originates on a remote computer?  Let's say you have a web server (apache) running on your Linux box where people are transmitting sensitive information (like banking / financial information or their social security numbers).  You would not be able to capture that if you were on the server (without getting access to the client's computer) unless you were reading memory.  You might be thinking, that data is surely encrypted when it is sent to the server and that this tool will only see encrypted data.  For the data to be acted on (sent to another server, a back end database, whatever), it has to be decrypted at some stage.  If you are monitoring process memory and the timing is right, you should see that data.  So yes, this tool is of limited use, but it is an interesting way to get a sense of how memory allocation works in Linux.

What do you think?  Is my approach too naive?  Let me know in the comments.  Thanks for reading!

No comments:

Post a Comment