PARSING: Using SAS When the Data Is Hiding in a Nonstandard Format

Just another WordPress site

PARSING: Using SAS When the Data Is Hiding in a Nonstandard Format

Ladies and gentlemen, welcome to SAS Global Forum 2020 virtual conference This is going to be a presentation of parsing when using SAS when the data’s hiding in a non-standard format Who am I? I am Andrew T. Kuligowski, and I have been using SAS for quite a long time I am listed as an independent consultant In fact, I only take assignments these days if I am interested, which means I get to spend a lot of time practicing all of this self-quarantine at home, and I have been doing it for a few months now just in case I had to do it for real I want to stress that this is a very abbreviated look Even by abbreviated standards, this is going to be an abbreviated look What is parsing? I spent a lot of time on the web looking at various definitions, and some of them came close, but I coined this one for the purposes of our presentation Parse, the analysis of a string of characters and subsequent breakdown of those characters into a group of components What does all that mean? Well, let’s take a look and find out We are going to have to do some human analysis first We have to determine, what in our file is considered noise, and what is considered useful data? A lot of people use the word, signal And I’m going to use the word, signal, here as well to represent the data that we want to keep However, I also want to point out that we have identifiers Identifiers help to determine the difference between data and noise This is what’s going to tell us, where does our signal start, where does our signal end? We want to reject the noise We want to keep the useful data We will reject the identifiers eventually but we can’t do it yet Here is a file Ironically, this file was done last year for a presentation in China And it has to do with medical records for a group of patients Of course, we had no idea how timely this would be, but we decided to keep the example for the purposes of this presentation Taking a look at this letter, we will see that the beginning with the salutation and the address and the ending, this is considered noise There is no data in here that we are going to store, report, analyze, whatever So our routine is going to be rejecting this The part that I have highlighted in yellow is useful data This is where the data exists, and this is where the indicators exist Let’s take a look at what we are going to do to find, identify, keep, and reject If we look at the signal part of the data, the useful data, we will see that we begin with the string, Patient A And we end with a blank line So we are going to write a routine that reads every line It rejects the lines until we get to Patient A, and then it keeps them until we get to the blank line Very simple routine Let’s take a look at a couple of the tricks that we are going to use in this routine Null input I used to say that null input was the fifth of four input statements Some people agree, some people disagree It used to be the only time you wanted to use null input was to close off a trailing at sign However, there is a second benefit to that It will populate and temporary SAS fail– variable, rather, excuse me– called _INFILE_ _INFILE_ will contain the line of data that was just read

by the input statement The beauty of this is that now we have a string that contains one line of data We don’t have to define that string We don’t have to worry about how long that string is We don’t have to worry about anything because all of this has occurred behind the scenes Rather than writing an input statement where we are keeping a variable that we create, we’re going to let SAS do it for us We’re also going to use the Index function to search for the string we want to keep We are going to search, quote unquote, a needle in a haystack Index is going to return the starting position of wherever it finds that string within the full string called haystack It will return a zero if the string is not found And as you can see in the code, we are defining this with an if statement using SAS’s internal 1-0 representation for true-false Little concept This is one of the best practices that I have employed for decades– I will admit it– decades _IND Any variable that I have in my code that ends with _IND I know is going to be a true-false indicator This way, when I am debugging my code later or if I have to enhance it, it helps me understand what it was I’m doing I do not know if any of you have ever had to go to your boss and say, look, I know you wanted me to fix this code but I have no idea what it is that they are doing And then had to say further, and I’m the one that wrote it That’s not a good thing So anything you can do to make your life easier is going to be of a big help to you And in fact, speaking of debug, something went wrong We only have one observation Normally, I would be sitting here asking for input from an audience Since we’re doing this virtually, I’m going to cheat I’ll tell you what it is There is a blank line right after Patient A. We did not account for this in our code Therefore, we told SAS to start recording, basically, when we hit Patient A and stop when you get to the blank line, which was the next character So we told SAS not to bother to read anything We can fix that We are going to make a few changes to the code Where we have RETAIN Output_IND=0, we are going to add a second variable– RETAIN Blankline_Cnt, and we’re also going to set that to 0 Then, down towards the end of the code where we’re watching for the presence of a blank line in INFILE, we are going to, instead of just outputting the line and changing the code to say, stop recording, we are going to put a whole if statement in there We’re going to keep track of our blanklines If the blankline count is 1, we’re going to increment it Either way, we’re going to increment it actually And if the blankline is 2, then we’re going to say, you know, we get the second blankline That’s the point where we are done But set the indicator to 0 and let’s stop recording Hopefully, if we run this code and we’ve done this correctly, we will now capture the paragraph that we were looking for By the way, I do want to point out, just as I have a variable called _IND, I have a variable called _CNT _CNT for count is another little internal thing I’ve done for years in which I say, any variable ending with this is

a counter So I know it is going to contain whole numbers keeping some kind of count or another And when we run this thing, I did not store the output, but trust me on it, it looked a lot better Here is that paragraph again that we’ve captured The next thing we are going to do is we are going to convert it to one observation per sentence Technically, we could probably write some code that would do a lot of parsing on these sentences and not have to do this But remember, we have to also be able to understand our code and debug our code, and I found that it’s much easier to do that if we are dealing with complete sentences Therefore, what we are going to do is we are going to use the period to denote the fact that we have the end of a sentence and we will begin another one Simple We have covered this in English class going back to when we were kids Is it so simple though? Let’s take a look We’re going to write some code And in this code, you’ll see that what we are going to do is we’re going to keep track of partial sentences, and we’re going to see if there is a period in the output line If there is a period in the output line, we are going to make a complete sentence We are going to take the partial sentence, and then we are going to append to the piece of the code starting at 1 and going until the position where we found the period If it is not, then we are going to append– well, actually, let’s take a step back Now that we’re done there, we are also going to say, take a look at the code, and we’re going to say, we have to start a new partial sentence Everything after the period is the next partial sentence And by the way, we better take a look and see if we found the period again, just in case we have a very short sentence Now what we’re going to do is we are going to take the partial sentence that we’ve got, we’re going to append the output line to it so we now have a longer partial sentence Why do we use two separate strings? Well that is, notice that we are only keeping complete sentences This way, we don’t have to worry about a bunch of observations containing partial strings I told you it might not be so simple Normally in a hands-on workshop I would let everybody discover the fact that this code would not work correctly I’ll tell you right now, though, save some time, there there’s a trick English class taught you about periods Math class taught you about decimals The problem is, they’re the same character If we do not allow for decimals, then our code is going to produce some very wonky results How are we going to do that? Simple All we have to do is look for a period followed by a space This way we know that we are dealing with the end of the sentence There’s a couple of people who say, yeah, but Trust me on this I will show you the code is simple enough right now There are some exceptions We’re not going to code for them right now, but I will denote them later Let’s take a look at some concepts here, some character functions that we might need Remember we talked about index before? But we also have a Find function The Find function is also going to search for a string, which we’re calling needle in a string, haystack And just like index it’s going to return a one or a zero Actually, it’s going to return a position or a zero– the position where the code started if it finds needle, or a zero if it doesn’t find needle

So why would we want to use Find instead of Index? There are some differences Find can actually look for multiple strings Find can trim trailing blanks And Find can also optionally include and ignore the case Trim I mentioned Trim earlier Trim removes blank characters from the end of a specified string And the double bar that’s in the code that I’ve mentioned is going to append the second string to the first string Why do we need Trim then? Because we have trailing blanks The trailing blanks are going to be– excuse me We have a cat down here Come on, guys, leave me alone, OK? We have a double bar, and that’s going to do the appending What’s going to happen, though, is that it’s going to append where the blanks end off The problem I have with those blanks is we don’t want to keep them We want to have a smooth transition So we’re going to have to trim that first string to get rid of the trailing blanks Somebody out there who’s probably paying attention is raising their hand saying, yeah, but what about all the CAT functions that Rick Lansing used to tell us about? I do mention the CAT functions in the extended version of this talk Hopefully, again, if we get to do the 90 minutes someday, we will talk about the CAT functions Substring I believe I mentioned Substring earlier Substring is going to take a character value called string We are going to take, starting at a start position, and go for a length of length, L– Len, we called it here If you do not specify len, then the Substring function will assume you want every character from start position to the end And of course, if you were to say, give me everything from 1 to len, you’d basically be saying an equivalent and you’re going to be taking some extra overhead Don’t do that Substring, by the way, is a very unique function It can be specified on the left-hand side of a character assignment So you can actually use Substring to change part of a string Normally, in most civilized coding languages, this is a much more complicated effort to have to change the middle of a string SAS allows us to do that by employing the Substring function on the left-hand side of the equal sign And here are the results You can see when we look at this string right now, we have separated everything into five rows We have one column because the only variable we kept is complete sentence I mentioned before, there were some exceptions The exceptions that I have are that the routine will not handle the period if it’s the last character of a string So if your sentence ends at the end of a string, this code will not handle it We will have to fix that if we ever do a more complex version of this code For the sake of a short presentation, though, we chose not to do it Again, now that we have everything put together in a string with complete sentences, we can now begin to actually parse out data The complete version of this talk will actually show us how to take each sentence, how to use indicators to determine where variable and values start, where variable and values end, and whether or not they are present in each sentence, just in case somebody seems to have forgotten one All right This, as I’ve mentioned, is an abbreviated talk We’re not going to have time to go into it, but what we would have preferred to do is to give you a lesson on many character functions

Each one of the variables that we are going to read in I tried to use a different technique so that we would get exposure to multiple versions and multiple character functions and multiple ways to use them In conclusion, sometimes your data is going to come in a non-standard format It might be necessary to even find the data within that format You can’t clean the data until you know what the data is You can’t process the data until you’ve cleaned it I would strongly suggest that before you begin a process like this that you have an understanding of character functions Character functions are the tool kit which is going to save your professional life, if you have to write something like this However, the internal browser– I always keep Chrome, I’m using Chrome for the moment– keep it up and running and pointing at the SAS documentation for the character functions whenever I do this because sometimes I want to use an option that I’m not too familiar with or sometimes I won’t even remember what the character function is that I’m trying to do, and I will see if there is maybe something that’s already written rather than my having to go dig for it One other thing I’ll add there is that if I can’t find a character function, lexjansen.com is a good place to go looking lexjansen.com is where all of the SAS presentations, the SAS papers that have been done at Global Forum, at the various regional conferences, and many other user-based SAS conferences can be found I will search for what I am trying to do there and see if it can be found Thank you for joining me during this short, abbreviated presentation If you’d like to get a hold of me, you have questions, please try to reach out at the email address listed below Thank you