So the next topic is the merging or joining of datasets. And our datasets are important as data frames, so it's the merging of data frames and this is obviously quite an important topic because often we have different data in different places and somehow we want to bring it together and the. The main command to use in pandas for this is merch. This is similar to the join. Operation in the SQL programming language, so the SQL database querying language. So if you go on to study SQL, learn about SQL, then the operations that we do is merge. So in in the merge outer merge left, right merge and so on. Things you can also do with the joint operation in SQL. So, um. Let's construct 2 dataframes, DF-1 and DF2 and merge them and see. See what that does. So we have a dataframe here. With two columns. We are constructing it form a dictionary so dictionaries have keys and values. And the key names become the columns. So we have two columns here, one called key, the other called data one. And. Key is a column going to be a column comprised of these string. These character string character strings. Um? And data one is going to be #0 to 6. Meanwhile, DF two is dataframe with again two columns has colon Corky like this one, but this time it has three. Strings 2 three string characters and. It has a range of values from zero to two as the other column. So the two data frames look like this when they're printed. And when we merge them the so when we use this pandas merge function. We put in the dataframes as the arguments and. It will merge by. Buy one of the columns in each data thing and by default it will find the one with the same name. If there is 1. And to merge on that so. Well, let's just run it. There are different types of merging. By default it's an inner merge, but so essentially I'd like to. Have two columns of data, so whichever way we merge would like to have two columns of data, one for data one and one for data too. And we'd like those to correspond to. The correct key. So the value 0 here. Is for a key value of a. So. If the key if the key value. Is a then data two should be zero and similarly if the key value is B. The value for data two should be one. So. Note here that D F1 comes first DF two comes second. So we're going to have a column with data 1 first and then data 2 second that's the way it'll workout. Um? But then how will this data for Day 2 the three rows of data for data to be incorporated with the data for data one? Well. We have keys AB&D here. For data 2. And in this column we don't have D, but we do have a B. Well, we do have A&B. We also have C for which there isn't any data too. So. For the key of B, the data. In this view, or for this entry, the data for data one is 0. The data for data to the value for data two is 1, so there's only one value for B in data 2, and that's the value one. So in this second column of data, so in the column of data for data two we would have a value 1 here. So again, we have a key B. This time we have a different value for data one. So one rather than zero but for data 2. There's just this one value, so we'll have one again. Forky of a. Data data one value is 2 here. And for data two, it's zero, so will have a 0 here. No. If we do an inner merge, then in a case like this where we have a key. Of see this that has a value in this state frame, but does not have a value in this data frame. Then that key will be excluded from the results of merging. So in other words, when we do this Merge D, F1 and F2. There will be no values for the key of CC will be removed. So we won't have see, but we move on to a then and he did have a value of two in the century, but in this century has a value of four. For data 2, there's only one possibility. It's a 0, so we will have zero here. We have another value for the QB for data one. Again, we'll just have a 0 for data to this value. And finally, this is a key B. It's had value zero and one note has a value of 6 for data one and four data to it always has this value one. And then there's the issue of this key D in the second data set. Well, there is no key of D in the first date set, so if we are in emerging then there will be no entry for the key of D. So we will simply have as it appears here. Keys and values. I mean values for the keys A&B because they are the only two keys that are common to both of the original data frames. So if we did an outer merge. Then this value this key of C. It has a value of three for data one it doesn't have a value for. For this data frame. So the stage frame doesn't include the key, see, so we would just get some missing entry. So similarly. We have a key of D here. In the second data frame. But we don't in the first data frame. So in the results we would have the key of D appearing, but it would have a value of two in the second column, but nothing in the first column because there's no key of D in DF-1. So we come to outer merge later. How to define that but? We also have left and right merge, so if we do a left merge where we entered the F1 first the F2 seconds. Then we will end up with two columns of data for data, one data 2, but only using the keys that appear in the first data frame. So we would have C would have a B&C. But in particular, we would have see, even though there's no data corresponding to that in the second data frame. But we wouldn't have T because D doesn't appear in the first one, and we're doing left much. So we take the keys from the first one. And similarly, if we do a right merge, then we take the key, we have keys AB&D, but we don't have see. So in a nutshell, that's emerging by emerge outer merge left and right. So another issue is that. We didn't specify a key when we did this, and I mentioned that it that the merge would be done on the. Column name that's common to both. So that's the default behavior when we just enter the two data frames that we want to merge so that the default behavior is to merge on the common column name and to do an inner merge. So generally we can also specify the key, so it's good practice to specify the column key. The key comp. Even so, in this case we have two dataframes with a key column called Key. And so we specify that the merge is too to be performed on this key. So you know common ones in in, in, in the macroeconomic context might be a country name or. In a microeconomics applied microeconomics context, it could be. A business name? And sometimes the key names. That we'd like to merge on are called slightly different things, but totally different. So we also need to be able to deal with situations where perhaps one of the. Columns that we'd like to merge on this call. Country ID in this way, with capital C, Capital ID and the other one is the same thing, but written in a different way. Because they could be completely different names as well. So in that case I'm going to. I'm going to talk about this, but you ascentia Lee. You specify a left on. And a right on. So the key name that you want to use from the first dataframe is left on column name. And so this is just the name that. Appears in here. And for the right data frame. The you have the right on column name. So. No, just like up here we had two dataframes with different columns of keys. But here we have the same key name. If these were different this would be the left on key. This would be the right on key. Rather than having you just on key. So we're introducing a dataframe, DF3 and DF4. And it's the same method of constructing dataframe. We use dictionary with column names and a list of values and another sequence of values here. And. There's just one that so. We have conveniently titled columns here. L key and R key, but you know they could be in practice. Something like this? And. We want to merge the two data frames on. On these keys, but they're called differently. They have different names for the columns. So like I said, we just specify that in DF three we're using this column for the keys and then DF four we're using. This column for the keys. So this is what I was saying earlier on. So the merge function by default takes the. Intersections of those for which there are values in the two data frames. So in other words, it's. It uses the keys that are common to both data frames. You can also specify this explicitly by saying that the. The method or the the? How is inner? Or you can set it to something else like outer or left or right. So before I go through this, I've realized that I didn't show the result of doing this up here, so we have. Emerging of this one and this one. They have three I4 where we are merging on these keys ABCD, but the columns containing the keys are called different things, so we have a left or right on. So if we run this cell, this is what we get. And so, um. Didn't really make it clear, but these. These values here are the same as we had for DF-1 and F2. So the results in terms of the data. Columns is the same as. Up here except that because we have two key column. Key columns we we end up with two key columns in the results, so we keep the. The L key column and the R key column because we can just drop one of the columns, it's not difficult to do that so. This can be done with what's what's mentioned down here, so the how equals outer left right? We can also do for this. Well, we have different column names for the keys as well. So I'm going back to DF-1 and DF2. We've seen the animerge that's the default. The outer merge looks like this, so this is just printing DF one and DF 2 so we can refer to it so. We have yeah, there's a slight difference in in how the result is ordered between the inner merge and the outer merge, so the in emerge would use the keys in the left column. I mean in the left dataframe first, so the. What I mean is that the order of the keys follows the left dataframe, but in the case of the outer merge, it's lexicographic ordering. Since we start with. So be here we do all of the bees, so we have all the bees and then the next key isn't a. So we do all of the A's. Then we have a C. So we do all the seas, but there's just one. And then finally we moved to the second data fameless another new key D and so we have this one here. So I believe that's the way that the data is ordered. Here I'll have to get back to you to confirm that, but the important thing is that the important thing is what data we have in the data frame. Because of course we can order the data anyway we like afterwards anyway. So what data do we have when we do an outer merge? We we have all the dates that we would have been in emerge. Plus the extra data for keys that were in one dataframe but not for. I mean for keys that had values in one dataframe but not values in the other dataframe. So you can see here like I mentioned earlier. In the first data frame, we have a key of C. But we have a value of three, but in the second dataframe we don't have the key and so therefore we don't have a value either. So when we do the outer merge, we still have the key of C, so we still have a row entry one or more ruin trees for C will have a value for the first column for data one, but we won't have a value for the second column data too. And so it just appears like this. So if you check back, this view was not included. When the inner merge was performed. And similarly, we have this view here for the key of D. Becausw. There's no key of D in the first dataframe. But there is, in the second dataframe, so we have a value for data 2. Corresponding to D, but we don't for data one. So the value for data one is missing. So we could also do a left merge. Then we have. Everything I mean, we have all of the data for keys in the first dataframe in the left dataframe DF-1. And. So we have value value for. For the key of C. But obviously when we include C as a key, then we don't have any data under this data two column. But note that we don't have the devo in here. Because when we do a left merge, we're only using keys that appear in the first data thing. OK, so the right merge is similar except that we use only the keys that appear in the second date thing, so we will have a B&D, but we won't have C. So you can see here we have. We have entries for B. A, but not for D. So for D data two has a value 2, but data one did not have a value. So if you look at here, there's no key of D, so no value. But we still include the Devo. So hopefully that makes sense. So we specify inner outer, left, right here. If the column names for the keys are called different things, we can specify those names using left, on and right on. And. Now this is all hopefully very useful for combining various data sets together. Something that you often have to do in practice because you're getting datasets from different locations and maybe you have country codes. That are called something different from different sources, but the same country code format. Things like that. Well, the final thing is to show that you can do this kind of merging when you have hierarchical indexing as well. So multiple index levels. And to do that we're revisiting the data sets on. Mean earnings across regions and also introducing the second data set, so this is on median. Median earnings and what I've done for data two is I've just taken the values for 2015, so it's a little artificial, but just to show the difference between inner and outer merge for. At this case with hierarchical indexing. So so when we do the inner merge, there are not going to be values for 2016 and 17. And when we do the outer merge. Well, we'll keep the values for. That we have for data one data one covers 2016, seventeen as well. But we will have missing entries for the median because this is only for 2015. So, um. I'm going to just run this cell so you can see the data. So this is what we had before for mean hours worked. And this is median hours worked, hours worked. So, um. You know when we do the merge? If we are doing a merge in a certain way, so we were merging on the two columns year and area. Then we will by default keep the entries where the keys are common to both data frames. So in that case, since in the second dataframe here we only have key pairs, 2015 England, 2015, England and Wales 2015 GB and so on, and we don't have 2016 England and 2016 England and Wales and so on. Will end up with just. Two columns of data mean I was a median hours, but only for these key pairs for 2015 and the the various areas. The others won't appear. Because they're not in both. So that's what will happen when we do this merge. So merging data one data 2. On these two columns. So we have two key columns now and we're doing the default in the merge. When we do the second one the the outer merge so everything is the same here except we we specify the how to be outer. Then we we will be keeping. All of these values here, but of course we don't have any values for the median corresponding to those keys, so we'll just get any ends, not a number. Markers for values. So you will have missing values, in other words. So it's just. Just very simple using previous ideas, but now specifying the two columns on which we are merging. And you see the difference here. So we have two outputs, two things printed, ones the emerge one the outer merge. This one is the print out of the inner merge and then we have the print out of the outer merge. Both of them have two columns, mean hours and median. But obviously for the inner merge we only have 2000, six 2015. Values. So. No, really, we only have these key pairs 2015 and and each of these, whereas for the outer merge we keep have all the key pairs. But we don't have these values. So that's it in terms of the content. The new content for lecture 7. Um? I recommend that you spend a bit of time just working through this and making sure that. You understand how all of these work and you don't have to remember absolutely everything. But you should know what's available and what you can do so that then you can come back to this when you need to. For example, when you're doing the coursework project so. Yeah, it's it's important to be aware of what you are able to do with pandas and to have at least quite a good idea of how you're going to do it. And then with practice you can start to. You know, write these things out straight away.