OK, so welcome to lecture seven of Econ two and five.
We are going to be looking some more pandas, so this is the third installment.
And in particular, these two topics we look at the joining of or merging of data frames, data sets and we look at data frames that have multiple index levels. So hierarchical indexing.
So this these are the topics we cover, hierarchical indexing and then we move onto.
At merging or joining datasets, and then finally we look at an example.
Where we merge data frames that have hierarchical indexing.
So we are using some data from the Office for National Statistics, the ONSA.
On average, hours worked by area of the UK.
And initially we will just have a look inside the data sets CSV format data set.
So if we do that, we can see that there are two lines of header.
Two rows of header.
And then we have year area.
And mean hours worked.
So the years we can see from here that we have 2015, sixteen, 17 and we have these different areas of the UK as well as the UK itself and another combination here.
OK so initially.
We can read that in in the way that we we would so far in the course with header equal to two because we have two rows of header.
And if you do that, this is what you get. So we now have these column names.
And well.
We can actually make what we've seen already in the course that we can set one of these columns to be an index, but we can actually set both of these columns here to be indexes so we can have a multi level index.
Where we have a level of index for the year, but we also have a level of index for area.
And the way to specify that is just by specifying the index columns to use.
So we're setting year and area to be the index levels.
And now this is what it looks like so slightly different to how the data frames have looked so far.
This is now all in bold in Jupiter notebook we have.
To look like 2 columns of index values so we have two levels of the index, one for the year, one for area.
And you can check this against the original data set. These are all.
These are all areas here and values.
I mean, these are all values corresponding to these areas.
And the year 2015. But of course, in the data set.
We would have a column. We have a column for the year with 2015 as the entry, but by setting the.
By setting the index levels to be year in area.
No.
It's been organized in this way where it's understood that all of these are now for the year 2015. All of these values.
And these fall 2016.
So to see this is slightly different.
In terms of the type of indexing that's being used.
Let's print the index to the screen.
So you can see now that the index is a multi index.
And.
It's representative.
Like toples
so we have.
2015 in England. 2015 England and Wales.
Then we have the 2016 and 17's but the index values are coupled together like this one for the year, one for the area.
So let's go back to the original data frame for a second. So if we just comment out the index attributes.
This is the data frame and one thing we can do in pandas is something called partial indexing.
So we can actually pick out the.
The part of the Dataframe corresponding to this outer index.
So by outer index.
We can select 2015, so this being the outer index this being the in the index.
And that will be selecting.
This part of the data frame here.
Yeah, so if we do that.
You can check it, see the bit that I just circled, so this is corresponding to 2015.
And then once you've got that, you can also pick out.
Pick out.
I mean, once you got that you can.
You can use the look index of again, so you know here we just have a single index and to get the value for England you just put in the England label.
On the thing that you know on this data frame that you've extracted from the original 1 so.
This first part gets us to here.
But then this is a data frame, so we can use a look indexer.
To get the role of data corresponding to the label England.
And the value of data corresponding to the label England is in this case just the value 39.2.
So that's what we get.
So let's go back to.
The original data set again.
But I'll do it here. So a comment out.
This part.
So from here, let's say if in one step we wanted to get to the value corresponding to year 2015 and to area England.
We could use this so.
We
have 2015 for the outer index and England for the inner index.
So separated by a comma.
Note that that's different to what's being done here. In the next example, where we're passing a list to look.
Here we're not passing a list into Luke, where passing 2, right? I mean 2.
Labels separated by comma, so this is for one index. This is for the other index you know index.
So this is going to get us this value 39.2 again.
So see, we get that, but if we pass in a list of index values.
Like this, but just going to get.
The they were selecting Pi the outer index.
But no two values of that out to index.
So let me.
Go back to the original data frame.
So the whole thing is this 2015 to 17. But by putting in a list.
Of.
Labels for the outer index. We get this part here.
So that's different to here where we're not putting in a list for putting in one value for the outer index and one value for the inner index.
OK, so now some further examples using series and data frame.
So we start with examples for series, so hierarchical indexing, using series and to begin with it, we're just making the making series so.
We can do that like this, so using the pandas series constructor.
This function creates a series for us. We put in the array of data here.
And new this is using the NUM PY library. We're just you are familiar with the random module as well.
And this is random normal variables.
Some.
Well, none of them so.
Then we also specify the index, and we're specifying a multi level index, so we'll have 9 numbers here.
Nine values in this NUM py array and we're going to specify an outer index.
And an inner index.
So the first number in here will be indexed by A and one, the second by A and to the third by A and three, and so on.
OK, so you see here that we definitely have a multi level index. We have eys and numbers bees and numbers, season numbers and so on and.
Well, we can check that this is a multi level index by looking at its index attributes. That's what's being done here so.
We can return that.
Multi level so it's a multi index and you see we have the values corresponding so each of these values here.
Correspond to a pair of index values now, so one of the letters.
It's one of the numbers so.
These
correspond to the values here.
K&M
let's print the data again.
So starting from this, we can use the same partial indexing that was used earlier on. So if we want to pick out.
This parts.
We can just use the outer index selection.
By putting in B.
For this label.
So you can check that that is the value. I mean that that is the data for be so minus nought .5 five 1.96.
So that's that's what we have here.
And we can use slicing so we can ask for everything from B.
But from that label be up to label C. So in other words, this bit here.
OK, and.
We can also your index things that we have selected using partial indexing, but so we can. I mean we can select using the Luke indexer from things that we have obtained via the Luke Indexes as a partial indexer.
So in this one we are initially.
Selecting the.
The.
The parts relevant to B and 2D.
So we're selecting by these outer labels, B&D.
And then from that.
We're selecting B, so again this will be selecting using the outer index and.
No, we do have. One could be we will have one. So the first part of this just selects the.
The the the part that has outer labels B&E.
Then the next part.
Selects the outer index, be formless.
So we've got back to this one.
But no, by first selecting B&E and then selecting B.
OK, I'm using the whole data again, we get this.
No, just like we can do this.
Which which selects using the outer index be. We can use a look and do the same thing.
Case so we can also do slicing, so starting from.
Starting from here, we can ask for.
Part of the data that corresponds to any outer index button. Inner index of two.
So this is all of the outer indexes, but the inner index must be too. So in other words.
We are allowed to have ABC or D as the outer index.
But the inner index must be here.
Here or here.
So we're only asking for these particular values.
So let's try to remember that .479 point 99.2 eight.
And if we go back.
You know that's the Four 7 and not nine and two 8.
So from the whole data set, again, here we're selecting using the outer and inner indexing so you can see there's a comma here separating a list of outer index labels and an inner index label. So we're asking for.
Well, out to index A&C.
The inner index must be 2, so it's this value. So A and two, and we can also have C.
And two.
So it's these two values.
So nought .47.
And nought point nought 9.
OK, so that was series and we could do very similar things for data thing.
So that's what we do now. So first of all we need to create a data frame and I mean we could read one in. But here we are just creating one.
We're using the NUM py of age.
A range of numbers from zero to 11.
And that's 12 numbers. And then we reshape that single array. So this is a 1 dimensional array into something that's four by three. So four rows and three columns.
Using the reshape method.
So once it's four by three.
Now we need to have labels for the four rows and these are AABB and labels for the columns.
So.
We got Ohio, Ohio and Colorado, but we also have inner indexes so.
No, this is the outer index for the rows and this is the outer index for the columns. But we also have inner indexes for the rows and for the columns.
So.
Each of the rows corresponds to a pair of values for the row index we have A and one.
A and two.
B and 1B and two and each of the values corresponds to two different column indexes. We have a higher in green, OH and red.
Weather in green.
So now we have this nice presentation of this so.
You see, you can see here looking at the columns, Ohio appears twice.
So we have a higher and green OH and red.
And this is represented like so. So the Ohio is for both the red and the green column.
And similarly with the A&B, so a.
Can be to a can also be three. I mean you can have A and two eight. I mean you can have A and one. You can also have A and two.
Be in one B and two.
So this is the multi level dataframe. We have multi level row indexing now. Also we have multi level column indexing.
We couldn't do that for series, all the C, Because series just has one column.
So.
We can pick out using the outer index again, so from here.
If we just select Ohio.
And we're going to get these two so.
These two columns.
And.
If we use the.
The look here we're going to get.
Everything for everything, for the row indexing.
And the column Ohio. So actually it's going to be the same.
So let's just.
Do it again so.
This is just selecting your higher. This is using the Luke indexer to say we want the entire row index, but we want Ohio four as the outer index as the outer column index.
So they're going to be the same.
We can also do this so.
Once you've got this.
From this one here. Then from that you can select.
One level, I mean, we only have one column index level now, so you can just.
Select columns in the way that you would normally for dataframe, so by passing in the string you get the the column for the data frame, so.
This gives us this thing one. One of the columns here is green.
The other one that if we put in green as the string we will be getting the green column.
K.
So, um.
They will cover more on hierarchical indexing later in lecture 9.
And also when we come to the final example, we will be merging on a hierarchical index.