OK, so we move on to pandas so we look at some key, panda series and dataframe facilities.
So first of all we look at series.
So we've already imported.
The pandas module has PD and so we are looking at the series type. Here we are creating a series.
Using Pandas dot series. So starting with the range objects 0 up to three here with this index that we're specifying so we have 0 up to three as the values of the series and index labels ABCD.
And so first of all we just display the objects so ABCD as the labels and zero to three of the values and what we're looking at here is just selection of items from the series.
So for example, we can select the value corresponding to the label B, so that should be a one.
And we can do this by by number as well.
So we get the same thing. This is just using the number index. Then we can do slicing. Actually, I'll display this.
So we this is what we start with.
If we do the slicing from index position 2 up to but not including four.
Then we get this so 012.
Three so we we just get these two.
And then here.
We are selecting a list of.
The values corresponding to this list of labels, so we have labels BADZBAD and the return of this is just going to be the values corresponding to those labels should be a indeed.
Can check be as one A0D3. That's what we have here.
The next we are using numeric indexing, so the deep using the default range index. We can always do that, so we have one and three so.
Position one is equivalent to label B position three equivalent to labeled D.
So there we go 1 three.
Next, we're doing some filtering so.
We are selecting values in the series obj where the values of obj are less than two, so that would be.
These two here these so this one and this one are both less than two. This one. This number is not less than two, so will just be selecting this part of the series.
And then.
How we doing slicing by labels so by by string labels so?
If you were to do slicing numerically, you would be going from this one up to, but not including this one, but with labels you go from this one up to and including this one, so we'll have.
Listie together.
And it's worth noting, I don't think this was done in the original lecture, but you can also do assignment this way. So we've just selected the the part of the list corresponding to be up to including C.
We can actually change the values here both equal to 5 by by assignment so.
But that's something I don't think we did in the lecture where I covered this. This kind of thing. So just as we can select values, we can assign values in the same way.
So you see that both set equal to 5 now.
So now we're moving onto data frames. So first of all, let's just specify the data frame.
With setting the value index and column index of the column labels.
So the value labels are the states, the column index of the numbers and if we just put in a string into here. So if we select according to a string.
Pandas will assume that we mean a column, so although the states are strings as well, if we put the string in here, it'll it'll be looking for a column. So if if we if we were to put in Colorado into here or higher then there would be an error because it isn't one of the columns. But if we put in two will get this column corresponding to two.
To display this.
So you see, this is the column corresponding to.
2.
We can ask for a list of columns.
By putting names of the columns here.
Note that it's in the order that we asked.
And if we put in numbers so if we ask for a slice from the starting .0 up to two, then pandas will assume that we want rows. So this is going to be.
No, from Rose 0 up to but not including two. So for 001 but not including.
The index position 2, so just the first 2 rows. That's what we have here.
OK, so a couple of examples with filtering and will assignment according to logical condition so.
We're still looking at this.
Data data frame. So I'm going to display that.
So we're starting with this and then in this code here.
We are selecting the part of the dataframe where.
The values of the column three are greater than five.
So this is a column three. It has values 2, six, 1014.
It has values greater than five here and here.
And so by doing this we are asking for the for the lose in data faintly, I mean the rows in data where.
This column has values greater than five, so in other words.
Just these two rows here, so these are the rows in the data frame.
Where the values of the column called three are greater than five.
So by doing this with selecting these two rows of the dataframes were selecting rows of the dataframe according to values in one of the columns.
So so this one is also great, and five isn't it? So yeah, my mistake. So these three rows both these three rows all have values for the column three that are greater than five, so this one as well.
OK, so just like the example previously, you can do assignment via a condition like this as well. This is something that I didn't do in the initial lecture, so it's just a minor point. So starting from the day-to-day to Spain data data frame.
So I'm going to display that.
We can, yeah. So we're just displaying this. But then we're printing.
A a data frame of logical values so.
Families true or false for the individual items in the data dataframe, according to whether they are less than the number 5. So we get this table here.
So you know zero is less than 5. Four is, so we get two and two 8 isn't, 9 isn't.
So we get false and false and so on.
And then.
When we so so perhaps we want to set some of the values here equal to certain number depending on the value that they are so.
This case here. If the values in the dataframe are less than five, we're going to assign the value 0 to them, so this is similar to the example above with series, just as we can do filtering. We can also do assignment.
According to the part of the data frame that we've filtered so.
This here is filtering part of the data frame according to this condition. Once we got that, we can assign a value to it.
So in all the cases where we have two here.
We are going to assign the value 0 and that's what's being done here so.
Now, in all these cases here we have the value 0 now.
So now we move on to selection with Luke and I Luke.
So the preferred way of selecting now in Python I think.
So now we're going to print the data.
But then select.
Select
certain rows so these two rows and a list and these two columns so we into the Luke index. So we pass a list of low labels. Then we have a comma and then a list of column labels. So Rose, Colorado and Utah.
Columns two and three, so we just end up with.
These four numbers here.
Next
so using the original data frame up here.
We want all of the rows.
But just a column in index position too. So this is now and I look example, so we're.
Indexing by the.
Ascent Sheet by the default range index, so starting from zero with the rows and the columns.
If we put in a number, then that means the index position. So this is index version two. In other words, the third column.
In this case.
If we have the slicing operator here, but with nothing either side, it just means that we start at the beginning and finish at the end so it's all of the rows in this case.
So all of the rows but index position 2 means column three, so that's what we have here.
And just like with Luke where we where we have seen that we can pass in the names of the labels in lists for the rows and columns, we can also just pass in index positions. So here we haven't wrote, so here we have a list of index positions for the Rose and here we have a list of index positions for the columns.
So there we go, no rows, two and three. Essentially so 2nd row.
Um?
And third view.
And then.
The columns.
It comes in positions 3, zero and one so.
301
and.
Well, maybe we're just looking at further examples. So first of all, printing the data.
And.
We are.
Actually using Luke so.
In the notes here we are selecting rows.
So row labels up to and including Utah, and then we have a Commerce. So then we looking at the columns and we're just listing a single column name called 2.
So where we going from?
The beginning, so from Ohio up to and new with with with labels, so with string labels for the rows or columns we go up to and including. When we do the slicing. So we go going up to an, including Utah, and then we're picking out the column called 2.
So it's these values here 059.
And then we have.
And I like example so.
We are selecting all of the rows.
And then the columns up to but not including the column, it's position 3, so it's 012 but not the final one.
So we just have all of this data here, but not the final column.
And here another I look busy another I like example. So we have all the rows again.
Up to the one in position 3. But then we're doing filtering, so we're saying, well, actually we only want where the column called 3 is greater than five. So if we look at the column called 3, where is it greater than five? While we have numbers greater than five?
In these three rows, so we're just excluding the slope. This row isn't satisfying this condition.
Yeah, so we we go up to but not including three so 012 we don't include the final column.
So now we move onto descriptive statistics with data frames. So first of all we're creating the data frame.
No, we're including some, not a numbers here. Some some missing entries.
And recreating the row index labels and they.
And the column names so Colmans 1 Two row index tables A to C.
And it's important to note that when you when you use pandas.
Reductions or summary statistics that by default they excludes these missing entries. So or not a number entries. So if you were to calculate the average value of a across columns one and two, then by default it would just be 1.4. So ignore the fact that you had a second value that was a missing number.
If you were to compute the average for the column one, then it would compute the average of three entries. So the entries for view AB&D, but it would completely ignore this one. So just compute the average over three. Given these three numbers that you have, so that's the default behavior of the the pandas dataframe methods. If you're using the sum methods, for example.
That's what it'll do.
So.
Let's look at that in more detail. So we are.
Printing the data frame just here and then in this first one we are. We are computing the sum.
And was setting this axis option equal to 0, so zero is for rose, one is 4 columns. So if we set it equal to zero, that means that we're doing the sum over the rows. So essentially we're computing the sum for each column then.
So if we do, that will get you a value for the column one and a value for the column 2.
So you get this and this and these ignore the not a number or missing entries, so this is an average over these.
Numbers here.
I mean sum over these numbers here, and this is a sum.
Over these numbers here.
So let's look at the second example with the option as sets to access equals one. Then you're actually computing the sum.
Over the columns. So let's see the sum across each row. In other words, so.
We compute the sum over. This way it's 1.4.
Some over this way it's 2.6.
And the sum over this view.
So no point.
Nought .55
so.
Plus or minus no .55, so that's what you get.
So access equal to 0 means you're going over the rules, so the operation is over the roots.
Acts equal to 1 means the operation is over the columns.
And then you can always override this.
This ignoring of the lottery numbers.
And only compute the sum in this case for the cases where you have all of the data available. So if you say not to skip.
Skip any is false.
Then you get not the numbers in the answer. So here we are computing the sum.
Across the columns. So over the columns for each row and you can see here we have a 1.4 for the first column, but nothing for the 2nd, so the sum is not defined.
For this one, it is so we guess a number.
For the third one is not. We don't have anything there.
And for the 4th one it is, so we get a number for this one.
So this option can be used to override default behavior of just, excluding anything that's not defined.
Well, we can.
How about we can specify the axis by name, not just by zero or one, so axis equal to 0 means rose.
Access equal to 1 means columns, but we can specify by rows or columns instead.
So up here.
We could use rows and columns so we could use rows and columns here.
Instead.
So you know there are many different pandas methods that we could use in addition to some so you can go back to the lecture to have a look. One of them is the mean methods. Here we are taking the mean over the column, so row by row mean in other words.
And we're not excluding the missing numbers, so you know if we don't have all of the numbers there now with this option, we're saying we are not willing to take the mean.
And so here you can see we have a missing number, so.
We don't just say the mean is 1.4, we say the mean is missing or not defined. Here we have two numbers, so we can take the mean.
We got this so this is just a simple average of these two numbers.
And the third column we don't have.
We don't have numbers.
In the fourth column we have these two and if you take the mean you get these. I mean you get this.
So we've seen some we see mean useful once users describe. So if you have a pandas dataframe, you can apply the described methods.
And this can give you a lot of useful information so.
How many?
Well defined items, you have the meanest and deviation and so on. The quantiles, the maximum minimum.
So it's quite useful. This is for the dataframe, so it's for both columns.
But you can also pick out an individual series. So here we are selecting an individual column which will be a series.
And then applying it. So then you get these values for the individual series.
So um, next we're looking at file input outputs too, so just revising this basic reading and writing of data.
If we have a directory called examples and inside that a file called EX One dot CSV.
Then we can read this file into a pandas dataframe using the read CSV methods read CSV function from the pandas library.
So there we go.
If you were using.
Windows, then you might have to do this, but like I said in a lecture, this doesn't seem to be necessary now. So even though in Windows you
typically specify a file path using the backslash operator.
And in that case you needs to convert the string to a raw format in Python, otherwise it'll interpret the backslash in a different way.
No, these days it seems that you can just specify the file path with a forward slash and even if your windows, it works so this.
Doesn't seem to be necessary.
So I think you should. You should be able to use this whichever operating system you're using.
So we are reading the data into a dataframe DF here.
You know it's works worked.
Pretty straightforwardly, no. We we have numbers inside the data frame. We have columns called ABCD and message and it all seems to make sense.
But
you know, sometimes.
It may be that the first row in the data set is not the not deveau of column names. In this case it has turned out to be the case, but you know maybe this was just one of those, so we have to watch out for that. And then there's a couple of examples later about that.
Another way to read and this data, so we review some read CSV function is sees the re table function. So then we just have to specify the separator.
So if it's column I mean comma separated data. We just say that the separator is a comma and then we can do the same thing so.
Just two different ways of doing the same thing. The read table function is more general than the read CSV function.
So next we're doing the same, but with this file X2 dot CSV.
And when we do that, so it is one of these cases now where actually the first row is not giving us column names. It doesn't look like this is a row of column names. It looks like this should be a row in the data part of the dataframe, so.
And what we can do is specify that actually we don't have a header in the in the data that we're reading in, so.
We want this view here to be a row of data. We can't this we're not. We're not saying that the first true in the file is the row of column names.
So one thing to note is that we don't have column names anymore, we just use the default range index for the column names 'cause nothing's been specified here and we said that the header is not existent in the original CSV file.
What we can do, though, is specify the names of the columns to be.
A list that we specify here. So A to D and then.
The string message.
Arnold, Sue, along the way.
We can specify which of these names too.
Use as the index for the data frame so we can specify that out of those the index column will be the one that's called message.
So this is this data up here, but we've assigned.
These names, and then we've made this one the index column and at the same time we moved it as one of the columns from Dataframe.
So you know sometimes the data is not stored with comma separated values. It stored where in a slightly different format, and if the file is not too large we can use this method just to have a little look at what's inside the file. So we are using the open function for Python open function with creating, so this is returning a file object so it hasn't been given a name, but this is a file object.
We're converting it to list and so each row in the file is being represented as a separate item in the in the list.
And provided the file is not too large, then this is a viable way to just just have a little look to see what the header is and so on.
We can see here that we have a tabular data format where we appear to have labels. I mean column names.
Column labels here we have very labels.
And we can then think about how this is separate, how the data here separated seems to be separated by spaces, but then we have larger spaces here.
Larger numbers of spaces.
And what we can do is just read use re table, specify the separator to use and the separator accepts regular expressions. This is an example of a regular expression were saying 2 separate by space and they're going to be one or more of those spaces. So we separated by one or more spaces and.
Lessons.
Quite clever, it it notes that.
We essentially have four columns of data. Here we have these three names, so these are then these values here then turned into the Deveau, interpreted as the.
The row labels where these are the column labels.
So, um.
There are others other ways of separating data, so just some common ones to account for. We have separation by tab, but comma by semicolon and by the pipe symbol.
So in addition to reading data, it's important to be able to write to CSV files.
So here we were reading in some data so.
Using the beat CSV function again just to read and this this CSV file. But then we're going to use the two CSV methods to write the data that's in this data. So to write this dataframe essentially to a CSV file. So we're putting in here the path to the file name out dot CSV.
And if you were to navigate to this directory, so it's important that actually you have this directory on your computer, so you need to create a directory called examples, and then if you look inside that, you'll have a file called out dot CSV Now, which will contain the data frame that we're calling data.
So the final piece of material on.
The the relatively early material on pandas, so it would be useful to have a look at setting the index of a dataframe so very common thing that is necessary to do when working with data.
Of course, you seen far more advanced material on hierarchical indexing, reshaping data frames, and that material is very important. But first of all know it's important to be able to set the index on the dataframe. So we're looking at example one from Section B of the formative assessment exercise, and this example with the arable land data frame.
So here we are. Important pandas as PD and first of all just having little look at this file so no common on all operating systems is some summary date to about the file. So no, perhaps we've looked at this file and we've seen that it's not so large, so we can. We can look at it using this method.
In the final lecture, will be looking at chunking and alternative ways of looking at data.
For very large files, but let's assume this is quite a small file, not not. It's not such a large file.
So we look at it like this.
And you see, when we print this that it needs to have header equal to 3.
So let's do that. So we we see here.
You know this is.
You're not not giving us column names.
So this part here.
This is the first row of data.
This is the 2nd row of data.
Then we have a third row with nothing at all, and so then we have our column names here.
So after setting the index, I mean after setting header equals three will just have from this part downwards, so will be setting this.
This low to be our I mean will be.
Using this row to give us the column names.
And then we'll have the data from this low downwards.
Actually you can see here, because this is not creating, you know that this is not just given that as the header. I mean the head of the file. This is giving us all of the file. We only have two rows of data, so one is for Australia and the other is for Austria. So we creating a dataframe with column names country name 2005.
2006 and then we have two rows of data.
One for Australia, one forestry.
And so we read in this data into a dataframe called arable land, and the question is to set the index of this dataframe to be the country name.
And so the number of methods and.
The actually the preferred one. I'll just go through as the second one here, so we use the set index methods on the dataframe.
We have a column called Country name.
So we're setting the index to be the country name, and we also have an in place option so we don't have to do.
This updating of the arable land name we can just directly on the arable land dataframe set the index to be country name and do it in place in one step.
So if you do that then.
We can display the the new updated albleen data frame.
And see that country name has been set to see index column and I mean as the index and.
With the two CSV methods.
We you can check on your computer you sent this to file called Unknown two dot TXT with this knew separator the semi colon.
So that was just a review of the relatively early pandas material in terms of Matplotlib. We've seen 4 main approaches, so in approach one we've just used the plot function from the Pyplot module.
In the second approach, we've created a figure object. We've added Axis subplot objects, so using.
The Axis subplot method. Then we plotted on those axis subplot objects.
So as examples of the first 2, have a look at question 2A, Section B of the form.
Have assessment exercise an lecture four and lecture 6 for the second one.
Then we've also plotted directly from pandas dataframe objects and an from Panda series objects using the data frame or series plot method, so in.
The second type here we are using axis subplots. Plot methods.
In the third case here we are using data fame or series plot methods.
And so, for examples of this third Type C lab, three tasks, two and three.
And then much more recently, in lecture 8, we've seen another more flexible method, so using subplot to grid.
So I'm now going to move to tasks five and six of Lab 4, and then after that the introduction to the Spider IDE.
Fire task 5 in in lap 4.