Indexing a DataFrame
How is indexing in DataFrame indexed?
: and , in the Index
- The colon by itself,
:, means "select all". - The
,: Separates row indexing from column indexing.
By default, when you use the index operator ,[] , on a DataFrame, it will retrieve the corresponding column by its position.
Column Indexing
Column indexing in Pandas involves selecting one or more columns from a DataFrame.
1. Selecting a Single Column
We can select a single column by its column name:
selected_col = df['col1']
Alternatively, we can select a single column by its column index:
selected_col = df[<index_val>]
We can also use the dot notation:
df.colName
2. Selecting a Multiple Columns
We can select for multiple columns by passing a list of columns to be selected:
selected_cols = df[['col1', 'col2', ...]]
Row Indexing
Methods for row indexing.
1. Using loc and iloc for Label and Integer-Based Indexing
loc is used for label-based indexing (if indexes are text-based):
df.loc['label']
iloc is used for integer-based indexing (can be used even if you have defined your own index):
df.iloc[<row_index>]
To get multiple rows using the loc or iloc, we must pass a list of labels or row indices:
# label
df.loc[['label1', 'label2', ...]]
#integer
df.iloc[[1, 2, 4,...]]
2. Select Rows By Slicing
This is done by specifying the range of rows to be retrieved using the index operator and the slicer:
df[n:m]
# n - First row selected (inclusive)
# m - End of range (exclusive)
3. Boolean Masking
We can filter for rows using 7. Boolean Masking. This involves creating a boolean condition (a boolean Series) using the Comparison Operators and then using it to filter the rows in a DataFrame.
# Example 1
boolean_series = df['colName']>value
# Example 2
boolean_series = df['colName']==value
A boolean condition or boolean series in Pandas is a series of boolean values (True or False) that indicates whether each element in the original series satisfies a certain condition. This is then used to create a mask where each element is marked as either True or False based on whether the condition is met.
The resulting DataFrame will only contain rows where the condition is True.
Subsetting a DataFrame Using Row and Column
We can subset the entire DataFrame by specifying both the row and column index using .loc or .iloc attribute
Using iloc
The .iloc indexer allows you to select rows and columns by their integer indices.
# Subset using row and column indices with .iloc
subset = df.iloc[row_start:row_end, col_start:col_end]
Using loc
Use loc[] to select rows and columns by their labels (names).
df.loc['row_labels', 'column_labels']
Using the : slice operator
# label indexing
df.loc[[beginning_r_label:ending_r_label], [beginning_c_label:ending_c_label]]
# integer indexing
df.iloc[[beginning_r_ipos:ending_r_ipos], [beginning_c_ipos:ending_c_ipos]]