Creating a DataFrame
How do we create a DataFrame?
To create a DataFrame, the pd.DataFrame() function can be used:
pd.dataframe(data, index, columns, inplace,...)
dataRepresents the data to be stored in the DataFrame. This can be a dictionary, lists, Series, array, or another DataFrame.
indexUsed to define our own custom indices for the row in DataFrame (i.e., how we identify each row in a DataFrame). If not specified, a default integer-based index will be used starting with 0.
pd.DataFrame(data, index=['a', 'b', ...])
Alternatively, can pass a column in the DataFrame, which we would like to use as the row index of the object to `index_label.
pd.DataFrame(data, index_label='colName')
columnsUsed to specify the column names in the DataFrame. By default, the column will take on a value from 0 to n(col)-1.
DataType: array-like or index object
df = pd.DataFrame(data, columns=['colName1', '[colName2]',...])
dtypeUsed to force datatypes of each column. If not specified, Python will infer the data types.
DataType: Can pass data type (int, str), or a dictionary mapping the columns to its datatype ({'col1':'str'})
inpalce Parameter
The inplace parameter in Pandas is a way to control whether the changes made to a DataFrame should be applied directly to the existing DataFrame , which we declare inplace=True .
Otherwise, create a new instance of the DataFrame with the changes.
DataFrame from a dictionary
When creating a Pandas DataFrame using a dictionary, each key-value pair in the dictionary typically represents a column in the DataFrame.
- Keys are column names
- Values are the lists or arrays containing the data for the column
data = {
'colName': ['list', 'of', 'values', 'any', 'datatype'],
'colName2': [1, 23, 4, 6, 10]
...
}
df = pd.DataFrame(data)
Pandas will raise a ValueError because it requires all columns to have the same length. They must all be the same length. Make sure that all the lists or arrays you provide as values in the dictionary have the same length.
DataFrame from a list
When creating a DataFrame using a list, each element of a list represent a row of values. Make sure that the position of the values in the list correspond to its appropriate column.
data = [['Alice', 25, 'New York'],
['Bob', 30, 'San Francisco'],
['Charlie', 35, 'Los Angeles']]
df = pd.DataFrame(data, columns=['Name', 'Age', 'City']) # need to pass columns
DataFrame from a List of Dictionary
When passing a list of dictionaries, each dictionary represent a row a values, where each key:value pair maps a the value to its corresponding column.
data = [{'Name': 'Alice', 'Age': 25, 'City': 'New York'},
{'Name': 'Bob', 'Age': 30, 'City': 'San Francisco'},
{'Name': 'Charlie', 'Age': 35, 'City': 'Los Angeles'}]
df = pd.DataFrame(data)
DataFrame from a NumPy Array
You can also create a DataFrame from a NumPy array:
import pandas as pd
import numpy as np
data = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]]) # each array is a row
df = pd.DataFrame(data, columns=['A', 'B', 'C']) # need to specify column name
DataFrame from an External File
To read data from an external file (e.g., CSV, Excel, SQL, etc):
import pandas as pd
# Reading from CSV
df_csv = pd.read_csv('filename.csv')
# Reading from Excel
df_excel = pd.read_excel('filename.xlsx')
# Reading from SQL
df_sql = pd.read_sql('SELECT * FROM table', connection)