Creating a DataFrame

How do we create a DataFrame?

To create a DataFrame, the pd.DataFrame() function can be used:

pd.dataframe(data, index, columns, inplace,...)

Common Parameters: data

Represents the data to be stored in the DataFrame. This can be a dictionary, lists, Series, array, or another DataFrame.

Common Parameter: index

Used to define our own custom indices for the row in DataFrame (i.e., how we identify each row in a DataFrame). If not specified, a default integer-based index will be used starting with 0.

pd.DataFrame(data, index=['a', 'b', ...])

Alternatively, can pass a column in the DataFrame, which we would like to use as the row index of the object to `index_label.

pd.DataFrame(data, index_label='colName')

Common Parameter: columns

Used to specify the column names in the DataFrame. By default, the column will take on a value from 0 to n(col)-1.

DataType: array-like or index object

df = pd.DataFrame(data, columns=['colName1', '[colName2]',...])

Common Parameter: dtype

Used to force datatypes of each column. If not specified, Python will infer the data types.

DataType: Can pass data type (int, str), or a dictionary mapping the columns to its datatype ({'col1':'str'})

Using the inpalce Parameter

The inplace parameter in Pandas is a way to control whether the changes made to a DataFrame should be applied directly to the existing DataFrame , which we declare inplace=True .

Otherwise, create a new instance of the DataFrame with the changes.

DataFrame from a `dictionary`

When creating a Pandas DataFrame using a dictionary, each key-value pair in the dictionary typically represents a column in the DataFrame.

Keys are column names
Values are the lists or arrays containing the data for the column


data = {
		'colName': ['list', 'of', 'values', 'any', 'datatype'],
		'colName2': [1, 23, 4, 6, 10] 
		...
		}	

df = pd.DataFrame(data)

What happens when the lengths of the lists (or arrays) don't match?

Pandas will raise a ValueError because it requires all columns to have the same length. They must all be the same length. Make sure that all the lists or arrays you provide as values in the dictionary have the same length.

DataFrame from a `list`

When creating a DataFrame using a list, each element of a list represent a row of values. Make sure that the position of the values in the list correspond to its appropriate column.

data = [['Alice', 25, 'New York'],
        ['Bob', 30, 'San Francisco'],
        ['Charlie', 35, 'Los Angeles']]

df = pd.DataFrame(data, columns=['Name', 'Age', 'City']) # need to pass columns

DataFrame from a List of Dictionary

When passing a list of dictionaries, each dictionary represent a row a values, where each key:value pair maps a the value to its corresponding column.

data = [{'Name': 'Alice', 'Age': 25, 'City': 'New York'},
        {'Name': 'Bob', 'Age': 30, 'City': 'San Francisco'},
        {'Name': 'Charlie', 'Age': 35, 'City': 'Los Angeles'}]

df = pd.DataFrame(data)

DataFrame from a NumPy Array

You can also create a DataFrame from a NumPy array:

import pandas as pd
import numpy as np

data = np.array([[1, 2, 3],
                 [4, 5, 6],
                 [7, 8, 9]]) # each array is a row

df = pd.DataFrame(data, columns=['A', 'B', 'C']) # need to specify column name

DataFrame from an External File

To read data from an external file (e.g., CSV, Excel, SQL, etc):

import pandas as pd

# Reading from CSV
df_csv = pd.read_csv('filename.csv')

# Reading from Excel
df_excel = pd.read_excel('filename.xlsx')

# Reading from SQL
df_sql = pd.read_sql('SELECT * FROM table', connection)