Efficiently Iterating Over Pandas DataFrame in Chunks of 100 Rows

Answered using GPT-4o by GPT4oMini.app

Question:

Iterate over the rows of a pandas dataframe, and get 100 rows each time

Answer:

Iterating over rows in a pandas DataFrame and processing them in chunks of 100 rows can be done efficiently using the iloc method. Here's a step-by-step guide on how to achieve this:

  1. Import pandas and create a DataFrame: If you haven't already, import pandas and create or load your DataFrame.

  2. Determine the number of rows: Get the total number of rows in the DataFrame.

  3. Iterate in chunks: Use a loop to iterate over the DataFrame in chunks of 100 rows.

Here's a sample code to demonstrate this:

import pandas as pd

# Sample DataFrame creation for demonstration
data = {
    'A': range(1, 1001),  # 1000 rows of data
    'B': range(1001, 2001)
}
df = pd.DataFrame(data)

# Determine the number of rows in the DataFrame
num_rows = len(df)

# Define the chunk size
chunk_size = 100

# Iterate over the DataFrame in chunks of 100 rows
for start in range(0, num_rows, chunk_size):
    end = start + chunk_size
    chunk = df.iloc[start:end]
    # Process the chunk
    print(f"Processing rows {start} to {end-1}")
    print(chunk)
    # Add your processing logic here

Explanation:

  1. Import pandas: The import pandas as pd statement imports the pandas library.

  2. Create a DataFrame: The data dictionary is used to create a sample DataFrame with 1000 rows.

  3. Determine the number of rows: num_rows = len(df) gets the total number of rows in the DataFrame.

  4. Define the chunk size: chunk_size = 100 sets the size of each chunk to 100 rows.

  5. Iterate over the DataFrame: The for loop iterates over the DataFrame in steps of 100 rows. The start variable is the starting index of the chunk, and end is the ending index. The iloc method is used to select the rows from start to end.

  6. Process the chunk: Inside the loop, you can add your processing logic for each chunk. In this example, it simply prints the chunk and the range of rows being processed.

This approach ensures that you efficiently handle large DataFrames by processing them in manageable chunks.