Run Python Scripts In Databricks: A Quick Guide

by SLV Team 48 views
Run Python Scripts in Databricks: A Quick Guide

Hey guys! Ever wondered how to run your Python scripts seamlessly within a Databricks notebook? You're in the right place! Databricks notebooks are super powerful for collaborative data science and engineering, and knowing how to execute your Python code inside them is a crucial skill. This guide will walk you through various methods, from simple inline execution to running external .py files, ensuring you get the most out of your Databricks environment.

Why Run Python Scripts in Databricks?

Before diving into the how, let's quickly cover the why. Databricks provides a unified platform for data processing, analytics, and machine learning. Running Python scripts within this environment offers several advantages:

  • Collaboration: Databricks notebooks allow multiple users to work on the same code simultaneously, making team projects smoother.
  • Scalability: Databricks is built on Apache Spark, meaning your Python code can leverage distributed computing for handling large datasets.
  • Integration: Seamlessly integrate with other Databricks features like Delta Lake, MLflow, and various data sources.
  • Reproducibility: Notebooks preserve the execution history, making it easier to reproduce results and debug issues.

These advantages make Databricks an ideal platform for data scientists and engineers looking to build and deploy data-driven applications efficiently.

Method 1: Inline Execution in a Databricks Notebook

The simplest way to run Python code in a Databricks notebook is by directly typing it into a cell and executing it. This is perfect for quick tests, data exploration, and prototyping.

  1. Create a New Notebook:

    • In your Databricks workspace, click on the "Workspace" button in the sidebar.
    • Navigate to the folder where you want to create the notebook.
    • Click the dropdown button and select "Notebook".
    • Give your notebook a name, select Python as the default language, and click "Create".
  2. Write Your Python Code:

    • In the first cell of the notebook, type your Python code. For example:
    print("Hello, Databricks!")
    x = 10
    y = 20
    print(f"The sum of x and y is: {x + y}")
    
  3. Execute the Cell:

    • Click the "Run Cell" button (the play button) to execute the code in the cell. Alternatively, you can use the keyboard shortcut Shift + Enter.
    • The output of the code will be displayed directly below the cell.

Tips for Inline Execution:

  • Use %python Magic Command: Although Python is the default language, you can explicitly specify it using the %python magic command at the beginning of a cell. This is useful when you're switching between languages in the same notebook.

    %python
    print("This is Python code.")
    
  • Install Libraries: You can install Python libraries directly within the notebook using %pip or %conda. For example:

    %pip install pandas
    import pandas as pd
    

    This will install the pandas library and make it available for use in subsequent cells.

  • Displaying DataFrames: Databricks provides enhanced display capabilities for pandas and Spark DataFrames. Simply printing a DataFrame will render it in a tabular format.

    import pandas as pd
    data = {'col1': [1, 2], 'col2': [3, 4]}
    df = pd.DataFrame(data)
    df
    

    This will display the DataFrame df in a nicely formatted table.

By following these steps and tips, you can efficiently execute Python code inline within your Databricks notebooks, making it ideal for interactive data exploration and development.

Method 2: Running External Python Scripts

Sometimes, you might have your Python code organized in separate .py files. Databricks allows you to run these external scripts from within your notebook. Here’s how you can do it:

  1. Upload the Python Script to Databricks:

    • Using the Databricks UI:

      • Click on the "Workspace" button in the sidebar.
      • Navigate to the folder where you want to store the script.
      • Right-click in the folder and select "Create" -> "File".
      • Give your file a name (e.g., my_script.py).
      • You can either upload the file or paste the code directly into the file editor.
    • Using the Databricks CLI:

      • If you have the Databricks CLI installed, you can upload the file using the following command:
      databricks fs cp my_script.py dbfs:/path/to/your/script/my_script.py
      

      Replace /path/to/your/script/ with the desired path in the Databricks File System (DBFS).

  2. Run the Script in the Notebook:

    • Using %run Magic Command:

      • The %run magic command allows you to execute a Python script located in DBFS.
      %run /path/to/your/script/my_script.py
      

      Replace /path/to/your/script/my_script.py with the actual path to your script in DBFS.

    • Using dbutils.fs.head for Verification:

      • Before running the script, you can verify its content using dbutils.fs.head.
      script_content = dbutils.fs.head("/path/to/your/script/my_script.py")
      print(script_content)
      

      This will print the first few lines of the script, allowing you to confirm that you're running the correct file.

Example: Running a Simple Script

Let's say you have a script named my_script.py with the following content:

# my_script.py
def greet(name):
    print(f"Hello, {name}!")

if __name__ == "__main__":
    greet("Databricks User")

In your Databricks notebook, you can run this script as follows:

%run /path/to/your/script/my_script.py

This will execute the script and print "Hello, Databricks User!" in the notebook output.

Important Considerations:

  • File Paths: Ensure that the file path you provide to %run is correct and points to the location of your script in DBFS.
  • Dependencies: If your script has external dependencies, make sure they are installed in the Databricks environment using %pip install package_name or %conda install package_name.
  • Scope: Variables and functions defined in the external script will be available in the notebook's scope after the script is executed.

By following these guidelines, you can seamlessly integrate external Python scripts into your Databricks workflows, promoting code reusability and organization.

Method 3: Using Modules and Packages

Organizing your code into modules and packages is a best practice for larger projects. Databricks supports importing and using your custom modules and packages. Here’s how:

  1. Structure Your Package:

    • Create a directory structure for your package. For example:
    my_package/
    β”œβ”€β”€ __init__.py
    β”œβ”€β”€ module1.py
    └── module2.py
    
    • __init__.py: This file is required to make Python treat the directory as a package. It can be empty or contain initialization code.
    • module1.py and module2.py: These are your Python modules containing functions, classes, and variables.
  2. Upload the Package to Databricks:

    • Zip the Package: Create a ZIP archive of your package directory (e.g., my_package.zip).
    • Upload to DBFS: Upload the ZIP file to DBFS using the Databricks UI or CLI, as described in Method 2.
  3. Install the Package in the Notebook:

    • Using %pip install:

      • You can install the package directly from the ZIP file in DBFS using %pip install.
      %pip install /dbfs/path/to/your/package/my_package.zip
      

      Replace /dbfs/path/to/your/package/my_package.zip with the actual path to your ZIP file in DBFS.

  4. Import and Use the Modules:

    • Once the package is installed, you can import and use its modules in your notebook.
    import my_package.module1
    import my_package.module2
    
    # Use functions from the modules
    my_package.module1.my_function()
    my_package.module2.another_function()
    

Example: Creating and Using a Simple Package

Let's create a simple package named mymath with two modules: add.py and multiply.py.

mymath/
β”œβ”€β”€ __init__.py
β”œβ”€β”€ add.py
└── multiply.py

__init__.py (can be empty):

# __init__.py

add.py:

# add.py
def add(x, y):
    return x + y

multiply.py:

# multiply.py
def multiply(x, y):
    return x * y
  1. Zip the mymath directory.
  2. Upload mymath.zip to DBFS (e.g., /dbfs/packages/mymath.zip).
  3. Install the package in the notebook:
%pip install /dbfs/packages/mymath.zip
  1. Import and use the modules:
import mymath.add
import mymath.multiply

result_add = mymath.add.add(5, 3)
result_multiply = mymath.multiply.multiply(5, 3)

print(f"The sum is: {result_add}")
print(f"The product is: {result_multiply}")

By organizing your code into packages, you can create more maintainable and reusable codebases within Databricks.

Conclusion

Running Python scripts in Databricks notebooks is a fundamental skill for any data scientist or engineer using the platform. Whether you're executing code inline, running external scripts, or using modules and packages, Databricks provides the flexibility and power you need to tackle complex data challenges. By following the methods outlined in this guide, you can efficiently integrate your Python code into Databricks workflows and leverage the platform's collaborative and scalable environment. Keep experimenting and happy coding!