Python & HDF5 hackfest

The PyTables and h5py projects are holding an hackfest at Curtin University. The focus of the event is the overreaching plan decided by the two teams at SciPy 2015: refactoring the Python and HDF5 stack by rebasing PyTables on top of the low-level API provided by h5py.

  • When: 8–11 August 2016
  • Where: Cisco IoE Innovation Centre, Curtin University, Perth, Australia

Background

HDF5 is an open-source, high-performance data storage and manipulation platform, capable of handling data scales from kilobytes to exabytes. As such, it is a critical component in many scientific computing projects, ranging from Astronomy & Astrophysics to Computational Biology to Nuclear Engineering.

Python is open-source, general-purpose programming language with a deep and broad scientific computing ecosystem. Users who intend to read and write HDF5 stored data from Python are faced with a choice between two long-standing packages with overlapping capabilities: h5py and PyTables. Both packages allows to access and manipulate HDF5 datafile but they strongly differer in spirit: while h5py wants to stay as close as possible to the original HDF5 C API, PyTables adds sophisticated features as indexing and out-of-core querying. h5py and Pytables suit slightly different use cases, and interactions with other software packages often end up requiring both packages to be used.

To solve this confusing situation, developers from PyTables, h5py, The HDF Group, as well as other community members have delineated a plan to simplify the stack for Python and HDF5. This plan contains the following components:

  • Refactor PyTables to depend on h5py for low-level operation
  • Extend h5py to cover all the HDF5 API needed by PyTables
  • Retain in PyTables its high-level abstractions

This plan will make h5py – PyTables interactions seamless and will benefit users, developers and the general ecosystem.

Events

Workshop: Python for Sciences

Anthony Scopatz (USC), Andrea Bedini (CIC)

  • When: Wednesday 10 August 2016, 9am to 12pm
  • Where: Building 405, Room 205, Bentley Campus, Curtin University
  • RSVP: To attend the workshop you need to register here

About the workshop: Python is one of the most popular languages for scientific computing today. Its applications span from testing out ideas on laptops through to high-performance computing simulations running on tens of thousands of nodes. This workshop will provide an introduction to the basics of modern, open source, scientific python. The first hour will cover NumPy and array-oriented programming. The second hour will discuss the Pandas package and DataFrames. Lastly, the third hour will discuss symbolic mathematics with SymPy. Throughout this session, we will demonstrate the use of the Jupyter Notebook for interactive computing and matplotlib for the creation of publication quality figures and plots.

Requirements: Please note that this is not an introduction to programming. Attendees are expected to have a basic familiarity with Python and Software Carpentry graduates should be able to follow. Participants must bring a laptop with a Mac, Linux, or Windows operating sytem (not a tablet, Chromebook, etc.) that they have administrative privileges on. Participants should have a working Python installation (we recommend installing Anaconda Python) and should be familiar with installing Python packages.

Seminar: Handling Big Data on Modern Computers. A Developer's View

Francesc Alted

  • When: Wednesday 10 August 2016, 1pm to 2pm, followed by afternoon tea
  • Where: CLT Learning Space, Building 105, Room 107
  • RSVP: For catering purposes, please email Linda Lilly by COB Monday 8 August

Abstract: During the last decade the evolution of the computers has been much different than before. Instead of seeing acceleration in CPU clock speeds we are seeing more cores in CPUs, and instead of having plain simple architectures with a CPU, memory and hard disk, we are seeing computer facilities with several CPUs, several levels of caches and several persistent layers of storage. It is unfortunate that not many libraries are being designed nowadays with this shift in mind. My intention is to explain how to tackle with this efficiently during the making of libraries for handling big datasets. We will see that the election of a good language is important too, and how Python, complemented with others (like Cython, C or Julia), is a good match for this.

Confirmed participants

Francesc Alted

Francesc is an independent consultant with years of experience in Python and C programming and compression techniques for IO. He is the main author of open source projects like PyTables, bcolz, and the ultra-fast compressor Blosc.

Anthony Scopatz

Anthony is a computational scientist and long time Python developer. He currently is an Assistant Professor of Nuclear Engineering at the University of South Carolina. He is the co-author of O'Reilly's "Effective Computation in Physics" and an author of xonsh.

Thomas Caswell

Thomas is a soft-matter physicist at Brookhaven National Laboratory who is committed to develop better software tools for scientists. He is co-lead developer of matplotlib.

Pablo Larraondo

Pablo is a HPC data analyst at the National Computational Infrastructure. He researches new methodologies to improve access and storage to geospatial gridded data based on the HDF5 and NetCDF file formats. He uses Python, C and Go every day at work.

Andrea Bedini

Andrea is a data/scientist and computer specialist at the Curtin Institute for Computation. As a researcher he worked in mathematical combinatorics, statistical physics, genomics and traffic flow modelling.

Sponsors

  • Curtin Institute for Computation
  • Cisco Systems, Inc.
  • NumFocus
  • Univeristy of South Carolina

Contact Us

What to know more about this event? Interested in taking part to it? Send us a message.