Dataframe 1.0.0.0

(discourse.haskell.org)

101 points | by internet_points 14 hours ago

9 comments

octopoc 8 hours ago
> There is now a DataFrame.Typed API that tracks the entire schema of the dataframe - column names, misapplied operations etc are now compile time failures and you can easily move between exploratory and pipeline work.
This makes complex dashboards so much easier to build, because in Python you have to test everything in the dashboard to make sure a change to a common dataset didn’t break anything.
Is there a good web dashboard library like streamlit for Haskell I wonder?
[-]
- mchav 17 minutes ago
  No but something is in the works! We are building reactive notebooks that we will eventually give export capabilties.
  You can try it from https://www.datahaskell.org/ under "try out our current stack"
brightball 9 hours ago
If anybody is reading this and would like to submit a talk on it or Haskell itself to the Carolina Code Conference, please do so. Our call for speakers is open until the end of March and I've been hoping to get a Haskell talk in for the last couple of years.
https://blog.carolina.codes/p/call-for-speakers-2026-is-open
[-]
- mchav 8 hours ago
  Author here: Would have loved to but this is round about my wedding anniversary. Will ask some Haskell friends to submit though.
  [-]
  - brightball 5 hours ago
    Thanks!
whateveracct 9 hours ago
And packed in here is more than Dataframe.
DataHaskell in general is revived and improving on multiple fronts. Exciting stuff!
mark_l_watson 8 hours ago
This looks so cool, just put it on top of my todo list. My Haskell skills are mediocre but I love the language. I get by with a subset of the language.
Strong typing and data science seems like a good combination.
[-]
- steve_adams_86 4 hours ago
  In my experience it's tough to sell to some scientists (they like to work with R and Python here), but when it's tied with pipelines that ultimately publish materials (rather than every step of the way), it's extremely helpful and streamlines the QA process of ensuring correct data is packaged with publications.
  When I've audited some of the published data at our org, there are errors that would have been caught with even basic type-safety. That's how I got the green light to start harassing my team with type safety in our pipelines.
  Of course, as with all things in programming, it isn't a silver bullet. It adds a layer of rigor that can slow things down, and there are often (seemingly always) nuances which can't be caught easily by most type systems. Things like complex relations between values (like 'if in Y is in [range], X must be null, and Z must be one of [a, b, c]'). Even so, eliminating categories of errors is worthwhile, and makes it easier to focus on the more complex challenges.
  Over all I'd agree though, it's a good combination.
hambandit 9 hours ago
I learned some haskell as my hobby language a few years back. It was very cool and forced me to think about programming differently (and finally grok recursion). It never felt like a good language for data analysis to me though (maybe that's cause this library wasn't around? lol). This isn't meant a criticism of this library, instead, I'm curious the use cases the author, if you're around or a user, has in mind. Congrats on the v1 release!
[-]
- mchav 8 hours ago
  Author here. At the time I worked in fraud detection and we needed to automate file generation for our BRMS. Initially created this to experiment with “models as dataframe expressions” and Haskell is great for DSL-like stuff. That work is still on going: https://github.com/DataHaskell/symbolic-regression and dataframe has a native sparse oblique tree implementation.
  As it’s grown it’s been pretty cool to have transparent schema transformations instead of every function mapping a statement a dataframe you can have function signatures like:
``` extract :: TypedDataFrame [Column "price" (Maybe Double), Column "quantity" Int, Column "comments" T.Text] -> TypedDataFrame [Column "price" (Maybe Double), Column "quantity" Int] -- body of extract
transform :: TypedDataFrame [Column "price" (Maybe Double), Column "quantity" Int] -> TypedDataFrame [Column "price" Double, Column "quantity" Int] -- body of transform
clean :: TypedDataFrame [Column "price" (Maybe Double), Column "quantity" Int, Column "comments" T.Text] -> TypedDataFrame [Column "price" Double, Column "quantity" Int] clean = transform . extract ```
But you can also do the simple thing too and only worry about type safety if you prefer:
``` df |> D.filterWhere (country_code .==. "JPN") |> D.select [F.name name] |> D.take 5 ```
  Being able to work across that whole spectrum of type safety is pretty great.
ghc 8 hours ago
I feel like I've been waiting for this to mature for a decade. I love that the vision has been realized despite the enthusiasm for functional programming languages cooling off somewhat.
huss-mo 1 hour ago
why choose an overlapping name with pandas dataframe?
october8140 12 hours ago
1.0.0.0.0.0.0.0
[-]
- qrobit 11 hours ago
  Hackage recommends using Haskell's PVP[^1], but does not enforce it. That's why many haskell packages are a four-places versions: 3 required and fourth optional (but popular) that represents "other" changes, like documentation.
  [^1]: https://pvp.haskell.org/
  [-]
  - whateveracct 9 hours ago
    Also, iirc PVP pre-dates SemVer. For anyone going to accuse Haskell of NIH :)
    Remember, everyone: Haskell is very old!
  - Y-bar 11 hours ago
    > A.B is known as the major version number
    Why are they requiring two numbers to represent one (semantic) number?
    [-]
    - tikhonj 8 hours ago
      I rather like this. A represents major changes like a substantial redesign of the whole API, while B catches all other breaking changes. Tiny changes to the public API of a library may not be strictly backwards compatible, even if they don't affect most users of the package or require substantial work to address.
      A problem with Semver is that a jump from 101.1.2 to 102.0.0 might be a trivial upgrade, and then the jump to 103.0.0 requires rewriting half your code. With two major version numbers, that would be 1.101.1.2 to 1.102.0.0 to 2.0.0.0. That makes the difference immediately clear, and lets library authors push a 1.103.0.0 release if they really need to.
      In practice, with Semver, changes like this get reflected in the package name instead of the version number. (Like maybe you go from data-frames 101.1.2 to data-frames-2 1.0.0.) But there's no consistent convention for how this works, and it always felt awkward to me, especially if the intention is that everyone migrates to the new version of the API eventually.
      [-]
      - Y-bar 8 hours ago
        You put into words why I appreciate SemVer so much! It is so much better at being deterministic and therefore allows me a greater confidence in version control.
        The author of a library has no idea how tightly coupled my code is to theirs and should therefore only make yes/no answers to ”is this a breaking” change.
        For example, when a large ORM library si use changed a small thing like ”no longer expose db tables for certain queries because not all db engines support it anyway” (ie moving a protected property to private) it required a two week effort to restructure the code base.
        > In practice, with Semver, changes like this get reflected in the package name instead of the version number.
        Not once have I seen this happen. Any specific examples?
    - winwang 10 hours ago
      (no idea but) I feel like changing the first number has a psychological issue, but the 2nd number feels more important than just "minor" sometimes. So may as well let the schema set the mind free?
  - philipwhiuk 9 hours ago
    > MAY optionally have *any* number of additional components, for example 2.1.0.4
    Thus making the silly example possible.
- torcete 5 hours ago
  I can't wait for version 1.0.0.0.0.0.0.1
- nickpeterson 5 hours ago
  risky, it feels like there is a chance you'll still need an extra .0 to cover something unexpected.
MoonWalk 2 hours ago
Is what?