Forums  > Software  > keeping it all together  
     
Page 1 of 1
Display using:  

Strange


Total Posts: 1475
Joined: Jun 2004
 
Posted: 2018-12-05 17:54
Not sure if this is a software question or basic question. Maybe it's a sign of impending senility, but i seem to be more and more scatter-brained about my research process, doing something and then forgetting it etc. How do you guys organize your research process (i.e. versions, naming conventions, libraries vs code straight in the scripts/notebooks)?

I don't interest myself in 'why?'. I think more often in terms of 'when?'...sometimes 'where?'. And always how much?'

nikol


Total Posts: 576
Joined: Jun 2005
 
Posted: 2018-12-05 18:20
It is all about discipline.

All garbage ideas which disturb my mind I dump into FreeMind. I use it also to analyse the structural relationship between various items and keep them visualized.
https://en.wikipedia.org/wiki/FreeMind

Force yourself to maintain the in-line documentation. Some people like extensive comments of User guide style (I did it in Fortran long ago).

Another way to force yourself to document things is to create a small team. In this way you have to make sure that things are documented, but still you need a discipline for that.

Markup language within iNotebooks (by jupyter) is very powerful, so I started to comment my research extensively between cells.

goldorak


Total Posts: 1050
Joined: Nov 2004
 
Posted: 2018-12-05 20:19
I have always taken the opposite side of this approach. Not documenting, leaving things aside for a (long) while, forgetting, needing a strong effort to come back to some undocumented code written to test ideas or to do research. End result is usually way more satisfactory.

I guess it depends on people.

If you are not living on the edge you are taking up too much space.

Strange


Total Posts: 1475
Joined: Jun 2004
 
Posted: 2018-12-06 00:47
@goldorak I've always followed your approach and it seemed to work, but lately i feel like i am duplicating effort all over the place and not getting as much done.

I don't interest myself in 'why?'. I think more often in terms of 'when?'...sometimes 'where?'. And always how much?'

prikolno


Total Posts: 26
Joined: Jul 2018
 
Posted: 2018-12-06 03:32
I think there's 2 slightly orthogonal issues here. One is documentation and the other is reproducibility.

I don't think there's a "right way" to go about documentation. You can either buffer up a lot of documentation debt and then flush it at once (like goldorak) or have very strict practices at the two extremes. I feel the ideal approach varies over time with size of the team, the nature and maturity of the strategy you're deploying, and the stylistic preferences of specific people you're working with. For example, past a certain point, some strategies have such a complex state machine that it's probably not productive to document everything because only 1 other person on your team will barely understand your strategy.

As for everything else you've named, versions, naming conventions, codes (and data sets), I find that a huge part of it is a matter of reproducibility. I've gone through stages of the platform where things were so non-reproducible, that it was extremely, extremely painful to context switch for even 1 day and work on a different hypothesis. I've also been at stages where I can drop a project for 6 months and go back to it almost losslessly because I could reproduce everything, from the derived data I had at the time, to the executables/binaries I used, to the exact functional forms of signals that were implemented.

There's no one silver bullet that brought us from the former phase to the latter. A significant part of it was spending time on good config management and metadata that describes the project and its dependencies, and there was probably some 100k+ LOC dedicated to this endeavor. We also found an internal research wiki to be pretty useful.

Nowadays there's version control systems for ML projects like DVC, which may fulfill some of your requirements or at least give you ideas. Making your exploratory environment more "interactive" and "preservable" through interactive plots and polyglot notebooks like Beaker may also help.

jslade


Total Posts: 1148
Joined: Feb 2007
 
Posted: 2018-12-06 14:47
For R, I stash functions into collections/packages, work flows into Sweave files. I leave work histories in scripts, which are roughly the same as a jupyter notebook. Honestly I almost never look at these and should probably throw them all away.

FWIIW I also use Freemind (freeplane, actually) for brainstorming and large project organization. Org emacs and physical notebooks for everything else. I'd like to throw away sweave and just use org mode (or some kinda mind map with org mode), but alas, such a tool doesn't exist, and I don't want to write one.

Oh yeah, you are using source control right? Private git repos are too easy to set up, even if you don't use github; you might as well use one.

"Learning, n. The kind of ignorance distinguishing the studious."

Rashomon


Total Posts: 190
Joined: Mar 2011
 
Posted: 2018-12-06 21:07
1. Think it through before acting
2. Alias the shit out of things.



2222222
If there are multiple paths to a folder, ln -s. If there is a simple English word for function(x) { a(b(x))}, name it. I don’t worry about long function names, especially on first writing. Trying to create an ontology of the world on the fly when you’re trying to be creative will kill you.


If I accidentally type sl instead of ls, that goes in my .bash_aliases. Ditto if I try to call a function in R by the wrong name: it gets wrong -> right.




1111111
I use paper, touch, and cat >> for thoughts. (Design of everyday things strongly made the point with the kayak map that things that destroy when they get wet are impermanent. There’s a tradeoff for shite thoughts between a computer folder I only look at when I’m looking at it and a physical folder I only sort through when I’m sorting through it. Recently I find paper for shite thoughts distracts† me less from the computer task at hand, and the need to sort/review shite thoughts is infrequent.       Paper has the huge advantage of being 2-dimensional and you can draw nonverbal ideas on it. It’s just not worth writing the parser to turn b -> c ; a -> d; e -> c; f -> b into a picture, let alone use blobs or colors. So, a big table which gets cleand periodically, markers/crayons, and somewhat-filed yet somewhat-accessible place for when you want to return to (last quarter’s / last year’s) thoughts.

People are deeply opposed to thinking. Banks and “startups” maybe more than most --- banks because sportsbro culture + deadlines + trader quick-decision thinking. “Startups” for reasons that are just self-inflicted. Bargain yourself the time, the down-time, the away-from-screen time, the quiet, and the non-work activities scheduled—so you can do actual thinking.




† Related to this, don’t underestimate the power of closing your eyes for 30 seconds whenever frustration makes you want to switch tasks. 30–90 seconds relaxing is better than 2–5 minutes on facebook / SMS / whatever. For your body as well.



The silver searcher (ag, like ack-grep) is your friend.



-----

Oh, and versions should be dealt with via version control. I use hginit.com because Joel Spolsky could onboard a normal person for me very quickly. But man git-everyday and man git-tutorial may have eased the path since I learned git.

Strange


Total Posts: 1475
Joined: Jun 2004
 
Posted: 2018-12-07 04:17
@prikolno That's a pretty serious approach (re 100k lines of code), I doubt all of my code base is that much :)

@jsladeOh yeah, you are using source control right? Private git repos are too easy to set up, even if you don't use github; you might as well use one.

I do use subversion, but I have a plan to switch to git as it seems to be state of the art. Is there an advantage to using github (the do have commercial accounts that are not visibile?)?

@Rashomon re "people are opposed to thinking" - yup, most of the ideas are best conceived slowly and I find discussing things ad nauseam is a good way of avoiding all sorts of problems. In fact, ideal discussions are with someone who has a different way of thinking.

I don't interest myself in 'why?'. I think more often in terms of 'when?'...sometimes 'where?'. And always how much?'

jslade


Total Posts: 1148
Joined: Feb 2007
 
Posted: 2018-12-07 19:58
TBH subversion is fine. Arguably handles merges better than git.

github provides a service. You can also provide that service for yourself if you want to waste time configuring things; I do it because I am paranoid.

"Learning, n. The kind of ignorance distinguishing the studious."

prikolno


Total Posts: 26
Joined: Jul 2018
 
Posted: 2018-12-07 22:28
I can probably get 80% of the way there with the first 2k LOC. I think there's one right design principle, and it's not very different from how docker containers work. I think the difficult part's having enough structure in the model layer so that you can version things efficiently, but at the same time enough flexibility there so you can try new things.

I agree with jslade here, subversion isn't bad if you are a one-man team. I also agree with Rashomon, I think symlinks can help you.

EspressoLover


Total Posts: 349
Joined: Jan 2015
 
Posted: 2018-12-09 16:51
Disagree with others. DVCS is a game changer. The only real upside to the centralized approach is better handling of binary assets. But those should live in artifact repositories, not source control, anyway. When you move to git, what you realize is that commits and branches are super light-weight and local. With svn the tendency is only to commit "version changes" with sprawling footprints. With git you tend towards a separate commit every time you change a few lines of code. This just isn't feasible with svn, because each commit changes the repo for everyone else. With git, it doesn't matter if you break the build, because it's only local until you push.

Having much finer granularity on commit history enables all kinds of productivity boosts. git revert essentially becomes Ctrl-Z in your local workspace. You can "git log -p | grep" to effortlessly to exactly when, where and why some change was made. git bisect is literally a one-button solution to diagnosing bugs. Not to mention source control functionality stays completely available even when you don't have an internet connection.

Same story with branching. With svn, branches are a pain in the ass, and you probably only use them for major version changes. With git, I'll use local branches as ways to isolate even the smallest change sets. Let's say I'm working on adding some feature X to the codebase, when I notice some orthogonal refactoring Y that I want to do. Using git branch, you can easily toggle back and forth between each change, keeping the workspaces single-focused and the changelog isolated. If you're collaborating with Alice on X, you can push to that branch to her, then collaborate with Bob on Y. Neither has to worry about the tasks outside their purview.

Specific to your question, it also makes lightweight experimentation simple. You can fork a "skunk works" repo to add some experimental features. Keep it as long-lived as you like, merge downstream changes from master as needed, and selectively promote changes back into master. You can keep master hooked into a CI pipeline, so that you don't have to worry about untested experimental fork changes accidentally polluting the stable codebase.

Good questions outrank easy answers. -Paul Samuelson

Maggette


Total Posts: 1067
Joined: Jun 2007
 
Posted: 2018-12-09 17:26
This might be considererd crazy...but for my "side projects" (a smaller project I spend probably one day a week on and private stuff) I use scrum tools (picture me running for cover).

I had the same problems with these side projects first. Work was started but often didn't lead to anything. I lost track of stuff..etc

When I first encountered Scrum I was a hater, and to a certain extend I still am.

But I used some parts of it for these side projects. It obvious to me, that I get more done, even though you "waste" some of your time on planning, writing and estimating stories (or tasks) and refinining stories. There is a trade of here of course.

But I do have a plan now. My forecasts about what I get done the next three weeks are shockingly accurate (and often depressing). But in the end I get more things done. The combination of jira ( Iuse https://www.openproject.org/de/jira-alternative/ ) and GIT are also a kind of a poor mans documentation.

I think it is not important that you use Scrum or whatever. But spending time on planing and creating and structuring tasks really helps.

Of course it's research and not a web service application. But it is totally ok to write down after a task => nothing interesting came out of it => no follow up activities.

To me it just feels by structuring it and doing it consciously you get much more out of dead ends.

Ich kam hierher und sah dich und deine Leute lächeln, und sagte mir: Maggette, scheiss auf den small talk, lass lieber deine Fäuste sprechen...

EspressoLover


Total Posts: 349
Joined: Jan 2015
 
Posted: 2018-12-09 19:01
On the topic of organizing the research process, I'd say there's two major separate challenges. One, is code stability and maintainability, while keeping experimentation low overhead. Two, is data provenance.

Code

From a software engineering perspective, research is fairly challenging. The vast majority of code produced in a research context gets thrown away or never used again. Most regular software is produced against a relatively fix spec. I.e. there's a product spec that calls for X. When we write code to do X, it's likely it or some future version of it will stick around for the life of the product. Against that it's justifiable to keep strict requirements in terms of software quality. Test coverage, coding standards, documentation, maintainability, code review, backwards compatibility, etc.

This isn't really what you want for research. The median line of research code gets written once, used a couple times in the same environment by the same person who wrote it, then forgotten about in a few days. Stability is a much lower priority than making it easy for researchers to quickly experiment with ad-hoc solutions without a lot of formal overhead. Then there's another twist, in that some unpredictable subset of research code will eventually be promoted into production systems.

Dealing with this isn't simple. It's easy to get lazy and avoid good software engineering standards by pretending that core production code is still research. Vice versa, once you've been burned you may go overboard with formal requirements effectively shutting down innovation. Plus keep in mind that in most orgs there's a power struggle between researchers and engineers. At the end of the day it takes honest actors with good judgement to decide when, where and how to vary the standards between different parts of the codebase.

One thing that does help is at least being explicit about it. Keep written standards for different levels of code, with everyone on the same page, and be clear about when code graduates from one level to another. Alice shouldn't have the gut feeling that this is still informal experimental code, while Bob is shipping it in a critical system.

I prefer to keep the division simple, two levels: "skunk works", which is the wild west, and "core", which should always be production safe. Obviously only core should ever be called by core. But even once something's starting to get used across different places in skunk works it should get promoted. Once a sub-project is used outside a single research team, or revisited outside it's original sprint, or grows past a few thousand LoC, or splitting into multiple layers of abstraction, then it should probably be promoted.

YMMV. Depending on your use cases and org's personality a different approach is justifiable. Maybe more granularity than just two levels, or different guidelines around promotion, or some other variation. I don't think the details are as important as just articulating a coherent philosophy that you can justify.

Data

Most research is just churning out all sorts of intermediate datasets and derived parameters. Some of which get used further downstream to make more datasets and parameters. Some are getting pushed right into production. Some are getting put in front of a human researcher who's trying glean an insight or make a decision. It's often not really clear what the end-goal is when you're actually generating the data what it will wind up being used for. The way the data's generated depends on all kinds of subtle structure and logic.

Data isn't like code. It's not self-explanatory. It's just a blob of bytes, and how exactly we created those bytes is not inherently represented inside the data itself. The challenge is to keep a provenance of how the data was generated and what it actually represents. I.e. metadata. Making metadata useful is really tough, especially when the data comes from complex transformations. Metadata could potentially be much higher, dimensional than the underlying data itself. For example fitted alpha coefficients, you might have to track all the parameters used in cleaning and pre-processing, the version of libsvm used, the random seed used in the fit, the date range, the symbol set, all kinds of hyper parameters, etc.

The less data you need to provenance the better. Ideally the only canon would be ground truth data (e.g. raw capture logs), and code that's sub specie aeterni. Derivations of data (including parameters) are done on the fly as needed, and discarded after they're finished. That doesn't mean that you can't keep derived data cached, but it's treated as scratch work, rather than a canonical source of truth. As soon as there's any question about where it came from (e.g. was this made with the most recent version of the library), you just discard and regenerate, rather than trying to investigate the origins of the current dataset.

There may be certain barriers why this doesn't work. One is computational restraints. If it costs $500,000 in computer time and takes three weeks to fit some parameters, then generating on the fly won't work for you. Another is if the one-button regeneration isn't practical. Maybe at some point in the pipe a human actually needs to use their judgement to make a decision. Or your current software doesn't have the hooks for it (although something like Apache Airflow should make this easy, even if you're wrapping a bunch of disparate clunky systems). And most of the time actual production parameters should be stable, and not reset because of some minor commit in the fit library.

Even then, it's still helpful to focus on making the surface area for provenance small. If you're compute constrained, only canonize the most upstream transformation that's past the compute barrier. Anything downstream derive on the fly when it's cheap. The fewer artifacts that need metadata, the simpler schema you can use. Millions of of artifacts in canon are going to require a machine readable schema with every possible dimensional included to be on the safe side. But for a single production param set, the metadata can just be a plain English changelog. The strategist can just use her personal judgement about when a refresh is needed.

Good questions outrank easy answers. -Paul Samuelson

Strange


Total Posts: 1475
Joined: Jun 2004
 
Posted: 2018-12-10 02:15
@jslade github provides a service. You can also provide that service for yourself if you want to waste time configuring things; I do it because I am paranoid.

Are there any secure services out there? I.e. someone that provides a DVCS, but has no access to your codebase? As a minor point of paranoia, I do recall reading that some fed agency has requested source code for some company and got it directly from Bitbucket (or someone like that).

I don't interest myself in 'why?'. I think more often in terms of 'when?'...sometimes 'where?'. And always how much?'

Strange


Total Posts: 1475
Joined: Jun 2004
 
Posted: 2018-12-10 02:18
@EspressoLover Lot's to think about. I was not thinking of changing my process, but now I might. A complicating addition is the fundamental/discretionary research notes and such.

I don't interest myself in 'why?'. I think more often in terms of 'when?'...sometimes 'where?'. And always how much?'

Rashomon


Total Posts: 190
Joined: Mar 2011
 
Posted: 2018-12-10 23:28
great points as usual, EspressoLover. Rmd’s / jupyter / lhs help ith data provenance. Stitching these together across a large team is about as well organized as lawyers e-mailing each other Word doc revisions, though.

Strange: Yes, Amazon and Bitbucket/Github/competitors all have secure cloud storage for government / hppa / etc. The instructions on how to et up a git server yourself are pretty simple though, in schacon’s pro-git book. (It’s one chapter, you can do it in half a day.) I do it not because I’m paranoid, but because I hate SV. I wouldn’t pay for FiresideChat when irc servers are free and, given claims I would make in any interview about my level of technological ability.

Strange


Total Posts: 1475
Joined: Jun 2004
 
Posted: 2018-12-11 14:26
@Rashamon I understand that running a git server is not a rocket science, but it all adds up time-wise. E.g. if I run my own git server, my own DB instance, optimize my own servers etc two things would happen - firstly, I'll do it worse than someone who understands how to do it well and I'll also waste time that I should be spending searching for that elusive alpha. Outsourcing non-core stuff makes a lot of sense, I think.

One thing I realized reading the above ideas is that I should stop emailing myself and switch to something else.


I don't interest myself in 'why?'. I think more often in terms of 'when?'...sometimes 'where?'. And always how much?'

ronin


Total Posts: 385
Joined: May 2006
 
Posted: 2018-12-11 14:47
Meh. Keeping it all together is overrated. I don't think I have kept it all together since at least 1996.


"There is a SIX am?" -- Arthur
Previous Thread :: Next Thread 
Page 1 of 1